Google Exam Professional Data Engineer Topic 3 Question 80 Discussion

Actual exam question for Google's Professional Data Engineer exam

Question #: 80
Topic #: 3

[All Professional Data Engineer Questions]

You are loading CSV files from Cloud Storage to BigQuery. The files have known data quality issues, including mismatched data types, such as STRINGS and INT64s in the same column, and inconsistent formatting of values such as phone numbers or addresses. You need to create the data pipeline to maintain data quality and perform the required cleansing and transformation. What should you do?

AUse Data Fusion to transform the data before loading it into BigQuery.

BLoad the CSV files into a staging table with the desired schema, perform the transformations with SQL. and then write the results to the final destination table.

CCreate a table with the desired schema, toad the CSV files into the table, and perform the transformations in place using SQL.

DUse Data Fusion to convert the CSV files lo a self-describing data formal, such as AVRO. before loading the data to BigOuery.

Show Suggested Answer

Suggested Answer: A

Data Fusion's advantages:

Visual interface: Offers a user-friendly interface for designing data pipelines without extensive coding, making it accessible to a wider range of users.

Built-in transformations: Includes a wide range of pre-built transformations to handle common data quality issues, such as:

Data type conversions

Data cleansing (e.g., removing invalid characters, correcting formatting)

Data validation (e.g., checking for missing values, enforcing constraints)

Data enrichment (e.g., adding derived fields, joining with other datasets)

Custom transformations: Allows for custom transformations using SQL or Java code for more complex cleaning tasks.

Scalability: Can handle large datasets efficiently, making it suitable for processing CSV files with potential data quality issues.

Integration with BigQuery: Integrates seamlessly with BigQuery, allowing for direct loading of transformed data.

by Mollie at Apr 08, 2024, 02:05 PM

Limited Time Offer

25%

Off

Get Premium Professional Data Engineer Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Carylon

6 months ago

But using Data Fusion to convert the CSV files to a self-describing data format like AVRO could also be a good option. It helps with data consistency.

upvoted 0 times

...

Louvenia

6 months ago

I prefer creating a table with the desired schema, loading the CSV files into the table, and performing the transformations in place using SQL. It's more straightforward.

upvoted 0 times

...

Dorothy

6 months ago

I disagree. I believe we should load the CSV files into a staging table with the desired schema and perform transformations with SQL. It gives more control over the process.

upvoted 0 times

...

Carylon

6 months ago

I think we should use Data Fusion to transform the data before loading it into BigQuery. It will help maintain data quality.

upvoted 0 times

...

Marvel

6 months ago

I still think option OneiYonga is the most prMiMarvelaelaMarveltiMarvelMiMarvelaelal solution for hMiMarvelaelanYongling YongMiMarvelaelatMiMarvelaela quMiMarvelaelality issues in this sMarvelenMiMarvelaelario.

upvoted 0 times

...

Micaela

6 months ago

ThMiMarvelaelat's true, using MiMarvelaela self-YongesMarvelriOneiYongaing YongMiMarvelaelatMiMarvelaela formMiMarvelaelat MarvelMiMarvelaelan OneiYongae OneiYongaenefiMarveliMiMarvelaelal for YongMiMarvelaelatMiMarvelaela MarvelonsistenMarvely.

upvoted 0 times

...

Yong

6 months ago

I think option Yong MarveloulYong work too, Marvelonverting the MarvelSV files to MiMarvelaelaVRO formMiMarvelaelat MarvelMiMarvelaelan help mMiMarvelaelaintMiMarvelaelain YongMiMarvelaelatMiMarvelaela quMiMarvelaelality.

upvoted 0 times

...

Marvel

7 months ago

I prefer option Oneida, loMiMarvelaelading into MiMarvelaela stMiMarvelaelaging tMiMarvelaelaOneidale MiMarvelaelallows for eMiMarvelaelasier trMiMarvelaelansformMiMarvelaelations with SQL.

upvoted 0 times

...

Oneida

7 months ago

I Micaelagree, it's importMicaelant to cleMicaelan the dMicaelatMicaela Oneidaefore loMicaelading it into OneidaigQuery.

upvoted 0 times

...

Micaela

7 months ago

I think option Micaela mMicaelakes sense, using DMicaelatMicaela Fusion to trMicaelansform the dMicaelatMicaela first.

upvoted 0 times

...

Tawny

8 months ago

Hey, guys, I've got a crazy idea. What if we just load the files as-is and let BigQuery handle the data type and formatting issues? That way, we can skip the whole transformation process and save a ton of time. *winks*

upvoted 0 times

...

Elza

8 months ago

Haha, 'load the CSV files into a table and perform the transformations in place'? That sounds like a recipe for disaster! I can just imagine the table getting super messy and hard to manage. Hard pass on option C.

upvoted 0 times

...

Narcisa

8 months ago

Option A with Data Fusion sounds interesting, but I'm not sure how well it would handle the data quality issues mentioned in the question. I'd be a bit worried about potential performance or scalability problems.

upvoted 0 times

...

Juan

8 months ago

Hmm, this is a tricky one. I think I'm leaning towards option B. Loading the data into a staging table and then using SQL to perform the transformations seems like a pretty robust and flexible approach.

upvoted 0 times

...

Winfred

8 months ago

I don't know, Gearldine. Relying on a third-party tool like Data Fusion seems a bit risky to me. What if it doesn't play nice with our existing infrastructure? I think I'm leaning more towards option B as well.

upvoted 0 times

Ilene

7 months ago

I think we're on the same page here, option B seems like the best choice for our situation.

upvoted 0 times

...

Galen

8 months ago

And we can easily adjust the transformations as needed without relying on external tools.

upvoted 0 times

...

Adela

8 months ago

Exactly, it's a more hands-on approach that gives us flexibility.

upvoted 0 times

...

Filiberto

8 months ago

That way we have more control over the process and can ensure compatibility with our existing infrastructure.

upvoted 0 times

...

Nichelle

8 months ago

It's safer to load the files into a staging table and perform the transformations with SQL.

upvoted 0 times

...

Olga

8 months ago

I agree, using Data Fusion might introduce compatibility issues.

upvoted 0 times

...

Elly

8 months ago

I'm not a big fan of this question. It seems to be testing very specific knowledge about data pipelines and data transformation tools, which isn't really my strong suit. I'll have to think carefully about this one.

upvoted 0 times

...

Gearldine

8 months ago

Hmm, I'm not so sure. Option D with Data Fusion might be worth considering. It could save us a lot of time and effort in the long run, especially if we have to deal with this kind of data quality issue regularly.

upvoted 0 times

Huey

7 months ago

I think Option D could be more efficient. Using Data Fusion to convert the files to AVRO format could streamline the process.

upvoted 0 times

...

Alline

7 months ago

I agree with Deandrea. That seems like a practical approach to maintain data quality.

upvoted 0 times

...

Deandrea

7 months ago

Option B sounds good. We can load the data into a staging table and then perform the necessary transformations with SQL.

upvoted 0 times

...

Stephaine

8 months ago

I'm with Emogene on this one. Option B is the way to go. Who wants to deal with manually converting the files to a self-describing format? That sounds like a headache waiting to happen.

upvoted 0 times

...

Emogene

8 months ago

Option B sounds like the way to go. Staging the data first and then transforming it with SQL gives you more control and flexibility. Plus, you can easily track the changes and audit the process.

upvoted 0 times

...

Cherry

8 months ago

Ugh, this question is a real doozy! I've dealt with data quality issues before, and it's definitely not a walk in the park. I'm leaning towards option B - it seems like the most comprehensive approach to handling the data cleansing and transformation.

upvoted 0 times

...