Free Access to Cloudera.CDP-3002.v2025-11-21.q109 with Valid Practice Test (Page 20)

Question 91

You are designing a data pipeline that involves ingesting data from multiple sources, performing data transformations using Spark, and storing the results in a data lake. How would you leverage the Cloudera Data Engineering service to ensure efficient and fault-tolerant execution?

A.Develop a single Spark job containing all transformation logic.
B.Design the pipeline with stages and steps, leveraging Spark operators for transformations and utilizing retries and error handling mechanisms.
C.Utilize separate Spark jobs for each data source and transformation step.
D.Implement custom logic within the YAML configuration file to manage data flow and error handling.

Question 92

When using Apache Airflow to schedule quality checks, which strategy helps ensure that checks are only run on the most recent data partition?
A Use the LatestOnlyOperator to skip tasks that are not the latest in a series of executions.

A.Utilize the execution_date in your SQL query to target the latest partition.
B.Apply the depends_on_past=True parameter to the quality check task.
C.Implement a custom PythonOperator that queries the database for the latest partition.

Question 93

You have a DataFrame containing sales data with columns "product_id", "customer id", and "amount". How can you efficiently calculate the total sales per customer?

A.Use a loop to iterate through the DataFrame and accumulate the sales for each customer
B.Utilize spark SQL's GROUP BY and SUM functions
C.Implement a custom function to group and sum the sales
D.Leverage Spark's machine learning library (MLIiB. for aggregation

Question 94

You're working with an ETL pipeline that extracts data from multiple sources. How can you ensure that the pipeline only processes the latest data and avoids re-processing already processed data?

A.Use timestamps or versioning information provided by the data sources to identify new data.
B.Implement a custom mechanism to track the last processed record for each source and filter data accordingly.
C.Configure the data sources to only provide new data by default.
D.Rely on Airflow's built-in mechanisms to handle data freshness automatically.

Question 95

Which of the following commands is used to install PySpark in your development environment?

A.pip install pyspark
B.npm install pyspark
C.yarn add pyspark
D.brew install pyspark

Question 91

Question 92

Question 93

Question 94

Question 95

Download PDF File