Free Access to Cloudera.CDP-3002.v2025-11-21.q109 with Valid Practice Test (Page 22)

Question 101

An Airflow DAG is designed to ingest data from multiple sources, transform it, and load it into a data warehouse. The transformation step is resource-intensive and should not run during peak hours (9 AM to 5 PM). How can you configure the DAG to meet this requirement?

A.Use the time_sensor operator to delay the transformation task until off-peak hours.
B.Set the max_active_runs parameter to limit executions during peak hours.
C.Utilize the BranchPythonOperator to dynamically skip the transformation task during peak hours.
D.Configure the DAG's schedule interval and use the TimeDelta sensor for precise timing.

Question 102

What is the impact of setting the Spark configuration spark.sql.autoBroadcastJoinThreshold to -1?

A.It disables the broadcast join feature, forcing all joins to be shuffled joins.
B.It sets an unlimited threshold for broadcasting tables, which may cause out-of-memory errors.
C.It automatically selects the optimal threshold for broadcasting based on the cluster's current workload.
D.It increases the threshold for choosing which table to broadcast in a join, potentially improving join performance.

Question 103

You're working with a Spark application that processes streaming data in real-time. How does Spark handle persistence in this context?

A.Spark uses the same persistence mechanisms as batch processing for streaming data
B.Streaming data cannot be persisted due to its continuous nature
C.Spark leverages micro-batching and checkpointing for persistence in streaming applications
D.Streaming data is automatically persisted to HDFS for historical analysis

Question 104

What is the primary consideration when choosing the number of buckets in a Hive table?

A.The number of rows in each partition
B.The total storage capacity of the HDFS cluster
C.The expected distribution of data across the bucketing column
D.The number of nodes in the Hadoop cluster

Question 105

If you want to set a minimum and maximum number of Executor pods for a Spark application in Kubernetes, which pair of PySpark configuration settings would you use?

A.'spark.executor.instances', 'spark.executor.cores'
B.'spark.dynamicAllocation.minExecutors', 'spark.dynamicAllocation.maxExecutors'
C.'spark.executor.memory', 'spark.executor.memoryoverhead'
D.'spark.kubernetes.container.image', 'spark.kubernetes.executor.limit.cores'

Question 101

Question 102

Question 103

Question 104

Question 105

Download PDF File