Free Access to Databricks.Associate-Developer-Apache-Spark-3.5.v2025-11-20.q72 with Valid Practice Test (Page 11)

Question 46

Given the code:

df = spark.read.csv("large_dataset.csv")
filtered_df = df.filter(col("error_column").contains("error"))
mapped_df = filtered_df.select(split(col("timestamp"), " ").getItem(0).alias("date"), lit(1).alias("count")) reduced_df = mapped_df.groupBy("date").sum("count") reduced_df.count() reduced_df.show() At which point will Spark actually begin processing the data?

A.When the filter transformation is applied
B.When the count action is applied
C.When the groupBy transformation is applied
D.When the show action is applied

Question 47

A data engineer is building an Apache Spark™ Structured Streaming application to process a stream of JSON events in real time. The engineer wants the application to be fault-tolerant and resume processing from the last successfully processed record in case of a failure. To achieve this, the data engineer decides to implement checkpoints.
Which code snippet should the data engineer use?

A.query = streaming_df.writeStream \
.format("console") \
.option("checkpoint", "/path/to/checkpoint") \
.outputMode("append") \
.start()
B.query = streaming_df.writeStream \
.format("console") \
.outputMode("append") \
.option("checkpointLocation", "/path/to/checkpoint") \
.start()
C.query = streaming_df.writeStream \
.format("console") \
.outputMode("complete") \
.start()
D.query = streaming_df.writeStream \
.format("console") \
.outputMode("append") \
.start()

Question 48

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:
Low number of Active Tasks
Many tasks complete in milliseconds
Fewer tasks than available CPUs
Which approach should be used to adjust the partitioning for optimal resource allocation?

A.Set the number of partitions equal to the total number of CPUs in the cluster
B.Set the number of partitions to a fixed value, such as 200
C.Set the number of partitions equal to the number of nodes in the cluster
D.Set the number of partitions by dividing the dataset size (1 TB) by a reasonable partition size, such as 128 MB

Question 49

What is the benefit of using Pandas on Spark for data transformations?
Options:

A.It is available only with Python, thereby reducing the learning curve.
B.It computes results immediately using eager execution, making it simple to use.
C.It runs on a single node only, utilizing the memory with memory-bound DataFrames and hence cost- efficient.
D.It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.

Question 50

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set forspark.sql.
adaptive.maxShuffledHashJoinLocalMapThreshold.
Which type of join will Adaptive Query Execution (AQE) choose in this case?

A.A Cartesian join
B.A shuffled hash join
C.A broadcast nested loop join
D.A sort-merge join

Question 46

Question 47

Question 48

Question 49

Question 50

Download PDF File