Free Access to Databricks.Associate-Developer-Apache-Spark-3.5.v2025-11-20.q72 with Valid Practice Test (Page 4)

Question 11

A data scientist is analyzing a large dataset and has written a PySpark script that includes several transformations and actions on a DataFrame. The script ends with a collect() action to retrieve the results.
How does Apache Spark™'s execution hierarchy process the operations when the data scientist runs this script?

A.The script is first divided into multiple applications, then each application is split into jobs, stages, and finally tasks.
B.The entire script is treated as a single job, which is then divided into multiple stages, and each stage is further divided into tasks based on data partitions.
C.The collect() action triggers a job, which is divided into stages at shuffle boundaries, and each stage is split into tasks that operate on individual data partitions.
D.Spark creates a single task for each transformation and action in the script, and these tasks are grouped into stages and jobs based on their dependencies.

Question 12

A developer runs:

What is the result?
Options:

A.It stores all data in a single Parquet file.
B.It throws an error if there are null values in either partition column.
C.It appends new partitions to an existing Parquet file.
D.It creates separate directories for each unique combination of color and fruit.

Question 13

A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data.
The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by themarket_timefield.
Which line of Spark code will produce a Parquet table that meets these requirements?

A.final_df \
.sort("market_time") \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
B.final_df \
.orderBy("market_time") \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
C.final_df \
.sort("market_time") \
.coalesce(1) \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
D.final_df \
.sortWithinPartitions("market_time") \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")

Question 14

9 of 55.
Given the code fragment:
import pyspark.pandas as ps
pdf = ps.DataFrame(data)
Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

A.pdf.to_pandas()
B.pdf.to_spark()
C.pdf.to_dataframe()
D.pdf.spark()

Question 15

3 of 55. A data engineer observes that the upstream streaming source feeds the event table frequently and sends duplicate records. Upon analyzing the current production table, the data engineer found that the time difference in the event_timestamp column of the duplicate records is, at most, 30 minutes.
To remove the duplicates, the engineer adds the code:
df = df.withWatermark("event_timestamp", "30 minutes")
What is the result?

A.It removes all duplicates regardless of when they arrive.
B.It accepts watermarks in seconds and the code results in an error.
C.It removes duplicates that arrive within the 30-minute window specified by the watermark.
D.It is not able to handle deduplication in this scenario.

Question 11

Question 12

Question 13

Question 14

Question 15

Download PDF File