Free Access to Databricks.Associate-Developer-Apache-Spark-3.5.v2025-11-20.q72 with Valid Practice Test (Page 14)

Question 61

A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?
A)
Use the applylnPandas API
B)

C)

A.Use the applyInPandas API:
df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()
B.Use the mapInPandas API:
df.mapInPandas(mean_func, schema="user_id long, value double").show()
C.Use a regular Spark UDF:
from pyspark.sql.functions import mean
df.groupBy("user_id").agg(mean("value")).show()
D.Use a Pandas UDF:
@pandas_udf("double")
def mean_func(value: pd.Series) -> float:
return value.mean()
df.groupby("user_id").agg(mean_func(df["value"])).show()

Question 62

Given:
python
CopyEdit
spark.sparkContext.setLogLevel("<LOG_LEVEL>")
Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

A.ALL, DEBUG, FAIL, INFO
B.ERROR, WARN, TRACE, OFF
C.WARN, NONE, ERROR, FATAL
D.FATAL, NONE, INFO, DEBUG

Question 63

A data engineer needs to write a Streaming DataFrame as Parquet files.
Given the code:

Which code fragment should be inserted to meet the requirement?
A)

B)

C)

D)

Which code fragment should be inserted to meet the requirement?

A..format("parquet")
.option("location", "path/to/destination/dir")
B.CopyEdit
.option("format", "parquet")
.option("destination", "path/to/destination/dir")
C..option("format", "parquet")
.option("location", "path/to/destination/dir")
D..format("parquet")
.option("path", "path/to/destination/dir")

Question 64

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns fortransaction_id,account_number, transaction_amount, andtimestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.
Which approach should the data scientist use to deduplicate the orders using PySpark?

A.df = df.dropDuplicates()
B.df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first ("timestamp"))
C.df = df.filter(F.col("transaction_id").isNotNull())
D.df = df.dropDuplicates(["transaction_amount"])

Question 65

A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.
Which line of Spark code will produce a Parquet table that meets these requirements?

A.final_df \
.sort("market_time") \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
B.final_df \
.orderBy("market_time") \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
C.final_df \
.sort("market_time") \
.coalesce(1) \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")
D.final_df \
.sortWithinPartitions("market_time") \
.write \
.format("parquet") \
.mode("overwrite") \
.saveAsTable("output.market_events")

Question 61

Question 62

Question 63

Question 64

Question 65

Download PDF File