Free Access to Databricks.Associate-Developer-Apache-Spark-3.5.v2025-11-20.q72 with Valid Practice Test (Page 10)

Question 41

Given the schema:

event_ts TIMESTAMP,
sensor_id STRING,
metric_value LONG,
ingest_ts TIMESTAMP,
source_file_path STRING
The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.
Options:

A.dropDuplicates on all columns (wrong criteria)
B.dropDuplicates with no arguments (removes based on all columns)
C.groupBy without aggregation (invalid use)
D.dropDuplicates on the exact matching fields

Question 42

A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:

import hashlib
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType
def shake_256(raw):
return hashlib.shake_256(raw.encode()).hexdigest(20)
shake_256_udf = sf.udf(shake_256, StringType())
The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition ofshake_256_udfto this:CopyEdit shake_256_udf = sf.pandas_udf(shake_256, StringType()) However, the developer receives the error:
What should the signature of theshake_256()function be changed to in order to fix this error?

A.def shake_256(df: pd.Series) -> str:
B.def shake_256(df: Iterator[pd.Series]) -> Iterator[pd.Series]:
C.def shake_256(raw: str) -> str:
D.def shake_256(df: pd.Series) -> pd.Series:

Question 43

12 of 55.
A data scientist has been investigating user profile data to build features for their model. After some exploratory data analysis, the data scientist identified that some records in the user profiles contain NULL values in too many fields to be useful.
The schema of the user profile table looks like this:
user_id STRING,
username STRING,
date_of_birth DATE,
country STRING,
created_at TIMESTAMP
The data scientist decided that if any record contains a NULL value in any field, they want to remove that record from the output before further processing.
Which block of Spark code can be used to achieve these requirements?

A.filtered_users = raw_users.na.drop("any")
B.filtered_users = raw_users.na.drop("all")
C.filtered_users = raw_users.dropna(how="any")
D.filtered_users = raw_users.dropna(how="all")

Question 44

43 of 55.
An organization has been running a Spark application in production and is considering disabling the Spark History Server to reduce resource usage.
What will be the impact of disabling the Spark History Server in production?

A.Prevention of driver log accumulation during long-running jobs
B.Improved job execution speed due to reduced logging overhead
C.Loss of access to past job logs and reduced debugging capability for completed jobs
D.Enhanced executor performance due to reduced log size

Question 45

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.
Which action should the engineer take to resolve this issue?

A.Optimize the data processing logic by repartitioning the DataFrame.
B.Modify the Spark configuration to disable garbage collection
C.Increase the memory allocated to the Spark Driver.
D.Cache large DataFrames to persist them in memory.

Question 41

Question 42

Question 43

Question 44

Question 45

Download PDF File