Free Access to Databricks.Associate-Developer-Apache-Spark-3.5.v2025-11-20.q72 with Valid Practice Test (Page 13)

Question 56

A data scientist is working with a Spark DataFrame called customerDF that contains customer information. The DataFrame has a column named email with customer email addresses. The data scientist needs to split this column into username and domain parts.
Which code snippet splits the email column into username and domain columns?

A.customerDF.select(
col("email").substr(0, 5).alias("username"),
col("email").substr(-5).alias("domain")
)
B.customerDF.withColumn("username", split(col("email"), "@").getItem(0)) \
.withColumn("domain", split(col("email"), "@").getItem(1))
C.customerDF.withColumn("username", substring_index(col("email"), "@", 1)) \
.withColumn("domain", substring_index(col("email"), "@", -1))
D.customerDF.select(
regexp_replace(col("email"), "@", "").alias("username"),
regexp_replace(col("email"), "@", "").alias("domain")
)

Question 57

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region-> region id containing the smallest 3 region_idvalues.
Which code fragment meets the requirements?
A)

B)

C)

D)

The resulting Python dictionary must contain a mapping ofregion -> region_idfor the smallest
3region_idvalues.
Which code fragment meets the requirements?

A.regions = dict(
regions_df
.select('region', 'region_id')
.sort('region_id')
.take(3)
)
B.regions = dict(
regions_df
.select('region_id', 'region')
.sort('region_id')
.take(3)
)
C.regions = dict(
regions_df
.select('region_id', 'region')
.limit(3)
.collect()
)
D.regions = dict(
regions_df
.select('region', 'region_id')
.sort(desc('region_id'))
.take(3)
)

Question 58

A Spark engineer must select an appropriate deployment mode for the Spark jobs.
What is the benefit of using cluster mode in Apache Spark™?

A.In cluster mode, resources are allocated from a resource manager on the cluster, enabling better performance and scalability for large jobs
B.In cluster mode, the driver is responsible for executing all tasks locally without distributing them across the worker nodes.
C.In cluster mode, the driver runs on the client machine, which can limit the application's ability to handle large datasets efficiently.
D.In cluster mode, the driver program runs on one of the worker nodes, allowing the application to fully utilize the distributed resources of the cluster.

Question 59

An engineer has a large ORC file located at /file/test_data.orc and wants to read only specific columns to reduce memory usage.
Which code fragment will select the columns, i.e., col1, col2, during the reading process?

A.spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")
B.spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")
C.spark.read.orc("/file/test_data.orc").selected("col1", "col2")
D.spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Question 60

30 of 55.
A data engineer is working on a num_df DataFrame and has a Python UDF defined as:
def cube_func(val):
return val * val * val
Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

A.spark.udf.register("cube_func", cube_func)
num_df.selectExpr("cube_func(num)").show()
B.num_df.select(cube_func("num")).show()
C.spark.createDataFrame(cube_func("num")).show()
D.num_df.register("cube_func").select("num").show()

Question 56

Question 57

Question 58

Question 59

Question 60

Download PDF File