You're working with a DataFrame containing customer data, including a "purchase_date" column. How can you calculate the average purchase amount per month for the past year?
You're working with a complex data pipeline involving Spark and Hive, and you need to monitor its performance and identify potential bottlenecks. Which tools and techniques can you employ for effective monitoring?
When leveraging Spark's DataFrame API for caching, what implicit optimization does Spark perform to enhance processing efficiency?
Consider a PySpark application that calculates the count of rows in a CSV file. The main application file is 'count_rows.py', and it uses a custom library 'data_utils.py'. After packaging, which 'spark-submit' command correctly submits this job including the custom library?
In a distributed environment like Apache Spark, under what conditions might a sort merge join perform more efficiently than a broadcast join?