An Airflow DAG is designed to ingest data from multiple sources, transform it, and load it into a data warehouse. The transformation step is resource-intensive and should not run during peak hours (9 AM to 5 PM). How can you configure the DAG to meet this requirement?
What is the impact of setting the Spark configuration spark.sql.autoBroadcastJoinThreshold to -1?
You're working with a Spark application that processes streaming data in real-time. How does Spark handle persistence in this context?
What is the primary consideration when choosing the number of buckets in a Hive table?
If you want to set a minimum and maximum number of Executor pods for a Spark application in Kubernetes, which pair of PySpark configuration settings would you use?