You're debugging a slow-running Spark job writing a large Iceberg table. Which optimization techniques could improve performance? (Choose three.
In the context of big data processing, what is a potential downside of relying heavily on schema inference?
In optimizing join operations, what role does the Catalyst optimizer in Spark play, specifically regarding join strategies?
You're building an Airflow ETL pipeline that involves data validation checks. How can you integrate these checks into the pipeline and handle potential failures?
Which tool or API is primarily used for monitoring and inspecting the performance of Spark applications in real-time?