How to cancel pyspark foreachPartition operation
How can I cancel a long pyspark foreachPartition operation?
For example I have my code that handles a very large amount of data (and it take a long time) but I want to be able to allow the user to cancel the operation – how do I do it?
def get_data(self, spark_session): query = 'Some query...' my_data_frame = spark_session.sql(query) my_data_frame.foreachPartition(handle_data) # How to cancel on user request?
It can be done using
sc = spark_session.sparkContext sc.setJobGroup(...) # In a separate thread: sc.cancelJobGroup(...)
There is a full example in PySpark API documentation