How to cancel pyspark foreachPartition operation

How can I cancel a long pyspark foreachPartition operation?

For example I have my code that handles a very large amount of data (and it take a long time) but I want to be able to allow the user to cancel the operation – how do I do it?

def get_data(self, spark_session):     query = 'Some query...'     my_data_frame = spark_session.sql(query)     my_data_frame.foreachPartition(handle_data)     # How to cancel on user request? 
Asked on July 16, 2020 in Python.
Add Comment
1 Answer(s)

It can be done using

sc = spark_session.sparkContext sc.setJobGroup(...) # In a separate thread: sc.cancelJobGroup(...) 

There is a full example in PySpark API documentation

Answered on July 16, 2020.
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.