Problems with memory and Dask distributed: multiple times the size of the data loading into memory and data spill not happening

Question

Home

Problems with memory and Dask distributed: multiple times the size of the data loading into memory and data spill not happening

0

I’m running some simple tests with Dask distributed and Datashader but I’m running into two problems that I haven’t been able to solve neither understand why it happens.

The data I’m working with consists of 1.7 billion rows with 97 columns each, distributed into 64 parquet files. My test code is the following, in which I simply plot two columns of the data in a scatter plot, following the example code at the bottom of https://datashader.org/user_guide/Performance.html :

def plot(file_path):     dask_df = dd.read_parquet(file_path,  engine='pyarrow')      cvs = ds.Canvas(plot_width=600, plot_height=300)     agg = cvs.points(dask_df, 'x', 'y')     img = tf.shade(agg, cmap=['lightblue', 'darkblue'])     return img  futures = [dask_client.submit(plot,file) for f in files_paths] result = [f.result() for f in futures]  #array with each plot per file

The two problems are the following:

First, my workers take way too many data into memory. For example, I’ve run the previous code with just one worker and one file. Even though one file is 11gb, the Dask dashboard shows around 50gb loaded into memory. The only solution I have found to this is to change the following line, expliciting a small slice of the columns:

def plot(file_path):     dask_df = dd.read_parquet(file_path,  columns=['x','y',...], engine='pyarrow')      …

Although this works (and makes sense because I’m only using 2 columns for the plot) it’s still confusing as to why the workers use that much memory.

The second problem is that, even though I have configured in my ~/.config/dask/distributed.yaml file that at 70% a spill should happen, my workers keep crashing because they run out of memory:

distributed.nanny – WARNING – Worker exceeded 95% memory budget. Restarting distributed.nanny – WARNING – Restarting worker

Finally, when I plot all the points, bringing only 5 columns with columns=['x','y','a','b','c'] when reading the data, I’m getting unreasonable slow times. Despite the files being split into 8 disk for speeding up the I/O and working with 8 cores (8 workers) it takes 5 minutes for the 1.7 billion points to plot.

I’ve been struggling with this for a whole week so any advice would be highly appreciated. Please feel free to ask me for any other information that may be missing.

Aldenmarisophia Asked on July 16, 2020 in Python.

Share
Comment(0)

Add Comment

0 Answer(s)

Votes
Oldest

Your Answer

Answer 1

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 2

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 3

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 4

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 5

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 6

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 7

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 8

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

Answer 9

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 10

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 11

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 12

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 13

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 14

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 15

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 16

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

LATEST ANSWERS

Problems with memory and Dask distributed: multiple times the size of the data loading into memory and data spill not happening

Your Answer

TOP USERS

HOT QUESTIONS

LATEST ANSWERS

Problems with memory and Dask distributed: multiple times the size of the data loading into memory and data spill not happening

Your Answer

Tags Widget

TOP USERS

HOT QUESTIONS