Optimization of json.load() to reduce in-memory usage and time in Python

Question

Home

Optimization of json.load() to reduce in-memory usage and time in Python

0

I have 10K folders each with 200 records in 200 JSON format files. Trying to compile all records into one data frame then finally into a CSV (other format suggestions welcome)

Here is my working solution which takes around 8.3hrs just for the dataframe building process. (Not converting into CSV)

%%time finalDf = pd.DataFrame() rootdir ='/path/foldername' all_files = Path(rootdir).rglob('*.json') for filename in all_files:     with open(filename, 'r+') as f:         data = json.load(f)         df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})         finalDf = finalDf.append(df, ignore_index=True)

Any suggestions to optimize this and bring the time down.

Noahmalcompatrica Asked on July 16, 2020 in Python.

Share
Comment(0)

Add Comment

2 Answer(s)

Votes
Oldest

0

If the goal is to just write the CSV, you can use multiprocessing to parallelize the read/deserialize/serialize steps and control the file writes with a lock. With a CSV you don’t have to hold the whole thing in memory, just append each DF as its generated. If you are using hard drives instead of a ssd, you may also get a boost if the CSV is on a different drive (not just partition).

import multiprocessing as mp import json import pandas as pd from pathlib import Path import os  def update_csv(args):     lock, infile, outfile = args     with open(infile) as f:         data = json.load(f)     df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})     with lock:         with open(outfile, mode="a", newline="") as f:             df.to_csv(f)  if __name__ == "__main__":     rootdir ='/path/foldername'     outfile = 'myoutput.csv'     if os.path.exists(outfile):         os.remove(outfile)     all_files = [str(p) for p in Path(rootdir).rglob('*.json')]     mgr = mp.Manager()     lock = mgr.Lock()     # pool sizing is a bit of a guess....     with mp.Pool(mp.cpu_count()-1) as pool:         result = pool.map(update_csv, [(lock, fn, outfile) for fn in all_files],             chunksize=1)

Personally, I prefer to use a file system lock file for this type of thing but that’s platform dependent and you may have problems on some file system types (like a mounted remote file system). multiprocessing.Manager uses background synchronization – I’m not sure if its Lock is efficient or not. But good enough here…. it’ll only be a minor % of costs.

Billyhobertmarlene Answered on July 16, 2020.

Share
Comment(0)

Add Comment

0

One important issue comes from the dataframe appending performed in O(n^2). Indeed, for each new processed json file, finalDf is entirely copied!

Here is a modified version running in O(n) time:

%%time finalDf = pd.DataFrame() rootdir ='/path/foldername' all_files = Path(rootdir).rglob('*.json') allDf = [] for filename in all_files:     with open(filename, 'r+') as f:         data = json.load(f)         df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})         allDf.append(df) finalDf = pd.concat(allDf, ignore_index=True)

If this not enough, the json parsing and pandas post-processings could be executed in parallel using the multiprocessing module.

Edgardomanuelleanna Answered on July 16, 2020.

Share
Comment(0)

Add Comment

Your Answer

Answer 1

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 2

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 3

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 4

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 5

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 6

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 7

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 8

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

Answer 9

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 10

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 11

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 12

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 13

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 14

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 15

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 16

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

LATEST ANSWERS

Optimization of json.load() to reduce in-memory usage and time in Python

Your Answer

TOP USERS

HOT QUESTIONS

LATEST ANSWERS

Optimization of json.load() to reduce in-memory usage and time in Python

Your Answer

Tags Widget

TOP USERS

HOT QUESTIONS