Optimization of json.load() to reduce in-memory usage and time in Python
I have 10K folders each with 200 records in 200 JSON format files. Trying to compile all records into one data frame then finally into a CSV (other format suggestions welcome)
Here is my working solution which takes around 8.3hrs just for the dataframe building process. (Not converting into CSV)
%%time finalDf = pd.DataFrame() rootdir ='/path/foldername' all_files = Path(rootdir).rglob('*.json') for filename in all_files: with open(filename, 'r+') as f: data = json.load(f) df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'}) finalDf = finalDf.append(df, ignore_index=True)
Any suggestions to optimize this and bring the time down.
If the goal is to just write the CSV, you can use multiprocessing to parallelize the read/deserialize/serialize steps and control the file writes with a lock. With a CSV you don’t have to hold the whole thing in memory, just append each DF as its generated. If you are using hard drives instead of a ssd, you may also get a boost if the CSV is on a different drive (not just partition).
import multiprocessing as mp import json import pandas as pd from pathlib import Path import os def update_csv(args): lock, infile, outfile = args with open(infile) as f: data = json.load(f) df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'}) with lock: with open(outfile, mode="a", newline="") as f: df.to_csv(f) if __name__ == "__main__": rootdir ='/path/foldername' outfile = 'myoutput.csv' if os.path.exists(outfile): os.remove(outfile) all_files = [str(p) for p in Path(rootdir).rglob('*.json')] mgr = mp.Manager() lock = mgr.Lock() # pool sizing is a bit of a guess.... with mp.Pool(mp.cpu_count()-1) as pool: result = pool.map(update_csv, [(lock, fn, outfile) for fn in all_files], chunksize=1)
Personally, I prefer to use a file system lock file for this type of thing but that’s platform dependent and you may have problems on some file system types (like a mounted remote file system). multiprocessing.Manager
uses background synchronization – I’m not sure if its Lock
is efficient or not. But good enough here…. it’ll only be a minor % of costs.
One important issue comes from the dataframe appending performed in O(n^2)
. Indeed, for each new processed json file, finalDf
is entirely copied!
Here is a modified version running in O(n)
time:
%%time finalDf = pd.DataFrame() rootdir ='/path/foldername' all_files = Path(rootdir).rglob('*.json') allDf = [] for filename in all_files: with open(filename, 'r+') as f: data = json.load(f) df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'}) allDf.append(df) finalDf = pd.concat(allDf, ignore_index=True)
If this not enough, the json parsing and pandas post-processings could be executed in parallel using the multiprocessing module.