python file operation slowing down on massive text files

Question

Home

python file operation slowing down on massive text files

0

python-textprocessing

This python code is slowing down the longer it runs.

Can anyone please tell me why?

I hope it is not reindexing for every line I query and counting from start again, I thought it would be some kind of file-stream ?!

From 10k to 20k it takes 2 sec. from 300k to 310k it takes like 5 min. and getting worse. The code is only running in the ELSE part up to that point and ‘listoflines’ is constant at that point (850000 lines in list) and of type ‘list[ ]’ as well as ‘offset’ is just a constant ‘int’ at that point.

The source file has millions of lines up to over 20 million lines.

‘dummyline not in listoflines’ should take the same time every time.

with open(filename, "rt") as source:     for dummyline in source:         if (len(dummyline) > 1) and (dummyline not in listoflines):             # RUN compute             # this part is not reached where I have the problem         else:             if dummyalreadycheckedcounter % 10000 == 0:             print ("%d/%d: %s already checked or not valid " % (dummyalreadycheckedcounter, offset, dummyline) )             dummyalreadycheckedcounter = dummyalreadycheckedcounter +1

Trumancarrieleanna Asked on July 16, 2020 in Python.

Share
Comment(0)

Add Comment

2 Answer(s)

Votes
Oldest

0

Same opinion with @Sedy Vlk. Use hash(that is dictionary in python) instead.

clines_count = {l: 0 for l in clines} for line in nlines:     if len(line) > 1 and line in clines_count:         pass     else:         if counter % 10000 == 0:             print ("%d: %s already checked or not valid " % (counter, line) )         counter += 1

Earlejuliestacie Answered on July 16, 2020.

Share
Comment(0)

Add Comment

0

actually in operation for list is not the same every time in fact it is O(n) so it gets slower and slower as you add

you want to use set See here https://wiki.python.org/moin/TimeComplexity

You didn’t ask for this but I’d suggest turning this into a processing pipe line, so your compute part would not be mixed with the dedup logic

def dedupped_stream(filename):     seen = set()     with open(filename, "rt") as source:         for each_line in source:             if len(line)>1 and each_line not in seen:                 seen.add(each_line)                 yield each_line

then you can do just

for line in (dedupped_stream):     ...

you would not need to worry about deduplication here at all

Mohamedduncanmarcella Answered on July 16, 2020.

Share
Comment(0)

Add Comment

Your Answer

Answer 1

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 2

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 3

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 4

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 5

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 6

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 7

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 8

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

Answer 9

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 10

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 11

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 12

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 13

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 14

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 15

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 16

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

LATEST ANSWERS

python file operation slowing down on massive text files

Your Answer

TOP USERS

HOT QUESTIONS

LATEST ANSWERS

python file operation slowing down on massive text files

Your Answer

Tags Widget

TOP USERS

HOT QUESTIONS