python file operation slowing down on massive text files

This python code is slowing down the longer it runs.

Can anyone please tell me why?

I hope it is not reindexing for every line I query and counting from start again, I thought it would be some kind of file-stream ?!

From 10k to 20k it takes 2 sec. from 300k to 310k it takes like 5 min. and getting worse. The code is only running in the ELSE part up to that point and ‘listoflines’ is constant at that point (850000 lines in list) and of type ‘list[ ]’ as well as ‘offset’ is just a constant ‘int’ at that point.

The source file has millions of lines up to over 20 million lines.

‘dummyline not in listoflines’ should take the same time every time.

with open(filename, "rt") as source:     for dummyline in source:         if (len(dummyline) > 1) and (dummyline not in listoflines):             # RUN compute             # this part is not reached where I have the problem         else:             if dummyalreadycheckedcounter % 10000 == 0:             print ("%d/%d: %s already checked or not valid " % (dummyalreadycheckedcounter, offset, dummyline) )             dummyalreadycheckedcounter = dummyalreadycheckedcounter +1 
Add Comment
2 Answer(s)

Same opinion with @Sedy Vlk. Use hash(that is dictionary in python) instead.

clines_count = {l: 0 for l in clines} for line in nlines:     if len(line) > 1 and line in clines_count:         pass     else:         if counter % 10000 == 0:             print ("%d: %s already checked or not valid " % (counter, line) )         counter += 1 
Answered on July 16, 2020.
Add Comment

actually in operation for list is not the same every time in fact it is O(n) so it gets slower and slower as you add

you want to use set See here https://wiki.python.org/moin/TimeComplexity

You didn’t ask for this but I’d suggest turning this into a processing pipe line, so your compute part would not be mixed with the dedup logic

def dedupped_stream(filename):     seen = set()     with open(filename, "rt") as source:         for each_line in source:             if len(line)>1 and each_line not in seen:                 seen.add(each_line)                 yield each_line 

then you can do just

for line in (dedupped_stream):     ... 

you would not need to worry about deduplication here at all

Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.