Faster nested row comparison with 2D numpy arrays

Question

Home

Faster nested row comparison with 2D numpy arrays

0

I have two very large arrays X_pos and X_neg with sizes (900000,100) and (1200000,100). I’d like to find out if any row element of arr1 is present on arr2. If there is a match, I need a list of indices for the matching rows. Later I need to remove them and make both arrays having no same row match. For now, I’m using for loops:

list_conflict=[] for i in range (len(X_pos)):     for j in range (len(X_neg)):         if (X_pos[i]==X_neg[j]):             list_conflict.append([i,j])  fault_index_pos = np.unique([x[0] for x in list_conflict]) fault_index_neg = np.unique([x[1] for x in list_conflict])  X_neg = np.delete(X_neg,fault_index_neg,axis=0) X_pos = np.delete(X_pos,fault_index_pos,axis=0)

It takes an element of X_pos on outer loop and compares it with every element of X_neg exhaustively. If finds a match, appends indices list_conflict with first element being X_pos position and second X_pos. Then fault_index_pos and fault_index_neg are squeezed into unique elements, since an element of X_pos could be on multiple places of X_pos and list will have recurrent positions. Finally, matching elements are removed with np.delete by taking fault_index lists as index to be deleted.

I’m looking for a faster approach for conflict comparison call it set, vectorization or anything else.

P.S. Example arrays:

X_pos

array([[0,1,2,3,4],[0,1,2,3,5],[0,1,2,3,6],[2,4,5,6,7]])

X_neg

array([[2,4,5,6,7],[0,1,2,3,7],[0,1,2,3,4],[0,1,2,3,9],[1,9,3,2,5]])

It should return the indices that are duplicate in the other array which is list_conflict = [[0,1],[3,1]]. Then by these indices, I should be able to delete same elements in both arrays. Possible duplicate question is what I asked yesterday, but it fails to precisely explain the problem with unnecessary simplification. Those answers are not a solution to this problem and they had to be altered in some way to work.

Kermitmarilavonne Asked on July 16, 2020 in Python.

Share
Comment(0)

Add Comment

1 Answer(s)

Votes
Oldest

0

Find rows where a contains rows of b:

duplicate_a = np.isin(a, b).all(axis=1)

For these rows, find the corresponding rows in b:

duplicate_b = np.isin(b, a[duplicate_a, :]).all(axis=1)

to get the indices of these duplicates:

duplicate_b_rows = b[duplicate_b]  duplicate_b_indices = duplicate_b.nonzero()[0] duplicate_a_indices = duplicate_a.nonzero()[0]  duplicates = [     (duplicate_a_indices[i], duplicate_b_indices[np.isin(duplicate_b_rows, row).all(axis=1)])     for i, row in enumerate(a[duplicate_a, :]) ]

For example,

a = np.random.random((1000, 100)) b = np.random.random((500, 100)) b[1, :] = a[3, :] b[5, :] = a[3, :] b[2, :] = a[1, :]

Would output

[(1, array([2])), (3, array([1, 5]))]

indicating that a[1, :] corresponds to b[2, :], and a[3, :] corresponds to b[1, :] as well as b[5, :].

To flatten this completely:

duplicates = [     (duplicate_a_indices[i], duplicate_b_indices[j])     for i, row in enumerate(a[duplicate_a, :]) for j in np.isin(duplicate_b_rows, row).all(axis=1).nonzero()[0] ]

It’d now output [(1, 2), (3, 1), (3, 5)] for the above example

But a faster way would be to use a lookup table for the rows:

a_lookup = {row.tostring(): i for i, row in enumerate(a)}  duplicates = [(a_lookup[row.tostring()], i) for i, row in enumerate(b) if row.tostring() in a_lookup]

This runs much faster than the numpy version in case you have many rows, as the complexity is O(N*K + M*K)) where N and M are the number of rows in a and b respectively, and K is the number of elements per row.

While the complexity of the numpy solution is O(M*N*K).

Note that this would work fine to find exact duplicates, but if your data is composed of floating point numbers, you may miss "duplicates" which are essentially the same, but due to the limited resolution of the float would differ in a digit or more.

Taddoreendoreen Answered on July 16, 2020.

Share
Comment(0)

Add Comment

Your Answer

Answer 1

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 2

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 3

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 4

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 5

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 6

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 7

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 8

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

Answer 9

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 10

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 11

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 12

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 13

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 14

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 15

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 16

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

LATEST ANSWERS

Faster nested row comparison with 2D numpy arrays

Your Answer

TOP USERS

HOT QUESTIONS

LATEST ANSWERS

Faster nested row comparison with 2D numpy arrays

Your Answer

Tags Widget

TOP USERS

HOT QUESTIONS