randomly selected a sample subset from the total dataset and cannot retrieve the remaining subset from the total dataset
My input data is under the form:
gold,Program,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,CompleteCallersCallees,classGold T,chess,Inner,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace, T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Medium,-1,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, T,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace, N,chess,Inner,Low,-1,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, .... N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,Trace, N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace, T,chess,Inner,Low,-1,-1,Low,Low,-1,Low,-1,Low,-1,-1,-1,0,Trace, T,chess,Inner,Low,-1,-1,Medium,-1,-1,Low,-1,Low,-1,-1,-1,0,Trace, N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,
I would like to sample my input file and only select rows having the feature value CompleteCallersCallees=1
which I am doing using the following line of code TrainingSet=dataset.loc[dataset['CompleteCallersCallees'] == 1]
. Then, I would like to select a random sample of this TrainingSet
which I am doing using the row TrainingSet1=TrainingSet.sample(frac=0.7)
. This allows me to select 70% of the TrainingSet
randomly. The problem is that I would like to retrieve the remaining 30% of this TrainingSet
that are not part of TrainingSet1
. I am doing so using the line of code TrainingSet2=pd.concat([TrainingSet, TrainingSet1]).drop_duplicates(keep=False)
. However, this does not work as the size of TrainingSet
is 2269 and the size of TrainingSet1
is 1588. Normally, the size of TrainingSet2
should be equal to 2269-1588=681 and the problem is that the size of TrainingSet2
is only 34 when I print it. My full input data file can be found under this link: https://drive.google.com/file/d/1vF4ZAPSps_aO2Umsp2hEgK7UxwcXLiZR/view?usp=sharing
Here is the code I am using:
import pandas as pd import numpy as np from sklearn.feature_selection import SelectFromModel from sklearn.model_selection import train_test_split # Feature Scaling from sklearn.preprocessing import StandardScaler SeparateProjectLearning=False CompleteCallersCallees=False PartialTrainingSetCompleteCallersCallees=True def main(): X_train={} X_test={} y_train={} y_test={} dataset = pd.read_csv( 'InputData.txt', sep= ',', index_col=False) #convert T into 1 and N into 0 dataset['gold'] = dataset['gold'].astype('category').cat.codes dataset['Program'] = dataset['Program'].astype('category').cat.codes dataset['classGold'] = dataset['classGold'].astype('category').cat.codes dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes dataset['CallersT'] = dataset['CallersT'].astype('category').cat.codes dataset['CallersN'] = dataset['CallersN'].astype('category').cat.codes dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes dataset['CallersCallersT'] = dataset['CallersCallersT'].astype('category').cat.codes dataset['CallersCallersN'] = dataset['CallersCallersN'].astype('category').cat.codes dataset['CallersCallersU'] = dataset['CallersCallersU'].astype('category').cat.codes dataset['CalleesT'] = dataset['CalleesT'].astype('category').cat.codes dataset['CalleesN'] = dataset['CalleesN'].astype('category').cat.codes dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes dataset['CalleesCalleesT'] = dataset['CalleesCalleesT'].astype('category').cat.codes dataset['CalleesCalleesN'] = dataset['CalleesCalleesN'].astype('category').cat.codes dataset['CalleesCalleesU'] = dataset['CalleesCalleesU'].astype('category').cat.codes pd.set_option('display.max_columns', None) row_count, column_count = dataset.shape Xcol = dataset.iloc[:, 1:column_count] TrainingSet=dataset.loc[dataset['CompleteCallersCallees'] == 1] print('TrainingSet',len(TrainingSet)) TrainingSet1=TrainingSet.sample(frac=0.7) TrainingSet2=pd.concat([TrainingSet, TrainingSet1]).drop_duplicates(keep=False) print('TrainingSet2',len(TrainingSet2),'TrainingSet1',len(TrainingSet1))