randomly selected a sample subset from the total dataset and cannot retrieve the remaining subset from the total dataset

My input data is under the form:

    gold,Program,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,CompleteCallersCallees,classGold T,chess,Inner,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace, T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Medium,-1,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, T,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace, N,chess,Inner,Low,-1,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, .... N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,Trace, N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace, T,chess,Inner,Low,-1,-1,Low,Low,-1,Low,-1,Low,-1,-1,-1,0,Trace, T,chess,Inner,Low,-1,-1,Medium,-1,-1,Low,-1,Low,-1,-1,-1,0,Trace, N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace, 

I would like to sample my input file and only select rows having the feature value CompleteCallersCallees=1 which I am doing using the following line of code TrainingSet=dataset.loc[dataset['CompleteCallersCallees'] == 1]. Then, I would like to select a random sample of this TrainingSet which I am doing using the row TrainingSet1=TrainingSet.sample(frac=0.7). This allows me to select 70% of the TrainingSet randomly. The problem is that I would like to retrieve the remaining 30% of this TrainingSet that are not part of TrainingSet1. I am doing so using the line of code TrainingSet2=pd.concat([TrainingSet, TrainingSet1]).drop_duplicates(keep=False). However, this does not work as the size of TrainingSet is 2269 and the size of TrainingSet1 is 1588. Normally, the size of TrainingSet2 should be equal to 2269-1588=681 and the problem is that the size of TrainingSet2 is only 34 when I print it. My full input data file can be found under this link: https://drive.google.com/file/d/1vF4ZAPSps_aO2Umsp2hEgK7UxwcXLiZR/view?usp=sharing

Here is the code I am using:

import pandas as pd import numpy as np from sklearn.feature_selection import SelectFromModel from sklearn.model_selection import train_test_split # Feature Scaling from sklearn.preprocessing import StandardScaler SeparateProjectLearning=False CompleteCallersCallees=False PartialTrainingSetCompleteCallersCallees=True def main():     X_train={}     X_test={}     y_train={}     y_test={}     dataset = pd.read_csv( 'InputData.txt', sep= ',', index_col=False)      #convert T into 1 and N into 0     dataset['gold'] = dataset['gold'].astype('category').cat.codes     dataset['Program'] = dataset['Program'].astype('category').cat.codes     dataset['classGold'] = dataset['classGold'].astype('category').cat.codes     dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes        dataset['CallersT'] = dataset['CallersT'].astype('category').cat.codes     dataset['CallersN'] = dataset['CallersN'].astype('category').cat.codes     dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes     dataset['CallersCallersT'] = dataset['CallersCallersT'].astype('category').cat.codes     dataset['CallersCallersN'] = dataset['CallersCallersN'].astype('category').cat.codes     dataset['CallersCallersU'] = dataset['CallersCallersU'].astype('category').cat.codes     dataset['CalleesT'] = dataset['CalleesT'].astype('category').cat.codes     dataset['CalleesN'] = dataset['CalleesN'].astype('category').cat.codes     dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes     dataset['CalleesCalleesT'] = dataset['CalleesCalleesT'].astype('category').cat.codes     dataset['CalleesCalleesN'] = dataset['CalleesCalleesN'].astype('category').cat.codes     dataset['CalleesCalleesU'] = dataset['CalleesCalleesU'].astype('category').cat.codes           pd.set_option('display.max_columns', None)     row_count, column_count = dataset.shape     Xcol = dataset.iloc[:, 1:column_count]       TrainingSet=dataset.loc[dataset['CompleteCallersCallees'] == 1]     print('TrainingSet',len(TrainingSet))     TrainingSet1=TrainingSet.sample(frac=0.7)     TrainingSet2=pd.concat([TrainingSet, TrainingSet1]).drop_duplicates(keep=False)     print('TrainingSet2',len(TrainingSet2),'TrainingSet1',len(TrainingSet1))          
Add Comment
0 Answer(s)

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.