randomly selected a sample subset from the total dataset and cannot retrieve the remaining subset from the total dataset

Question

Home

randomly selected a sample subset from the total dataset and cannot retrieve the remaining subset from the total dataset

0

My input data is under the form:

    gold,Program,MethodType,CallersT,CallersN,CallersU,CallersCallersT,CallersCallersN,CallersCallersU,CalleesT,CalleesN,CalleesU,CalleesCalleesT,CalleesCalleesN,CalleesCalleesU,CompleteCallersCallees,classGold T,chess,Inner,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,High,-1,-1,-1,Low,1,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace, T,chess,Inner,Low,-1,-1,Low,-1,-1,Medium,-1,Medium,High,-1,High,0,Trace, N,chess,Inner,-1,Low,-1,-1,Low,-1,-1,Medium,Medium,-1,High,High,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Medium,-1,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, T,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,Low,Low,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, N,chess,Inner,-1,Low,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace, N,chess,Inner,Low,-1,-1,-1,-1,-1,Low,Low,High,Low,Low,Medium,0,Trace, N,chess,Inner,-1,Low,-1,-1,-1,-1,-1,Medium,High,-1,Medium,Medium,0,NoTrace, .... N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,Trace, N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace, T,chess,Inner,Low,-1,-1,Low,Low,-1,Low,-1,Low,-1,-1,-1,0,Trace, T,chess,Inner,Low,-1,-1,Medium,-1,-1,Low,-1,Low,-1,-1,-1,0,Trace, N,chess,Inner,-1,Low,-1,-1,Medium,-1,-1,Low,Low,-1,-1,-1,0,NoTrace,

I would like to sample my input file and only select rows having the feature value CompleteCallersCallees=1 which I am doing using the following line of code TrainingSet=dataset.loc[dataset['CompleteCallersCallees'] == 1]. Then, I would like to select a random sample of this TrainingSet which I am doing using the row TrainingSet1=TrainingSet.sample(frac=0.7). This allows me to select 70% of the TrainingSet randomly. The problem is that I would like to retrieve the remaining 30% of this TrainingSet that are not part of TrainingSet1. I am doing so using the line of code TrainingSet2=pd.concat([TrainingSet, TrainingSet1]).drop_duplicates(keep=False). However, this does not work as the size of TrainingSet is 2269 and the size of TrainingSet1 is 1588. Normally, the size of TrainingSet2 should be equal to 2269-1588=681 and the problem is that the size of TrainingSet2 is only 34 when I print it. My full input data file can be found under this link: https://drive.google.com/file/d/1vF4ZAPSps_aO2Umsp2hEgK7UxwcXLiZR/view?usp=sharing

Here is the code I am using:

import pandas as pd import numpy as np from sklearn.feature_selection import SelectFromModel from sklearn.model_selection import train_test_split # Feature Scaling from sklearn.preprocessing import StandardScaler SeparateProjectLearning=False CompleteCallersCallees=False PartialTrainingSetCompleteCallersCallees=True def main():     X_train={}     X_test={}     y_train={}     y_test={}     dataset = pd.read_csv( 'InputData.txt', sep= ',', index_col=False)      #convert T into 1 and N into 0     dataset['gold'] = dataset['gold'].astype('category').cat.codes     dataset['Program'] = dataset['Program'].astype('category').cat.codes     dataset['classGold'] = dataset['classGold'].astype('category').cat.codes     dataset['MethodType'] = dataset['MethodType'].astype('category').cat.codes        dataset['CallersT'] = dataset['CallersT'].astype('category').cat.codes     dataset['CallersN'] = dataset['CallersN'].astype('category').cat.codes     dataset['CallersU'] = dataset['CallersU'].astype('category').cat.codes     dataset['CallersCallersT'] = dataset['CallersCallersT'].astype('category').cat.codes     dataset['CallersCallersN'] = dataset['CallersCallersN'].astype('category').cat.codes     dataset['CallersCallersU'] = dataset['CallersCallersU'].astype('category').cat.codes     dataset['CalleesT'] = dataset['CalleesT'].astype('category').cat.codes     dataset['CalleesN'] = dataset['CalleesN'].astype('category').cat.codes     dataset['CalleesU'] = dataset['CalleesU'].astype('category').cat.codes     dataset['CalleesCalleesT'] = dataset['CalleesCalleesT'].astype('category').cat.codes     dataset['CalleesCalleesN'] = dataset['CalleesCalleesN'].astype('category').cat.codes     dataset['CalleesCalleesU'] = dataset['CalleesCalleesU'].astype('category').cat.codes           pd.set_option('display.max_columns', None)     row_count, column_count = dataset.shape     Xcol = dataset.iloc[:, 1:column_count]       TrainingSet=dataset.loc[dataset['CompleteCallersCallees'] == 1]     print('TrainingSet',len(TrainingSet))     TrainingSet1=TrainingSet.sample(frac=0.7)     TrainingSet2=pd.concat([TrainingSet, TrainingSet1]).drop_duplicates(keep=False)     print('TrainingSet2',len(TrainingSet2),'TrainingSet1',len(TrainingSet1))

Alexanderjeanlourdes Asked on July 16, 2020 in Python.

Share
Comment(0)

Add Comment

0 Answer(s)

Votes
Oldest

Your Answer

Answer 1

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 2

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 3

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 4

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 5

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 6

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 7

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 8

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

Answer 9

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 10

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 11

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 12

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 13

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 14

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 15

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 16

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

LATEST ANSWERS

randomly selected a sample subset from the total dataset and cannot retrieve the remaining subset from the total dataset

Your Answer

TOP USERS

HOT QUESTIONS

LATEST ANSWERS

randomly selected a sample subset from the total dataset and cannot retrieve the remaining subset from the total dataset

Your Answer

Tags Widget

TOP USERS

HOT QUESTIONS