Error when using pandas trying to retrieve column data for categorical_subset

Question

Home

Error when using pandas trying to retrieve column data for categorical_subset

0

I am trying to use Sklearn tools alongside Pandas and np. I am trying to run my code (shown below the error)

Traceback (most recent call last):   File "C:/PycharmProjects/AISyiff/testingAi.py", line 129, in <module>     categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")])   File "C:\PycharmProjects\AISyiff\venv\lib\site-packages\pandas\core\indexes\base.py", line 5018, in drop     raise KeyError(f"{labels[mask]} not found in axis") KeyError: "['protocol'] not found in axis"

Please let me know where I’ve made the mistake and what I can do to fix this!

import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model from sklearn.model_selection import train_test_split from sklearn import preprocessing as preprocessing from sklearn.metrics import accuracy_score import matplotlib as mpl mpl.use('TkAgg') import seaborn as sns from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor  sns.set(style="white", context="talk") mpl.rcParams['figure.dpi'] = 200 df = pd.read_csv("datasets_for_paper.csv", low_memory=False)  ##firstPaint provides time info about page renderingso does ,rumSpeedIndex=avg page render print(df.dtypes) df["nodeId"] = df["nodeId"].astype(int) df["numObj"] = df["numObj"].astype(int) df["rumSpeedIndex"] = df['rumSpeedIndex'].astype(int) df["pageLoadTime"] = df['pageLoadTime'].astype(int) df["firstPaint"] = df['firstPaint'].astype(int)   # convert from name into pure string def changeProtName(value):     if value == 'H1s':         return str('Hs')     else:         return str('Hl')   df['protocol'] = df['protocol'].map(lambda x: changeProtName(x))  # hot encode catagories as catagorical data df['protocol'] = pd.Categorical(df["protocol"]) df['browser'] = pd.Categorical(df['browser']) df['nodeType'] = pd.Categorical(df['nodeType']) df['url'] = pd.Categorical(df['url'])   # list a bunch of details about categorical data def summerize_data(df1):     for column in df1.columns:         print(column)         if df.dtypes[column] == np.object:             print(df1[column].value_counts())         else:             print(df1[column].describe())          print('\n')   summerize_data(df)   def hotEncodingCats(df1):     results = df1.copy()     encoders = {}     for column in results.columns:         encoders[column] = preprocessing.LabelEncoder()         results[column] = encoders[column].fit_transform(results[column])     return results, encoders   print(df.dtypes)  encoded_data, _ = hotEncodingCats(df) sns.heatmap(encoded_data.corr(), square=True)   encoded_data.tail(5)  encoded_data, encoders = hotEncodingCats(df) new_series = encoded_data["protocol"]  X_train, X_test, y_train, y_test = train_test_split(encoded_data[encoded_data.columns.drop("protocol")], new_series,                                                     train_size=0.70) scaler = preprocessing.StandardScaler()  X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns) X_test = scaler.transform(X_test)  cls = linear_model.LogisticRegression()  cls.fit(X_train, y_train) y_pred = cls.predict(X_test)  print(df.dtypes) print(accuracy_score(y_test, y_pred)) print(df.dtypes) print("cookieprint")  def mae(y_true, y_pred):     return np.mean(abs(y_true - y_pred))  print("cookie3")  def fit_and_evaluate(model):     # Train the model     model.fit(X_train, y_train)      # Make predictions and evalute     model_pred = model.predict(X_test)     model_mae = mae(y_test, model_pred)      # Return the performance metric     return model_mae   print(fit_and_evaluate(cls)) print("cookie1") random_forest = RandomForestRegressor(random_state=60) coefs = pd.Series(cls.coef_[0], index=X_train.columns) print(X_train.columns) print("cookie2") coefs = coefs.sort_values() plt.subplot(1, 1, 1) plt.figure(figsize=(10,10)) coefs.plot(kind="bar", alpha=0.4) plt.show() print(coefs.sort_values(ascending=False))  features = df.copy() numeric_subset = df.select_dtypes('number') categorical_subset = df.select_dtypes('object')  categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")]) features = pd.concat([numeric_subset, categorical_subset], axis = 1) print(features.head())

Kermitmarilavonne Asked on July 16, 2020 in Python.

Share
Comment(0)

Add Comment

1 Answer(s)

Votes
Oldest

0

I was able to reproduce your problem like this:

>>> df = pd.DataFrame() >>> df['protocol'] = pd.Categorical(['A', 'B', 'C', 'D', 'A']) >>> df.select_dtypes('object') Empty DataFrame Columns: []

You can see that the last line, corresponding to

categorical_subset = df.select_dtypes('object')

is probably returning an empty DataFrame (when in doubt, it would have been good to check that categorical_subset actually contains what you expected it to contain.

This is because when you re-assigned df['protocol'], which originally contained strings, to a pd.Categorical, its dtype (as well as those of the other categorical columns) is no longer object, but rather category):

>>> df.dtypes protocol    category dtype: object

(this output looks a little confusing; it says the dtype of protocol is category but under that it says dtype: object: the return value of DataFrame.dtypes is actually a Series with columns for the column name and the dtype, so the deceptive dtype: object at the bottom refers to the dtype of that series).

This is probably what you actually wanted:

>>> df.select_dtypes('category')   protocol 0        A 1        B 2        C 3        D 4        A

In fact, it says in the docs for select_dtypes:

To select Pandas categorical dtypes, use 'category'

The above is a good example of how to create a Minimal, Reproducible Example and also in general how to debug small programs. We first zeroed in on the problem area, the line

categorical_subset.columns.drop("protocol")

where it apparently thinks there shouldn’t be a column called 'protocol'. Then we just work backwards to how categorical_subset was created (we called df.select_dtypes('object') on our original dataframe). And then beyond that all we need is an example dataframe that has some pd.Categorical columns.

Jessiedoylejuliet Answered on July 16, 2020.

Share
Comment(0)

Add Comment

Your Answer

Answer 1

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 2

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 3

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 4

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 5

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 6

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 7

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 8

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

Answer 9

BuddyPress is a plugin for WordPress that enables you to create a social network or community website. It has all the...

Answer 10

I value you getting some margin to help me with this task. Without you, no part of this would have...

Answer 11

Try to define a Cohesive class, until and unless the methods are written relevant to the class and it defines...

Answer 12

Try to add exportAllData: true, as an other option, hope it helps :)

Answer 13

DataSet can read an XML, infer schema and create a tabular representation that's easy to manipulate: DataSet ip1 = new...

Answer 14

I created a class and used Xml Linq : using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Xml; using...

Answer 15

XDocument first = XDocument.Load(args[0]); XDocument second = XDocument.Load(args[1]); var result = new XElement( "ipaddresses", first.Root.Elements("ip") .Zip(second.Root.Elements("ip"), (f, s) => {...

Answer 16

Following your code for the header row, you could achieve this by an <xsl:apply-templates select="/report/order_actions/order_action[order_id = current()/order_id]" /> As well...

LATEST ANSWERS

Error when using pandas trying to retrieve column data for categorical_subset

Your Answer

TOP USERS

HOT QUESTIONS

LATEST ANSWERS

Error when using pandas trying to retrieve column data for categorical_subset

Your Answer

Tags Widget

TOP USERS

HOT QUESTIONS