Error when using pandas trying to retrieve column data for categorical_subset
I am trying to use Sklearn tools alongside Pandas and np. I am trying to run my code (shown below the error)
Traceback (most recent call last): File "C:/PycharmProjects/AISyiff/testingAi.py", line 129, in <module> categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")]) File "C:\PycharmProjects\AISyiff\venv\lib\site-packages\pandas\core\indexes\base.py", line 5018, in drop raise KeyError(f"{labels[mask]} not found in axis") KeyError: "['protocol'] not found in axis"
Please let me know where I’ve made the mistake and what I can do to fix this!
import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model from sklearn.model_selection import train_test_split from sklearn import preprocessing as preprocessing from sklearn.metrics import accuracy_score import matplotlib as mpl mpl.use('TkAgg') import seaborn as sns from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor sns.set(style="white", context="talk") mpl.rcParams['figure.dpi'] = 200 df = pd.read_csv("datasets_for_paper.csv", low_memory=False) ##firstPaint provides time info about page renderingso does ,rumSpeedIndex=avg page render print(df.dtypes) df["nodeId"] = df["nodeId"].astype(int) df["numObj"] = df["numObj"].astype(int) df["rumSpeedIndex"] = df['rumSpeedIndex'].astype(int) df["pageLoadTime"] = df['pageLoadTime'].astype(int) df["firstPaint"] = df['firstPaint'].astype(int) # convert from name into pure string def changeProtName(value): if value == 'H1s': return str('Hs') else: return str('Hl') df['protocol'] = df['protocol'].map(lambda x: changeProtName(x)) # hot encode catagories as catagorical data df['protocol'] = pd.Categorical(df["protocol"]) df['browser'] = pd.Categorical(df['browser']) df['nodeType'] = pd.Categorical(df['nodeType']) df['url'] = pd.Categorical(df['url']) # list a bunch of details about categorical data def summerize_data(df1): for column in df1.columns: print(column) if df.dtypes[column] == np.object: print(df1[column].value_counts()) else: print(df1[column].describe()) print('\n') summerize_data(df) def hotEncodingCats(df1): results = df1.copy() encoders = {} for column in results.columns: encoders[column] = preprocessing.LabelEncoder() results[column] = encoders[column].fit_transform(results[column]) return results, encoders print(df.dtypes) encoded_data, _ = hotEncodingCats(df) sns.heatmap(encoded_data.corr(), square=True) encoded_data.tail(5) encoded_data, encoders = hotEncodingCats(df) new_series = encoded_data["protocol"] X_train, X_test, y_train, y_test = train_test_split(encoded_data[encoded_data.columns.drop("protocol")], new_series, train_size=0.70) scaler = preprocessing.StandardScaler() X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns) X_test = scaler.transform(X_test) cls = linear_model.LogisticRegression() cls.fit(X_train, y_train) y_pred = cls.predict(X_test) print(df.dtypes) print(accuracy_score(y_test, y_pred)) print(df.dtypes) print("cookieprint") def mae(y_true, y_pred): return np.mean(abs(y_true - y_pred)) print("cookie3") def fit_and_evaluate(model): # Train the model model.fit(X_train, y_train) # Make predictions and evalute model_pred = model.predict(X_test) model_mae = mae(y_test, model_pred) # Return the performance metric return model_mae print(fit_and_evaluate(cls)) print("cookie1") random_forest = RandomForestRegressor(random_state=60) coefs = pd.Series(cls.coef_[0], index=X_train.columns) print(X_train.columns) print("cookie2") coefs = coefs.sort_values() plt.subplot(1, 1, 1) plt.figure(figsize=(10,10)) coefs.plot(kind="bar", alpha=0.4) plt.show() print(coefs.sort_values(ascending=False)) features = df.copy() numeric_subset = df.select_dtypes('number') categorical_subset = df.select_dtypes('object') categorical_subset = pd.get_dummies(categorical_subset[categorical_subset.columns.drop("protocol")]) features = pd.concat([numeric_subset, categorical_subset], axis = 1) print(features.head())
I was able to reproduce your problem like this:
>>> df = pd.DataFrame() >>> df['protocol'] = pd.Categorical(['A', 'B', 'C', 'D', 'A']) >>> df.select_dtypes('object') Empty DataFrame Columns: []
You can see that the last line, corresponding to
categorical_subset = df.select_dtypes('object')
is probably returning an empty DataFrame
(when in doubt, it would have been good to check that categorical_subset
actually contains what you expected it to contain.
This is because when you re-assigned df['protocol']
, which originally contained strings, to a pd.Categorical
, its dtype (as well as those of the other categorical columns) is no longer object
, but rather category
):
>>> df.dtypes protocol category dtype: object
(this output looks a little confusing; it says the dtype of protocol
is category
but under that it says dtype: object
: the return value of DataFrame.dtypes
is actually a Series
with columns for the column name and the dtype, so the deceptive dtype: object
at the bottom refers to the dtype of that series).
This is probably what you actually wanted:
>>> df.select_dtypes('category') protocol 0 A 1 B 2 C 3 D 4 A
In fact, it says in the docs for select_dtypes
:
To select Pandas categorical dtypes, use
'category'
The above is a good example of how to create a Minimal, Reproducible Example and also in general how to debug small programs. We first zeroed in on the problem area, the line
categorical_subset.columns.drop("protocol")
where it apparently thinks there shouldn’t be a column called 'protocol'
. Then we just work backwards to how categorical_subset
was created (we called df.select_dtypes('object')
on our original dataframe). And then beyond that all we need is an example dataframe that has some pd.Categorical
columns.