Why should LabelEncoder from sklearn be used only for the target variable?
I was trying to create a pipeline with a LabelEncoder to transform categorical values.
cat_variable = Pipeline(steps = [ ('imputer',SimpleImputer(strategy = 'most_frequent')), ('lencoder',LabelEncoder()) ]) num_variable = SimpleImputer(strategy = 'mean') preprocess = ColumnTransformer (transformers = [ ('categorical',cat_variable,cat_columns), ('numerical',num_variable,num_columns) ]) odel = RandomForestRegressor(n_estimators = 100, random_state = 0) final_pipe = Pipeline(steps = [ ('preprocessor',preprocess), ('model',model) ]) scores = -1 * cross_val_score(final_pipe,X_train,y,cv = 5,scoring = 'neg_mean_absolute_error')
But this is throwing a TypeError:
TypeError: fit_transform() takes 2 positional arguments but 3 were given
On further reference, I found out that transformers like LabelEncoders are not supposed to be used with features and should only be used on the prediction target.
class sklearn.preprocessing.LabelEncoder
Encode target labels with value between 0 and n_classes-1.
This transformer should be used to encode target values, i.e. y, and not the input X.
My question is, why can we not use LabelEncoder on feature variables and are there any other transformers that have a condition like this?
LabelEncoder can be used to normalize labels or to transform non-numerical labels. For the input categorical you should use OneHotEncoder.
The difference:
le = preprocessing.LabelEncoder() le.fit_transform([1, 2, 2, 6]) array([0, 0, 1, 2]) enc = OneHotEncoder(handle_unknown='ignore') enc.fit_transform([[1], [2], [2], [6]]).toarray() array([[1., 0., 0.], [0., 1., 0.], [0., 1., 0.], [0., 0., 1.]])