Why should LabelEncoder from sklearn be used only for the target variable?

I was trying to create a pipeline with a LabelEncoder to transform categorical values.

cat_variable = Pipeline(steps = [     ('imputer',SimpleImputer(strategy = 'most_frequent')),     ('lencoder',LabelEncoder()) ])                          num_variable = SimpleImputer(strategy = 'mean')  preprocess = ColumnTransformer (transformers = [     ('categorical',cat_variable,cat_columns),     ('numerical',num_variable,num_columns) ])  odel = RandomForestRegressor(n_estimators = 100, random_state = 0)  final_pipe = Pipeline(steps = [     ('preprocessor',preprocess),     ('model',model) ])  scores = -1 * cross_val_score(final_pipe,X_train,y,cv = 5,scoring = 'neg_mean_absolute_error')  

But this is throwing a TypeError:

 TypeError: fit_transform() takes 2 positional arguments but 3 were given  

On further reference, I found out that transformers like LabelEncoders are not supposed to be used with features and should only be used on the prediction target.

From Documentation:

class sklearn.preprocessing.LabelEncoder

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

My question is, why can we not use LabelEncoder on feature variables and are there any other transformers that have a condition like this?

Add Comment
1 Answer(s)

LabelEncoder can be used to normalize labels or to transform non-numerical labels. For the input categorical you should use OneHotEncoder.

The difference:

le = preprocessing.LabelEncoder() le.fit_transform([1, 2, 2, 6]) array([0, 0, 1, 2])  enc = OneHotEncoder(handle_unknown='ignore') enc.fit_transform([[1], [2], [2], [6]]).toarray() array([[1., 0., 0.],        [0., 1., 0.],        [0., 1., 0.],        [0., 0., 1.]]) 
Add Comment

Your Answer

By posting your answer, you agree to the privacy policy and terms of service.