Adjusting the number of features for TF-IDF/logistic regression sentiment analysis
I’m doing a sentiment analysis project on a Twitter dataset. I used TF-IDF feature extraction and a logistic regression model for classification. So far I’ve trained the model with the following:
def get_tfidf_features(train_fit, ngrams=(1,1)): vector = TfidfVectorizer(ngrams, sublinear_tf=True) vector.fit(train_fit) return vector X = tf_vector.transform(traintest['text']) y = traintest['sentiment'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.01, random_state = 42) LR_model = LogisticRegression(solver='lbfgs') LR_model.fit(X_train, y_train) y_predict_lr = LR_model.predict(X_test)
This logistic regression model was trained on a dataset of about 1.5 million tweets. I have a set of about 1.7 million tweets I’m trying to use this sentiment analysis model on, df_april
. On my first attempt, I extract the features as follows:
tfidf = TfidfVectorizer(ngram_range = unigrams, max_features = None, sublinear_tf = True) X_april = tfidf.fit_transform(df_april['text'].values.astype('U'))
My first thought was to just call predict on X_april
but this gives me an error:
y_predict_april = LR_model.predict(X_april) ValueError: X has 208976 features per sample; expecting 271794
This made sense to me: the shape of these feature vectors was different:
X.shape (1578614, 271794) X_april.shape (1705758, 208976)
So I know I need to somehow adjust the number of features to match between X
and X_april
to call predict on X_april
. My attempt to do this was:
x = pd.DataFrame.sparse.from_spmatrix(X) x_april = pd.DataFrame.sparse.from_spmatrix(X_april) not_existing_cols = [c for c in x.columns.tolist() if c not in x_april] x_april = x_april.reindex(x_april.columns.tolist() + not_existing_cols, axis=1) x_april = x_april[x.columns.tolist()]
I’m working in a Jupyter notebook, and this code results in a dead kernel every time I’ve tried it. How can I adjust the features so that I can call the logistic regression model?