我正在尝试从这个page的代码。我跑到部分LR (tf-idf)
并得到类似的结果
之后,我决定尝试GridSearchCV
。
1)
#lets try gridsearchcv
#https://www.kaggle.com/enespolat/grid-search-with-logistic-regression
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)
#tuned hpyerparameters :(best parameters) {'C': 10.0, 'penalty': 'l2'}
#best score : 0.7390325593588823
然后我手工计算 f1 分,为什么不匹配?
logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]
final_prediction=np.where(logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]>=0.5,1,0)
#https://www.statology.org/f1-score-in-python/
from sklearn.metrics import f1_score
#calculate F1 score
f1_score(y_train, final_prediction)
0.9839388145315489
如果我尝试scoring='precision'
为什么它给出以下错误?我不清楚主要是因为我有相对平衡的数据集(55-45 %)和f1
需要precision
正在计算没有任何问题
#lets try gridsearchcv #https://www.kaggle.com/enespolat/grid-search-with-logistic-regression
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='precision')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
tuned hpyerparameters :(best parameters) {'C': 0.1, 'penalty': 'l2'}
best score : 0.9474200393672962
有没有更简单的方法来获取火车数据的预测?我们已经有了logreg_cv
对象。我用下面的方法来获取预测。有没有更好的方法来做同样的事情?
logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]
#######################
############# 更新1
请从上面回答问题 1。在问题的评论中说The best score in GridSearchCV is calculated by taking the average score from cross validation for the best estimators. That is, it is calculated from data that is held out during fitting. From what I can tell, you are calculating predicted values from the training data and calculating an F1 score on that. Since the model was trained on that data, that is why the F1 score is so much larger compared to the results in the grid search
这是我得到以下结果的原因#tuned hpyerparameters :(best parameters) {'C': 10.0, 'penalty': 'l2'} #best score : 0.7390325593588823
但是当我手动我得到f1_score(y_train, final_prediction) 0.9839388145315489
2)
我尝试按照下面的答案中的建议使用f1_micro
进行调谐。没有错误消息。我仍然不清楚为什么f1_micro
在precision
失败时没有失败
from sklearn.model_selection import GridSearchCV
grid={"C":np.logspace(-3,3,7), "penalty":["l2"], "solver":['liblinear','newton-cg'], 'class_weight':[{ 0:0.95, 1:0.05 }, { 0:0.55, 1:0.45 }, { 0:0.45, 1:0.55 },{ 0:0.05, 1:0.95 }]}# l1 lasso l2 ridge
#logreg=LogisticRegression(solver = 'liblinear')
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1_micro')
logreg_cv.fit(X_train_vectors_tfidf, y_train)
tuned hpyerparameters :(best parameters) {'C': 10.0, 'class_weight': {0: 0.45, 1: 0.55}, 'penalty': 'l2', 'solver': 'newton-cg'}
best score : 0.7894909688013136

你最终会得到精确的错误,因为你的一些惩罚对于这个模型来说太强了,如果你检查结果,当 C = 0.001 和 C = 0.01 时,f1 得分为0
res = pd.DataFrame(logreg_cv.cv_results_)
res.iloc[:,res.columns.str.contains("split[0-9]_test_score|params",regex=True)]
params split0_test_score split1_test_score split2_test_score
0 {'C': 0.001, 'penalty': 'l2'} 0.000000 0.000000 0.000000
1 {'C': 0.01, 'penalty': 'l2'} 0.000000 0.000000 0.000000
2 {'C': 0.1, 'penalty': 'l2'} 0.973568 0.952607 0.952174
3 {'C': 1.0, 'penalty': 'l2'} 0.863934 0.851064 0.836449
4 {'C': 10.0, 'penalty': 'l2'} 0.811634 0.769547 0.787838
5 {'C': 100.0, 'penalty': 'l2'} 0.789826 0.762162 0.773438
6 {'C': 1000.0, 'penalty': 'l2'} 0.781003 0.750000 0.763871
你可以检查这个:
lr = LogisticRegression(C=0.01).fit(X_train_vectors_tfidf,y_train)
np.unique(lr.predict(X_train_vectors_tfidf))
array([0])
预测的概率向截距漂移:
# expected probability
np.exp(lr.intercept_)/(1+np.exp(lr.intercept_))
array([0.41764462])
lr.predict_proba(X_train_vectors_tfidf)
array([[0.58732636, 0.41267364],
[0.57074279, 0.42925721],
[0.57219143, 0.42780857],
...,
[0.57215605, 0.42784395],
[0.56988186, 0.43011814],
[0.58966184, 0.41033816]])
对于“在列车数据上获取预测”的问题,我认为这是唯一的方法。使用最佳参数在整个训练集上重新构建模型,但不存储预测或预测概率。如果您正在寻找在训练 / 测试期间获得的值,则可以检查cross_val_predict
本站系公益性非盈利分享网址,本文来自用户投稿,不代表码文网立场,如若转载,请注明出处
评论列表(49条)