机器学习主成分分析降维例题(特征选择和降维实例)
机器学习主成分分析降维例题(特征选择和降维实例)导入必须的Python库维度降低降维是将特征转换为较低维度。在本文中,我们将探索以下特征选择和降维技术:特征选择
“特征选择是选择用于模型构建的相关特征的子集的过程”,或者换句话说,选择最重要的特征。
在正常情况下,领域知识起着重要作用,我们可以选择我们认为最重要的特征。例如,在预测房价时,卧室和面积通常被认为是重要的。不幸的是,在Do not Overfit II竞赛(https://www.kaggle.com/c/dont-overfit-ii/data)中,领域知识的使用是不可能的,因为我们有一个二元目标和300个连续变量,这迫使我们尝试特征选择技术。
简介通常,我们将特征选择和降维组合在一起使用。虽然这两种方法都用于减少数据集中的特征数量,但存在很大不同。
特征选择只是选择和排除给定的特征而不改变它们。
降维是将特征转换为较低维度。
在本文中,我们将探索以下特征选择和降维技术:
特征选择
- 删除缺少值的特征
- 删除方差较小的特征
- 删除高度相关的特征
- 单变量特征选择
- 递归特征消除
- 使用SelectFromModel选择特征
维度降低
- PCA
导入必须的Python库
import numpy as np # linear algebra import pandas as pd # data processing CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier
设置默认绘图参数
%matplotlib inline plt.rcParams['figure.figsize'] = [20.0 7.0] plt.rcParams.update({'font.size': 22 }) sns.set_palette('viridis') sns.set_style('white') sns.set_context('talk' font_scale=0.8)
加载机器学习数据集
train = pd.read_csv('../input/train.csv') test = pd.read_csv('../input/test.csv') print('Train Shape: ' train.shape) print('Test Shape: ' test.shape) train.head()
Train Shape: (250 302)
Test Shape: (19750 301)
使用seaborns countplot来显示机器学习数据集中问题的分布
fig ax = plt.subplots() g = sns.countplot(train.target palette='viridis') g.set_xticklabels(['0' '1']) g.set_yticklabels([]) # function to show values on bars def show_values_on_bars(axs): def _show_on_single_plot(ax): for p in ax.patches: _x = p.get_x() p.get_width() / 2 _y = p.get_y() p.get_height() value = '{:.0f}'.format(p.get_height()) ax.text(_x _y value ha="center") if isinstance(axs np.ndarray): for idx ax in np.ndenumerate(axs): _show_on_single_plot(ax) else: _show_on_single_plot(axs) show_values_on_bars(ax) sns.despine(left=True bottom=True) plt.xlabel('') plt.ylabel('') plt.title('Distribution of Target' fontsize=30) plt.tick_params(axis='x' which='major' labelsize=15) plt.show()
基线模型我们将使用逻辑回归作为基线模型。我们首先将数据分为测试集和训练集,并进行了缩放:
# prepare for modeling X_train_df = train.drop(['id' 'target'] axis=1) y_train = train['target'] X_test = test.drop(['id'] axis=1) # scaling data scaler = StandardScaler() X_train = scaler.fit_transform(X_train_df) X_test = scaler.transform(X_test) lr = LogisticRegression(solver='liblinear') rfc = RandomForestClassifier(n_estimators=100) lr_scores = cross_val_score(lr X_train y_train cv=5 scoring='roc_auc') rfc_scores = cross_val_score(rfc X_train y_train cv=5 scoring='roc_auc') print('LR Scores: ' lr_scores) print('RFC Scores: ' rfc_scores)
LR Scores: [0.80729167 0.71875 0.734375 0.80034722 0.66319444]
RFC Scores: [0.66753472 0.61371528 0.69618056 0.63715278 0.65104167]
检查是最重要的特征
# checking which are the most important features feature_importance = rfc.fit(X_train y_train).feature_importances_ # Make importances relative to max importance. feature_importance = 100.0 * (feature_importance / feature_importance.max()) sorted_idx = np.argsort(feature_importance) sorted_idx = sorted_idx[-20:-1:1] pos = np.arange(sorted_idx.shape[0]) .5 plt.barh(pos feature_importance[sorted_idx] align='center') plt.yticks(pos X_train_df.columns[sorted_idx]) plt.xlabel('Relative Importance') plt.title('Feature Importance' fontsize=30) plt.tick_params(axis='x' which='major' labelsize=15) sns.despine(left=True bottom=True) plt.show()
从交叉验证分数的变化可以看出,模型存在过拟合现象。我们可以尝试通过特征选择来提高这些分数。
删除有缺失值的特征检查缺失值是任何机器学习问题的第一步。然后我们可以删除超过我们定义的阈值的列。
train.isnull().any().any()
False
数据集没有缺失值,因此在此步骤中没有要删除的特征。
删除低方差的特征在sklearn的特征选择模块中,我们可以找到VarianceThreshold。它删除方差不满足某个阈值的所有特征。默认情况下,它删除了方差为零的特征,或所有样本值相同的特征。
from sklearn import feature_selection sel = feature_selection.VarianceThreshold() train_variance = sel.fit_transform(train) train_variance.shape
(250 302)
我们可以从上面看到,所有列中都没有相同值的特征,因此我们没有要删除的特征。
删除高度相关的特征高度相关或共线性的特征可能导致过度拟合。
当一对变量高度相关时,我们可以删除一个变量来减少维度,而不会损失太多信息。我们应该保留哪一个呢?与目标相关性更高的那个。
让我们来探索我们的特征之间的相关性:
# find correlations to target corr_matrix = train.corr().abs() print(corr_matrix['target'].sort_values(ascending=False).head(10))
这里我们看到了与目标变量高度相关的特性。特征33与目标相关性最高,但相关值仅为0.37,仅为弱相关。
我们还可以检查特征与其他特征之间的相关性。下面我们可以看到一个相关矩阵。看起来我们所有的特征都不是高度相关的。
# Select upper triangle of correlation matrix matrix = corr_matrix.where(np.triu(np.ones(corr_matrix.shape) k=1).astype(np.bool)) sns.heatmap(matrix) plt.show;
相关矩阵
让我们尝试删除相关值大于0.5的特征:
# Find index of feature columns with high correlation to_drop = [column for column in matrix.columns if any(matrix[column] > 0.50)] print('Columns to drop: ' (len(to_drop)))
Columns to drop: 0
从上面的相关矩阵可以看出,数据集中没有高度相关的特征。最高的相关性仅为0.37。
单变量特征选择单变量特征选择是基于单变量统计检验选择最优特征。
我们可以使用sklearn的SelectKBest来选择一些要保留的特征。这种方法使用统计测试来选择与目标相关性最高的特征。这里我们将保留前100个特征。
# feature extraction k_best = feature_selection.SelectKBest(score_func=feature_selection.f_classif k=100) # fit on train set fit = k_best.fit(X_train y_train) # transform train set univariate_features = fit.transform(X_train) # checking which are the most important features feature_importance = rfc.fit(univariate_features y_train).feature_importances_ # Make importances relative to max importance. feature_importance = 100.0 * (feature_importance / feature_importance.max()) sorted_idx = np.argsort(feature_importance) sorted_idx = sorted_idx[-20:-1:1] pos = np.arange(sorted_idx.shape[0]) .5 plt.barh(pos feature_importance[sorted_idx] align='center') plt.yticks(pos X_train_df.columns[sorted_idx]) plt.xlabel('Relative Importance') plt.title('Feature Importance' fontsize=30) plt.tick_params(axis='x' which='major' labelsize=15) sns.despine(left=True bottom=True) plt.show()
交叉验证分数比上面的基线有所提高,但是我们仍然可以看到分数的变化,这表明过度拟合。
递归特性消除递归特征选择通过消除最不重要的特征来实现。它进行递归,直到达到指定数量的特征为止。递归消除可以用于通过coef_或feature_importances_为特征分配权重的任何模型。
在这里,我们将使用随机森林选择100个最好的特征:
# feature extraction rfe = feature_selection.RFE(lr n_features_to_select=100) # fit on train set fit = rfe.fit(X_train y_train) # transform train set recursive_features = fit.transform(X_train) lr = LogisticRegression(solver='liblinear') rfc = RandomForestClassifier(n_estimators=10) lr_scores = cross_val_score(lr recursive_features y_train cv=5 scoring='roc_auc') rfc_scores = cross_val_score(rfc recursive_features y_train cv=5 scoring='roc_auc') print('LR Scores: ' lr_scores) print('RFC Scores: ' rfc_scores)
LR Scores: [0.99826389 0.99652778 0.984375 1. 0.99652778]
RFC Scores: [0.63368056 0.72569444 0.66666667 0.77430556 0.59895833]
# checking which are the most important features feature_importance = rfc.fit(recursive_features y_train).feature_importances_ # Make importances relative to max importance. feature_importance = 100.0 * (feature_importance / feature_importance.max()) sorted_idx = np.argsort(feature_importance) sorted_idx = sorted_idx[-20:-1:1] pos = np.arange(sorted_idx.shape[0]) .5 plt.barh(pos feature_importance[sorted_idx] align='center') plt.yticks(pos X_train_df.columns[sorted_idx]) plt.xlabel('Relative Importance') plt.title('Feature Importance' fontsize=30) plt.tick_params(axis='x' which='major' labelsize=15) sns.despine(left=True bottom=True) plt.show()
使用SelectFromModel选择特征与递归特征选择一样,sklearn的SelectFromModel与任何具有coef_或featureimportances属性的估计器一起使用。它删除低于设置阈值的特征。
# feature extraction select_model = feature_selection.SelectFromModel(lr) # fit on train set fit = select_model.fit(X_train y_train) # transform train set model_features = fit.transform(X_train) lr = LogisticRegression(solver='liblinear') rfc = RandomForestClassifier(n_estimators=100) lr_scores = cross_val_score(lr model_features y_train cv=5 scoring='roc_auc') rfc_scores = cross_val_score(rfc model_features y_train cv=5 scoring='roc_auc') print('LR Scores: ' lr_scores) print('RFC Scores: ' rfc_scores)
LR Scores: [0.984375 0.99479167 0.97222222 0.99305556 0.99305556]
RFC Scores: [0.70659722 0.80729167 0.76475694 0.84461806 0.77170139]
# checking which are the most important features feature_importance = rfc.fit(model_features y_train).feature_importances_ # Make importances relative to max importance. feature_importance = 100.0 * (feature_importance / feature_importance.max()) sorted_idx = np.argsort(feature_importance) sorted_idx = sorted_idx[-20:-1:1] pos = np.arange(sorted_idx.shape[0]) .5 plt.barh(pos feature_importance[sorted_idx] align='center') plt.yticks(pos X_train_df.columns[sorted_idx]) plt.xlabel('Relative Importance') plt.title('Feature Importance' fontsize=30) plt.tick_params(axis='x' which='major' labelsize=15) sns.despine(left=True bottom=True) plt.show()
PCA主成分分析(PCA)是一种降维技术,它将数据投影到较低的维度空间。PCA在许多情况下都是有用的,但在多重共线性或预测函数需要解释的情况下,就不需要优先考虑了。
这里我们将使用PCA,保持90%的方差:
from sklearn.decomposition import PCA # pca - keep 90% of variance pca = PCA(0.90) principal_components = pca.fit_transform(X_train) principal_df = pd.DataFrame(data = principal_components) principal_df.shape
(250 139)
lr = LogisticRegression(solver='liblinear') rfc = RandomForestClassifier(n_estimators=100) lr_scores = cross_val_score(lr principal_df y_train cv=5 scoring='roc_auc') rfc_scores = cross_val_score(rfc principal_df y_train cv=5 scoring='roc_auc') print('LR Scores: ' lr_scores) print('RFC Scores: ' rfc_scores)
LR Scores: [0.80902778 0.703125 0.734375 0.80555556 0.66145833]
RFC Scores: [0.60503472 0.703125 0.69878472 0.56597222 0.72916667]
# pca keep 75% of variance pca = PCA(0.75) principal_components = pca.fit_transform(X_train) principal_df = pd.DataFrame(data = principal_components) principal_df.shape
(250 93)
lr = LogisticRegression(solver='liblinear') rfc = RandomForestClassifier(n_estimators=100) lr_scores = cross_val_score(lr principal_df y_train cv=5 scoring='roc_auc') rfc_scores = cross_val_score(rfc principal_df y_train cv=5 scoring='roc_auc') print('LR Scores: ' lr_scores) print('RFC Scores: ' rfc_scores)
LR Scores: [0.72048611 0.60069444 0.68402778 0.71006944 0.61284722]
RFC Scores: [0.61545139 0.71440972 0.57465278 0.59722222 0.640625 ]
# checking which are the most important features feature_importance = rfc.fit(principal_df y_train).feature_importances_ # Make importances relative to max importance. feature_importance = 100.0 * (feature_importance / feature_importance.max()) sorted_idx = np.argsort(feature_importance) sorted_idx = sorted_idx[-20:-1:1] pos = np.arange(sorted_idx.shape[0]) .5 plt.barh(pos feature_importance[sorted_idx] align='center') plt.yticks(pos X_train_df.columns[sorted_idx]) plt.xlabel('Relative Importance') plt.title('Feature Importance' fontsize=30) plt.tick_params(axis='x' which='major' labelsize=15) sns.despine(left=True bottom=True) plt.show()
# feature extraction rfe = feature_selection.RFE(lr n_features_to_select=100) # fit on train set fit = rfe.fit(X_train y_train) # transform train set recursive_X_train = fit.transform(X_train) recursive_X_test = fit.transform(X_test) lr = LogisticRegression(C=1 class_weight={1:0.6 0:0.4} penalty='l1' solver='liblinear') lr_scores = cross_val_score(lr recursive_X_train y_train cv=5 scoring='roc_auc') lr_scores.mean()
0.9059027777777778
predictions = lr.fit(recursive_X_train y_train).predict_proba(recursive_X_test) submission = pd.read_csv('../input/sample_submission.csv') submission['target'] = predictions submission.to_csv('submission.csv' index=False) submission.head()
结论特征选择是任何机器学习过程的重要组成部分。在本文中,我们探索了几种有助于提高模型性能的特征选择和降维方法。