pythonscikit教程（用Python做科学计算工具篇）

威哥 2023-04-10 08:51:28 921

pythonscikit教程（用Python做科学计算工具篇）>>>>>> print(data.DESCR) Boston House Prices dataset =========================== Notes ------ Data Set Characteristics: :number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive :Median Value (attribute 14) is usually the target :Attribute Information (in order): - CRIM per capita crime rate by town - ZN pro

pythonscikit教程（用Python做科学计算工具篇）(1)

所需基本库

numpy
scipy
matplotlib

全章目录【本节：监督学习：住房数据的回归】

简介：问题设置
使用 scikit-learn 进行机器学习的基本原理
监督学习：手写数字的分类
监督学习：住房数据的回归
测量预测性能
无监督学习：降维和可视化
特征脸示例：链接 PCA 和 SVM
特征脸示例：链接 PCA 和 SVM
参数选择、验证和测试

6.4.监督学习：住房数据的回归

在这里，我们将做一个回归问题的简短示例：从一组特征中学习一个连续值。

6.4.1. 快速浏览数据

我们将使用 scikit-learn 中提供的简单波士顿房价集。这记录了波士顿周围房地产市场的 13 个属性的测量值，以及中位数价格。问题是：你能根据新市场的属性预测其价格吗？

>>>

>>> from sklearn.datasets import load_boston >>> data = load_boston() >>> print(data.data.shape) (506 13) >>> print(data.target.shape) (506 )

我们可以看到只有 500 多个数据点。

该DESCR变量对数据集有很长的描述：

>>>

>>> print(data.DESCR) Boston House Prices dataset =========================== Notes ------ Data Set Characteristics: :number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive :Median Value (attribute 14) is usually the target :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25 000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10 000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's ...

它通常有助于使用直方图、散点图或其他绘图类型快速可视化数据片段。使用 matplotlib，让我们显示目标值的直方图：每个邻域的中位数价格：

>>>

>>> plt.hist(data.target) (array([...

pythonscikit教程（用Python做科学计算工具篇）(2)

让我们快速看一下某些特征是否比其他特征更适合我们的问题：

>>>

>>> for index feature_name in enumerate(data.feature_names): ... plt.figure() ... plt.scatter(data.data[: index] data.target) <Figure size...

pythonscikit教程（用Python做科学计算工具篇）(3)

pythonscikit教程（用Python做科学计算工具篇）(4)

pythonscikit教程（用Python做科学计算工具篇）(5)

pythonscikit教程（用Python做科学计算工具篇）(6)

pythonscikit教程（用Python做科学计算工具篇）(7)

pythonscikit教程（用Python做科学计算工具篇）(8)

pythonscikit教程（用Python做科学计算工具篇）(9)

pythonscikit教程（用Python做科学计算工具篇）(10)

pythonscikit教程（用Python做科学计算工具篇）(11)

pythonscikit教程（用Python做科学计算工具篇）(12)

pythonscikit教程（用Python做科学计算工具篇）(13)

pythonscikit教程（用Python做科学计算工具篇）(14)

pythonscikit教程（用Python做科学计算工具篇）(15)

有时，在机器学习中，使用特征选择来决定哪些特征对特定问题最有用是很有用的。现有的自动化方法可以量化这种选择信息量最大的特征的练习。

6.4.2. 预测房价：一个简单的线性回归

现在我们将使用scikit-learn对住房数据执行简单的线性回归。有许多使用回归量的可能性。一个特别简单的是LinearRegression：这基本上是一个普通最小二乘计算的包。

>>>

>>> from sklearn.model_selection import train_test_split >>> X_train X_test y_train y_test = train_test_split(data.data data.target) >>> from sklearn.linear_model import LinearRegression >>> clf = LinearRegression() >>> clf.fit(X_train y_train) LinearRegression(copy_X=True fit_intercept=True n_jobs=1 normalize=False) >>> predicted = clf.predict(X_test) >>> expected = y_test >>> print("RMS: %s" % np.sqrt(np.mean((predicted - expected) ** 2))) RMS: 5.0059...

pythonscikit教程（用Python做科学计算工具篇）(16)