科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）

忙族 2023-04-11 14:48:19 509

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）作为球迷，我们能做的只有惋惜与缅怀。不散播谣言，不消费 “曼巴精神”正如我的文案所说，我没有见过凌晨四点的洛杉矶，可我在凌晨四点听闻了你去世的消息，1978-2020。前段时间，湖人当家球星科比·布莱恩特不幸遇难。这对于无数的球迷来说无疑使晴天霹雳，他逆天终究也没能改命但命运也从来都没改得了他，曼巴精神会一直延续下去。随着大数据时代的到来，好像任何事情都可以和大数据这三个字挂钩。早在很久以前，大数据分析就已经广泛的应用在运动员职业生涯规划、医疗、金融等方面，在本文中将会使用Python对球星科比进行对维度分析，向 “老大” 致敬！前景提要那天，是2020年1月27日凌晨，我失眠了，足足在床上打滚到4点钟还是睡不着，解锁屏幕，盯着刺眼的手机打算刷刷微博，但却得到了一个令人震惊的消息：球星科比不幸遇难。换做是往常，我当然是举报三连，这种标题党罪有应得，但却刷到了越来越多条类似的消息，直到

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(1)

作者 | 高羊羊羊羊羊杨

来源 | CSDN博客

头图 | 付费下载自视觉中国

出品 | CSDN（ID:CSDNnews）

前段时间，湖人当家球星科比·布莱恩特不幸遇难。这对于无数的球迷来说无疑使晴天霹雳，他逆天终究也没能改命但命运也从来都没改得了他，曼巴精神会一直延续下去。随着大数据时代的到来，好像任何事情都可以和大数据这三个字挂钩。早在很久以前，大数据分析就已经广泛的应用在运动员职业生涯规划、医疗、金融等方面，在本文中将会使用Python对球星科比进行对维度分析，向 “老大” 致敬！

前景提要

那天，是2020年1月27日凌晨，我失眠了，足足在床上打滚到4点钟还是睡不着，解锁屏幕，盯着刺眼的手机打算刷刷微博，但却得到了一个令人震惊的消息：球星科比不幸遇难。换做是往常，我当然是举报三连，这种标题党罪有应得，但却刷到了越来越多条类似的消息，直到看到官方发布的消息。

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(2)

正如我的文案所说，我没有见过凌晨四点的洛杉矶，可我在凌晨四点听闻了你去世的消息，1978-2020。

作为球迷，我们能做的只有惋惜与缅怀。不散播谣言，不消费 “曼巴精神”

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(3)

数据获取

来源：NBA官方提供了的科比布莱恩特近二十年职业生涯数据资料集（数据量比较庞大，大约有3万行）

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(4)

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(5)

数据处理

翻阅文档时不难发现其中有很多空缺值，简单粗暴的方式是直接删除有空值的行，但为了样本完整性与预测结果的正确率。

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(6)

首先我们对投篮距离做一个简单的异常值检测，这里采用的是箱线图呈现

1#-*- coding: utf-8 -*- 2catering_sale = '2.csv' 3data = pd.read_csv(catering_sale index_col = 'shot_id') #读取数据，指定“shot_id”列为索引列 4 5import matplotlib.pyplot as plt #导入图像库 6plt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签 7plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号 8# 9plt.figure #建立图像 10p = data.boxplot(return_type='dict') #画箱线图，直接使用DataFrame的方法 11x = p['fliers'][0].get_xdata # 'flies'即为异常值的标签 12y = p['fliers'][0].get_ydata 13y.sort #从小到大排序，该方法直接改变原对象 14print('共有30687个数据其中异常值的个数为{}'.format(len(y))) 15 16#用annotate添加注释 17#其中有些相近的点，注解会出现重叠，难以看清，需要一些技巧来控制。 18 19for i in range(len(x)): 20 if i>0: 21 plt.annotate(y[i] xy = (x[i] y[i]) xytext=(x[i] 0.05 -0.8/(y[i]-y[i-1]) y[i])) 22 else: 23 plt.annotate(y[i] xy = (x[i] y[i]) xytext=(x[i] 0.08 y[i])) 24 25plt.show #展示箱线图

我们将得到这样的结果：

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(7)

根据判断，该列数据有68个异常值，这里采取的操作是将这些异常值所在行删除，其他列属性同理。

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(8)

数据整合

将数据导入，并按我们的需求对数据进行合并、添加新列名的操作

1import pandas as pd 2 3 4allData = pd.read_csv('data.csv') 5data = allData[allData['shot_made_flag'].not].reset_index 6 7# 添加新的列名 8data['game_date_DT'] = pd.to_datetime(data['game_date']) 9data['dayOfWeek'] = data['game_date_DT'].dt.dayofweek 10data['dayOfYear'] = data['game_date_DT'].dt.dayofyear 11data['secondsFromPeriodEnd'] = 60 * data['minutes_remaining'] data['seconds_remaining'] 12data['secondsFromPeriodStart'] = 60 * (11 - data['minutes_remaining']) (60 - data['seconds_remaining']) 13data['secondsFromGameStart'] = (data['period'] <= 4).astype(int) * (data['period'] - 1) * 12 * 60 ( 14 data['period'] > 4).astype(int) * ((data['period'] - 4) * 5 * 60 3 * 12 * 60) data['secondsFromPeriodStart'] 15 16''' 17其中： 18secondsFromPeriodEnd 一个周期结束后的秒 19secondsFromPeriodStart 一个周期开始时的秒 20secondsFromGameStart 一场比赛开始后的秒数 21''' 22 23#对数据进行验证 24print(data.loc[:10 ['period' 'minutes_remaining' 'seconds_remaining' 'secondsFromGameStart']])

运行有如下结果：

看起来还是一切正常的

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(9)

绘制投篮尝试图

根据不同的时间变化(从比赛开始)来绘制投篮的尝试图

这里我们将用到matplotlib包

1import pandas as pd 2import numpy as np 3import matplotlib.pyplot as plt 4 5 6plt.rcParams['figure.figsize'] = (16 16) 7plt.rcParams['font.size'] = 16 8binsSizes = [24 12 6] 9plt.figure 10 11for k binSizeInSeconds in enumerate(binsSizes): 12 timeBins = np.arange(0 60 * (4 * 12 3 * 5) binSizeInSeconds) 0.01 13 attemptsAsFunctionOfTime b = np.histogram(data['secondsFromGameStart'] bins=timeBins) 14 15 maxHeight = max(attemptsAsFunctionOfTime) 30 16 barWidth = 0.999 * (timeBins[1] - timeBins[0]) 17 plt.subplot(len(binsSizes) 1 k 1) 18 plt.bar(timeBins[:-1] attemptsAsFunctionOfTime align='edge' width=barWidth) 19 plt.title(str(binSizeInSeconds) ' second time bins') 20 plt.vlines(x=[0 12 * 60 2 * 12 * 60 3 * 12 * 60 4 * 12 * 60 4 * 12 * 60 5 * 60 4 * 12 * 60 2 * 5 * 60 21 4 * 12 * 60 3 * 5 * 60] ymin=0 ymax=maxHeight colors='r') 22 plt.xlim((-20 3200)) 23 plt.ylim((0 maxHeight)) 24 plt.ylabel('attempts') 25plt.xlabel('time [seconds from start of game]') 26plt.show

看下效果：

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(10)

可以看出随着比赛时间的进行，科比的出手次数呈现增长状态。

绘制命中率对比图

这里们将做一个对比来判断一下科比的命中率如何

1# 在比赛中，根据时间的函数绘制出投篮精度。 2# 绘制精度随时间变化的函数 3plt.rcParams['figure.figsize'] = (15 10) 4plt.rcParams['font.size'] = 16 5 6binSizeInSeconds = 20 7timeBins = np.arange(0 60 * (4 * 12 3 * 5) binSizeInSeconds) 0.01 8attemptsAsFunctionOfTime b = np.histogram(data['secondsFromGameStart'] bins=timeBins) 9madeAttemptsAsFunctionOfTime b = np.histogram(data.loc[data['shot_made_flag'] == 1 'secondsFromGameStart'] 10 bins=timeBins) 11attemptsAsFunctionOfTime[attemptsAsFunctionOfTime < 1] = 1 12accuracyAsFunctionOfTime = madeAttemptsAsFunctionOfTime.astype(float) / attemptsAsFunctionOfTime 13accuracyAsFunctionOfTime[attemptsAsFunctionOfTime <= 50] = 0 # zero accuracy in bins that don't have enough samples 14 15maxHeight = max(attemptsAsFunctionOfTime) 30 16barWidth = 0.999 * (timeBins[1] - timeBins[0]) 17 18plt.figure 19plt.subplot(2 1 1) 20plt.bar(timeBins[:-1] attemptsAsFunctionOfTime align='edge' width=barWidth); 21plt.xlim((-20 3200)) 22plt.ylim((0 maxHeight)) 23 24#上面图的y轴投篮次数 25plt.ylabel('attempts') 26plt.title(str(binSizeInSeconds) ' second time bins') 27plt.vlines(x=[0 12 * 60 2 * 12 * 60 3 * 12 * 60 4 * 12 * 60 4 * 12 * 60 5 * 60 4 * 12 * 60 2 * 5 * 60 28 4 * 12 * 60 3 * 5 * 60] ymin=0 ymax=maxHeight colors='r') 29plt.subplot(2 1 2) 30plt.bar(timeBins[:-1] accuracyAsFunctionOfTime align='edge' width=barWidth); 31plt.xlim((-20 3200)) 32#下面图的y轴命中率 33plt.ylabel('accuracy') 34plt.xlabel('time [seconds from start of game]') 35plt.vlines(x=[0 12 * 60 2 * 12 * 60 3 * 12 * 60 4 * 12 * 60 4 * 12 * 60 5 * 60 4 * 12 * 60 2 * 5 * 60 36 4 * 12 * 60 3 * 5 * 60] ymin=0.0 ymax=0.7 colors='r') 37plt.show

看一下效果怎么样

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(11)

分析可得出科比的投篮命中率大概徘徊在0.4左右，但这并不是我们想要的效果

为了进一步对数据进行挖掘，我们需要使用一些算法了。

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(12)

GMM聚类

那么什么是GMM聚类呢？

GMM是高斯混合模型（或者是混合高斯模型）的简称。大致的意思就是所有的分布可以看做是多个高斯分布综合起来的结果。这样一来，任何分布都可以分成多个高斯分布来表示。

因为我们知道，按照大自然中很多现象是遵从高斯（即正态）分布的，但是，实际上，影响一个分布的原因是多个的，甚至有些是人为的，可能每一个影响因素决定了一个高斯分布，多种影响结合起来就是多个高斯分布。（个人理解）

因此，混合高斯模型聚类的原理：通过样本找到K个高斯分布的期望和方差，那么K个高斯模型就确定了。在聚类的过程中，不会明确的指定一个样本属于哪一类，而是计算这个样本在某个分布中的可能性。

高斯分布一般还要结合EM算法作为其似然估计算法。

1''' 2现在，让我们继续我们的初步探索，研究一下科比投篮的空间位置。 3我们将通过构建一个高斯混合模型来实现这一点，该模型试图对科比的射门位置进行简单的总结。 4用GMM在科比的投篮位置上对他们的投篮尝试进行聚类 5''' 6 7numGaussians = 13 8gaussianMixtureModel = mixture.GaussianMixture(n_components=numGaussians covariance_type='full' 9 init_params='kmeans' n_init=50 10 verbose=0 random_state=5) 11gaussianMixtureModel.fit(data.loc[: ['loc_x' 'loc_y']]) 12 13# 将GMM集群作为字段添加到数据集中 14data['shotLocationCluster'] = gaussianMixtureModel.predict(data.loc[: ['loc_x' 'loc_y']])

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(13)

球场可视化

这里借鉴了MichaelKrueger的excelent脚本里的draw_court函数

draw_court函数

1def draw_court(ax=None color='black' lw=2 outer_lines=False): 2 # 如果没有提供用于绘图的axis对象，就获取当前对象 3 if ax is None: 4 ax = plt.gca 5 6 # 创建一个NBA的球场 7 # 建一个篮筐 8 # 直径是18，半径是9 9 # 7.5在坐标系内 10 hoop = Circle((0 0) radius=7.5 linewidth=lw color=color fill=False) 11 12 # 创建篮筐 13 backboard = Rectangle((-30 -7.5) 60 -1 linewidth=lw color=color) 14 15 # The paint 16 # 为球场外部上色， width=16ft height=19ft 17 outer_box = Rectangle((-80 -47.5) 160 190 linewidth=lw color=color 18 fill=False) 19 # 为球场内部上色 width=12ft height=19ft 20 inner_box = Rectangle((-60 -47.5) 120 190 linewidth=lw color=color 21 fill=False) 22 23 24 #创建发球顶弧 25 top_free_throw = Arc((0 142.5) 120 120 theta1=0 theta2=180 26 linewidth=lw color=color fill=False) 27 28 #创建发球底弧 29 bottom_free_throw = Arc((0 142.5) 120 120 theta1=180 theta2=0 30 linewidth=lw color=color linestyle='dashed') 31 32 # 这是一个距离篮筐中心4英尺半径的弧线 33 restricted = Arc((0 0) 80 80 theta1=0 theta2=180 linewidth=lw 34 color=color) 35 36 # 三分线 37 # 创建边3pt的线，14英尺长 38 corner_three_a = Rectangle((-220 -47.5) 0 140 linewidth=lw 39 color=color) 40 corner_three_b = Rectangle((220 -47.5) 0 140 linewidth=lw color=color) 41 42 # 圆弧到圆心是个圆环，距离为23'9" 43 # 调整一下thetal的值，直到它们与三分线对齐 44 three_arc = Arc((0 0) 475 475 theta1=22 theta2=158 linewidth=lw 45 color=color) 46 47 48 # 中场部分 49 center_outer_arc = Arc((0 422.5) 120 120 theta1=180 theta2=0 50 linewidth=lw color=color) 51 center_inner_arc = Arc((0 422.5) 40 40 theta1=180 theta2=0 52 linewidth=lw color=color) 53 54 55 # 要绘制到坐标轴上的球场元素的列表 56 court_elements = [hoop backboard outer_box inner_box top_free_throw 57 bottom_free_throw restricted corner_three_a 58 corner_three_b three_arc center_outer_arc 59 center_inner_arc] 60 61 if outer_lines: 62 63 # 划出半场线、底线和边线 64 outer_lines = Rectangle((-250 -47.5) 500 470 linewidth=lw 65 color=color fill=False) 66 court_elements.append(outer_lines) 67 68 69 # 将球场元素添加到轴上 70 for element in court_elements: 71 ax.add_patch(element) 72 73 return ax

二维高斯图

建立绘制画二维高斯图的函数

Draw2DGaussians

1def Draw2DGaussians(gaussianMixtureModel ellipseColors ellipseTextMessages): 2 fig h = plt.subplots 3 for i (mean covarianceMatrix) in enumerate(zip(gaussianMixtureModel.means_ gaussianMixtureModel.covariances_)): 4 # 得到协方差矩阵的特征向量和特征值 5 v w = np.linalg.eigh(covarianceMatrix) 6 v = 2.5 * np.sqrt(v) # go to units of standard deviation instead of variance 用标准差的单位代替方差 7 8 # 计算椭圆角和两轴长度并画出它 9 u = w[0] / np.linalg.norm(w[0]) 10 angle = np.arctan(u[1] / u[0]) 11 angle = 180 * angle / np.pi # convert to degrees 转换成度数 12 currEllipse = mpl.patches.Ellipse(mean v[0] v[1] 180 angle color=ellipseColors[i]) 13 currEllipse.set_alpha(0.5) 14 h.add_artist(currEllipse) 15 h.text(mean[0] 7 mean[1] - 1 ellipseTextMessages[i] fontsize=13 color='blue')

下面开始绘制2D高斯投篮次数图，图中的每个椭圆都是离高斯分布中心2.5个标准差远的计数，每个蓝色的数字代表从该高斯分布观察到的所占百分比

1# 显示投篮尝试的高斯混合椭圆 2plt.rcParams['figure.figsize'] = (13 10) 3plt.rcParams['font.size'] = 15 4 5ellipseTextMessages = [str(100 * gaussianMixtureModel.weights_[x])[:4] '%' for x in range(numGaussians)] 6ellipseColors = ['red' 'green' 'purple' 'cyan' 'magenta' 'yellow' 'blue' 'orange' 'silver' 'maroon' 'lime' 7 'olive' 'brown' 'darkblue'] 8Draw2DGaussians(gaussianMixtureModel ellipseColors ellipseTextMessages) 9draw_court(outer_lines=True) 10plt.ylim(-60 440) 11plt.xlim(270 -270) 12plt.title('shot attempts') 13plt.show

看一下成果：

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(14)

我们可以看到，着色后的2D高斯图中，科比在球场的左侧（或者从他看来是右侧）做了更多的投篮尝试。这可能是因为他是右撇子。此外，我们还可以看到，大量的投篮尝试（16.8%）是直接从篮下进行的，5.06%的额外投篮尝试是从非常接近篮下的位置投出去的。

它看起来并不完美，但确实显示了一些有用的东西

对于绘制的每个高斯集群的投篮精度，蓝色数字将代表从这个集群中获取到的准确性，因此我们可以了解哪些是容易的，哪些是困难的。

对于每个集群，计算一下它的精度并绘图

1plt.rcParams['figure.figsize'] = (13 10) 2plt.rcParams['font.size'] = 15 3 4variableCategories = data['shotLocationCluster'].value_counts.index.tolist 5 6clusterAccuracy = {} 7for category in variableCategories: 8 shotsAttempted = np.array(data['shotLocationCluster'] == category).sum 9 shotsMade = np.array(data.loc[data['shotLocationCluster'] == category 'shot_made_flag'] == 1).sum 10 clusterAccuracy[category] = float(shotsMade) / shotsAttempted 11 12ellipseTextMessages = [str(100 * clusterAccuracy[x])[:4] '%' for x in range(numGaussians)] 13Draw2DGaussians(gaussianMixtureModel ellipseColors ellipseTextMessages) 14draw_court(outer_lines=True) 15plt.ylim(-60 440) 16plt.xlim(270 -270) 17plt.title('shot accuracy') 18plt.show

看一下效果图

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(15)

我们可以清楚地看到投篮距离和精度之间的关系。

绘制二维时空图

另一个有趣的事实是：科比不仅在右侧做了更多的投篮尝试（从他看来的那边），而且他在这些投篮尝试上更擅长

现在让我们绘制一个科比职业生涯的二维时空图。在X轴上，将从比赛开始时计时；在y轴上有科比投篮的集群指数(根据集群精度排序)；图片的深度将反映科比在那个特定的时间从那个特定的集群中尝试的次数；图中的红色垂线分割比赛的每节

1# 制科比整个职业生涯比赛中的二维时空直方图 2plt.rcParams['figure.figsize'] = (18 10) #设置图像显示的大小 3plt.rcParams['font.size'] = 18 #字体大小 4 5 6# 根据集群的准确性对它们进行排序 7sortedClustersByAccuracyTuple = sorted(clusterAccuracy.items key=operator.itemgetter(1) reverse=True) 8sortedClustersByAccuracy = [x[0] for x in sortedClustersByAccuracyTuple] 9 10binSizeInSeconds = 12 11timeInUnitsOfBins = ((data['secondsFromGameStart'] 0.0001) / binSizeInSeconds).astype(int) 12locationInUintsOfClusters = np.array( 13 [sortedClustersByAccuracy.index(data.loc[x 'shotLocationCluster']) for x in range(data.shape[0])]) 14 15 16# 建立科比比赛的时空直方图 17shotAttempts = np.zeros((gaussianMixtureModel.n_components 1 max(timeInUnitsOfBins))) 18for shot in range(data.shape[0]): 19 shotAttempts[locationInUintsOfClusters[shot] timeInUnitsOfBins[shot]] = 1 20 21 22# 让y轴有更大的面积，这样会更明显 23shotAttempts = np.kron(shotAttempts np.ones((5 1))) 24 25# 每节结束的位置 26vlinesList = 0.5001 np.array([0 12 * 60 2 * 12 * 60 3 * 12 * 60 4 * 12 * 60 4 * 12 * 60 5 * 60]).astype( 27 int) / binSizeInSeconds 28 29plt.figure(figsize=(13 8)) #设置宽和高 30plt.imshow(shotAttempts cmap='copper' interpolation="nearest") #设置了边界的模糊度，或者是图片的模糊度 31plt.xlim(0 float(4 * 12 * 60 6 * 60) / binSizeInSeconds) 32plt.vlines(x=vlinesList ymin=-0.5 ymax=shotAttempts.shape[0] - 0.5 colors='r') 33plt.xlabel('time from start of game [sec]') 34plt.ylabel('cluster (sorted by accuracy)') 35plt.show

看一下运行结果：

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(16)

集群按精度降序排序。高准确度的投篮在最上面，而低准确度的半场投篮在最下面我们现在可以看到，在第一、第二和第三节中的“最后一秒出手”实际上是从很远的地方“绝杀” 然而，有趣的是，在第4节中，最后一秒的投篮并不属于“绝杀”的投篮群，而是属于常规的3分投篮（这仍然比较难命中，但不是毫无希望的)。

在以后的分析中，我们将根据投篮属性来评估投篮难度(如投篮类型和投篮距离）

下面将为投篮难度模型创建一个新表格

1def FactorizeCategoricalVariable(inputDB categoricalVarName): 2 opponentCategories = inputDB[categoricalVarName].value_counts.index.tolist 3 4 outputDB = pd.DataFrame 5 for category in opponentCategories: 6 featureName = categoricalVarName ': ' str(category) 7 outputDB[featureName] = (inputDB[categoricalVarName] == category).astype(int) 8 9 return outputDB 10 11 12featuresDB = pd.DataFrame 13featuresDB['homeGame'] = data['matchup'].apply(lambda x: 1 if (x.find('@') < 0) else 0) 14featuresDB = pd.concat([featuresDB FactorizeCategoricalVariable(data 'opponent')] axis=1) 15featuresDB = pd.concat([featuresDB FactorizeCategoricalVariable(data 'action_type')] axis=1) 16featuresDB = pd.concat([featuresDB FactorizeCategoricalVariable(data 'shot_type')] axis=1) 17featuresDB = pd.concat([featuresDB FactorizeCategoricalVariable(data 'combined_shot_type')] axis=1) 18featuresDB = pd.concat([featuresDB FactorizeCategoricalVariable(data 'shot_zone_basic')] axis=1) 19featuresDB = pd.concat([featuresDB FactorizeCategoricalVariable(data 'shot_zone_area')] axis=1) 20featuresDB = pd.concat([featuresDB FactorizeCategoricalVariable(data 'shot_zone_range')] axis=1) 21featuresDB = pd.concat([featuresDB FactorizeCategoricalVariable(data 'shotLocationCluster')] axis=1) 22 23featuresDB['playoffGame'] = data['playoffs'] 24featuresDB['locX'] = data['loc_x'] 25featuresDB['locY'] = data['loc_y'] 26featuresDB['distanceFromBasket'] = data['shot_distance'] 27featuresDB['secondsFromPeriodEnd'] = data['secondsFromPeriodEnd'] 28 29featuresDB['dayOfWeek_cycX'] = np.sin(2 * np.pi * (data['dayOfWeek'] / 7)) 30featuresDB['dayOfWeek_cycY'] = np.cos(2 * np.pi * (data['dayOfWeek'] / 7)) 31featuresDB['timeOfYear_cycX'] = np.sin(2 * np.pi * (data['dayOfYear'] / 365)) 32featuresDB['timeOfYear_cycY'] = np.cos(2 * np.pi * (data['dayOfYear'] / 365)) 33 34labelsDB = data['shot_made_flag']

根据FeaturesDB表构建模型，并确保它不会过度匹配（即训练误差与测试误差相同）

使用一个额外的分类器

建立一个简单的模型，并确保它不超载

1randomSeed = 1 2numFolds = 4 3 4stratifiedCV = model_selection.StratifiedKFold(n_splits=numFolds shuffle=True random_state=randomSeed) 5 6mainLearner = ensemble.ExtraTreesClassifier(n_estimators=500 max_depth=5 7 min_samples_leaf=120 max_features=120 8 criterion='entropy' bootstrap=False 9 n_jobs=-1 random_state=randomSeed) 10 11startTime = time.time 12trainAccuracy = 13validAccuracy = 14trainLogLosses = 15validLogLosses = 16for trainInds validInds in stratifiedCV.split(featuresDB labelsDB): 17 # 分割训练和有效的集合 18 X_train_CV = featuresDB.iloc[trainInds :] 19 y_train_CV = labelsDB.iloc[trainInds] 20 X_valid_CV = featuresDB.iloc[validInds :] 21 y_valid_CV = labelsDB.iloc[validInds] 22 23 # 训练 24 mainLearner.fit(X_train_CV y_train_CV) 25 26 # 作出预测 27 y_train_hat_mainLearner = mainLearner.predict_proba(X_train_CV)[: 1] 28 y_valid_hat_mainLearner = mainLearner.predict_proba(X_valid_CV)[: 1] 29 30 # 储存结果 31 trainAccuracy.append(accuracy(y_train_CV y_train_hat_mainLearner > 0.5)) 32 validAccuracy.append(accuracy(y_valid_CV y_valid_hat_mainLearner > 0.5)) 33 trainLogLosses.append(log_loss(y_train_CV y_train_hat_mainLearner)) 34 validLogLosses.append(log_loss(y_valid_CV y_valid_hat_mainLearner)) 35 36print("-----------------------------------------------------") 37print("total (train valid) Accuracy = (%.5f %.5f). took %.2f minutes" % ( 38 np.mean(trainAccuracy) np.mean(validAccuracy) (time.time - startTime) / 60)) 39print("total (train valid) Log Loss = (%.5f %.5f). took %.2f minutes" % ( 40 np.mean(trainLogLosses) np.mean(validLogLosses) (time.time - startTime) / 60)) 41print("-----------------------------------------------------") 42 43mainLearner.fit(featuresDB labelsDB) 44data['shotDifficulty'] = mainLearner.predict_proba(featuresDB)[: 1] 45 46# 为了深入了解，我们来看看特性选择 47featureInds = mainLearner.feature_importances_.argsort[::-1] 48featureImportance = pd.DataFrame( 49 np.concatenate((featuresDB.columns[featureInds None] mainLearner.feature_importances_[featureInds None]) 50 axis=1) 51 columns=['featureName' 'importanceET']) 52 53print(featureImportance.iloc[:30 :])**看看运行结果如何**：

1total (train valid) Accuracy = (0.67912 0.67860). took 0.29 minutes 2total (train valid) Log Loss = (0.60812 0.61100). took 0.29 minutes 3----------------------------------------------------- 4 featureName importanceET 50 action_type: Jump Shot 0.578036 61 action_type: Layup Shot 0.173274 72 combined_shot_type: Dunk 0.113341 83 homeGame 0.0288043 94 action_type: Dunk Shot 0.0161591 105 shotLocationCluster: 9 0.0136386 116 combined_shot_type: Layup 0.00949568 127 distanceFromBasket 0.0084703 138 shot_zone_range: 16-24 ft. 0.0072107 149 action_type: Slam Dunk Shot 0.00690316 1510 combined_shot_type: Jump Shot 0.00592586 1611 secondsFromPeriodEnd 0.00589391 1712 action_type: Running Jump Shot 0.00544904 1813 shotLocationCluster: 11 0.00449125 1914 locY 0.00388509 2015 action_type: Driving Layup Shot 0.00364757 2116 shot_zone_range: Less Than 8 ft. 0.00349615 2217 combined_shot_type: Tip Shot 0.00260399 2318 shot_zone_area: Center(C) 0.0011585 2419 opponent: DEN 0.000882106 2520 action_type: Driving Dunk Shot 0.000848156 2621 shot_zone_basic: Restricted Area 0.000650022 2722 shotLocationCluster: 2 0.000513476 2823 action_type: Tip Shot 0.000489918 2924 shot_zone_basic: Mid-Range 0.000487306 3025 action_type: Pullup Jump shot 0.000453641 3126 shot_zone_range: 8-16 ft. 0.000452574 3227 timeOfYear_cycX 0.000432267 3328 dayOfWeek_cycX 0.00039668 3429 shotLocationCluster: 8 0.000254077 35 36Process finished with exit code 0

在这里想谈谈科比·布莱恩特在决策过程中的一些问题；为此，我们将收集两组不同的效果图，并分析它们之间的差异：

在一次成功的投篮后马上继续投篮
在一次不成功的投篮后马上马上投篮

考虑到科比投进或投失了最后一球，我收集了一些数据

1timeBetweenShotsDict = {} 2timeBetweenShotsDict['madeLast'] = 3timeBetweenShotsDict['missedLast'] = 4 5changeInDistFromBasketDict = {} 6changeInDistFromBasketDict['madeLast'] = 7changeInDistFromBasketDict['missedLast'] = 8 9changeInShotDifficultyDict = {} 10changeInShotDifficultyDict['madeLast'] = 11changeInShotDifficultyDict['missedLast'] = 12 13afterMadeShotsList = 14afterMissedShotsList = 15 16for shot in range(1 data.shape[0]): 17 18 # 确保当前的投篮和最后的投篮都在同一场比赛的同一时间段 19 sameGame = data.loc[shot 'game_date'] == data.loc[shot - 1 'game_date'] 20 samePeriod = data.loc[shot 'period'] == data.loc[shot - 1 'period'] 21 22 if samePeriod and sameGame: 23 madeLastShot = data.loc[shot - 1 'shot_made_flag'] == 1 24 missedLastShot = data.loc[shot - 1 'shot_made_flag'] == 0 25 26 timeDifferenceFromLastShot = data.loc[shot 'secondsFromGameStart'] - data.loc[shot - 1 'secondsFromGameStart'] 27 distDifferenceFromLastShot = data.loc[shot 'shot_distance'] - data.loc[shot - 1 'shot_distance'] 28 shotDifficultyDifferenceFromLastShot = data.loc[shot 'shotDifficulty'] - data.loc[shot - 1 'shotDifficulty'] 29 30 # check for currupt data points (assuming all samples should have been chronologically ordered) 31 # 检查数据(假设所有样本都按时间顺序排列) 32 if timeDifferenceFromLastShot < 0: 33 continue 34 35 if madeLastShot: 36 timeBetweenShotsDict['madeLast'].append(timeDifferenceFromLastShot) 37 changeInDistFromBasketDict['madeLast'].append(distDifferenceFromLastShot) 38 changeInShotDifficultyDict['madeLast'].append(shotDifficultyDifferenceFromLastShot) 39 afterMadeShotsList.append(shot) 40 41 if missedLastShot: 42 timeBetweenShotsDict['missedLast'].append(timeDifferenceFromLastShot) 43 changeInDistFromBasketDict['missedLast'].append(distDifferenceFromLastShot) 44 changeInShotDifficultyDict['missedLast'].append(shotDifficultyDifferenceFromLastShot) 45 afterMissedShotsList.append(shot) 46 47afterMissedData = data.iloc[afterMissedShotsList :] 48afterMadeData = data.iloc[afterMadeShotsList :] 49 50shotChancesListAfterMade = afterMadeData['shotDifficulty'].tolist 51totalAttemptsAfterMade = afterMadeData.shape[0] 52totalMadeAfterMade = np.array(afterMadeData['shot_made_flag'] == 1).sum 53 54shotChancesListAfterMissed = afterMissedData['shotDifficulty'].tolist 55totalAttemptsAfterMissed = afterMissedData.shape[0] 56totalMadeAfterMissed = np.array(afterMissedData['shot_made_flag'] == 1).sum

柱状图

为他们绘制“上次投篮后的时间”的柱状图

1plt.rcParams['figure.figsize'] = (13 10) 2 3jointHist timeBins = np.histogram(timeBetweenShotsDict['madeLast'] timeBetweenShotsDict['missedLast'] bins=200) 4barWidth = 0.999 * (timeBins[1] - timeBins[0]) 5 6timeDiffHist_GivenMadeLastShot b = np.histogram(timeBetweenShotsDict['madeLast'] bins=timeBins) 7timeDiffHist_GivenMissedLastShot b = np.histogram(timeBetweenShotsDict['missedLast'] bins=timeBins) 8maxHeight = max(max(timeDiffHist_GivenMadeLastShot) max(timeDiffHist_GivenMissedLastShot)) 30 9 10plt.figure 11plt.subplot(2 1 1) 12plt.bar(timeBins[:-1] timeDiffHist_GivenMadeLastShot width=barWidth) 13plt.xlim((0 500)) 14plt.ylim((0 maxHeight)) 15plt.title('made last shot') 16plt.ylabel('counts') 17plt.subplot(2 1 2) 18plt.bar(timeBins[:-1] timeDiffHist_GivenMissedLastShot width=barWidth) 19plt.xlim((0 500)) 20plt.ylim((0 maxHeight)) 21plt.title('missed last shot') 22plt.xlabel('time since last shot') 23plt.ylabel('counts') 24plt.show

看一下运行结果：

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(17)

从图中可以看出：科比投了一个球之后有些着急去投下一个，而图中的一些比较平缓的值可能是球权在另一只队伍手中，需要一些时间来夺回。

累计柱状图

为了更好地可视化柱状图之间的差异，我们来看看累积柱状图。

1plt.rcParams['figure.figsize'] = (13 6) 2 3timeDiffCumHist_GivenMadeLastShot = np.cumsum(timeDiffHist_GivenMadeLastShot).astype(float) 4timeDiffCumHist_GivenMadeLastShot = timeDiffCumHist_GivenMadeLastShot / max(timeDiffCumHist_GivenMadeLastShot) 5timeDiffCumHist_GivenMissedLastShot = np.cumsum(timeDiffHist_GivenMissedLastShot).astype(float) 6timeDiffCumHist_GivenMissedLastShot = timeDiffCumHist_GivenMissedLastShot / max(timeDiffCumHist_GivenMissedLastShot) 7 8maxHeight = max(timeDiffCumHist_GivenMadeLastShot[-1] timeDiffCumHist_GivenMissedLastShot[-1]) 9 10plt.figure 11madePrev = plt.plot(timeBins[:-1] timeDiffCumHist_GivenMadeLastShot label='made Prev') 12plt.xlim((0 500)) 13missedPrev = plt.plot(timeBins[:-1] timeDiffCumHist_GivenMissedLastShot label='missed Prev') 14plt.xlim((0 500)) 15plt.ylim((0 1)) 16plt.title('cumulative density function - CDF') 17plt.xlabel('time since last shot') 18plt.legend(loc='lower right') 19plt.show

运行效果如下：

科比的职业生涯数据统计（30万行数据Python分析科比二十年职业生涯）(18)

虽然可以观察到密度有差异，但好像不太清楚，所以还是转换成高斯格式来显示数据吧

1# 显示投中后和失球后的投篮次数 2plt.rcParams['figure.figsize'] = (13 10) 3 4variableCategories = afterMadeData['shotLocationCluster'].value_counts.index.tolist 5clusterFrequency = {} 6for category in variableCategories: 7 shotsAttempted = np.array(afterMadeData['shotLocationCluster'] == category).sum 8 clusterFrequency[category] = float(shotsAttempted) / afterMadeData.shape[0] 9 10ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] '%' for x in range(numGaussians)] 11Draw2DGaussians(gaussianMixtureModel ellipseColors ellipseTextMessages) 12draw_court(outer_lines=True) 13plt.ylim(-60 440) 14plt.xlim(270 -270) 15plt.title('after made shots') 16 17variableCategories = afterMissedData['shotLocationCluster'].value_counts.index.tolist 18clusterFrequency = {} 19for category in variableCategories: 20 shotsAttempted = np.array(afterMissedData['shotLocationCluster'] == category).sum 21 clusterFrequency[category] = float(shotsAttempted) / afterMissedData.shape[0] 22 23ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] '%' for x in range(numGaussians)] 24Draw2DGaussians(gaussianMixtureModel ellipseColors ellipseTextMessages) 25draw_court(outer_lines=True) 26plt.ylim(-60 440) 27plt.xlim(270 -270) 28plt.title('after missed shots') 29plt.show 30