朴素贝叶斯分类器的python库,朴素贝叶斯分类器的简单实现
朴素贝叶斯分类器的python库,朴素贝叶斯分类器的简单实现df = pd.DataFrame(positive columns=positive.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c='red' figsize=(15 15) marker='o' hist_kwds={'bins': 10 'color':'red'} s=10 alpha=.2); c)健康的#according to the color map blue-> 0 (negative) & red-> 1(positive) df = pd.DataFrame(full_catalog columns=full_catalog.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c=full_catalog['O
在本文中,我将讨论Naive Bayes分类器的简单实现,以预测患者是否患有糖尿病。为此,我使用机器学习数据集:“Pima Indians糖尿病数据库”(https://www.kaggle.com/uciml/pima-indians-diabetes-database);
我们一如既往地开始加载我们将要使用的所有Python库:
%matplotlib inline import matplotlib.pyplot as plt plt.style.use('classic') import numpy as np import pandas as pd from numpy import * import random import math from IPython.display import display pi = math.pi
我们首先加载数据集文件并查看数据:用于做出预测的特征、大小以及是否缺少数据。
full_catalog = pd.read_csv('/home/ealmaraz/dscience/sandbox/datasets/unzip/diabetes.csv') print (full_catalog.columns) print("Size of the catalogue: {}".format(len(full_catalog))) print("Is there any NaN?: {}".format(full_catalog.isnull().any().any())) full_catalog.head()
positive = full_catalog[full_catalog['Outcome'] == 1] print('number of patients with diabetes: ' len(positive)) negative = full_catalog[full_catalog['Outcome'] == 0] print('number of healthy patients: ' len(negative))
number of patients with diabetes: 268
number of healthy patients: 500
我们来看看是否有一些特征可以用来解释数据。
a)蓝色代表健康,红色代表疾病
#according to the color map blue-> 0 (negative) & red-> 1(positive) df = pd.DataFrame(full_catalog columns=full_catalog.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c=full_catalog['Outcome'].values figsize=(15 15) marker='o' hist_kwds={'bins': 10 'color':'green'} s=10 alpha=.2 cmap=plt.get_cmap('bwr'));
b)糖尿病患者
df = pd.DataFrame(positive columns=positive.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c='red' figsize=(15 15) marker='o' hist_kwds={'bins': 10 'color':'red'} s=10 alpha=.2);
c)健康的
df = pd.DataFrame(negative columns=negative.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c='blue' figsize=(15 15) marker='o' hist_kwds={'bins': 10 'color':'blue'} s=10 alpha=.2);
create_training_test函数将整个数据集分为训练集和测试集。参数:
- dataset:要分析的数据集
- fraction_training:与训练集相对应的数据集的部分(0到1之间)
- msg:调试标记。如果此标志为真,程序将显示当前正在做什么的信息
输出:
- training_set:包含训练集的数据框
- test_set:包含测试集的数据框架名
def create_training_test(dataset fraction_training msg): size_dataset=len(dataset); size_training=round(size_dataset*fraction_training); size_test=size_dataset-size_training #initially both the training set and the test sets are made from the whole dataset training_set = dataset.copy(); test_set = dataset.copy() #index of the dataset dataframe total_idx_list = list(dataset.index.values) #index of the test set. We use random.sample to pick out non-repeated integers in the dataset.index.values array test_idx_list = random.sample(list(dataset.index.values) size_test) test_idx_list.sort() #index of the training set. This is simply the difference between total_idx_list and test_idx_list training_idx_list = list(set(total_idx_list)-set(test_idx_list)) #once we have the two lists we drop the corresponding rows from the training and the test dataframes training_set.drop(training_set.index[test_idx_list] inplace=True) test_set.drop(test_set.index[training_idx_list] inplace=True) if msg == True: training_positive = training_set[training_set['Outcome']==1] training_negative = training_set[training_set['Outcome']==0] print("size of the dataset : {} samples".format(size_dataset)) print('size of the training set : {} samples ({} of the whole dataset)'.format(len(training_set) fraction_training)) print('\tpositive cases in the training set: {}'.format(len(training_positive))) print('\tnegative cases in the training set: {}'.format(len(training_negative))) print('size of the test set : {} samples'.format(len(test_set))) return training_set test_set
dict_parameters函数创建一个字典,存储每个特性的平均值和标准差。函数的参数为:
- dataset:要分析的dataset frame
- msg:调试标记。
输出:
- dict_parameters:包含每个特征的平均值和标准差的字典,例如:{'Pregnancies':(3.02 0.23) 'Age':(25.34 3.2) ...}
def get_parameters(dataset msg): features = dataset.columns.values; nbins = 10; dict_parameters = {} #we are excluding 'Outcome' from the loop for i in range(0 len(features)-1): #we single out the column 'features[i]' from dataset aux_df = pd.DataFrame(dataset[features[i]]) #here we make the partition into nbins. aux_df has an extra column indicating #to which bin each instance belongs to aux_df['bin'] = pd.cut(aux_df[features[i]] nbins) #'counts' is a series whose index is the bin interval and the values are the number #of counts in each bin. counts = pd.value_counts(aux_df['bin']) points_X = np.zeros(nbins) points_Y = np.zeros(nbins) for j in range(0 nbins): points_X[j] = counts.index[j].mid #the mid point of each bin points_Y[j] = counts.iloc[j] #the number of counts total_Y = np.sum(points_Y) #we compute the mean and the standard deviation. The results are stored in the dictionary 'dict_parameters' #whose keys are the labels of the columns and the values are (mu sigma) mu = np.sum(points_X*points_Y)/total_Y sigma2 = np.sum((points_X-mu)**2*points_Y)/(total_Y-1); sigma = math.sqrt(sigma2) dict_parameters[features[i]]=(mu sigma) if msg == True: print('\t\tfeature: {} mean: {} standard deviation: {}'.format(features[i] mu sigma)) return dict_parameters
这个函数计算每个特征的概率密度函数。函数的参数为:
- instance:pandas序列,其索引是特征,其值只是每个特征的度量
- dictionary:包含平均值和标准差的字典,用来评估实例中每个特征的概率密度函数
输出:
- dict_likelihood:具有条件概率的字典
对于每个特征
在探索性分析的基础上,我们使用了指数分布(Pregnancies Insulin Diabetes Pedigree Function 和 Age)
对于Glucose Blood Pressure Skin Thickness 和 BMI,我们采用高斯分布:
严格地说,这些是密度概率分布而不是概率。为了得到概率,我们必须P(x)乘以dx。实际上,P(x)可能比1大,但是当我们乘以dx,结果一定总是小于1。然而,这并不重要,因为这些$dx$因子在result =1和result =0中都是相同的,所以它们在Baye定理中是相同的乘法因子,因此在判断P(result =1|features)是否大于P(result =0|features)时没有影响。
def likelihood(instance dictionary): instance = instance[instance.index != 'Outcome'] dict_likelihood = {} for feature in instance.index: mu = dictionary[feature][0]; sigma = dictionary[feature][1] measurement = instance[feature] if feature in ['Pregnancies' 'Insulin' 'DiabetesPedigreeFunction' 'Age']: dict_likelihood[feature] = 1./mu*math.exp(-measurement/mu) elif feature in ['Glucose' 'BloodPressure' 'SkinThickness' 'BMI']: dict_likelihood[feature] = 1./(math.sqrt(2.*pi)*sigma)*math.exp(-(measurement-mu)**2/(2.*sigma**2)) return dict_likelihood
该bayes函数实现了贝叶斯定理对实例进行分类。输入:
- lkh_positive:对于每个特征,条件概率P(features|outcome=1)的字典
- lkh_negative:对于每个特征,条件概率P(features|outcome=0)的字典
- prob_positive:找到病人的概率
输出是分类器预测:1(患者患病),0(患者未患病)。
def bayes(lkh_positive lkh_negative prob_positive): logPositive = 0; logNegative = 0 for feature in lkh_positive: logPositive = math.log(lkh_positive[feature]); logNegative = math.log(lkh_negative[feature]) logPositive = logPositive math.log(prob_positive); logNegative = logNegative math.log(1.-prob_positive) if logPositive > logNegative: return 1 else: return 0
def pima_indians_NBClassifier(training_fraction msg): #we import the catalog dataset = pd.read_csv('/home/ealmaraz/dscience/sandbox/datasets/unzip/diabetes.csv') #here we create the training and the test sets training test=create_training_test(dataset training_fraction msg) #we split the training set into positive (1) and negative (0) values of 'Outcome' training_positive = training[training['Outcome']==1]; training_negative = training[training['Outcome']==0] prob_positive = len(training_positive)/(len(training)) #we get the parameters for the positive (negative) subsamples in the training set if msg == True: print('getting the parameters for the training set...') print('\tpositive cases subsample') param_positive = get_parameters(training_positive msg) if msg == True: print('\tnegative cases subsample') param_negative = get_parameters(training_negative msg) if msg == True: print('\tprobability of finding a positive case: {}'.format(prob_positive)) print('analizing the test set...') #here we compute the accuracy of the classifier by looping over the instances of the test set error_count = 0 for idx in test.index.values: instance = test.loc[idx] likelihood_positive = likelihood(instance param_positive) likelihood_negative = likelihood(instance param_negative) prediction = bayes(likelihood_positive likelihood_negative prob_positive) answer = int(instance['Outcome']) if prediction != answer: error_count = 1 if msg == True: print('\tclassification error!') error_rate = float(error_count)/len(test) if msg == True: print('Results for this implementation:') print('\terror rate : {}'.format(error_rate)) print('\tsuccessfull classification rate : {}'.format(1.-error_rate)) return error_rate
a) Single implementation。在这里,我们通过运行分类器的 Single implementation来显示结果
training_fraction = 0.75; msg = True pima_indians_NBClassifier(training_fraction msg)
b) Multiple implementation. 为了计算分类器的精度,我们需要多次运行它,并取所有实现的平均值。
training_fraction = 0.75; nrealizations = 500; msg = False error_rate = np.zeros(nrealizations) success_rate = np.zeros(nrealizations) for i in range(0 nrealizations): aux = pima_indians_NBClassifier(training_fraction msg) error_rate[i] = aux success_rate[i] = 1.-aux print('Results after {} realizations and training the classifier with {} of the whole sample...'.format(nrealizations training_fraction)) print('error rate mean: {} std {}'.format(np.mean(error_rate) np.std(error_rate))) print('successfull rate mean: {} std {}'.format(np.mean(success_rate) np.std(success_rate)))