朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现

小君 2023-02-01 02:26:43 392

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现df = pd.DataFrame(positive columns=positive.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c='red' figsize=(15 15) marker='o' hist_kwds={'bins': 10 'color':'red'} s=10 alpha=.2); c)健康的#according to the color map blue-> 0 (negative) & red-> 1(positive) df = pd.DataFrame(full_catalog columns=full_catalog.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c=full_catalog['O

在本文中，我将讨论Naive Bayes分类器的简单实现，以预测患者是否患有糖尿病。为此，我使用机器学习数据集：“Pima Indians糖尿病数据库”（https://www.kaggle.com/uciml/pima-indians-diabetes-database）；

我们一如既往地开始加载我们将要使用的所有Python库：

%matplotlib inline import matplotlib.pyplot as plt plt.style.use('classic') import numpy as np import pandas as pd from numpy import * import random import math from IPython.display import display pi = math.pi

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(1)

探索性分析

我们首先加载数据集文件并查看数据:用于做出预测的特征、大小以及是否缺少数据。

full_catalog = pd.read_csv('/home/ealmaraz/dscience/sandbox/datasets/unzip/diabetes.csv') print (full_catalog.columns) print("Size of the catalogue: {}".format(len(full_catalog))) print("Is there any NaN?: {}".format(full_catalog.isnull().any().any())) full_catalog.head()

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(2)

positive = full_catalog[full_catalog['Outcome'] == 1] print('number of patients with diabetes: ' len(positive)) negative = full_catalog[full_catalog['Outcome'] == 0] print('number of healthy patients: ' len(negative))

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(3)

number of patients with diabetes: 268

number of healthy patients: 500

我们来看看是否有一些特征可以用来解释数据。

a)蓝色代表健康，红色代表疾病

#according to the color map blue-> 0 (negative) & red-> 1(positive) df = pd.DataFrame(full_catalog columns=full_catalog.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c=full_catalog['Outcome'].values figsize=(15 15) marker='o' hist_kwds={'bins': 10 'color':'green'} s=10 alpha=.2 cmap=plt.get_cmap('bwr'));

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(4)

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(5)

b)糖尿病患者

df = pd.DataFrame(positive columns=positive.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c='red' figsize=(15 15) marker='o' hist_kwds={'bins': 10 'color':'red'} s=10 alpha=.2);

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(6)

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(7)

c)健康的

df = pd.DataFrame(negative columns=negative.columns.drop('Outcome')) pd.plotting.scatter_matrix(df c='blue' figsize=(15 15) marker='o' hist_kwds={'bins': 10 'color':'blue'} s=10 alpha=.2);

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(8)

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(9)

朴素贝叶斯分类器函数

create_training_test函数将整个数据集分为训练集和测试集。参数：

dataset:要分析的数据集
fraction_training:与训练集相对应的数据集的部分(0到1之间)
msg:调试标记。如果此标志为真，程序将显示当前正在做什么的信息

输出:

training_set:包含训练集的数据框
test_set:包含测试集的数据框架名

def create_training_test(dataset fraction_training msg): size_dataset=len(dataset); size_training=round(size_dataset*fraction_training); size_test=size_dataset-size_training #initially both the training set and the test sets are made from the whole dataset training_set = dataset.copy(); test_set = dataset.copy() #index of the dataset dataframe total_idx_list = list(dataset.index.values) #index of the test set. We use random.sample to pick out non-repeated integers in the dataset.index.values array test_idx_list = random.sample(list(dataset.index.values) size_test) test_idx_list.sort() #index of the training set. This is simply the difference between total_idx_list and test_idx_list training_idx_list = list(set(total_idx_list)-set(test_idx_list)) #once we have the two lists we drop the corresponding rows from the training and the test dataframes training_set.drop(training_set.index[test_idx_list] inplace=True) test_set.drop(test_set.index[training_idx_list] inplace=True) if msg == True: training_positive = training_set[training_set['Outcome']==1] training_negative = training_set[training_set['Outcome']==0] print("size of the dataset : {} samples".format(size_dataset)) print('size of the training set : {} samples ({} of the whole dataset)'.format(len(training_set) fraction_training)) print('\tpositive cases in the training set: {}'.format(len(training_positive))) print('\tnegative cases in the training set: {}'.format(len(training_negative))) print('size of the test set : {} samples'.format(len(test_set))) return training_set test_set

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(10)

dict_parameters函数创建一个字典，存储每个特性的平均值和标准差。函数的参数为:

dataset:要分析的dataset frame
msg:调试标记。

输出:

dict_parameters:包含每个特征的平均值和标准差的字典，例如:{'Pregnancies':(3.02 0.23) 'Age':(25.34 3.2) ...}

def get_parameters(dataset msg): features = dataset.columns.values; nbins = 10; dict_parameters = {} #we are excluding 'Outcome' from the loop for i in range(0 len(features)-1): #we single out the column 'features[i]' from dataset aux_df = pd.DataFrame(dataset[features[i]]) #here we make the partition into nbins. aux_df has an extra column indicating #to which bin each instance belongs to aux_df['bin'] = pd.cut(aux_df[features[i]] nbins) #'counts' is a series whose index is the bin interval and the values are the number #of counts in each bin. counts = pd.value_counts(aux_df['bin']) points_X = np.zeros(nbins) points_Y = np.zeros(nbins) for j in range(0 nbins): points_X[j] = counts.index[j].mid #the mid point of each bin points_Y[j] = counts.iloc[j] #the number of counts total_Y = np.sum(points_Y) #we compute the mean and the standard deviation. The results are stored in the dictionary 'dict_parameters' #whose keys are the labels of the columns and the values are (mu sigma) mu = np.sum(points_X*points_Y)/total_Y sigma2 = np.sum((points_X-mu)**2*points_Y)/(total_Y-1); sigma = math.sqrt(sigma2) dict_parameters[features[i]]=(mu sigma) if msg == True: print('\t\tfeature: {} mean: {} standard deviation: {}'.format(features[i] mu sigma)) return dict_parameters

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(11)

这个函数计算每个特征的概率密度函数。函数的参数为:

instance:pandas序列，其索引是特征，其值只是每个特征的度量
dictionary:包含平均值和标准差的字典，用来评估实例中每个特征的概率密度函数

输出:

dict_likelihood:具有条件概率的字典

对于每个特征

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(12)

在探索性分析的基础上，我们使用了指数分布（Pregnancies Insulin Diabetes Pedigree Function 和 Age）

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(13)

对于Glucose Blood Pressure Skin Thickness 和 BMI，我们采用高斯分布:

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(14)

严格地说，这些是密度概率分布而不是概率。为了得到概率，我们必须P(x)乘以dx。实际上，P(x)可能比1大，但是当我们乘以dx，结果一定总是小于1。然而，这并不重要，因为这些$dx$因子在result =1和result =0中都是相同的，所以它们在Baye定理中是相同的乘法因子，因此在判断P(result =1|features)是否大于P(result =0|features)时没有影响。

def likelihood(instance dictionary): instance = instance[instance.index != 'Outcome'] dict_likelihood = {} for feature in instance.index: mu = dictionary[feature][0]; sigma = dictionary[feature][1] measurement = instance[feature] if feature in ['Pregnancies' 'Insulin' 'DiabetesPedigreeFunction' 'Age']: dict_likelihood[feature] = 1./mu*math.exp(-measurement/mu) elif feature in ['Glucose' 'BloodPressure' 'SkinThickness' 'BMI']: dict_likelihood[feature] = 1./(math.sqrt(2.*pi)*sigma)*math.exp(-(measurement-mu)**2/(2.*sigma**2)) return dict_likelihood

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(15)

该bayes函数实现了贝叶斯定理对实例进行分类。输入:

lkh_positive:对于每个特征，条件概率P(features|outcome=1)的字典
lkh_negative:对于每个特征，条件概率P(features|outcome=0)的字典
prob_positive:找到病人的概率

输出是分类器预测:1(患者患病)，0(患者未患病)。

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(16)

def bayes(lkh_positive lkh_negative prob_positive): logPositive = 0; logNegative = 0 for feature in lkh_positive: logPositive = math.log(lkh_positive[feature]); logNegative = math.log(lkh_negative[feature]) logPositive = logPositive math.log(prob_positive); logNegative = logNegative math.log(1.-prob_positive) if logPositive > logNegative: return 1 else: return 0

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(17)

分类器管理程序

def pima_indians_NBClassifier(training_fraction msg): #we import the catalog dataset = pd.read_csv('/home/ealmaraz/dscience/sandbox/datasets/unzip/diabetes.csv') #here we create the training and the test sets training test=create_training_test(dataset training_fraction msg) #we split the training set into positive (1) and negative (0) values of 'Outcome' training_positive = training[training['Outcome']==1]; training_negative = training[training['Outcome']==0] prob_positive = len(training_positive)/(len(training)) #we get the parameters for the positive (negative) subsamples in the training set if msg == True: print('getting the parameters for the training set...') print('\tpositive cases subsample') param_positive = get_parameters(training_positive msg) if msg == True: print('\tnegative cases subsample') param_negative = get_parameters(training_negative msg) if msg == True: print('\tprobability of finding a positive case: {}'.format(prob_positive)) print('analizing the test set...') #here we compute the accuracy of the classifier by looping over the instances of the test set error_count = 0 for idx in test.index.values: instance = test.loc[idx] likelihood_positive = likelihood(instance param_positive) likelihood_negative = likelihood(instance param_negative) prediction = bayes(likelihood_positive likelihood_negative prob_positive) answer = int(instance['Outcome']) if prediction != answer: error_count = 1 if msg == True: print('\tclassification error!') error_rate = float(error_count)/len(test) if msg == True: print('Results for this implementation:') print('\terror rate : {}'.format(error_rate)) print('\tsuccessfull classification rate : {}'.format(1.-error_rate)) return error_rate

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(18)

分类器的性能

a) Single implementation。在这里，我们通过运行分类器的 Single implementation来显示结果

training_fraction = 0.75; msg = True pima_indians_NBClassifier(training_fraction msg)

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(19)

b) Multiple implementation. 为了计算分类器的精度，我们需要多次运行它，并取所有实现的平均值。

training_fraction = 0.75; nrealizations = 500; msg = False error_rate = np.zeros(nrealizations) success_rate = np.zeros(nrealizations) for i in range(0 nrealizations): aux = pima_indians_NBClassifier(training_fraction msg) error_rate[i] = aux success_rate[i] = 1.-aux print('Results after {} realizations and training the classifier with {} of the whole sample...'.format(nrealizations training_fraction)) print('error rate mean: {} std {}'.format(np.mean(error_rate) np.std(error_rate))) print('successfull rate mean: {} std {}'.format(np.mean(success_rate) np.std(success_rate)))

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(20)

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现(21)

网站首页

返回栏目

朴素贝叶斯分类器的python库，朴素贝叶斯分类器的简单实现

猜您喜欢：

相关文章