朴素贝叶斯

朴素贝叶斯方法是基于贝叶斯定理的一组有监督学习算法,即“简单”地假设每对特征之间相互独立。 给定一个类别yy和一个从x1x_1xnx_n的相关的特征向量,贝叶斯定理阐述了一下关系:

P(yx1,,xn)=P(y)P(x1,,xny)P(x1,,xn)P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}{P(x_1, \dots, x_n)}

使用简单(naive)的假设-每对特征之间都相互独立:

P(xiy,x1,,xi1,xi+1,,xn)=P(xiy)P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_i | y)

对于所有的ii都成立,这个关系式可以简化为:

P(yx1,,xn)=P(y)i=1nP(xiy)P(x1,,xn)P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}{P(x_1, \dots, x_n)}

由于在给定的输入中P(x1,...,xn)P(x_1,...,x_n)是一个常量,用下面的分类规则:
\begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\ \Rightarrow\ \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align}
我们可以用最大后验(MAP)估计来估计P(y)P(y)P(xiy)P(x_i \mid y);前者是训练集中类别yy的相对频率。
各种各样的的朴素贝叶斯分类器的差异大部分来自于处理P(xiy)P(x_i \mid y)分布时的所做的假设不同。
尽管其假设过于简单,在很多实际情况下,朴素贝叶斯工作得很好,特别是文档分类垃圾邮件过滤。这些工作都要求一个小的训练集来估计必需参数。
相比于其他更复杂的方法,朴素贝叶斯学习器和分类器非常快。分类条件分布的解耦意味着可以独立单独地把每个特征视为一维分布来估计。这样反过来有助于缓解维度灾难带来的问题。
另一方面,尽管朴素贝叶斯被认为是一种相当不错的分类器,但却不是好的估计器(estimator),所以不能太过于重视从predict_proba输出的概率。

高斯朴素贝叶斯

GaussianNB实现了运用于分类的高斯朴素贝叶斯算法。特征的可能性(即概率)假设为高斯分布:

P(xiy)=12πσy2exp((xiμy)22σy2)P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)

参数 \sigma_y 和 \mu_y 使用最大似然法估计。

多项分布朴素贝叶斯

MultinomialNB实现了服从多项分布数据的朴素贝叶斯算法,也是用于文本分类(这个领域中数据往往以词向量表示,尽管在实践中tf-idf向量在预测时表现良好)的两大经典朴素贝叶斯算法之一。分布参数由每类yyθy=(θy1,,θyn)\theta_y=(\theta_{y1},\ldots,\theta_{yn})向量决定, 式中nn是特征的数量(对于文本分类,是词汇量的大小)θyi\theta_{yi}是样本中属于类&y&中特征&i&概率&P(x_i \mid y)&。
参数&\theta_y&使用平滑过的最大似然估计法来估计,即相对频率计数:

θ^yi=Nyi+αNy+αn\hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n}

式中Nyi=xTxiN_{yi}=\sum_{x \in T}x_i是训练集T中特征ii在类yy中出现的次数,Ny=i=1TNyiN_y=\sum_{i=1}^{|T|}N_{yi}是类yy中出现所有特征的计数总和。

先验平滑因子α0\alpha \ge 0为在学习样本中没有出现的特征而设计,以防在将来的计算中出现0概率输出。
把$\alpha = 1 被称为拉普拉斯平滑(Lapalcesmoothing),而被称为拉普拉斯平滑(Lapalce smoothing),而\alpha < 1$被称为Lidstone平滑方法(Lidstone smoothing)。

伯努利朴素贝叶斯

BernoulliNB实现了用于多重伯努利分布数据的朴素贝叶斯训练和分类算法,即有多个特征,但每个特征都假设是一个二元(Bernoulli, boolean)变量。因此,这类算法要求样本以二元值特征向量表示;如果样本含有其他类型的数据,一个BernoulliNB实例会将其二值化(取决于binarize参数)。

伯努利朴素贝叶斯的决策规则基于:

P(xiy)=P(iy)xi+(1P(iy))(1xi)P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i)

与多项分布朴素贝叶斯的规则不同 伯努利朴素贝叶斯明确地惩罚类yy中没有出现作为预测因子的特征ii,而多项分布分布朴素贝叶斯只是简单地忽略没出现的特征。

在文本分类的示例中,统计词语是否出现的向量(word occurrence vectors)(而非统计词语出现次数的向量(word count vectors))可以用于训练和使用这个分类器。BernoulliNB可能在一些数据集上表现得更好,特别是那些更短的文档。

使用朴素贝叶斯进行个人信用风险评估

数据源与查看数据

1
2
3
4
5
6
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

credit = pd.read_csv("./input/credit.csv")
credit.head(5)

checking_balance months_loan_duration credit_history purpose amount savings_balance employment_length installment_rate personal_status other_debtors ... property age installment_plan housing existing_credits job dependents telephone foreign_worker default
0 < 0 DM 6 critical radio/tv 1169 unknown > 7 yrs 4 single male none ... real estate 67 none own 2 skilled employee 1 yes yes 1
1 1 - 200 DM 48 repaid radio/tv 5951 < 100 DM 1 - 4 yrs 2 female none ... real estate 22 none own 1 skilled employee 1 none yes 2
2 unknown 12 critical education 2096 < 100 DM 4 - 7 yrs 2 single male none ... real estate 49 none own 1 unskilled resident 2 none yes 1
3 < 0 DM 42 repaid furniture 7882 < 100 DM 4 - 7 yrs 2 single male guarantor ... building society savings 45 none for free 1 skilled employee 2 none yes 1
4 < 0 DM 24 delayed car (new) 4870 < 100 DM 1 - 4 yrs 3 single male none ... unknown/none 53 none for free 2 skilled employee 2 none yes 2

5 rows × 21 columns

数据预处理

checking_balance,credit_history,purpose,savings_balance,employment_length,personal_status,other_debtors,property,installment_plan,housing,job,telephone,foreign_worker为字符串类型形式的变量,需要预处理使用整数进行编码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
col_dicts = {}
cols = ['checking_balance','credit_history', 'purpose', 'savings_balance', 'employment_length', 'personal_status',
'other_debtors','property','installment_plan','housing','job','telephone','foreign_worker']

col_dicts = {'checking_balance': {'unknown': 0,
'< 0 DM': 1,
'1 - 200 DM': 2,
'> 200 DM': 3
},
'credit_history': {'critical': 0,
'repaid': 1,
'delayed': 2,
'fully repaid': 3,
'fully repaid this bank': 4
},
'employment_length': {'unemployed': 0,
'0 - 1 yrs': 1,
'1 - 4 yrs': 2,
'4 - 7 yrs': 3,
'> 7 yrs': 4
},
'foreign_worker': {'yes': 0 ,'no': 1},
'housing': {'own': 0, 'for free': 1, 'rent': 2},
'installment_plan': {'none': 0, 'bank': 1, 'stores': 2},
'job': {'unemployed non-resident': 0,
'unskilled resident': 1,
'skilled employee': 2,
'mangement self-employed': 3
},
'other_debtors': {'none': 0,
'guarantor': 1,
'co-applicant': 2 },
'personal_status': {'single male': 0,
'female': 1,
'divorced male': 2,
'married male': 3
},
'property': {'real estate': 0,
'building society savings': 1,
'unknown/none': 2,
'other': 3
},
'purpose': {'radio/tv': 0,
'education': 1,
'furniture': 2,
'car (new)': 3,
'car (used)': 4,
'business': 5,
'domestic appliances': 6,
'repairs': 7,
'others': 8,
'retraining': 9},
'savings_balance': {'unknown': 0,
'< 100 DM': 1,
'101 - 500 DM': 2,
'501 - 1000 DM': 3,
'> 1000 DM': 4
},
'telephone': {'none': 1, 'yes': 0}}

for col in cols:
credit[col] = credit[col].map(col_dicts[col])


credit.head(5)

checking_balance months_loan_duration credit_history purpose amount savings_balance employment_length installment_rate personal_status other_debtors ... property age installment_plan housing existing_credits job dependents telephone foreign_worker default
0 1 6 0 0 1169 0 4 4 0 0 ... 0 67 0 0 2 2 1 0 0 1
1 2 48 1 0 5951 1 2 2 1 0 ... 0 22 0 0 1 2 1 1 0 2
2 0 12 0 1 2096 1 3 2 0 0 ... 0 49 0 0 1 1 2 1 0 1
3 1 42 1 2 7882 1 3 2 0 1 ... 1 45 0 1 1 2 2 1 0 1
4 1 24 2 3 4870 1 2 3 0 0 ... 2 53 0 1 2 2 2 1 0 2

5 rows × 21 columns

特征分析

获取特征的相关性矩阵,可以查看各变量之间的依赖关系。

1
2
3
4
import numpy as np

corrmat=credit.corr()#获取相关性矩阵
corrmat

checking_balance months_loan_duration credit_history purpose amount savings_balance employment_length installment_rate personal_status other_debtors ... property age installment_plan housing existing_credits job dependents telephone foreign_worker default
checking_balance 1.000000 0.035050 0.138210 0.017272 0.024561 -0.005614 -0.108536 -0.057942 0.069946 0.041970 ... -0.005623 -0.049058 0.033566 0.032925 -0.093081 -0.054255 -0.040889 0.039209 -0.000205 0.197788
months_loan_duration 0.035050 1.000000 0.142631 0.105305 0.624984 -0.064526 0.057381 0.074749 -0.116029 0.006711 ... 0.245655 -0.036136 0.076992 0.011950 -0.011284 0.210910 -0.023834 -0.164718 -0.138196 0.214927
credit_history 0.138210 0.142631 1.000000 0.143938 0.113776 0.019657 -0.097325 -0.024740 -0.005519 -0.008955 ... 0.071606 -0.070046 0.239431 0.077417 -0.207960 0.001718 0.051849 0.018283 -0.041784 0.232157
purpose 0.017272 0.105305 0.143938 1.000000 0.203234 0.005263 -0.052126 -0.092747 -0.035918 -0.020423 ... 0.027161 0.066020 0.049489 0.028464 0.071995 0.025409 0.077245 -0.116031 0.035655 0.051311
amount 0.024561 0.624984 0.113776 0.203234 1.000000 -0.107538 -0.008367 -0.271316 -0.159434 0.037921 ... 0.224550 0.032716 0.045815 0.056119 0.020795 0.285385 0.017142 -0.276995 -0.050050 0.154739
savings_balance -0.005614 -0.064526 0.019657 0.005263 -0.107538 1.000000 0.014600 -0.000805 0.062953 -0.047575 ... -0.004121 -0.017997 0.009373 0.003268 -0.004176 -0.040803 -0.021302 0.037452 0.005318 -0.033871
employment_length -0.108536 0.057381 -0.097325 -0.052126 -0.008367 0.014600 1.000000 0.126161 -0.181745 -0.028758 ... 0.065533 0.256227 -0.008676 -0.044583 0.125791 0.101225 0.097192 -0.060518 -0.027232 -0.116002
installment_rate -0.057942 0.074749 -0.024740 -0.092747 -0.271316 -0.000805 0.126161 1.000000 -0.081121 -0.014835 ... 0.039353 0.058266 0.034750 -0.073955 0.021669 0.097755 -0.071207 -0.014413 -0.090024 0.072404
personal_status 0.069946 -0.116029 -0.005519 -0.035918 -0.159434 0.062953 -0.181745 -0.081121 1.000000 -0.011880 ... -0.099575 -0.186563 -0.065461 0.083146 -0.089640 -0.064335 -0.238327 0.057207 0.009204 0.042643
other_debtors 0.041970 0.006711 -0.008955 -0.020423 0.037921 -0.047575 -0.028758 -0.014835 -0.011880 1.000000 ... -0.101378 -0.028294 -0.000955 0.036219 -0.017662 -0.021106 -0.010990 0.050996 0.107639 0.028441
residence_history -0.059555 0.034067 -0.027989 0.073651 0.028926 -0.011772 0.245081 0.049302 -0.106742 -0.012690 ... 0.055260 0.266419 -0.034517 0.255106 0.089625 0.012655 0.042643 -0.095359 -0.054097 0.002967
property -0.005623 0.245655 0.071606 0.027161 0.224550 -0.004121 0.065533 0.039353 -0.099575 -0.101378 ... 1.000000 -0.054186 0.041147 0.022420 0.001209 0.244946 -0.041111 -0.155051 -0.138772 0.090146
age -0.049058 -0.036136 -0.070046 0.066020 0.032716 -0.017997 0.256227 0.058266 -0.186563 -0.028294 ... -0.054186 1.000000 0.021858 -0.108437 0.149254 0.015673 0.118201 -0.145259 -0.006151 -0.091127
installment_plan 0.033566 0.076992 0.239431 0.049489 0.045815 0.009373 -0.008676 0.034750 -0.065461 -0.000955 ... 0.041147 0.021858 1.000000 -0.077624 0.046993 0.009872 0.057595 -0.030704 -0.036734 0.104885
housing 0.032925 0.011950 0.077417 0.028464 0.056119 0.003268 -0.044583 -0.073955 0.083146 0.036219 ... 0.022420 -0.108437 -0.077624 1.000000 -0.052609 0.015201 -0.015004 0.003307 0.005155 0.123815
existing_credits -0.093081 -0.011284 -0.207960 0.071995 0.020795 -0.004176 0.125791 0.021669 -0.089640 -0.017662 ... 0.001209 0.149254 0.046993 -0.052609 1.000000 -0.026321 0.109667 -0.065553 -0.009717 -0.045732
job -0.054255 0.210910 0.001718 0.025409 0.285385 -0.040803 0.101225 0.097755 -0.064335 -0.021106 ... 0.244946 0.015673 0.009872 0.015201 -0.026321 1.000000 -0.093559 -0.383022 -0.100944 0.032735
dependents -0.040889 -0.023834 0.051849 0.077245 0.017142 -0.021302 0.097192 -0.071207 -0.238327 -0.010990 ... -0.041111 0.118201 0.057595 -0.015004 0.109667 -0.093559 1.000000 0.014753 0.077071 -0.003015
telephone 0.039209 -0.164718 0.018283 -0.116031 -0.276995 0.037452 -0.060518 -0.014413 0.057207 0.050996 ... -0.155051 -0.145259 -0.030704 0.003307 -0.065553 -0.383022 0.014753 1.000000 0.107401 0.036466
foreign_worker -0.000205 -0.138196 -0.041784 0.035655 -0.050050 0.005318 -0.027232 -0.090024 0.009204 0.107639 ... -0.138772 -0.006151 -0.036734 0.005155 -0.009717 -0.100944 0.077071 0.107401 1.000000 -0.082079
default 0.197788 0.214927 0.232157 0.051311 0.154739 -0.033871 -0.116002 0.072404 0.042643 0.028441 ... 0.090146 -0.091127 0.104885 0.123815 -0.045732 0.032735 -0.003015 0.036466 -0.082079 1.000000

21 rows × 21 columns

使用seaborn绘图库绘制出相关型矩阵热度图,各变量间相关度并不高。我们可以“简单”地假设每对特征之间相互独立。

1
2
3
4
5
import seaborn as sns
sns.set(font_scale=1.5)#字符大小设定
plt.figure(figsize=(15, 15))
hm=sns.heatmap(corrmat, cbar=True, square=True, yticklabels=credit.columns, xticklabels=credit.columns,cmap="YlGnBu")
plt.show()

1
2
3
4
5
6

模型选择

先进行数据划分,需要将数据集分为训练集测试集两部分。其中训练集用来构建朴素贝叶斯模型,测试集用来评估模型性能。

1
2
3
4
5
6
7
8
from sklearn import model_selection
from sklearn import metrics

y = credit['default']
#del credit['default']
X = credit.loc[:,'checking_balance':'foreign_worker']

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=1)

使用多项分布朴素贝叶斯模型,对训练数据集进行拟合。predict_proba输出对于X_test中的每行数据得到的对于两种预测结果的后验概率。因此被分类到后验概率较大的一类中。

1
2
3
4
5
6
from sklearn.naive_bayes import MultinomialNB
clf_multi = MultinomialNB()
clf_multi.fit(X_train,y_train)
y_pred = clf_multi.predict(X_test)

print(clf_multi.predict_proba(X_test))

[[7.08355206e-09 9.99999993e-01]
[4.05129188e-26 1.00000000e+00]
[9.79691264e-01 2.03087364e-02]
[9.82619335e-01 1.73806654e-02]
[4.94859998e-04 9.99505140e-01]
[9.99998409e-01 1.59110968e-06]
[8.85299264e-04 9.99114701e-01]
[9.99912434e-01 8.75657002e-05]
[9.99998245e-01 1.75510442e-06]
[9.87150312e-01 1.28496881e-02]
[9.99791955e-01 2.08045167e-04]
[7.17413046e-02 9.28258695e-01]
[1.42769922e-09 9.99999999e-01]
[9.99997304e-01 2.69595250e-06]
[5.61504493e-08 9.99999944e-01]
[9.99995078e-01 4.92187029e-06]
[8.54170929e-01 1.45829071e-01]
[9.99988770e-01 1.12298436e-05]
[9.99290453e-01 7.09546685e-04]
[5.59657372e-14 1.00000000e+00]
[9.99999766e-01 2.33633118e-07]
[9.99991893e-01 8.10675856e-06]
[9.27312798e-01 7.26872020e-02]
[9.99999972e-01 2.81816265e-08]
[9.99572578e-01 4.27421788e-04]
[9.99554836e-01 4.45164490e-04]
[7.54262725e-11 1.00000000e+00]
[9.99583183e-01 4.16817416e-04]
[9.99999892e-01 1.07557333e-07]
[2.03671867e-20 1.00000000e+00]
[9.30936726e-03 9.90690633e-01]
[4.22137734e-01 5.77862266e-01]
[3.07988013e-24 1.00000000e+00]
[9.99892134e-01 1.07866386e-04]
[9.98896579e-01 1.10342119e-03]
[1.27516939e-19 1.00000000e+00]
[3.21326686e-13 1.00000000e+00]
[9.99978102e-01 2.18978094e-05]
[9.99543666e-01 4.56334493e-04]
[9.75120153e-01 2.48798467e-02]
[9.99783249e-01 2.16751016e-04]
[7.53225441e-01 2.46774559e-01]
[1.00000000e+00 1.52390295e-10]
[9.23327518e-01 7.66724815e-02]
[9.99999996e-01 3.96981750e-09]
[9.99201910e-01 7.98090310e-04]
[7.42536445e-01 2.57463555e-01]
[9.99779916e-01 2.20084349e-04]
[9.97465543e-01 2.53445731e-03]
[9.99522680e-01 4.77319741e-04]
[9.99729186e-01 2.70814335e-04]
[9.99736721e-01 2.63278865e-04]
[5.64344942e-04 9.99435655e-01]
[4.00026914e-05 9.99959997e-01]
[9.53117144e-06 9.99990469e-01]
[8.69420434e-01 1.30579566e-01]
[9.99999997e-01 3.07306827e-09]
[1.57057604e-06 9.99998429e-01]
[9.99530449e-01 4.69551156e-04]
[4.44235731e-07 9.99999556e-01]
[9.99997479e-01 2.52079810e-06]
[9.99985346e-01 1.46542234e-05]
[3.09048564e-19 1.00000000e+00]
[7.49952563e-01 2.50047437e-01]
[9.98645456e-01 1.35454403e-03]
[5.61016777e-15 1.00000000e+00]
[9.99599379e-01 4.00620678e-04]
[1.00000000e+00 4.94688483e-10]
[9.93921444e-01 6.07855588e-03]
[2.55830477e-12 1.00000000e+00]
[9.98128166e-01 1.87183359e-03]
[9.40583266e-01 5.94167339e-02]
[9.99999664e-01 3.35665419e-07]
[9.99965174e-01 3.48257351e-05]
[7.96495634e-01 2.03504366e-01]
[2.30045586e-01 7.69954414e-01]
[3.43845989e-01 6.56154011e-01]
[9.99999399e-01 6.00685647e-07]
[9.99092122e-01 9.07878038e-04]
[9.99932474e-01 6.75263730e-05]
[3.18979875e-11 1.00000000e+00]
[9.79179080e-01 2.08209204e-02]
[1.00000000e+00 1.36448477e-11]
[9.99993882e-01 6.11801758e-06]
[9.65786631e-01 3.42133689e-02]
[9.76111368e-01 2.38886319e-02]
[9.99999941e-01 5.85847263e-08]
[9.99999819e-01 1.80804466e-07]
[9.99922821e-01 7.71793093e-05]
[9.99993133e-01 6.86748630e-06]
[9.99268149e-01 7.31851195e-04]
[5.22545593e-01 4.77454407e-01]
[1.97973112e-10 1.00000000e+00]
[9.88164915e-01 1.18350852e-02]
[9.99969019e-01 3.09809787e-05]
[1.20452595e-05 9.99987955e-01]
[9.95753421e-01 4.24657901e-03]
[1.50879629e-01 8.49120371e-01]
[3.69758990e-03 9.96302410e-01]
[1.32728342e-07 9.99999867e-01]
[9.67105940e-01 3.28940603e-02]
[1.00000000e+00 2.51889396e-10]
[4.20607438e-05 9.99957939e-01]
[9.99999995e-01 4.50441505e-09]
[9.98196982e-01 1.80301823e-03]
[2.60895496e-01 7.39104504e-01]
[9.99964355e-01 3.56449651e-05]
[3.40718013e-08 9.99999966e-01]
[6.68017653e-28 1.00000000e+00]
[9.99982158e-01 1.78417815e-05]
[9.99999723e-01 2.77421602e-07]
[9.99797624e-01 2.02375590e-04]
[8.59527536e-01 1.40472464e-01]
[1.81821662e-08 9.99999982e-01]
[1.89849679e-08 9.99999981e-01]
[1.90552896e-07 9.99999809e-01]
[9.99991452e-01 8.54833987e-06]
[9.99999999e-01 8.90283237e-10]
[9.86238948e-01 1.37610516e-02]
[9.19056892e-01 8.09431077e-02]
[5.20189940e-15 1.00000000e+00]
[1.07176072e-06 9.99998928e-01]
[1.96524734e-08 9.99999980e-01]
[9.22312119e-02 9.07768788e-01]
[9.99999561e-01 4.38736439e-07]
[3.14262425e-02 9.68573758e-01]
[9.99656508e-01 3.43491997e-04]
[4.49764172e-01 5.50235828e-01]
[9.99551027e-01 4.48973406e-04]
[4.67593429e-01 5.32406571e-01]
[9.99995983e-01 4.01738206e-06]
[9.79193431e-01 2.08065690e-02]
[4.33184050e-06 9.99995668e-01]
[9.99993077e-01 6.92318920e-06]
[9.99999812e-01 1.88485229e-07]
[3.02205341e-06 9.99996978e-01]
[9.07568653e-01 9.24313468e-02]
[1.22717545e-01 8.77282455e-01]
[9.99982544e-01 1.74558917e-05]
[9.99999996e-01 3.78412084e-09]
[5.61424124e-05 9.99943858e-01]
[9.82620365e-01 1.73796352e-02]
[1.94171276e-05 9.99980583e-01]
[9.83179110e-01 1.68208896e-02]
[9.99971324e-01 2.86761851e-05]
[1.86978428e-09 9.99999998e-01]
[7.06519691e-01 2.93480309e-01]
[8.80296829e-01 1.19703171e-01]
[9.87132431e-01 1.28675686e-02]
[1.65934444e-08 9.99999983e-01]
[9.99999995e-01 5.22521465e-09]
[9.97170829e-01 2.82917130e-03]
[9.99995505e-01 4.49460473e-06]
[9.97536742e-01 2.46325758e-03]
[1.17003871e-05 9.99988300e-01]
[2.75965897e-01 7.24034103e-01]
[4.72459215e-04 9.99527541e-01]
[9.99603650e-01 3.96350493e-04]
[9.99993266e-01 6.73357777e-06]
[9.95930147e-01 4.06985314e-03]
[9.98430108e-01 1.56989215e-03]
[1.02950719e-14 1.00000000e+00]
[9.95504721e-01 4.49527932e-03]
[9.88755899e-01 1.12441013e-02]
[5.76970096e-29 1.00000000e+00]
[9.99994030e-01 5.96964214e-06]
[8.04594587e-01 1.95405413e-01]
[2.73498848e-02 9.72650115e-01]
[9.98062495e-01 1.93750460e-03]
[9.99976044e-01 2.39560643e-05]
[5.51307112e-05 9.99944869e-01]
[9.99999921e-01 7.91559779e-08]
[9.99999870e-01 1.30319104e-07]
[9.99999957e-01 4.27677987e-08]
[8.68652943e-01 1.31347057e-01]
[9.99878314e-01 1.21685501e-04]
[9.97220154e-01 2.77984582e-03]
[9.99998005e-01 1.99475475e-06]
[1.12048195e-01 8.87951805e-01]
[9.98556552e-01 1.44344822e-03]
[4.74835052e-01 5.25164948e-01]
[8.85006321e-01 1.14993679e-01]
[9.99791624e-01 2.08376168e-04]
[7.14924653e-14 1.00000000e+00]
[9.11613644e-01 8.83863562e-02]
[9.99168495e-01 8.31504595e-04]
[9.99999999e-01 1.40036211e-09]
[8.95294053e-01 1.04705947e-01]
[9.94302903e-01 5.69709688e-03]
[2.58934387e-12 1.00000000e+00]
[9.99968031e-01 3.19694335e-05]
[1.00483240e-01 8.99516760e-01]
[9.99883869e-01 1.16131177e-04]
[9.99998999e-01 1.00050972e-06]
[9.41217448e-01 5.87825518e-02]
[9.99999144e-01 8.56187393e-07]
[3.04245716e-02 9.69575428e-01]
[9.99950646e-01 4.93539798e-05]
[9.94764447e-01 5.23555342e-03]
[9.99725874e-01 2.74126170e-04]
[9.99675935e-01 3.24064676e-04]
[9.99988898e-01 1.11021676e-05]
[4.96753226e-01 5.03246774e-01]
[9.92510677e-01 7.48932261e-03]
[3.71443861e-02 9.62855614e-01]
[9.99880010e-01 1.19989622e-04]
[5.43341069e-01 4.56658931e-01]
[2.23740846e-02 9.77625915e-01]
[2.16734441e-07 9.99999783e-01]
[8.28053468e-03 9.91719465e-01]
[6.42953055e-04 9.99357047e-01]
[2.31551620e-05 9.99976845e-01]
[9.99999954e-01 4.59849329e-08]
[9.99999995e-01 4.68993640e-09]
[9.99999996e-01 4.44118198e-09]
[1.95727744e-13 1.00000000e+00]
[9.97044910e-01 2.95509029e-03]
[1.00000000e+00 4.70059652e-11]
[4.91843429e-12 1.00000000e+00]
[9.99151819e-01 8.48180737e-04]
[9.99990247e-01 9.75330022e-06]
[9.99900851e-01 9.91486749e-05]
[5.19925707e-01 4.80074293e-01]
[9.98595280e-01 1.40472032e-03]
[9.99998228e-01 1.77218388e-06]
[9.70866147e-01 2.91338528e-02]
[7.48246743e-07 9.99999252e-01]
[9.99973845e-01 2.61547528e-05]
[1.20010195e-08 9.99999988e-01]
[9.99989583e-01 1.04166523e-05]
[3.28748185e-08 9.99999967e-01]
[9.99999966e-01 3.38654087e-08]
[5.05940992e-03 9.94940590e-01]
[9.99980420e-01 1.95803984e-05]
[9.99748556e-01 2.51444300e-04]
[9.99882532e-01 1.17467621e-04]
[9.98448749e-01 1.55125101e-03]
[9.99999884e-01 1.16129977e-07]
[1.53291657e-03 9.98467083e-01]
[9.76220766e-01 2.37792338e-02]
[8.44717999e-03 9.91552820e-01]
[9.99886188e-01 1.13812358e-04]
[3.90368809e-03 9.96096312e-01]
[9.99999858e-01 1.42492789e-07]
[9.99999698e-01 3.01588597e-07]
[9.30141719e-01 6.98582809e-02]
[9.99985365e-01 1.46346301e-05]
[9.99920174e-01 7.98259279e-05]
[9.99996587e-01 3.41321375e-06]
[9.96987845e-01 3.01215503e-03]
[9.99483133e-01 5.16866706e-04]
[9.91463705e-01 8.53629509e-03]
[3.15926974e-10 1.00000000e+00]
[2.14783690e-03 9.97852163e-01]
[7.50823010e-01 2.49176990e-01]
[9.99999137e-01 8.62819165e-07]
[6.99934104e-03 9.93000659e-01]
[5.46894966e-15 1.00000000e+00]
[2.13290238e-05 9.99978671e-01]
[9.99793159e-01 2.06840609e-04]
[8.74970112e-01 1.25029888e-01]
[9.99867579e-01 1.32420724e-04]
[1.00000000e+00 2.04284020e-10]
[9.99997195e-01 2.80502361e-06]
[9.99999157e-01 8.43225198e-07]
[9.99999948e-01 5.17726713e-08]
[2.40872121e-02 9.75912788e-01]
[9.47413892e-01 5.25861076e-02]
[6.54889629e-09 9.99999993e-01]
[3.40621192e-06 9.99996594e-01]
[4.30630753e-01 5.69369247e-01]
[9.99993574e-01 6.42617921e-06]
[9.98944910e-01 1.05508978e-03]
[9.16489524e-01 8.35104757e-02]
[9.99418072e-01 5.81928176e-04]
[5.40842574e-04 9.99459157e-01]
[9.99998061e-01 1.93903762e-06]
[3.47609654e-02 9.65239035e-01]
[9.99812899e-01 1.87101404e-04]
[9.99746283e-01 2.53716520e-04]
[2.81628794e-04 9.99718371e-01]
[9.99828951e-01 1.71048982e-04]
[9.99969589e-01 3.04105630e-05]
[8.55977708e-02 9.14402229e-01]
[3.44376719e-11 1.00000000e+00]
[9.99999699e-01 3.01499794e-07]
[9.99998707e-01 1.29347853e-06]
[9.99958928e-01 4.10716828e-05]
[9.93314781e-01 6.68521864e-03]
[3.21042931e-09 9.99999997e-01]
[9.86472042e-01 1.35279576e-02]
[9.99904973e-01 9.50269676e-05]
[2.83671375e-04 9.99716329e-01]
[9.98868733e-01 1.13126730e-03]
[2.44074089e-01 7.55925911e-01]
[4.94339246e-04 9.99505661e-01]
[3.39748678e-05 9.99966025e-01]
[9.99999809e-01 1.91303777e-07]
[1.16822919e-03 9.98831771e-01]
[9.99997637e-01 2.36342666e-06]]

模型评估

1
2
3
4
5
6
7
8
9
print('多项分布贝叶斯分类结果如下:')
print('验证集评分:')
print(clf_multi.score(X_test,y_test))
print("准确率:")
print(metrics.precision_score(y_test,y_pred))
print('混淆矩阵:')
print(metrics.confusion_matrix(y_true=y_test,y_pred=y_pred,labels=list(set(y))))

print(metrics.classification_report(y_test,y_pred))

多项分布贝叶斯分类结果如下:
验证集评分:
0.6233333333333333
准确率:
0.7537688442211056
混淆矩阵:
[[150 64]
[ 49 37]]
precision recall f1-score support

       1       0.75      0.70      0.73       214
       2       0.37      0.43      0.40        86

accuracy                           0.62       300

macro avg 0.56 0.57 0.56 300
weighted avg 0.64 0.62 0.63 300
模型的准确率为75.38%,对模型验证集评分为62.33%,对于违规结果,召回率只有43%。因此使用多项分布的朴素贝叶斯,对于结果的预测还不够精准。

模型调优

下面考虑使用高斯朴素贝叶斯模型,对训练数据集进行拟合。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print('高斯贝叶斯分类结果如下:')
print('验证集评分:')
print(clf.score(X_test,y_test))
print("准确率:")
print(metrics.precision_score(y_test,y_pred))

print('混淆矩阵:')
print(metrics.confusion_matrix(y_true=y_test,y_pred=y_pred,labels=list(set(y))))

print(metrics.classification_report(y_test,y_pred))

高斯贝叶斯分类结果如下:
验证集评分:
0.7233333333333334
准确率:
0.8046511627906977
混淆矩阵:
[[173 41]
[ 42 44]]
precision recall f1-score support

       1       0.80      0.81      0.81       214
       2       0.52      0.51      0.51        86

accuracy                           0.72       300

macro avg 0.66 0.66 0.66 300
weighted avg 0.72 0.72 0.72 300
模型准确率提高到80.47%,对模型验证集评分提高到了72.33%,同时对于违规结果的召回率提高到了51%。