ロジスティック回帰モデル¶

Pythonの機械学習用ライブラリscikit-learnを使って，ロジスティック回帰モデルを使って簡単な分類問題にチャレンジしてみましょう.

0.ライブラリのインポート¶

import numpy as np
import pandas as pd

import sklearn

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

np.set_printoptions(precision=4)

print("numpy :", np.__version__)
print("pandas :", pd.__version__)
print("sklearn :", sklearn.__version__)
print("seaborn :", sns.__version__)
print("matplotlib :", matplotlib.__version__)

numpy : 1.16.1
pandas : 0.24.2
sklearn : 0.20.2
seaborn : 0.9.0
matplotlib : 3.0.2

1. データの読込・整形¶

sklearn.datasetsからBreastCancerデータセットを読み込みましょう．

# make data samples
from sklearn.datasets import load_breast_cancer
bc = load_breast_cancer()

次に，pandas DataFrame()クラスのインスタンスとして，変数df_feature, df_target, dfを定義します．

参考: pandas.DataFrame — pandas 1.0.1 documentation

df_data = pd.DataFrame(bc.data, columns=bc.feature_names)
df_data = df_data.loc[:, ['mean concave points', 'symmetry error', 'texture error', 'worst radius']]
df_target = pd.DataFrame(bc.target, columns=['class'])
df = pd.concat([df_data, df_target], axis=1)
df.head()

説明変数(特徴量):
- mean concave points - 細胞にある凹みの平均個数
- symmetry error - 細胞の左右の非対称性の度合い
- texture error - 細胞表面の明暗の標準偏差
- worst radius - 細胞の平均半径
目的変数(クラスラベル):
- class - 腫瘍の判定結果（悪性: 0, 良性: 1）

データの要約統計量(サンプル数, 平均, 標準偏差, 四分位数, 中央値, 最小値, 最大値など)をみましょう．

df.describe().T

データの共分散行列を描画します．
対角成分は自分との共分散(相関)を表すため常に1.0となります．

df.corr()

seabornを使って，共分散行列を可視化してみましょう．

データの散布図行列を描画します．
相関が大きい説明変数のペアについては, 多重共線性を考えるべきです.

sns.pairplot(df, height=2.0, diag_kind='hist', markers='+')
plt.show()

分類用のデータセットには，各データに対応するクラスラベルが与えられています．
上の散布図行列の各点を所属する3つのクラスに応じて色分けしてみましょう．

sns.pairplot(df, height=2.0, diag_kind='hist', markers='+', hue='class')
plt.show()

2. データの分割¶

pandas DataFrame()クラスの変数dfから，説明変数と目的変数に相当するデータをそれぞれ取り出し，numpy.ndarray()クラスの変数X, yへ格納します．

X = df_data.values
y = df_target.values
y = y.reshape(-1)

全データをtrainデータとtestデータに分割します．すなわち，変数XをX_trainとX_testに，変数yをy_trainとy_testに分けます．

# split data by Hold-out-method
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print()で配列の形状を確認してみましょう．

print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test: ", X_test.shape)
print("y_test: ", y_test.shape)

X_train:  (455, 4)
y_train:  (455,)
X_test:  (114, 4)
y_test:  (114,)

X_train: 4次元データが455コ格納されている．
y_train: 1次元データが455コ格納されている．
X_test: 4次元データが114コ格納されている．
y_test: 1次元データが114コ格納されている．

3. モデルの作成¶

# Logistic Regression
from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(random_state=0, 
                            solver='lbfgs', 
                            multi_class='auto')

4. モデルへデータを適合させる¶

# fit
clf_lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=0, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

モデルの評価¶

# predictions
y_train_pred = clf_lr.predict(X_train)
y_test_pred = clf_lr.predict(X_test)

# Accuracy
from sklearn.metrics import accuracy_score

print('Accuracy (train)  : {:>.4f}'.format(accuracy_score(y_train, y_train_pred)))
print('Accuracy (test)   : {:>.4f}'.format(accuracy_score(y_test, y_test_pred)))

Accuracy (train)  : 0.9253
Accuracy (test)   : 0.9386

混同行列(Confusion matrix)を描画してみましょう.

# Confusion matrix
from sklearn.metrics import confusion_matrix

cmat_train = confusion_matrix(y_train, y_train_pred)
cmat_test = confusion_matrix(y_test, y_test_pred)

def print_confusion_matrix(confusion_matrix, class_names, plt_title='Confusion matrix: ', cmap='BuGn', figsize = (6.25, 5), fontsize=10):
    df_cm = pd.DataFrame(confusion_matrix, index=class_names, columns=class_names)
    fig = plt.figure(figsize=figsize)
    heatmap = sns.heatmap(df_cm, annot=True, fmt="d", cmap=cmap)
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
    plt.title(plt_title, fontsize=fontsize*1.25)
    plt.show()

print_confusion_matrix(cmat_train, 
                       bc.target_names, 
                       plt_title='Confusion matrix (train, 455 samples)')

print_confusion_matrix(cmat_test, 
                       bc.target_names, 
                       plt_title='Confusion matrix (test, 114 samples)')

	mean concave points	symmetry error	texture error	worst radius
0	0.14710	0.03003	0.9053	25.38
1	0.07017	0.01389	0.7339	24.99
2	0.12790	0.02250	0.7869	23.57
3	0.10520	0.05963	1.1560	14.91
4	0.10430	0.01756	0.7813	22.54

	count	mean	std	min	25%	50%	75%	max
mean concave points	569.0	0.048919	0.038803	0.000000	0.02031	0.03350	0.07400	0.20120
symmetry error	569.0	0.020542	0.008266	0.007882	0.01516	0.01873	0.02348	0.07895
texture error	569.0	1.216853	0.551648	0.360200	0.83390	1.10800	1.47400	4.88500
worst radius	569.0	16.269190	4.833242	7.930000	13.01000	14.97000	18.79000	36.04000
class	569.0	0.627417	0.483918	0.000000	0.00000	1.00000	1.00000	1.00000

	mean concave points	symmetry error	texture error	worst radius	class
mean concave points	1.000000	0.095351	0.021480	0.830318	-0.776614
symmetry error	0.095351	1.000000	0.411621	-0.128121	0.006522
texture error	0.021480	0.411621	1.000000	-0.111690	0.008303
worst radius	0.830318	-0.128121	-0.111690	1.000000	-0.776454
class	-0.776614	0.006522	0.008303	-0.776454	1.000000