機械学習入門：線形回帰をPythonで実装する

機械学習は、データからパターンを学習し、予測や分類を行う技術です。この記事では、機械学習の最も基本的なアルゴリズムである線形回帰を、NumPyを使って一から実装します。数学的な背景を理解しながら、実際に動作するコードを作成することで、機械学習の本質を学びます。

scikit-learnのようなライブラリを使う前に、アルゴリズムを自分で実装することで、より深い理解が得られます。

1. 機械学習とは

機械学習は、明示的にプログラムされなくても、データから学習して改善するシステムを構築する技術です。

1.1 機械学習の種類

教師あり学習: 正解データ（ラベル）から学習
教師なし学習: ラベルなしデータからパターンを発見
強化学習: 環境との相互作用から学習

1.2 線形回帰とは

線形回帰は、連続値を予測する教師あり学習アルゴリズムです。特徴量（入力）とターゲット（出力）の間の線形関係を学習します。

1.3 環境の準備

# 必要なライブラリのインストール
pip install numpy matplotlib scikit-learn pandas

# Jupyter Notebookのインストール（推奨）
pip install jupyter
jupyter notebook

2. 線形回帰の数学的基礎

2.1 単回帰の数式

単回帰（1つの特徴量）の場合、以下の式で表されます：

y = w * x + b

y: 予測値（ターゲット）
x: 特徴量（入力）
w: 重み（傾き）
b: バイアス（切片）

2.2 目的関数（コスト関数）

線形回帰の目的は、予測値と実際の値の差の二乗和（平均二乗誤差：MSE）を最小化することです。

MSE = (1/n) * Σ(y_pred - y_true)²

2.3 最急降下法（Gradient Descent）

最急降下法は、コスト関数を最小化するための最適化アルゴリズムです。

w = w - α * (∂MSE/∂w)
b = b - α * (∂MSE/∂b)

α: 学習率（learning rate）

3. NumPyを使った線形回帰の実装

3.1 基本的な線形回帰クラス

import numpy as np
import matplotlib.pyplot as plt

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        """
        線形回帰クラスの初期化
        
        Parameters:
        learning_rate: 学習率
        n_iterations: イテレーション回数
        """
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []
    
    def fit(self, X, y):
        """
        モデルの学習
        
        Parameters:
        X: 特徴量（2次元配列）
        y: ターゲット（1次元配列）
        """
        # データの形状を取得
        n_samples, n_features = X.shape
        
        # 重みとバイアスの初期化
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # 最急降下法による学習
        for i in range(self.n_iterations):
            # 予測値の計算
            y_pred = np.dot(X, self.weights) + self.bias
            
            # コスト（MSE）の計算
            cost = (1 / n_samples) * np.sum((y_pred - y) ** 2)
            self.cost_history.append(cost)
            
            # 勾配の計算
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)
            
            # 重みとバイアスの更新
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # 進捗の表示（100回ごと）
            if i % 100 == 0:
                print(f"Iteration {i}, Cost: {cost:.4f}")
    
    def predict(self, X):
        """
        予測の実行
        
        Parameters:
        X: 特徴量（2次元配列）
        
        Returns:
        予測値（1次元配列）
        """
        return np.dot(X, self.weights) + self.bias
    
    def score(self, X, y):
        """
        決定係数（R²）の計算
        
        Parameters:
        X: 特徴量
        y: 実際の値
        
        Returns:
        R²スコア
        """
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        r2 = 1 - (ss_res / ss_tot)
        return r2

3.2 単回帰の例

# サンプルデータの生成
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X.flatten() + 1.5 + np.random.randn(100) * 2

# モデルの学習
model = LinearRegression(learning_rate=0.01, n_iterations=1000)
model.fit(X, y)

# 予測
y_pred = model.predict(X)

# 結果の可視化
plt.figure(figsize=(12, 5))

# データと回帰直線
plt.subplot(1, 2, 1)
plt.scatter(X, y, alpha=0.6, label='Actual')
plt.plot(X, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression')
plt.legend()

# コストの推移
plt.subplot(1, 2, 2)
plt.plot(model.cost_history)
plt.xlabel('Iteration')
plt.ylabel('Cost (MSE)')
plt.title('Cost History')

plt.tight_layout()
plt.show()

# モデルのパラメータ
print(f"Weight: {model.weights[0]:.4f}")
print(f"Bias: {model.bias:.4f}")
print(f"R² Score: {model.score(X, y):.4f}")

4. 重回帰の実装

複数の特徴量を使う場合の実装も、基本的には同じです。

# 重回帰用のサンプルデータ生成
np.random.seed(42)
n_samples = 200
n_features = 3

X = np.random.randn(n_samples, n_features)
true_weights = np.array([2.5, -1.2, 0.8])
true_bias = 1.5

y = np.dot(X, true_weights) + true_bias + np.random.randn(n_samples) * 0.5

# モデルの学習
model = LinearRegression(learning_rate=0.01, n_iterations=1000)
model.fit(X, y)

# 結果の比較
print("True weights:", true_weights)
print("Learned weights:", model.weights)
print("True bias:", true_bias)
print("Learned bias:", model.bias)
print(f"R² Score: {model.score(X, y):.4f}")

5. 正規化（Normalization）

特徴量のスケールが異なる場合、正規化することで学習が安定します。

class LinearRegressionWithNormalization(LinearRegression):
    def __init__(self, learning_rate=0.01, n_iterations=1000, normalize=True):
        super().__init__(learning_rate, n_iterations)
        self.normalize = normalize
        self.mean = None
        self.std = None
    
    def normalize_features(self, X):
        """特徴量の正規化（標準化）"""
        if self.normalize:
            if self.mean is None:
                self.mean = np.mean(X, axis=0)
                self.std = np.std(X, axis=0)
                # 標準偏差が0の場合は1に設定
                self.std[self.std == 0] = 1
            return (X - self.mean) / self.std
        return X
    
    def fit(self, X, y):
        X = self.normalize_features(X)
        super().fit(X, y)
    
    def predict(self, X):
        X = self.normalize_features(X)
        return super().predict(X)

6. 実践的な例：住宅価格の予測

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# データセットの生成
X, y = make_regression(
    n_samples=1000,
    n_features=5,
    n_informative=5,
    noise=10,
    random_state=42
)

# 訓練データとテストデータに分割
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# モデルの学習
model = LinearRegressionWithNormalization(
    learning_rate=0.01,
    n_iterations=2000,
    normalize=True
)
model.fit(X_train, y_train)

# 予測
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# 評価
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print(f"Training MSE: {train_mse:.4f}")
print(f"Test MSE: {test_mse:.4f}")
print(f"Training R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")

# 予測結果の可視化
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(y_train, y_train_pred, alpha=0.6)
plt.plot([y_train.min(), y_train.max()], 
         [y_train.min(), y_train.max()], 'r--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Training Set')

plt.subplot(1, 2, 2)
plt.scatter(y_test, y_test_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], 
         [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Test Set')

plt.tight_layout()
plt.show()

7. 正則化（L1/L2正則化）

過学習を防ぐために、正則化項を追加します。

7.1 Ridge回帰（L2正則化）

class RidgeRegression(LinearRegression):
    def __init__(self, learning_rate=0.01, n_iterations=1000, alpha=1.0):
        """
        Ridge回帰（L2正則化）
        
        Parameters:
        alpha: 正則化の強さ
        """
        super().__init__(learning_rate, n_iterations)
        self.alpha = alpha
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for i in range(self.n_iterations):
            y_pred = np.dot(X, self.weights) + self.bias
            
            # L2正則化項を含むコスト
            cost = (1 / n_samples) * np.sum((y_pred - y) ** 2) + \
                   self.alpha * np.sum(self.weights ** 2)
            self.cost_history.append(cost)
            
            # L2正則化項を含む勾配
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y)) + \
                 2 * self.alpha * self.weights
            db = (1 / n_samples) * np.sum(y_pred - y)
            
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

7.2 Lasso回帰（L1正則化）

class LassoRegression(LinearRegression):
    def __init__(self, learning_rate=0.01, n_iterations=1000, alpha=1.0):
        """
        Lasso回帰（L1正則化）
        
        Parameters:
        alpha: 正則化の強さ
        """
        super().__init__(learning_rate, n_iterations)
        self.alpha = alpha
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for i in range(self.n_iterations):
            y_pred = np.dot(X, self.weights) + self.bias
            
            # L1正則化項を含むコスト
            cost = (1 / n_samples) * np.sum((y_pred - y) ** 2) + \
                   self.alpha * np.sum(np.abs(self.weights))
            self.cost_history.append(cost)
            
            # L1正則化項を含む勾配（符号関数を使用）
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y)) + \
                 self.alpha * np.sign(self.weights)
            db = (1 / n_samples) * np.sum(y_pred - y)
            
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

8. 学習曲線とバリデーション

def plot_learning_curves(model, X_train, y_train, X_val, y_val):
    """学習曲線のプロット"""
    train_errors = []
    val_errors = []
    
    for m in range(10, len(X_train), 20):
        # 部分データで学習
        model.fit(X_train[:m], y_train[:m])
        train_pred = model.predict(X_train[:m])
        val_pred = model.predict(X_val)
        
        train_errors.append(mean_squared_error(y_train[:m], train_pred))
        val_errors.append(mean_squared_error(y_val, val_pred))
    
    plt.plot(range(10, len(X_train), 20), train_errors, label='Train')
    plt.plot(range(10, len(X_train), 20), val_errors, label='Validation')
    plt.xlabel('Training Set Size')
    plt.ylabel('MSE')
    plt.title('Learning Curves')
    plt.legend()
    plt.show()

# データの分割
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

# 学習曲線のプロット
model = LinearRegression(learning_rate=0.01, n_iterations=1000)
plot_learning_curves(model, X_train, y_train, X_val, y_val)

9. 実データでの実践

import pandas as pd

# 実データの読み込み（例：Boston Housing Dataset）
from sklearn.datasets import load_boston

boston = load_boston()
X = boston.data
y = boston.target
feature_names = boston.feature_names

# データフレームの作成
df = pd.DataFrame(X, columns=feature_names)
df['PRICE'] = y

print(df.head())
print(df.describe())

# データの分割
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# モデルの学習
model = LinearRegressionWithNormalization(
    learning_rate=0.01,
    n_iterations=2000,
    normalize=True
)
model.fit(X_train, y_train)

# 予測と評価
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Test MSE: {mse:.4f}")
print(f"Test R²: {r2:.4f}")

# 特徴量の重要度（重みの絶対値）
feature_importance = pd.DataFrame({
    'feature': feature_names,
    'importance': np.abs(model.weights)
}).sort_values('importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance)

10. バッチ勾配降下法と確率的勾配降下法

10.1 確率的勾配降下法（SGD）

class SGDLinearRegression(LinearRegression):
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for i in range(self.n_iterations):
            # ランダムにサンプルを選択
            random_index = np.random.randint(n_samples)
            xi = X[random_index:random_index+1]
            yi = y[random_index:random_index+1]
            
            # 予測と勾配の計算
            y_pred = np.dot(xi, self.weights) + self.bias
            
            dw = np.dot(xi.T, (y_pred - yi))
            db = np.sum(y_pred - yi)
            
            # 更新
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # コストの計算（全データで）
            if i % 100 == 0:
                y_pred_full = np.dot(X, self.weights) + self.bias
                cost = (1 / n_samples) * np.sum((y_pred_full - y) ** 2)
                self.cost_history.append(cost)

10.2 ミニバッチ勾配降下法

class MiniBatchLinearRegression(LinearRegression):
    def __init__(self, learning_rate=0.01, n_iterations=1000, batch_size=32):
        super().__init__(learning_rate, n_iterations)
        self.batch_size = batch_size
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for i in range(self.n_iterations):
            # ランダムにバッチを選択
            indices = np.random.choice(n_samples, self.batch_size, replace=False)
            X_batch = X[indices]
            y_batch = y[indices]
            
            # 予測と勾配の計算
            y_pred = np.dot(X_batch, self.weights) + self.bias
            
            dw = (1 / self.batch_size) * np.dot(X_batch.T, (y_pred - y_batch))
            db = (1 / self.batch_size) * np.sum(y_pred - y_batch)
            
            # 更新
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
            
            # コストの計算
            if i % 100 == 0:
                y_pred_full = np.dot(X, self.weights) + self.bias
                cost = (1 / n_samples) * np.sum((y_pred_full - y) ** 2)
                self.cost_history.append(cost)

11. まとめと次のステップ

この記事を通じて、線形回帰を一から実装し、機械学習の基礎を学びました。

学んだこと

線形回帰の数学的基礎: 目的関数、最急降下法
NumPyによる実装: 基本的な線形回帰クラス
正規化: 特徴量の標準化
正則化: Ridge、Lasso回帰
最適化手法: バッチ、確率的、ミニバッチ勾配降下法
評価指標: MSE、R²スコア
学習曲線: モデルの性能評価

次のステップ

多項式回帰: 非線形関係のモデリング
ロジスティック回帰: 分類問題への応用
他の回帰手法: サポートベクターマシン、決定木
深層学習: ニューラルネットワーク
実践プロジェクト: 実際のデータセットでの予測

機械学習の基礎を理解することで、より複雑なアルゴリズムも理解しやすくなります。継続的な学習と実践を通じて、データサイエンスの世界を探索しましょう！

Happy Learning!

機械学習入門：線形回帰をPythonで実装する

機械学習入門：線形回帰をPythonで実装する

1. 機械学習とは

1.1 機械学習の種類

1.2 線形回帰とは

1.3 環境の準備

2. 線形回帰の数学的基礎

2.1 単回帰の数式

2.2 目的関数（コスト関数）

2.3 最急降下法（Gradient Descent）

3. NumPyを使った線形回帰の実装

3.1 基本的な線形回帰クラス

3.2 単回帰の例

4. 重回帰の実装

5. 正規化（Normalization）

6. 実践的な例：住宅価格の予測

7. 正則化（L1/L2正則化）

7.1 Ridge回帰（L2正則化）

7.2 Lasso回帰（L1正則化）

8. 学習曲線とバリデーション

9. 実データでの実践

10. バッチ勾配降下法と確率的勾配降下法

10.1 確率的勾配降下法（SGD）

10.2 ミニバッチ勾配降下法

11. まとめと次のステップ

学んだこと

次のステップ

コメント