Bài 16: Train Linear Regression với sklearn — code đầu tiên

1

Mục tiêu bài học

Bài 15 đã có công thức $ \hat{y} = \mathbf{w}^T \mathbf{x} + b $ và demo bằng list thuần. Bài này dùng sklearn.linear_model.LinearRegression để chạy đúng workflow như khi đi làm: load dataset thật, split, fit, đọc parameter, predict, đánh giá.

Sau bài này, bạn sẽ:

Train được LinearRegression trên California housing (8 feature, 20.640 sample).
Đọc và diễn giải model.coef_, model.intercept_.
Predict trên test set và 1 sample mới, hiểu yêu cầu shape 2D.
Dùng model.score() để lấy R² (deep ở Bài 18).
Đóng gói StandardScaler + LinearRegression trong Pipeline (đã học Bài 14).
Phân biệt khi nào dùng LinearRegression (OLS) vs SGDRegressor (gradient descent).

Đây là bài "code đầu tiên" của Module 3 — toàn bộ ý tưởng đã có ở Bài 15, ở đây chỉ là cú pháp sklearn.

2

Import và dataset California housing

Ba import cần cho cả bài:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

California housing là dataset chuẩn của sklearn cho bài regression nhập môn. 20.640 sample mỗi khối điều tra dân số ở California (US Census 1990), 8 feature số, target = giá nhà median trong khối (đơn vị $100.000).

8 feature:

MedInc — thu nhập trung vị của hộ (đơn vị $10.000).
HouseAge — tuổi nhà trung vị (năm).
AveRooms — số phòng trung bình trên một hộ.
AveBedrms — số phòng ngủ trung bình trên một hộ.
Population — dân số của khối.
AveOccup — số người trung bình trên một hộ.
Latitude — vĩ độ trung tâm khối.
Longitude — kinh độ trung tâm khối.

Load nhanh dưới dạng (X, y):

X, y = fetch_california_housing(return_X_y=True)
print(X.shape, y.shape)
# (20640, 8) (20640,)

Vì sao không Boston housing? Boston housing từng là dataset mẫu nhưng đã bị deprecated trong scikit-learn 1.0 và bị xoá ở 1.2 (2023) vì chứa feature có vấn đề đạo đức (B dựa trên thành phần chủng tộc của khu dân cư). Mọi bài viết cũ còn dùng load_boston() đều cần thay bằng fetch_california_housing() hoặc load_diabetes().

3

Split train/test

Đã học ở Bài 6. Reuse:

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
)
print(X_train.shape, X_test.shape)
# (16512, 8) (4128, 8)

random_state=42 để kết quả reproducible — chạy lại nhiều lần ra cùng split. Không phải con số ma thuật; bất cứ int nào cố định đều được.

Lưu ý: California housing không có vấn đề class imbalance (target liên tục), nên không cần stratify. Với time series hoặc grouped data thì split khác — bàn ở các bài advanced.

4

Fit và 5 dòng cốt lõi

Toàn bộ workflow Linear Regression với sklearn gói trong 5 dòng:

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)   # R²
print(score)
# 0.5757877060324508

Đây là pattern fit / predict / score mặc định của mọi estimator sklearn (đã thấy ở Bài 13 với KNeighborsClassifier). Cùng giao diện cho mọi model — đổi LinearRegression() thành RandomForestRegressor() hay XGBRegressor() là chạy được, không cần đổi cấu trúc code.

R² ~ 0.58 nghĩa là model giải thích được khoảng 58% phương sai của giá nhà chỉ với 8 feature số gốc, không feature engineering. Đây là baseline — Bài 18 sẽ phân tích R² kỹ hơn, Bài 21 (Ridge/Lasso) cải thiện thêm.

5

Inspect model — coef_, intercept_, score

Sau khi gọi fit(), sklearn gắn các thuộc tính kết thúc bằng _ (quy ước "đã học từ data"):

model.coef_ — ndarray shape (n_features,). Mỗi phần tử là weight $ w_i $ cho feature thứ i.
model.intercept_ — scalar float. Đây là bias $ b $.
model.n_features_in_ — int, số feature đã thấy lúc fit. Predict với data khác số cột sẽ raise ValueError.
model.feature_names_in_ — array tên cột, chỉ có khi fit bằng DataFrame.

feature_names = fetch_california_housing().feature_names

print("Intercept:", model.intercept_)
# Intercept: -37.02327770606397

for name, w in zip(feature_names, model.coef_):
    print(f"  {name:12s} = {w:+.6f}")
# MedInc        = +0.448675
# HouseAge      = +0.009724
# AveRooms      = -0.123323
# AveBedrms     = +0.783145
# Population    = -0.000002
# AveOccup      = -0.003526
# Latitude      = -0.419792
# Longitude     = -0.433708

model.score(X, y) là default scorer — trả về R² (coefficient of determination) cho regression. Công thức:

\[ R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} \]

Phạm vi thường $ (-\infty, 1] $. Bằng 1 = perfect, bằng 0 = chỉ tốt ngang predict trung bình, âm = tệ hơn predict trung bình. Chi tiết Bài 18.

6

Predict trên test set và single sample

model.predict(X) nhận ndarray hoặc DataFrame shape (n_samples, n_features), trả về array shape (n_samples,).

y_pred = model.predict(X_test)
print(y_pred.shape)
# (4128,)

print("y_true:", y_test[:5])
print("y_pred:", y_pred[:5])
# y_true: [0.477   0.458   5.00001 2.186   2.78   ]
# y_pred: [0.71912 1.76401 2.70914 2.83878 2.60470]

Đơn vị target là $100.000, nên y_pred = 0.719 nghĩa là model dự đoán giá nhà median ~$71.900.

Predict 1 sample: sklearn yêu cầu input luôn 2D. Phải bọc thêm 1 lớp list/array:

import numpy as np

new_sample = np.array([[8.3, 41, 6.98, 1.02, 322, 2.55, 37.88, -122.23]])
print(new_sample.shape)   # (1, 8)

pred = model.predict(new_sample)
print(pred)
# [4.13554]   ~ $413,554

Nếu truyền np.array([8.3, 41, ...]) (1D) sẽ raise ValueError — xem mục 11.

7

Diễn giải coefficient — vì sao cần standardize

Nhìn lại bảng coef ở mục 5: AveBedrms = +0.78, Population = -0.000002. Kết luận "AveBedrms quan trọng hơn Population 400.000 lần" là SAI.

Lý do: coefficient phụ thuộc đơn vị của feature. AveBedrms dao động trong khoảng [0, 5], còn Population dao động [3, 35.000]. Tăng Population 1 đơn vị (1 người) nhỏ tí so với range của nó, nên coef tự nhiên rất nhỏ.

Quy tắc: chỉ so sánh độ lớn coefficient giữa các feature khi data đã được standardize (mean 0, std 1 — Bài 8). Sau standardize, một đơn vị của mọi feature đều là "1 std", nên coef trực tiếp đo "tăng 1 std feature thì target đổi bao nhiêu unit".

Ví dụ minh hoạ với StandardScaler:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

m_scaled = LinearRegression().fit(X_train_scaled, y_train)
for name, w in zip(feature_names, m_scaled.coef_):
    print(f"  {name:12s} = {w:+.4f}")
# MedInc        = +0.8540
# HouseAge      = +0.1224
# AveRooms      = -0.3050
# AveBedrms     = +0.3711
# Population    = -0.0023
# AveOccup      = -0.0367
# Latitude      = -0.8962
# Longitude     = -0.8689

Lúc này so sánh hợp lý: MedInc, Latitude, Longitude mạnh nhất; Population gần 0 (gần như không ảnh hưởng). Dấu âm của Latitude/Longitude phản ánh: càng lên Bắc/Tây California (toạ độ âm tăng), giá nhà càng cao — đặc thù vùng Bay Area.

Lưu ý quan trọng: R² không đổi khi standardize trước Linear Regression (mục 8 sẽ verify) — coef đổi, intercept đổi, nhưng prediction vẫn ra cùng số. Standardize chỉ ảnh hưởng diễn giải, không ảnh hưởng chất lượng fit với OLS.

8

Pipeline với StandardScaler

Cách đúng để standardize không leak data: dùng Pipeline (Bài 14). make_pipeline là short-form không cần đặt tên step:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe.fit(X_train, y_train)
print(pipe.score(X_test, y_test))
# 0.5757877060324508   (giống y hệt model gốc không scale)

R² của pipeline = R² của model gốc. Lý do: OLS bất biến với affine transformation của feature — scale lại không thay đổi nghiệm tối ưu, chỉ tái phân phối weight giữa các coef và intercept.

Truy cập step trong pipeline:

lr_inside = pipe.named_steps["linearregression"]
print(lr_inside.coef_)        # giống coef của m_scaled ở mục 7
print(lr_inside.intercept_)   # ≈ 2.072 (= mean của y_train)

Pipeline xử lý đúng kỷ luật train/test: scaler fit trên train, transform train + test bằng cùng mean/std học từ train. Không cần tự gọi scaler.transform(X_test) tay — gọi pipe.predict(X_test) là xong.

9

Tham số của LinearRegression

Constructor LinearRegression(fit_intercept=True, copy_X=True, n_jobs=None, positive=False). Chỉ 4 tham số, không phải hyperparameter để tune:

fit_intercept=True (mặc định) — có học $ b $. Đặt False nếu bạn chắc chắn đường thẳng phải đi qua gốc (vd data đã centered, hoặc về mặt vật lý $ y = 0 $ khi $ x = 0 $). Thường giữ True.
copy_X=True — sklearn copy X trước khi tính. Đặt False để tiết kiệm RAM nếu bạn không cần X nguyên bản sau fit.
n_jobs=None — số CPU dùng cho phép tính. -1 = dùng hết core. Chỉ tăng tốc với dataset rất lớn hoặc target multi-output.
positive=False — nếu True, ràng buộc tất cả coef_ >= 0 (giải bằng Non-Negative Least Squares — NNLS). Hữu ích khi domain biết coefficient phải không âm, vd hồi quy số lượng nguyên liệu trong công thức.

Khác với Ridge, Lasso, SGDRegressor — những model có nhiều hyperparameter cần grid search — LinearRegression không có gì để tune. Đơn giản và đó là điểm mạnh.

10

Visualize: prediction vs actual và residual plot

Hai biểu đồ chuẩn để chẩn đoán regression model. Code matplotlib:

import matplotlib.pyplot as plt

y_pred = model.predict(X_test)

# Plot 1: prediction vs actual
plt.figure(figsize=(6, 6))
plt.scatter(y_test, y_pred, alpha=0.3, s=8)
# đường y = x — prediction lý tưởng
lim = [y_test.min(), y_test.max()]
plt.plot(lim, lim, "r--", linewidth=1)
plt.xlabel("y_test (giá thật)")
plt.ylabel("y_pred (model)")
plt.title("Prediction vs Actual")
plt.show()

# Plot 2: residual plot
residual = y_test - y_pred
plt.figure(figsize=(8, 4))
plt.scatter(y_pred, residual, alpha=0.3, s=8)
plt.axhline(0, color="r", linestyle="--", linewidth=1)
plt.xlabel("y_pred")
plt.ylabel("residual = y_test - y_pred")
plt.title("Residual vs Predicted")
plt.show()

Đọc plot 1: điểm bám sát đường chéo $ y = x $ = model tốt. Với California housing, scatter sẽ thấy điểm trải rộng quanh đường — phù hợp với R² ~ 0.58. Đặc biệt sẽ thấy một dải ngang ở mép trên ($ y_{test} = 5.0 $) — đó là cap của target gốc: giá median bị clip ở $500.000.

Đọc plot 2: residual nên phân tán đều quanh 0, không có hình thù. Nếu thấy:

Hình phễu (residual tăng theo prediction) → heteroscedasticity, vi phạm giả định Bài 15 mục 9.
Hình cong / parabola → quan hệ phi tuyến, model linear không đủ.
Cluster tách biệt → có sub-population mà 1 model không đại diện được.

Với California housing, residual plot sẽ thấy hình phễu nhẹ và một dải kéo về phía âm tại $ y_{pred} $ cao — dấu hiệu của clip ở target.

11

Common errors khi dùng fit/predict

1. ValueError: Expected 2D array, got 1D array instead

Xảy ra khi X là 1D array hoặc list 1 chiều. Tình huống thường gặp: single feature.

x_1d = np.array([1, 2, 3, 4, 5])     # shape (5,)
y = np.array([2, 4, 6, 8, 10])

# SAI
LinearRegression().fit(x_1d, y)
# ValueError: Expected 2D array...

# ĐÚNG
X_2d = x_1d.reshape(-1, 1)            # shape (5, 1)
LinearRegression().fit(X_2d, y)

Quy tắc: X phải luôn shape (n_samples, n_features), kể cả khi n_features = 1. Áp dụng cho cả fit và predict.

2. Truyền y 2D thay vì 1D

y_2d = y.reshape(-1, 1)
model.fit(X_2d, y_2d)
# DataConversionWarning: A column-vector y was passed when a 1d array was expected.

Sklearn vẫn chạy (auto squeeze) nhưng warning. Truyền y 1D: y.ravel() nếu lỡ có shape (n, 1).

3. Số cột predict khác số cột train

model.fit(X_train, y_train)             # X_train shape (n, 8)
model.predict(np.zeros((1, 7)))          # thiếu 1 cột
# ValueError: X has 7 features, but LinearRegression is expecting 8 features as input.

Luôn check X_test.shape[1] == X_train.shape[1] trước predict. Nếu dùng Pipeline với preprocessor, vấn đề này được handle tự động.

4. NaN trong input

LinearRegression không tự xử lý NaN — sẽ raise ValueError: Input contains NaN. Phải impute trước (vd SimpleImputer) hoặc drop trong DataFrame.

12

LinearRegression (OLS) vs SGDRegressor

Sklearn có hai estimator giải cùng bài toán Linear Regression nhưng khác cách:

LinearRegression — OLS closed-form (SVD-based). Cho nghiệm chính xác, không cần tune. Chi phí $ O(n \cdot d^2 + d^3) $. RAM peak phụ thuộc d. Phù hợp khi d < vài nghìn và n vừa với RAM.
SGDRegressor(loss="squared_error") — Stochastic Gradient Descent. Cập nhật từng sample (hoặc mini-batch), không cần load toàn X vào RAM cùng lúc. Cần tune learning_rate, eta0, max_iter; cần scale feature trước (BẮT BUỘC, không phải optional). Phù hợp n > 10⁶ hoặc data dạng streaming.

So sánh nhanh trên California housing:

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)
X_tr_s = scaler.transform(X_train)
X_te_s = scaler.transform(X_test)

sgd = SGDRegressor(max_iter=1000, tol=1e-4, random_state=42)
sgd.fit(X_tr_s, y_train)
print("SGD R²:", sgd.score(X_te_s, y_test))
# SGD R²: ~0.5755   (xấp xỉ OLS, lệch chút do iterative)

Quy tắc thực dụng:

Bắt đầu với LinearRegression — đơn giản, không tune.
Khi fit tốn RAM hoặc chậm → chuyển sang SGDRegressor hoặc Ridge(solver="sag"/"saga").
Khi cần regularization → Ridge / Lasso (Bài 21), không phải LinearRegression trần.

13

Code hoàn chỉnh — end-to-end

Toàn bộ workflow trong 1 script chạy được:

import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# 1) Load
data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")

# 2) Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3) Train LinearRegression trần
model = LinearRegression()
model.fit(X_train, y_train)

# 4) Inspect
print(f"\nIntercept: {model.intercept_:+.4f}")
print("Coefficients (chưa scale, đừng so độ lớn giữa feature):")
for name, w in zip(feature_names, model.coef_):
    print(f"  {name:12s} = {w:+.6f}")

# 5) Predict 5 sample đầu test
y_pred = model.predict(X_test)
print("\n5 prediction đầu của test set (đơn vị $100k):")
for i in range(5):
    print(f"  y_true={y_test[i]:.3f}  y_pred={y_pred[i]:.3f}")

# 6) Score R²
print(f"\nR² trên test: {model.score(X_test, y_test):.4f}")

# 7) Bonus: pipeline với StandardScaler để coef so sánh được
pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe.fit(X_train, y_train)
print(f"\nR² pipeline (StandardScaler + LR): {pipe.score(X_test, y_test):.4f}")

lr_scaled = pipe.named_steps["linearregression"]
print("Coefficients (sau standardize — so sánh được):")
for name, w in zip(feature_names, lr_scaled.coef_):
    print(f"  {name:12s} = {w:+.4f}")

# 8) Predict 1 sample mới
new = np.array([[8.3, 41, 6.98, 1.02, 322, 2.55, 37.88, -122.23]])
print(f"\nPrediction cho sample mới: {pipe.predict(new)[0]:.4f}  (~${pipe.predict(new)[0] * 100_000:,.0f})")

Output rút gọn:

Dataset: 20640 samples, 8 features

Intercept: -37.0233
Coefficients (chưa scale, đừng so độ lớn giữa feature):
  MedInc       = +0.448675
  HouseAge     = +0.009724
  AveRooms     = -0.123323
  AveBedrms    = +0.783145
  Population   = -0.000002
  AveOccup     = -0.003526
  Latitude     = -0.419792
  Longitude    = -0.433708

5 prediction đầu của test set (đơn vị $100k):
  y_true=0.477  y_pred=0.719
  y_true=0.458  y_pred=1.764
  y_true=5.000  y_pred=2.709
  y_true=2.186  y_pred=2.839
  y_true=2.780  y_pred=2.605

R² trên test: 0.5758

R² pipeline (StandardScaler + LR): 0.5758
Coefficients (sau standardize — so sánh được):
  MedInc       = +0.8540
  HouseAge     = +0.1224
  AveRooms     = -0.3050
  AveBedrms    = +0.3711
  Population   = -0.0023
  AveOccup     = -0.0367
  Latitude     = -0.8962
  Longitude    = -0.8689

Prediction cho sample mới: 4.1355  (~$413,555)

Quan sát: R² trước và sau scale bằng nhau đúng như mục 7-8 đã nói. Coefficient sau scale dễ đọc hơn — MedInc, Latitude, Longitude mạnh nhất.

14

Bài tập thực hành

Bài 1 — load_diabetes. Thay fetch_california_housing bằng load_diabetes (10 feature đã được scale sẵn, target = tiến triển bệnh tiểu đường sau 1 năm). Train LinearRegression, in coef_ và R² trên test. So sánh R² với California housing — số nào cao hơn? Vì sao?

Bài 2 — có scale hay không. Trên California housing, fit 2 model: (a) LinearRegression trần, (b) Pipeline(StandardScaler, LinearRegression). So sánh:

R² trên test (kỳ vọng: bằng nhau).
Top-3 feature theo độ lớn coef (kỳ vọng: ranking đổi nhiều).
Intercept (kỳ vọng: model (b) có intercept ≈ mean của y_train).

Giải thích vì sao R² không đổi nhưng ranking coef đổi.

Bài 3 — predict sample mới. Tự sinh 1 sample California housing hợp lý (vd nhà ở Bay Area: MedInc=10, HouseAge=20, AveRooms=6, AveBedrms=1.1, Population=500, AveOccup=2.5, Latitude=37.7, Longitude=-122.4). Predict bằng cả 2 model ở Bài 2 — so sánh kết quả. Đổi đơn vị từ $100k sang $ thực.

Bài 4 — positive=True. Refit với LinearRegression(positive=True). So sánh R² và coef_ với mặc định. Feature nào bị "kẹp" về 0? Ý nghĩa của ràng buộc dương trong bài toán giá nhà có hợp lý không?

Bài 5 — OLS vs SGD trên diabetes. Dùng load_diabetes ở Bài 1, fit cả LinearRegression và SGDRegressor(max_iter=1000, random_state=42) (nhớ scale trước SGD). So sánh R², thời gian fit (dùng time.time()), và coef_. Trên dataset 442 sample này, kết quả có chênh đáng kể không?

Gợi ý đáp án Bài 1: R² của load_diabetes với LinearRegression vào khoảng 0.45–0.48 — thấp hơn California housing. Lý do: diabetes có quan hệ feature–target phi tuyến và noise cao hơn, 10 feature đã được biến đổi không giữ ý nghĩa vật lý gốc.

15

Bài tiếp theo

Bài 17: MSE và RMSE — metric đánh giá regression — model đã train xong, R² ~ 0.58. Bài tiếp đi sâu vào hai metric phổ biến nhất cho regression: MSE và RMSE, công thức, đơn vị, khi nào ưu tiên cái nào, và mối liên hệ với loss function của LinearRegression.

Danh sách bài viết