Bài 30: Gradient Boosting và XGBoost

1

Boosting là gì

Bài 29 đã giới thiệu Random Forest — kiểu ensemble bagging: nhiều tree độc lập, train song song trên bootstrap sample, vote/average kết quả. Hôm nay là họ ensemble thứ hai: boosting — train tuần tự, mỗi tree mới cố sửa lỗi của các tree đứng trước.

Quy trình boosting cơ bản:

Train tree thứ nhất trên dataset gốc.
Tính sample nào tree 1 dự đoán sai (hoặc còn residual lớn). Train tree thứ 2 tập trung vào những sample đó.
Cộng tree 2 vào ensemble. Tính sample nào ensemble (tree 1 + tree 2) còn sai. Train tree thứ 3 tập trung vào đó.
Lặp đến khi đủ \( M \) tree.

Prediction cuối là tổng có trọng số của tất cả tree, không phải vote majority như RF. Mỗi tree là một bước chỉnh nhỏ thêm vào ensemble hiện tại.

Họ thuật toán này bắt đầu từ AdaBoost (Freund & Schapire, 1996), được Friedman (2001) tổng quát hoá thành Gradient Boosting. Toàn bộ XGBoost, LightGBM, CatBoost ngày nay đều là biến thể của ý tưởng gốc này.

2

Bagging vs Boosting

Hai chiến lược ensemble khác nhau cả về cấu trúc training lẫn loại lỗi chúng giảm.

Bagging (Random Forest)
- Tree độc lập với nhau.
- Train song song được — tận dụng nhiều CPU core dễ dàng.
- Mỗi tree thường deep, low-bias high-variance. Ensemble giúp giảm variance bằng cách trung bình hoá.
- Thêm tree không gây overfit thêm — chỉ làm prediction ổn định hơn.
Boosting (Gradient Boosting)
- Tree phụ thuộc: tree \( m \) cần biết kết quả của tree \( 1, \ldots, m-1 \).
- Train tuần tự — không song song hoá theo tree được (song song chỉ ở mức split trong từng tree).
- Mỗi tree thường nông (depth 3–8), high-bias low-variance. Ensemble giúp giảm bias bằng cách chồng nhiều bước chỉnh nhỏ.
- Thêm tree quá nhiều gây overfit — cần early stopping hoặc giới hạn số tree.

Hệ quả thực tế: RF dễ tune và an toàn để chạy lâu; GB thường cho accuracy nhỉnh hơn 1–3% trên tabular nếu chịu khó tune.

3

Gradient Boosting — generalize

"Tập trung vào sample sai" được Friedman (2001) hình thức hoá: mỗi tree mới fit vào negative gradient của loss function tại ensemble hiện tại — gọi là residual tổng quát.

Công thức cập nhật:

\[ F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x}) \]

Trong đó:

\( F_{m-1}(\mathbf{x}) \) — ensemble sau \( m-1 \) tree.
\( h_m(\mathbf{x}) \) — tree mới, fit vào pseudo-residual \( r_i = -\frac{\partial L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)} \).
\( \eta \in (0, 1] \) — learning rate, giảm bớt đóng góp của mỗi tree để tránh overfit.

Loss function có thể là bất kỳ hàm khả vi nào:

MSE cho regression — residual đúng nghĩa \( r_i = y_i - F_{m-1}(\mathbf{x}_i) \).
Log-loss (binary cross-entropy) cho binary classification.
Multi-class cross-entropy cho multi-class.
Quantile loss, Huber loss, ranking loss... tuỳ bài toán.

Tên "Gradient Boosting" đến từ chỗ mỗi tree là một bước gradient descent trong không gian hàm — mỗi tree là một bước đi theo hướng giảm loss nhanh nhất.

4

GradientBoostingClassifier trong sklearn

Implementation tham chiếu, có sẵn trong sklearn, không cần cài thêm:

from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
)
gb.fit(X_train, y_train)
print(gb.score(X_test, y_test))

Phiên bản regression tương tự: GradientBoostingRegressor.

Đặc điểm:

Implement bằng Python thuần + Cython, không có optimization mạnh như XGBoost/LightGBM.
Chậm trên dataset lớn (> 10k sample × 100 feature) — mỗi tree duyệt toàn bộ data để tìm split.
Phù hợp cho dataset nhỏ, hoặc khi muốn baseline không cần dependency thêm.

Cho production hoặc dataset lớn, sklearn khuyến nghị chuyển sang HistGradientBoosting (mục 5).

5

HistGradientBoosting — version nhanh

Sklearn 0.21+ thêm HistGradientBoostingClassifier / HistGradientBoostingRegressor — inspired bởi LightGBM, dùng histogram-based split finding.

from sklearn.ensemble import HistGradientBoostingClassifier

hgb = HistGradientBoostingClassifier(
    max_iter=200,
    learning_rate=0.1,
    max_depth=None,
    early_stopping=True,
    random_state=42,
)
hgb.fit(X_train, y_train)

Khác biệt với GradientBoostingClassifier:

Bin feature liên tục thành 256 bucket trước khi train. Tìm split chỉ duyệt 256 ngưỡng thay vì toàn bộ giá trị → nhanh hơn 10–100 lần với dataset lớn.
Tham số tên khác: dùng max_iter thay n_estimators, max_leaf_nodes thay vì max_depth trực tiếp.
Hỗ trợ missing value natively — không cần impute trước.
Có early stopping built-in.

Performance gần ngang LightGBM, không cần cài thêm package. Trong nhiều bài toán tabular thực tế, HistGradientBoosting đã đủ tốt mà không phải chạm đến XGBoost.

6

XGBoost

XGBoost (eXtreme Gradient Boosting) — Chen & Guestrin (2016), implementation Boosting phổ biến nhất từ giữa 2010s.

pip install xgboost

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=4,
    random_state=42,
)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Điểm mạnh so với GB cổ điển:

Tốc độ — implement C++ tối ưu cache, song song hoá ở mức split.
Regularization built-in — có sẵn reg_alpha (L1) và reg_lambda (L2) trên leaf weight, khó overfit hơn GB cơ bản.
Missing value — học hướng split tốt nhất cho NaN tự động.
GPU training — tree_method="hist", device="cuda" dùng GPU.
Cross-validation built-in — xgb.cv() tích hợp early stopping.

Điểm yếu:

Nhiều hyperparameter hơn RF — không có combo mặc định đúng cho mọi dataset, phải tune.
Khó tune — nhiều tham số tương tác (learning_rate × n_estimators × max_depth × subsample × colsample).
Dễ overfit nếu để n_estimators quá lớn — cần early stopping.

Phổ biến trên Kaggle suốt 2015–2020, vẫn là default cho nhiều tabular competition đến nay.

7

LightGBM

LightGBM — Ke et al. (2017), Microsoft. Tập trung vào tốc độ và dataset rất lớn.

import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=200,
    learning_rate=0.1,
    num_leaves=31,
    random_state=42,
)
model.fit(X_train, y_train)

Khác biệt cấu trúc so với XGBoost:

Histogram-based split như HistGB.
Leaf-wise growth thay vì level-wise — chọn leaf nào giảm loss nhất để split, không phát triển tree theo từng tầng. Cho tree không đối xứng nhưng accuracy thường cao hơn với cùng số leaf.
GOSS (Gradient-based One-Side Sampling) — sample dựa trên gradient, giữ sample khó.
Native categorical support — truyền cột categorical mà không cần one-hot encode.
Nhanh nhất trong các implementation phổ biến cho dataset lớn (> 100k sample).

Trade-off: leaf-wise growth dễ overfit hơn level-wise — cần giới hạn num_leaves và min_data_in_leaf. Mặc định num_leaves=31 là khởi đầu hợp lý cho dataset vừa.

8

CatBoost

CatBoost — Prokhorenkova et al. (2018), Yandex. Hướng vào dataset có nhiều feature categorical.

from catboost import CatBoostClassifier

model = CatBoostClassifier(
    iterations=200,
    learning_rate=0.1,
    depth=6,
    cat_features=[0, 2, 5],  # index cot categorical
    verbose=False,
)
model.fit(X_train, y_train)

Điểm khác biệt:

Ordered Target Statistics — encode categorical bằng target mean tính theo permutation, tránh target leakage thường gặp khi mean-encode thủ công.
Symmetric tree (oblivious tree) — mọi node ở cùng tầng dùng cùng split. Tree đơn giản hơn → inference cực nhanh, mô hình nhỏ.
Ít hyperparameter cần tune hơn XGBoost/LightGBM, default chạy tốt ngay.
Hỗ trợ text feature và embedding built-in.

Thường thắng XGBoost/LightGBM trên dataset có nhiều cột categorical với cardinality cao (mã sản phẩm, mã thành phố, user ID).

9

So sánh tóm tắt 5 implementation

GradientBoostingClassifier (sklearn) — có sẵn, không cần cài thêm. Chậm. Dùng cho baseline trên dataset nhỏ.
HistGradientBoostingClassifier (sklearn) — histogram-based, nhanh ngang LightGBM. Hỗ trợ missing value, early stopping. Đủ tốt cho hầu hết bài toán tabular thực tế.
XGBoost — implementation chuẩn cho Kaggle. Nhiều knob để tune, regularization mạnh, GPU support.
LightGBM — nhanh nhất, leaf-wise growth, native categorical. Dataset lớn (> 100k sample) thường chọn LightGBM.
CatBoost — tốt nhất cho categorical nhiều, default mạnh, inference nhanh nhờ oblivious tree.

Hướng chọn thực dụng:

Dataset nhỏ, không muốn cài thêm → HistGradientBoosting.
Kaggle competition → XGBoost hoặc LightGBM (thường ensemble cả hai).
Dataset rất lớn → LightGBM.
Nhiều categorical cardinality cao → CatBoost.

Khác biệt accuracy giữa 4 implementation hiện đại (HistGB, XGBoost, LightGBM, CatBoost) thường < 1% sau khi tune kỹ — chọn dựa trên ergonomics và tốc độ là chính.

10

Hyperparameters quan trọng

Tên tham số khác nhau giữa các framework, nhưng ý nghĩa nhóm thành các họ sau.

n_estimators / num_boost_round / iterations — số tree. Khác RF: thêm tree nhiều quá → overfit. Thường set lớn (1000–5000) kết hợp early stopping.
learning_rate (\( \eta \)) — đóng góp mỗi tree. Quy tắc: \( \eta \) nhỏ + nhiều tree tốt hơn \( \eta \) lớn + ít tree. Khoảng phổ biến 0.01–0.3. Default 0.1 là khởi đầu hợp lý.
max_depth — độ sâu mỗi tree. Khác RF (thường để None): GB ưa tree nông 3–8. Sâu hơn dễ overfit.
min_child_weight / min_samples_leaf — số sample tối thiểu trong leaf. Tăng để giảm overfit.
subsample — tỷ lệ sample dùng mỗi round (0.5–1.0). < 1.0 biến thành Stochastic Gradient Boosting — thêm randomness, giảm overfit.
colsample_bytree — tỷ lệ feature dùng mỗi tree (0.5–1.0). Giống max_features trong RF.
reg_alpha — L1 regularization trên leaf weight (sparse tree).
reg_lambda — L2 regularization trên leaf weight (smooth).

Chiến lược tune thực dụng:

Cố định learning_rate=0.1, set n_estimators=1000 + early stopping → tìm số tree tối ưu.
Tune max_depth, min_child_weight bằng grid hoặc Optuna.
Tune subsample, colsample_bytree.
Tune reg_alpha, reg_lambda.
Cuối: giảm learning_rate còn 0.01, tăng n_estimators tương ứng để ép thêm ít % accuracy.

11

Early stopping

Vì Boosting overfit khi quá nhiều tree, early stopping là tiêu chuẩn — không phải optional.

Quy trình:

Tách một validation set riêng (hoặc dùng CV fold).
Sau mỗi tree thêm vào, đánh giá metric trên val set.
Nếu metric không cải thiện sau early_stopping_rounds round liên tiếp (ví dụ 50), dừng training.
Lấy ensemble tại round có metric tốt nhất (không phải round cuối).

Code XGBoost:

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=5,
    early_stopping_rounds=50,
    eval_metric="logloss",
    random_state=42,
)
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False,
)
print("Best iteration:", model.best_iteration)
print("Best score:    ", model.best_score)

LightGBM và CatBoost có API tương tự (early_stopping_rounds, early_stopping_round). HistGradientBoosting trong sklearn bật bằng early_stopping=True, validation_fraction=0.1, n_iter_no_change=10.

Hệ quả: không cần đoán n_estimators chính xác — set lớn, để early stopping tự cắt.

12

Gradient Boosting vs Random Forest

So sánh hai họ ensemble đã học (bài 29 và bài 30):

Accuracy — GB thường cao hơn RF khoảng 1–3% trên tabular sau khi tune. Khác biệt nhỏ trên dataset dễ, lớn hơn trên dataset cạnh tranh.
Tuning effort — RF: 2–3 tham số đủ chạy tốt. GB: 5–8 tham số, cần tune cẩn thận.
Tốc độ train — RF song song hoá tốt, thường nhanh hơn GB sklearn cổ điển. HistGB/LightGBM/XGBoost đã bù khoảng cách này.
Overfit — RF gần như không overfit khi tăng số tree. GB sẽ overfit, bắt buộc early stopping.
Robustness — RF ổn định, ít nhạy outlier. GB nhạy hơn với label noise.
Interpretability — cả hai cho feature importance. SHAP áp dụng tốt cho cả hai.

Khuyến nghị thực dụng:

Baseline đầu tiên trên dataset mới: RF — chạy nhanh, không cần tune.
Cần thêm 1–3% accuracy: chuyển sang GB (HistGB hoặc XGBoost).
Competition Kaggle, thi đấu: thường ensemble cả RF + GB + linear model.

GB là họ phổ biến trên Kaggle giai đoạn 2015–2020+ cho bài toán tabular.

13

Feature importance và SHAP

Như RF, mọi implementation GB đều expose feature_importances_ sau khi fit:

import pandas as pd

importances = pd.Series(model.feature_importances_, index=feature_names)
print(importances.sort_values(ascending=False))

Mặc định là gain-based importance — tổng cải thiện loss khi split trên feature đó. Đáng tin cậy hơn split-count, vẫn có bias khi có feature cardinality cao.

Tiêu chuẩn explanation hiện tại là SHAP (SHapley Additive exPlanations, Lundberg & Lee 2017) — gán cho mỗi feature một giá trị đóng góp cho từng prediction cụ thể, dựa trên Shapley values từ game theory.

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

SHAP có implementation riêng cho tree (TreeExplainer) — exact và nhanh, không phải approximate như cho model khác. Đây là lý do GB + SHAP là combo phổ biến cho explainable ML trên tabular. Chi tiết về SHAP sẽ ở Series 5 (Capstone & Job).

14

Use case và vị trí của GB cho tabular

Gradient Boosting xuất hiện ở hầu hết bài toán tabular có giá trị kinh tế cao:

Kaggle tabular competition — chuẩn industry, top-10 solution gần như luôn dùng XGBoost/LightGBM/CatBoost.
Banking — credit risk, default prediction — feature là tabular (thu nhập, lịch sử trả nợ, demographic), label nhị phân.
Fraud detection — transaction tabular, label imbalanced. GB + class weight + threshold tuning.
Churn prediction — usage feature theo thời gian, label nhị phân stay/leave.
Recommendation ranking — pairwise ranking loss (LambdaRank, RankNet) trong LightGBM/XGBoost cho learning-to-rank.
Insurance pricing, marketing response, sale forecasting — bài toán tabular truyền thống.

Một câu hỏi thường gặp: "Deep learning có thay thế GB cho tabular không?"

Trả lời ngắn: chưa. Grinsztajn et al. (2022) "Why do tree-based models still outperform deep learning on tabular data?" benchmark 45 dataset tabular thấy GB (XGBoost/CatBoost) vẫn vượt các kiến trúc DL chuyên cho tabular (TabNet, FT-Transformer, SAINT) cả về accuracy và training time. Lý do: tree-based xử lý feature heterogeneous, ít smooth function, robust với uninformative feature — đặc trưng phổ biến của tabular.

Deep learning vẫn dominate ở ảnh, text, voice (data có cấu trúc spatial/sequential). Tabular vẫn thuộc về GB.

15

Code Python — so sánh GB, HistGB, XGBoost, RF

Trên Breast Cancer (569 sample, 30 feature, binary):

import time
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    HistGradientBoostingClassifier,
)
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

models = {
    "RandomForest":      RandomForestClassifier(n_estimators=200, random_state=42),
    "GradientBoosting":  GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42),
    "HistGradientBoost": HistGradientBoostingClassifier(max_iter=200, learning_rate=0.1, random_state=42),
}

for name, m in models.items():
    t0 = time.time()
    m.fit(X_train, y_train)
    fit_time = time.time() - t0
    acc = accuracy_score(y_test, m.predict(X_test))
    print(f"{name:20s}  acc={acc:.4f}  fit={fit_time:.3f}s")

Thêm XGBoost với early stopping (tách validation set từ train):

import xgboost as xgb

# Tach validation tu train
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

xgb_model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=4,
    early_stopping_rounds=30,
    eval_metric="logloss",
    random_state=42,
)
t0 = time.time()
xgb_model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
fit_time = time.time() - t0
acc = accuracy_score(y_test, xgb_model.predict(X_test))
print(f"XGBoost (early stop)  acc={acc:.4f}  fit={fit_time:.3f}s  "
      f"best_iter={xgb_model.best_iteration}")

Quan sát điển hình:

4 model accuracy chênh trong khoảng 1–2% trên Breast Cancer — dataset không đủ khó để phân biệt rõ.
HistGradientBoosting và XGBoost fit nhanh hơn GradientBoosting cổ điển vài lần.
Early stopping cắt XGBoost ở khoảng 100–300 iteration (tùy seed), không cần chạy hết 1000.

Để thấy khác biệt rõ hơn, lặp lại trên dataset lớn hơn (Adult Income, Higgs Boson, hoặc một dataset Kaggle).

16

Bài tập thực hành

Bài 1 — Ảnh hưởng của learning rate. Trên Breast Cancer, train XGBoost với 3 giá trị learning_rate: 0.01, 0.1, 0.3. Cùng n_estimators=500, max_depth=4, random_state=42. Đo accuracy test cho mỗi cấu hình. Learning rate nào tốt nhất? Lặp lại với n_estimators=2000 + early stopping — kết luận về quy tắc "learning rate nhỏ + nhiều tree".

Bài 2 — So sánh RF, GB, XGBoost trên dataset Adult Income. Load qua fetch_openml("adult", version=2) (48k sample, mix numeric + categorical). One-hot encode categorical, split 70/30, scale (optional cho tree-based). Train ba model với default. So sánh accuracy, F1, fit time. Loại nào tốt nhất trên dataset này?

Bài 3 — Implement early stopping với XGBoost. Trên Breast Cancer, tách thêm validation set từ train (60/20/20 train/val/test). Train XGBoost với n_estimators=2000, learning_rate=0.05, early_stopping_rounds=50. Đọc model.best_iteration. Plot model.evals_result() theo iteration — chỉ ra điểm mà val loss bắt đầu tăng (overfit).

Bài 4 — Số tree quá nhiều với GB không early stopping. Trên Breast Cancer, train GradientBoostingClassifier với n_estimators ∈ {10, 100, 500, 2000}, learning_rate=0.1, max_depth=3. Đo accuracy train và test. Khi nào test accuracy bắt đầu giảm trong khi train accuracy vẫn cao? So sánh với hành vi tương tự trên RF (bài 29) — kết luận về khác biệt overfit.

Bài 5 — Feature importance từ XGBoost. Train XGBoost trên Breast Cancer, lấy model.feature_importances_, sort và plot bar chart top 10 feature. So sánh với feature importance từ RandomForest trên cùng dataset — top feature có giống nhau không? Nếu có chênh, giải thích.

17

Bài tiếp theo

Bài 31: Support Vector Machine (SVM) — họ classifier khác với tree-based: tìm hyperplane cách xa hai class nhất (maximum margin). Mở rộng phi tuyến qua kernel trick, vẫn hữu ích cho dataset nhỏ và bài toán có cấu trúc hình học rõ ràng.

Danh sách bài viết