Bài 11: Project 1 — Customer Churn Prediction (ML cổ điển end-to-end)

1

Tổng quan project

Domain: Customer churn prediction cho công ty viễn thông / SaaS.

Goal: Dự đoán customer nào sẽ rời bỏ dịch vụ (churn) trong 30 ngày tới, để team retention có thể can thiệp trước.

Dataset: Telco Customer Churn — IBM Watson / Kaggle, ~7.000 samples, 21 features. Miễn phí, không cần GPU, tải về trong vài giây.

Tech stack

Data: pandas 2.x, scikit-learn 1.5+
Model: XGBoost 2.x, category-encoders
Explainability: shap 0.45+
API: FastAPI 0.115+, pydantic v2
Infra: Docker, Render (hoặc Railway)

Timeline tham khảo (part-time)

Tuần 1: EDA, preprocessing, feature engineering
Tuần 2: Model training, tuning, evaluation
Tuần 3: SHAP, FastAPI, Dockerfile
Tuần 4: Deploy, README, polish

Kỹ năng thực hành

Exploratory Data Analysis (EDA) có mục tiêu rõ
sklearn Pipeline — tránh data leakage
Model comparison + hyperparameter tuning
SHAP để giải thích prediction
FastAPI + Pydantic schema validation
Containerize và deploy lên cloud

2

Vì sao project này phù hợp cho portfolio

Churn prediction là bài toán cổ điển của ML. Recruiter quen với domain này — họ có thể đánh giá chất lượng solution mà không cần giải thích background dài.

Ưu điểm kỹ thuật

Tabular data — không cần GPU. Bất kỳ laptop nào cũng train được trong vài phút.
Dataset công khai — không có vấn đề bảo mật dữ liệu. Có thể commit notebook EDA lên GitHub.
Full pipeline — bao gồm EDA, model, API, deploy. Khác với notebook một chiều không có output production.
Business value rõ — giảm churn rate = giảm cost customer acquisition. Recruiter hiểu ROI ngay.

Những gì recruiter kiểm tra

Pipeline có tránh data leakage không (scaler fit trên train only)
Metric có phù hợp với imbalanced data không (AUC, F1, không phải accuracy)
Model có được giải thích không (SHAP hoặc feature importance)
Deploy URL có hoạt động không
README có đủ technical detail để tái tạo không

3

Bước 1 — EDA và kiểm tra dataset

Lưu EDA vào notebooks/01-eda.ipynb. Mục tiêu: hiểu phân phối data, tìm vấn đề cần xử lý, và đưa ra hypothesis về feature nào correlate với churn.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/raw/telco_churn.csv")
print(df.shape)         # (7043, 21)
print(df.dtypes)
print(df.isnull().sum())

# Class balance
print(df["Churn"].value_counts(normalize=True))
# No     0.7345
# Yes    0.2655

Những điểm cần kiểm tra

Missing values: Cột TotalCharges có ~11 giá trị rỗng (string " ") — cần convert sang numeric với errors="coerce".
Data type: SeniorCitizen là 0/1 integer nhưng thực tế là categorical. TotalCharges là object thay vì float.
Class imbalance: ~26.5% churn. Không quá nghiêm trọng nhưng cần dùng metric phù hợp (AUC, F1) thay vì accuracy.
Leakage candidates: customerID là identifier, phải drop trước khi train.

Hypothesis từ EDA

tenure ngắn (dưới 12 tháng) có tỷ lệ churn cao hơn rõ rệt.
Khách hàng dùng Month-to-month contract churn nhiều hơn One year / Two year.
MonthlyCharges cao kết hợp tenure thấp là signal mạnh.
Không có TechSupport hoặc OnlineSecurity correlate với churn.

# Visualize churn rate theo contract type
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Churn rate theo contract
churn_by_contract = df.groupby("Contract")["Churn"].apply(
    lambda x: (x == "Yes").mean()
).reset_index()
churn_by_contract.columns = ["Contract", "ChurnRate"]
axes[0].bar(churn_by_contract["Contract"], churn_by_contract["ChurnRate"])
axes[0].set_title("Churn rate by contract type")
axes[0].set_ylabel("Churn rate")

# Tenure distribution
axes[1].hist(df[df["Churn"] == "Yes"]["tenure"], bins=20, alpha=0.6, label="Churn")
axes[1].hist(df[df["Churn"] == "No"]["tenure"], bins=20, alpha=0.6, label="No churn")
axes[1].set_title("Tenure distribution")
axes[1].legend()

plt.tight_layout()
plt.savefig("notebooks/figures/eda_overview.png", dpi=120)

4

Bước 2 — Data preprocessing pipeline

Dùng sklearn.pipeline.Pipeline + ColumnTransformer để đóng gói toàn bộ preprocessing. Cách này đảm bảo scaler và encoder chỉ fit trên training set — không có leakage.

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import TargetEncoder

# Fix TotalCharges: string rỗng → NaN → float
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Target
y = (df["Churn"] == "Yes").astype(int)
X = df.drop(columns=["customerID", "Churn"])

# Tách feature theo type
cat_cols = X.select_dtypes("object").columns.tolist()
num_cols = X.select_dtypes(["int64", "float64"]).columns.tolist()

# Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("impute", SimpleImputer(strategy="median")),
            ("scale", StandardScaler()),
        ]), num_cols),
        ("cat", Pipeline([
            ("impute", SimpleImputer(strategy="most_frequent")),
            ("encode", TargetEncoder()),
        ]), cat_cols),
    ],
    remainder="drop",
)

Lưu ý về TargetEncoder: TargetEncoder từ thư viện category_encoders cần target y_train lúc fit. Khi dùng trong Pipeline với cross_val_score, sklearn tự truyền target vào đúng fold — không cần làm thủ công.

Nếu muốn dùng encoder từ sklearn 1.3+, có thể thay bằng sklearn.preprocessing.TargetEncoder — API tương tự và không cần cài thêm package.

5

Bước 3 — Feature engineering

Feature engineering cho tabular data thường mang lại cải thiện lớn hơn việc thử nhiều model. Đây là 3 feature có ý nghĩa business rõ:

def add_features(X: pd.DataFrame) -> pd.DataFrame:
    X = X.copy()

    # tenure_bucket: phân nhóm vòng đời customer
    X["tenure_bucket"] = pd.cut(
        X["tenure"],
        bins=[0, 12, 24, 48, 72],
        labels=["new", "developing", "mature", "loyal"],
    )

    # charges_per_month: normalize TotalCharges theo tenure
    # Tránh divide-by-zero khi tenure = 0
    X["charges_per_month"] = X["TotalCharges"] / (X["tenure"] + 1)

    # has_multiple_services: customer dùng nhiều service ít churn hơn
    X["has_multiple_services"] = (
        (X["PhoneService"] == "Yes").astype(int)
        + (X["InternetService"] != "No").astype(int)
    )

    return X

X = add_features(X)

Đưa hàm này vào src/features/engineering.py để dùng lại ở cả training và inference. Bài 8 của series đã hướng dẫn cách tổ chức module; áp dụng pattern đó ở đây.

Lưu ý về tenure_bucket: đây là categorical feature mới — cần thêm vào cat_cols trước khi build preprocessor, hoặc dùng FunctionTransformer để nhúng bước này vào Pipeline.

6

Bước 4 — Model training và so sánh

Train 3 model, so sánh bằng 5-fold cross-validation trên tập train. Không dùng test set ở bước này — test set chỉ dùng một lần duy nhất ở bước đánh giá cuối.

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,     # giữ tỷ lệ churn trong cả 2 split
    random_state=42,
)

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, class_weight="balanced"),
    "Random Forest": RandomForestClassifier(
        n_estimators=200,
        class_weight="balanced",
        random_state=42,
    ),
    "XGBoost": xgb.XGBClassifier(
        n_estimators=200,
        learning_rate=0.05,
        scale_pos_weight=y_train.value_counts()[0] / y_train.value_counts()[1],
        random_state=42,
        eval_metric="logloss",
    ),
}

results = {}
for name, clf in models.items():
    pipeline = Pipeline([("preprocess", preprocessor), ("clf", clf)])
    scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="roc_auc")
    results[name] = {"mean": scores.mean(), "std": scores.std()}
    print(f"{name}: AUC={scores.mean():.4f} (+/- {scores.std():.4f})")

Kết quả tham khảo trên dataset gốc (không có custom feature engineering):

Logistic Regression: AUC ~0.84
Random Forest: AUC ~0.86
XGBoost: AUC ~0.87

Lưu ý về imbalanced data: Dùng class_weight="balanced" cho Logistic Regression và Random Forest. Với XGBoost, dùng scale_pos_weight = negative_count / positive_count. Không dùng SMOTE trong pipeline cross-validation vì rất dễ gây leakage nếu áp dụng sai.

7

Bước 5 — Hyperparameter tuning

Dùng RandomizedSearchCV thay vì GridSearchCV — ít tốn compute hơn, kết quả thường tương đương với n_iter đủ lớn.

from sklearn.model_selection import RandomizedSearchCV

pipeline_xgb = Pipeline([
    ("preprocess", preprocessor),
    ("clf", xgb.XGBClassifier(
        eval_metric="logloss",
        random_state=42,
    )),
])

param_dist = {
    "clf__n_estimators": [100, 200, 300, 500],
    "clf__learning_rate": [0.01, 0.03, 0.05, 0.1],
    "clf__max_depth": [3, 5, 7, 9],
    "clf__min_child_weight": [1, 3, 5],
    "clf__subsample": [0.8, 0.9, 1.0],
    "clf__colsample_bytree": [0.7, 0.8, 1.0],
}

search = RandomizedSearchCV(
    pipeline_xgb,
    param_dist,
    n_iter=30,
    cv=5,
    scoring="roc_auc",
    n_jobs=-1,
    random_state=42,
    verbose=1,
)
search.fit(X_train, y_train)

print(f"Best AUC (CV): {search.best_score_:.4f}")
print(f"Best params: {search.best_params_}")

best_model = search.best_estimator_

Với 30 iterations và 5 folds = 150 fits. Trên CPU thông thường mất khoảng 3-5 phút.

8

Bước 6 — Evaluation

Chạy evaluation trên test set đúng một lần. Đưa kết quả vào README và model metadata.

from sklearn.metrics import (
    roc_auc_score,
    classification_report,
    confusion_matrix,
    RocCurveDisplay,
    PrecisionRecallDisplay,
)
import matplotlib.pyplot as plt

y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

auc = roc_auc_score(y_test, y_proba)
print(f"Test AUC: {auc:.4f}")
print(classification_report(y_test, y_pred, target_names=["No Churn", "Churn"]))
print(confusion_matrix(y_test, y_pred))

# Plot ROC + PR curve
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
RocCurveDisplay.from_predictions(y_test, y_proba, ax=axes[0], name="XGBoost")
axes[0].set_title(f"ROC Curve (AUC={auc:.4f})")
PrecisionRecallDisplay.from_predictions(y_test, y_proba, ax=axes[1], name="XGBoost")
axes[1].set_title("Precision-Recall Curve")
plt.tight_layout()
plt.savefig("notebooks/figures/model_evaluation.png", dpi=120)

Metric nào cần report

AUC-ROC: metric chính — đo khả năng phân biệt churn/no-churn ở mọi threshold.
Precision / Recall / F1 tại threshold 0.5: hiểu trade-off giữa false positive (gửi offer thừa) và false negative (bỏ sót customer sắp rời).
Confusion matrix: trực quan cho recruiter và stakeholder không quen metric.

Với bài toán churn, false negative thường đắt hơn false positive (bỏ sót 1 customer sắp rời > gửi 1 offer thừa). Có thể điều chỉnh threshold xuống 0.4 để tăng recall nếu business muốn.

9

Bước 7 — Feature importance và SHAP

SHAP (SHapley Additive exPlanations) giải thích tại sao model đưa ra prediction cụ thể — quan trọng cả cho portfolio và cho production (debug false positive/negative).

import shap
import numpy as np

# Lấy tên feature sau khi transform
# (ColumnTransformer thay đổi tên feature)
feature_names = (
    num_cols
    + best_model.named_steps["preprocess"]
        .named_transformers_["cat"]
        .named_steps["encode"]
        .get_feature_names_out(cat_cols)
        .tolist()
)

# Transform test set
X_test_transformed = best_model.named_steps["preprocess"].transform(X_test)

# SHAP TreeExplainer cho XGBoost
explainer = shap.TreeExplainer(best_model.named_steps["clf"])
shap_values = explainer.shap_values(X_test_transformed)

# Summary plot — top features theo |SHAP value|
shap.summary_plot(
    shap_values,
    X_test_transformed,
    feature_names=feature_names,
    show=False,
)
import matplotlib.pyplot as plt
plt.savefig("notebooks/figures/shap_summary.png", dpi=120, bbox_inches="tight")
plt.close()

Kết quả thường thấy trên dataset Telco:

tenure — SHAP âm khi tenure cao (customer lâu năm ít churn)
Contract_Month-to-month — SHAP dương mạnh
MonthlyCharges — SHAP dương khi charges cao
TechSupport_No — SHAP dương (không có tech support → churn cao hơn)

Đưa ảnh shap_summary.png vào README. Đây là phần thường được recruiter hỏi trong interview: "Tại sao model predict customer X sẽ churn?"

10

Bước 8 — Lưu model và metadata

Lưu toàn bộ Pipeline (preprocessor + model) vào một file duy nhất. Không lưu preprocessor và model riêng lẻ — dễ dẫn đến mismatch khi load lại.

import joblib
import json
from datetime import date

# Lưu pipeline đầy đủ
joblib.dump(best_model, "models/churn_model_v1.pkl")

# Metadata cho traceability
metadata = {
    "version": "1.0.0",
    "trained_at": str(date.today()),
    "auc_test": round(float(auc), 4),
    "features": list(X.columns),
    "training_size": int(len(X_train)),
    "test_size": int(len(X_test)),
    "best_params": search.best_params_,
    "sklearn_version": "1.5.x",
    "xgboost_version": "2.x",
}

with open("models/metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

print("Model saved to models/churn_model_v1.pkl")
print(f"Metadata: {json.dumps(metadata, indent=2)}")

Versioning: Dùng tên file churn_model_v1.pkl thay vì churn_model.pkl. Khi retrain sau này, tạo v2 — không ghi đè. Giữ metadata JSON đi kèm để biết model nào được train khi nào với params gì.

.gitignore: File .pkl có thể lớn. Với model nhỏ (~MB) có thể commit. Với model lớn, dùng Git LFS hoặc lưu lên S3 / Hugging Face Hub và ghi URL vào metadata.

11

Bước 9 — FastAPI inference API

File app/schemas.py — schema input đầy đủ khớp với feature set đã train:

# app/schemas.py
from pydantic import BaseModel, Field
from typing import Literal

class CustomerData(BaseModel):
    gender: Literal["Male", "Female"]
    SeniorCitizen: Literal[0, 1]
    Partner: Literal["Yes", "No"]
    Dependents: Literal["Yes", "No"]
    tenure: int = Field(ge=0, le=72)
    PhoneService: Literal["Yes", "No"]
    MultipleLines: Literal["Yes", "No", "No phone service"]
    InternetService: Literal["DSL", "Fiber optic", "No"]
    OnlineSecurity: Literal["Yes", "No", "No internet service"]
    OnlineBackup: Literal["Yes", "No", "No internet service"]
    DeviceProtection: Literal["Yes", "No", "No internet service"]
    TechSupport: Literal["Yes", "No", "No internet service"]
    StreamingTV: Literal["Yes", "No", "No internet service"]
    StreamingMovies: Literal["Yes", "No", "No internet service"]
    Contract: Literal["Month-to-month", "One year", "Two year"]
    PaperlessBilling: Literal["Yes", "No"]
    PaymentMethod: Literal[
        "Electronic check",
        "Mailed check",
        "Bank transfer (automatic)",
        "Credit card (automatic)",
    ]
    MonthlyCharges: float = Field(ge=0)
    TotalCharges: float = Field(ge=0)

File app/main.py:

# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
import joblib
import pandas as pd

from app.schemas import CustomerData

MODEL_PATH = "models/churn_model_v1.pkl"
model = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global model
    model = joblib.load(MODEL_PATH)
    yield
    model = None

app = FastAPI(title="Churn Prediction API", version="1.0.0", lifespan=lifespan)

@app.post("/predict")
def predict(customer: CustomerData):
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")

    df = pd.DataFrame([customer.model_dump()])
    proba = float(model.predict_proba(df)[0, 1])

    return {
        "churn_probability": round(proba, 4),
        "churn_prediction": int(proba >= 0.5),
        "risk_level": (
            "high" if proba >= 0.7
            else "medium" if proba >= 0.4
            else "low"
        ),
    }

@app.get("/health")
def health():
    return {"status": "ok", "model_loaded": model is not None}

Test local:

uvicorn app.main:app --reload
# Swagger UI: http://localhost:8000/docs

Lưu ý về lifespan: FastAPI 0.93+ deprecate cách dùng @app.on_event("startup"). Dùng lifespan context manager như trên để load model một lần, không load lại mỗi request.

12

Bước 10 — Dockerize

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Copy requirements trước để tận dụng layer cache
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy source
COPY app/ ./app/
COPY models/ ./models/

EXPOSE 8000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

File requirements.txt — pin version để reproducibility:

fastapi==0.115.0
uvicorn[standard]==0.30.6
pydantic==2.8.2
scikit-learn==1.5.2
xgboost==2.1.1
category-encoders==2.6.3
pandas==2.2.2
joblib==1.4.2
numpy==1.26.4

Build và test local:

docker build -t churn-api:latest .
docker run -p 8000:8000 churn-api:latest

# Test endpoint
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"tenure": 2, "Contract": "Month-to-month", "MonthlyCharges": 75.0, ...}'

Nếu API trả về đúng JSON thì image sẵn sàng deploy.

13

Bước 11 — Deploy lên Render

Push code lên GitHub (bao gồm Dockerfile, app/, models/).
Vào render.com → New → Web Service → Connect repository.
Chọn Runtime: Docker.
Plan: Free (service sleep sau 15 phút không dùng) hoặc Starter ($7/month, luôn chạy).
Thêm environment variable nếu cần (ví dụ: MODEL_VERSION=1.0.0).
Deploy — Render tự build Docker image từ repo.

Sau khi deploy xong, URL dạng https://churn-api-xxxx.onrender.com/docs — đưa URL này vào README.

Lưu ý với Free plan: Service sẽ cold start (mất ~30-60 giây) nếu không có request trong 15 phút. Với portfolio, Free plan đủ dùng. Trong README, note rõ "Free tier — có thể cần 30s để warm up lần đầu."

Alternative — Railway: Railway có UX tương tự, free tier $5 credit/tháng. Quy trình connect GitHub và deploy Docker giống Render.

14

Bước 12 — README và demo

README là thứ recruiter đọc đầu tiên. Cấu trúc tối thiểu:

# Customer Churn Prediction

Predict whether a telecom customer will churn in the next 30 days.
Binary classification on tabular data using XGBoost with sklearn Pipeline.

## Demo

API live: https://churn-api-xxxx.onrender.com/docs
*(Free tier — may take ~30s cold start)*

![Swagger UI screenshot](docs/swagger_screenshot.png)

## Results

| Metric | Value |
|--------|-------|
| AUC-ROC (test) | 0.872 |
| Precision (churn) | 0.65 |
| Recall (churn) | 0.79 |
| F1 (churn) | 0.71 |

## Top Features (SHAP)

![SHAP Summary](notebooks/figures/shap_summary.png)

Features with highest impact: tenure, Contract type, MonthlyCharges, TechSupport.

## Architecture

```
CSV → pandas → sklearn Pipeline (TargetEncoder + StandardScaler)
→ XGBoost → FastAPI → Docker → Render
```

## Tech Stack

- Python 3.11, scikit-learn 1.5, XGBoost 2.1, FastAPI 0.115
- Docker, Render

## Quick Start

```bash
git clone https://github.com/yourusername/churn-prediction
cd churn-prediction
pip install -r requirements.txt

# Train
python scripts/train.py

# Serve local
uvicorn app.main:app --reload
```

## Engineering Decisions

- **Pipeline over manual transform**: Prevents data leakage in cross-validation.
- **TargetEncoder over OneHotEncoder**: Handles high-cardinality categoricals without feature explosion.
- **RandomizedSearchCV**: Faster than GridSearch, sufficient with n_iter=30.
- **scale_pos_weight**: Handles class imbalance natively in XGBoost.

Screenshot Swagger UI: vào /docs, mở endpoint /predict, điền sample request, chụp màn hình kết quả. Đây là bằng chứng API đang hoạt động.

15

Cấu trúc thư mục cuối cùng

churn-prediction/
├── app/
│   ├── main.py             # FastAPI app với lifespan loader
│   └── schemas.py          # Pydantic schema đầy đủ
├── models/
│   ├── churn_model_v1.pkl  # sklearn Pipeline (preprocessor + XGBoost)
│   └── metadata.json       # version, AUC, features, training date
├── notebooks/
│   ├── 01-eda.ipynb
│   ├── 02-feature-engineering.ipynb
│   └── 03-model-comparison.ipynb
├── src/
│   ├── data/
│   │   └── load.py         # load_raw(), validate_schema()
│   ├── features/
│   │   └── engineering.py  # add_features()
│   └── models/
│       └── train.py        # build_pipeline(), train()
├── scripts/
│   ├── train.py            # entry point: python scripts/train.py
│   └── evaluate.py         # in metrics, lưu figures
├── tests/
│   └── test_predict.py     # test API response schema
├── data/
│   └── .gitignore          # data/raw/ không commit
├── docs/
│   └── swagger_screenshot.png
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── pyproject.toml
└── README.md

Cấu trúc này áp dụng layout đã học ở bài 7 (cấu trúc thư mục chuẩn), pattern refactor từ bài 8, và convention docstring + type hints từ bài 9. Project capstone là nơi tổng hợp tất cả những kỹ năng đó.

16

Common pitfalls

1. Data leakage qua scaler

Lỗi: Fit StandardScaler trên toàn bộ dataset trước khi split. Scaler biết mean/std của test set → metric trên test set quá tốt, không phản ánh thực tế.

Fix: Luôn dùng Pipeline. Scaler chỉ fit trên X_train, transform trên X_test.

2. Bỏ qua imbalanced data

Lỗi: Optimize accuracy. Với 26.5% churn, model predict "No churn" cho tất cả đạt accuracy 73.5% — vô dụng.

Fix: Dùng AUC-ROC là metric chính. Thêm class_weight hoặc scale_pos_weight.

3. Lưu model và preprocessor tách nhau

Lỗi:

# SAI: lưu riêng lẻ
joblib.dump(scaler, "scaler.pkl")
joblib.dump(xgb_model, "model.pkl")

# Load lại khi predict
scaler = joblib.load("scaler.pkl")
model = joblib.load("model.pkl")
X_scaled = scaler.transform(X_new)  # Nếu scaler version khác → lỗi âm thầm
y_pred = model.predict(X_scaled)

Fix: Lưu toàn bộ Pipeline. Khi load lại, gọi pipeline.predict(X_new) — preprocessing tự động.

4. Hardcode feature schema trong API

Lỗi: Feature list trong Pydantic schema khác với feature list lúc train. Khi thêm/bỏ feature, API fail không rõ lý do.

Fix: Lưu feature_names vào metadata.json. Khi load model, validate schema input khớp với feature list trong metadata.

5. Không Dockerize

Lỗi: Deploy bằng cách push code lên server và pip install thủ công. Khác Python version hoặc package version → fail.

Fix: Docker đảm bảo environment giống hệt local. Pin version trong requirements.txt.

6. SMOTE trong cross-validation pipeline sai

Lỗi: Apply SMOTE trước cross_val_score. SMOTE tạo synthetic sample từ toàn bộ data — bao gồm validation fold → leakage.

Fix: Nếu dùng SMOTE, dùng imblearn.pipeline.Pipeline (không phải sklearn Pipeline) để SMOTE chỉ apply trong training fold.

17

Bonus — Monitoring drift

Optional nhưng distinguish project khỏi "notebook portfolio". Thêm section này vào README hoặc implement một phần nhỏ để show production thinking.

Pattern đơn giản

Log mỗi request prediction (input features + output probability) vào file JSON hoặc database.
Weekly batch job so sánh phân phối input features với training distribution.
Nếu drift score vượt threshold, alert để retrain.

# Dùng evidently để tính data drift
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
import pandas as pd

# reference = training data, current = production data trong 1 tuần
reference_data = pd.read_csv("data/train_reference.csv")
current_data = pd.read_csv("data/production_week_1.csv")

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=reference_data, current_data=current_data)
report.save_html("reports/drift_week_1.html")

Thư viện: evidently (pip install evidently). Không cần setup server — report là file HTML tĩnh, có thể commit vào repo hoặc publish lên GitHub Pages.

18

Checklist hoàn thành

Trước khi đưa project vào portfolio, kiểm tra từng mục:

Code

Pipeline dùng sklearn.pipeline.Pipeline — không scale trước split
Metric chính là AUC, không phải accuracy
SHAP summary plot đã generate và lưu vào notebooks/figures/
Model lưu dưới dạng Pipeline đầy đủ (joblib.dump)
metadata.json có đủ: version, AUC, feature list, training date

API

Pydantic schema khớp với feature set đã train
Model load qua lifespan, không load trong từng request
/health endpoint trả về trạng thái model
Swagger UI (/docs) hoạt động và có thể test trực tiếp

Deploy

Docker image build thành công local
Deploy URL hoạt động (test với curl hoặc Swagger UI)
P95 latency dưới 100ms (test với ab hoặc k6)

README

Demo URL với note cold start nếu dùng free tier
Result table: AUC, Precision, Recall, F1
SHAP summary image
Quick start: clone → install → train → serve
Engineering decisions giải thích lý do chọn từng kỹ thuật

Danh sách bài viết