Bài 16: Quy trình chuẩn cho 1 project: notebook prototype → script → API → deploy

1

Mục tiêu bài học

Sau bài này, bạn sẽ nắm được:

Quy trình 5 phase cho 1 AI project, từ idea đến URL public.
Nội dung cụ thể từng phase: deliverable, output, tip thực tế.
Checklist để biết project đã "xong" hay chưa.
Cách điều chỉnh quy trình theo project type (ML, DL, LLM/RAG, Agent).
Anti-pattern khiến project stuck hoặc không bao giờ done.

Bài này là methodology tổng — không phụ thuộc vào domain hay framework cụ thể. Áp dụng được cho tất cả 5 capstone project đã làm ở bài 11–15.

2

Tổng quan 5 phase

Một capstone project hoàn chỉnh cần đi qua 5 phase theo thứ tự:

Phase	Tên	Thời gian ước lượng	Output chính
1	Define + Research	1–2 tuần	Problem statement, dataset, tech stack
2	Notebook prototype	1 tuần	EDA notebook, baseline model, metric
3	Refactor sang script	1 tuần	Modular code, tests, config YAML
4	API + UI	1 tuần	FastAPI endpoint, Streamlit/Gradio demo
5	Deploy + Document	3–5 ngày	URL public, README, CI/CD

Tổng: 4–6 tuần cho 1 capstone project. Đây là estimate cho người học part-time (2–3 giờ/ngày). Full-time có thể rút xuống 2–3 tuần.

Mỗi phase có deliverable cụ thể. Nếu không xác định rõ deliverable, bạn sẽ không biết khi nào phase đó kết thúc.

3

Phase 1 — Define + Research

Phase quan trọng nhất. Bỏ qua phase này là nguyên nhân phổ biến nhất khiến project bỏ dở.

Step 1.1 — Problem statement

Viết 1 câu mô tả rõ vấn đề, không mơ hồ:

# Tốt — cụ thể
"Predict customer churn within 30 days using account behavior data."

# Không đủ rõ
"Build an AI model for customers."

Kèm theo:

User persona: ai sẽ dùng app này? (data analyst nội bộ, end user, developer via API?)
Business value: tiết kiệm được gì, giải quyết được vấn đề gì cụ thể? Không cần số chính xác, ước lượng cũng được.

Step 1.2 — Data exploration sơ bộ

Tìm dataset: Kaggle, Hugging Face Datasets, Papers with Code Datasets, UCI ML Repository.
Check size (bao nhiêu row, feature), quality (missing ratio), và license (phải cho phép dùng phi thương mại hay commercial).
Xác định baseline ngây thơ: nếu không có ML, accuracy sẽ là bao nhiêu? (ví dụ: chỉ predict class majority → 85% accuracy — thì model phải beat được con số này).

Step 1.3 — Approach research

Search "churn prediction site:paperswithcode.com" hoặc "churn prediction benchmark".
Đọc 2–3 paper hoặc blog kỹ thuật gần đây (2022–2025).
Xác định: SOTA dùng gì, baseline đơn giản nhất là gì, metric chuẩn của domain là gì (AUC, F1, BLEU, ROUGE…).

Step 1.4 — Scope MVP

Dùng kỹ thuật MoSCoW (đã trình bày chi tiết ở bài 2):

Must: tính năng bắt buộc để gọi là "done".
Should: quan trọng nhưng có thể dời sang V1.1.
Could: nice-to-have, chỉ làm nếu còn thời gian.
Won't: ra quyết định rõ ràng không làm trong V1.

Gắn deadline cho mỗi phase ngay ở đây, trước khi bắt đầu code.

Step 1.5 — Tech stack decision

Chọn và ghi lại lý do cho mỗi lựa chọn:

Layer	Option	Khi nào dùng
ML framework	sklearn, PyTorch, TensorFlow, LangChain	Tabular → sklearn; DL → PyTorch; LLM app → LangChain/LlamaIndex
Storage	SQLite, Postgres, S3, ChromaDB	Local dev → SQLite; RAG → vector DB
API	FastAPI	Default cho hầu hết project
UI	Streamlit, Gradio	Gradio tốt cho ML demo; Streamlit tốt cho dashboard
Deploy	Render, Railway, HF Spaces	HF Spaces miễn phí cho ML demo; Render tốt cho FastAPI

Ghi lý do chọn vào docs/decisions.md hoặc README. Recruiter đôi khi hỏi tại sao chọn X mà không chọn Y.

Deliverable Phase 1: 1 file docs/project-plan.md hoặc Notion page với problem statement, dataset info, tech stack, MoSCoW, timeline.

4

Phase 2 — Notebook prototype

Mục tiêu phase này: chứng minh approach khả thi, không phải viết code sạch.

Step 2.1 — EDA (Exploratory Data Analysis)

File: notebooks/01-eda.ipynb

Statistics summary: df.describe(), df.info(), df.isnull().sum().
Visualize distribution của target và feature quan trọng.
Identify vấn đề: missing data (%?), outlier, class imbalance (ratio?).

Output cuối notebook: một đoạn text "Summary Findings" ghi lại 3–5 điểm phát hiện quan trọng.

Step 2.2 — Baseline model

File: notebooks/02-baseline.ipynb

Mục tiêu: end-to-end pipeline chạy được, không tối ưu.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

baseline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000)),
])
baseline.fit(X_train, y_train)

auc = roc_auc_score(y_test, baseline.predict_proba(X_test)[:, 1])
print(f"Baseline AUC: {auc:.4f}")

Ghi lại con số metric baseline. Tất cả iteration tiếp theo phải cải thiện con số này.

Step 2.3 — Iterate và so sánh

Mỗi notebook tập trung 1 hypothesis:

notebooks/03-feature-engineering.ipynb — thêm feature mới làm tốt hơn không?
notebooks/04-model-comparison.ipynb — XGBoost, LightGBM, RandomForest so với baseline.

Track metric của mỗi experiment ngay trong notebook. Nếu scale lên nhiều experiment, dùng MLflow hoặc Weights & Biases.

Step 2.4 — Validation cuối phase 2

Cross-validation (5-fold) trên training set.
Evaluate trên holdout test set (chỉ dùng 1 lần, không dùng để tune).
Report metric rõ ràng: AUC, F1, Precision/Recall — chọn metric phù hợp với bài toán.

Tip: Notebook prototype nên messy — cell lộn xộn, nhiều thử nghiệm bị comment out là bình thường. Mục tiêu của phase này là exploration, không phải clean code. Phase 3 sẽ refactor.

Deliverable Phase 2: 3–4 notebooks trong notebooks/, metric cuối cùng được ghi lại, best model artifact được save.

5

Phase 3 — Refactor sang script

Phase này được trình bày chi tiết ở bài 8 (refactor notebook → script) và bài 9 (type hints + docstring). Ở đây chỉ nhắc lại các bước trong context của quy trình tổng.

Step 3.1 — Identify components từ notebook

Đọc lại notebooks, đánh dấu các cell theo function:

load_data        → src/data/loader.py
clean_data       → src/data/preprocessor.py
feature_engineer → src/features/builder.py
train_model      → src/models/trainer.py
evaluate         → src/evaluation/metrics.py
save_model       → src/models/serializer.py
predict          → src/models/predictor.py

Step 3.2 — Cấu trúc thư mục sau refactor

project/
├── configs/
│   └── train.yaml          # hyperparameters, paths
├── notebooks/              # giữ nguyên, không xóa
├── src/
│   ├── data/
│   ├── features/
│   ├── models/
│   └── evaluation/
├── scripts/
│   ├── train.py
│   └── predict.py
├── tests/
│   ├── unit/
│   └── integration/
├── .env.example
└── requirements.txt

Step 3.3 — Type hints + docstring

import pandas as pd
from pathlib import Path

def load_data(path: Path) -> pd.DataFrame:
    """Load CSV từ path và trả về DataFrame.

    Args:
        path: Đường dẫn tới file CSV.

    Returns:
        DataFrame với index reset.

    Raises:
        FileNotFoundError: Nếu file không tồn tại.
    """
    return pd.read_csv(path).reset_index(drop=True)

Step 3.4 — Tests

# tests/unit/test_preprocessor.py
import pandas as pd
from src.data.preprocessor import clean_data

def test_clean_data_removes_nulls():
    df = pd.DataFrame({"age": [25, None, 30], "label": [1, 0, 1]})
    result = clean_data(df)
    assert result.isnull().sum().sum() == 0

def test_clean_data_preserves_shape():
    df = pd.DataFrame({"age": [25, 30], "label": [1, 0]})
    result = clean_data(df)
    assert result.shape[1] == df.shape[1]

Target: coverage ≥ 70%. Chạy: pytest tests/ --cov=src --cov-report=term.

Step 3.5 — Config thay vì hardcode

# configs/train.yaml
data:
  path: data/raw/telco-churn.csv
  test_size: 0.2
  random_state: 42

model:
  type: xgboost
  params:
    n_estimators: 300
    max_depth: 6
    learning_rate: 0.05

output:
  model_path: models/churn_model.pkl
  metrics_path: reports/metrics.json

Step 3.6 — Logging thay vì print

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s — %(message)s",
)
logger = logging.getLogger(__name__)

# Thay print("Training done") bằng:
logger.info("Training complete. AUC=%.4f", auc)

Xem chi tiết logging tại bài 10.

Deliverable Phase 3: Chạy được python scripts/train.py --config configs/train.yaml. Tests pass. Coverage report hiển thị ≥ 70%.

6

Phase 4 — API + UI

Mục tiêu: recruiter và user có thể dùng thử model mà không cần đọc code.

Step 4.1 — FastAPI endpoint

# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import pandas as pd

class PredictRequest(BaseModel):
    tenure: int
    monthly_charges: float
    total_charges: float
    contract: str  # "Month-to-month" | "One year" | "Two year"

class PredictResponse(BaseModel):
    churn_probability: float
    churn_label: bool

model = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global model
    model = joblib.load("models/churn_model.pkl")
    yield

app = FastAPI(lifespan=lifespan)

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
    df = pd.DataFrame([req.model_dump()])
    prob = float(model.predict_proba(df)[0, 1])
    return PredictResponse(churn_probability=prob, churn_label=prob >= 0.5)

Test cục bộ:

uvicorn app.main:app --reload

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"tenure": 12, "monthly_charges": 65.0, "total_charges": 780.0, "contract": "Month-to-month"}'

Step 4.2 — UI demo

Chọn Streamlit hoặc Gradio — cả hai đủ tốt cho demo portfolio.

# app/ui.py (Streamlit)
import streamlit as st
import requests

st.title("Customer Churn Predictor")

with st.form("predict_form"):
    tenure = st.slider("Tenure (months)", 1, 72, 12)
    monthly_charges = st.number_input("Monthly Charges ($)", 20.0, 120.0, 65.0)
    total_charges = st.number_input("Total Charges ($)", 0.0, 9000.0, monthly_charges * tenure)
    contract = st.selectbox("Contract", ["Month-to-month", "One year", "Two year"])
    submitted = st.form_submit_button("Predict")

if submitted:
    resp = requests.post(
        "http://localhost:8000/predict",
        json={
            "tenure": tenure,
            "monthly_charges": monthly_charges,
            "total_charges": total_charges,
            "contract": contract,
        },
    )
    result = resp.json()
    prob = result["churn_probability"]
    st.metric("Churn Probability", f"{prob:.1%}")
    if result["churn_label"]:
        st.warning("High risk of churn.")
    else:
        st.success("Low risk of churn.")

Thêm sample data sẵn (nút "Load example") để người dùng không phải nhập tay khi demo.

Step 4.3 — Error handling

from fastapi import HTTPException
from pydantic import ValidationError

@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
    try:
        df = pd.DataFrame([req.model_dump()])
        prob = float(model.predict_proba(df)[0, 1])
        return PredictResponse(churn_probability=prob, churn_label=prob >= 0.5)
    except Exception as exc:
        logger.error("Prediction error: %s", exc, exc_info=True)
        raise HTTPException(status_code=500, detail="Prediction failed.")

Pydantic tự trả 422 cho input sai schema. 500 cần log đầy đủ để debug sau deploy.

Deliverable Phase 4: GET /health trả 200, POST /predict trả kết quả đúng với sample input, UI chạy được ở browser.

7

Phase 5 — Deploy + Document

Project chỉ được tính là hoàn thành khi có URL public. Chạy tốt trên local không đủ.

Step 5.1 — Containerize

# Dockerfile (multi-stage)
FROM python:3.11-slim AS base
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM base AS runtime
COPY . .
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

# docker-compose.yml
services:
  api:
    build: .
    ports:
      - "8000:8000"
    env_file: .env
  ui:
    build:
      context: .
      dockerfile: Dockerfile.ui
    ports:
      - "8501:8501"
    depends_on:
      - api

Test local trước: docker compose up --build. Nếu pass local, deploy sẽ ít bị surprise.

Step 5.2 — Deploy production

Ba option phổ biến cho portfolio project:

Render: connect GitHub repo, chọn Web Service, set environment variables, auto-deploy từ main branch. Free tier đủ cho demo.
Railway: tương tự Render, UI tốt hơn một chút, cũng có free tier.
Hugging Face Spaces: phù hợp nhất cho Gradio/Streamlit app, không cần Dockerfile.

Set environment variables trên dashboard của platform, không commit .env vào Git.

Step 5.3 — CI/CD (optional nhưng nên có)

# .github/workflows/ci.yml
name: CI

on:
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -r requirements.txt
      - run: pip install pytest pytest-cov ruff
      - run: ruff check src/ scripts/
      - run: pytest tests/ --cov=src --cov-fail-under=70

Step 5.4 — Monitoring tối giản

GET /health endpoint để uptime monitor có thể ping.
Structured JSON log (xem bài 10) để trace error.
Sentry free tier: 1 dòng thêm vào main.py là có error tracking.

Step 5.5 — README

Viết README song song với deploy, không để cuối cùng mới viết. Chi tiết cấu trúc README được trình bày ở bài 17.

Deliverable Phase 5: URL public hoạt động, README có demo link, CI badge trên GitHub.

8

Checklist hoàn thành project

Dùng checklist này để xác định project đã sẵn sàng đưa vào portfolio:

Phase 1

Problem statement viết được bằng 1 câu rõ ràng
Dataset đã acquire, license OK
Baseline (non-ML) đã xác định
Tech stack đã chọn và có lý do
MoSCoW + timeline đã viết ra

Phase 2

EDA notebook tồn tại và có summary findings
Baseline model chạy được end-to-end
Metric baseline ghi lại rõ ràng
Best model trên test set được ghi lại với metric cụ thể

Phase 3

Code nằm trong src/, modular theo chức năng
Type hints + docstring cho public function
Tests pass, coverage ≥ 70%
Hyperparameter và path được config qua YAML
Không có print() nào trong src/, chỉ dùng logger
Chạy được: python scripts/train.py --config configs/train.yaml

Phase 4

GET /health trả 200
POST /predict (hoặc endpoint tương đương) trả kết quả đúng
Invalid input trả 422 với message rõ
UI demo chạy được, có sample data

Phase 5

Dockerfile build thành công
docker compose up chạy được local
App deployed lên URL public
README có: mô tả project, tech stack, cách chạy local, link demo
CI workflow chạy được (optional)

9

Anti-pattern khi chạy quy trình

Skip Phase 1 — nhảy thẳng vào code

Hệ quả: sau 2–3 tuần code, phát hiện dataset không đủ chất lượng, hoặc scope quá rộng, hoặc approach đã có người làm tốt hơn nhiều. Khoảng 50% project bỏ dở giữa chừng đến từ nguyên nhân này.

Fix: giữ Phase 1 ít nhất 3–5 ngày làm việc, kể cả với project đơn giản.

Stuck ở Phase 2 mãi

Triệu chứng: notebook ngày càng dài, thêm mãi feature, thêm mãi model — nhưng không refactor. Portfolio chỉ có .ipynb mà không có deploy URL.

Fix: đặt deadline cứng cho Phase 2 (1 tuần). Khi hết deadline, chuyển Phase 3 với model hiện tại, dù metric chưa tối ưu.

Over-engineer Phase 3

Triệu chứng: 1 project nhỏ có 20 module, abstract class, factory pattern, dependency injection container — hard to navigate và mất nhiều tuần refactor.

Fix: refactor đến mức "có thể test được và có thể chạy từ CLI". Đừng áp enterprise pattern vào portfolio project một người làm.

Deploy thiếu Docker

Triệu chứng: app chỉ chạy trên máy cá nhân với python app.py. Recruiter không thể test.

Fix: Phase 5 bắt đầu bằng Dockerfile, không bắt đầu bằng push lên platform.

README viết sau cùng

Triệu chứng: viết README sau khi deploy xong — quên mất các quyết định thiết kế, số metric, lý do chọn tech stack.

Fix: tạo README.md ngay từ Phase 1, update liên tục theo mỗi phase.

10

Điều chỉnh theo project type

Quy trình 5 phase là khung chung, nhưng trọng tâm và thời gian mỗi phase khác nhau theo project type:

ML cổ điển (ví dụ: churn prediction, fraud detection)

Phase 2 dài hơn: EDA và feature engineering chiếm phần lớn thời gian.
Phase 4 đơn giản: thường chỉ cần 1 endpoint /predict.
Phase 5 standard: Render + Dockerfile là đủ.

DL / CV (ví dụ: image classification)

Phase 2 notebook ngắn hơn (training loop minimal), nhưng Phase 3 tốn thời gian hơn vì training cần GPU.
Phase 4 cần xử lý file upload (multipart form) thay vì chỉ JSON.
Phase 5: cân nhắc HF Spaces vì hỗ trợ GPU miễn phí cho demo.

LLM / RAG

Phase 1 cần thêm data ingestion strategy: chunking strategy, embedding model nào, vector DB nào.
Phase 2 không phải training: là prompt engineering và retrieval tuning (top-k, reranker, threshold).
Phase 3 refactor LangChain chain / LlamaIndex pipeline sang module riêng.
Phase 4 cần streaming response (StreamingResponse trong FastAPI).
Phase 5: cost monitoring quan trọng — log số token dùng per request để track OpenAI/Anthropic cost.

Agent

Phase 1 cần workflow design: vẽ diagram agent loop, xác định tool set, điểm HITL (human-in-the-loop).
Phase 4 UI cần HITL — ít nhất 1 bước người dùng confirm trước khi agent thực thi action.
Phase 5 cần checkpointer (persistent state) để agent không mất context khi restart.

11

Estimate effort thực tế

Phân bổ thời gian % tổng project:

Phase	% thời gian	Ghi chú
Phase 1 — Define + Research	10–20%	Dễ bị underestimate
Phase 2 — Notebook prototype	20–30%	Dễ bị overrun nếu không có deadline
Phase 3 — Refactor sang script	25–35%	Thường tốn hơn dự kiến
Phase 4 — API + UI	15–20%	FastAPI setup nhanh, UI tốn thêm nếu phức tạp
Phase 5 — Deploy + Document	10–15%	Debug deploy environment hay mất thêm 1–2 ngày

Người mới thường phân bổ ngược: bỏ 40% vào Phase 1+2 (exploration), sau đó còn 60% cho Phase 3+4+5 là tốt. Vấn đề xảy ra khi bỏ 70–80% vào Phase 2 (notebook quá dài) rồi không còn thời gian cho deploy.

12

Best practice trong process

Commit thường xuyên trên Git, message rõ (ví dụ: feat: add XGBoost baseline, AUC=0.81).
Tạo branch cho mỗi feature mới (feat/feature-engineering, feat/fastapi-endpoint). Không làm thẳng trên main.
Self-review PR trước khi merge: đọc lại diff, kiểm tra test, đảm bảo không commit .env hay model artifact lớn.
Dùng pre-commit hook để chạy ruff + pytest trước mỗi commit.
Update README đồng thời với code — không để sau.

Tools cho workflow

Mục đích	Tool đề xuất
Note + plan	Notion, Obsidian, hoặc `docs/` trong repo
Task tracking	GitHub Issues (đủ cho cá nhân)
Experiment tracking	MLflow (self-host) hoặc Weights & Biases (cloud free tier)
Error tracking	Sentry free tier
Time tracking (optional)	Toggl Track

13

Sau khi xong project

"Xong" theo checklist Phase 5 không có nghĩa là bỏ luôn.

Document learning: viết blog post giải thích approach, kết quả, và điều học được (bài 19). Giúp recruiter hiểu được depth.
Iterate V1.1: GitHub Issues là nơi list feature cho V1.1. Sau khi apply một vài job, quay lại cải thiện metric hoặc thêm feature quan trọng.
Maintain: demo URL trên Render free tier tắt sau 15 phút idle. Kiểm tra 1–2 lần/tháng. Update requirements.txt khi dependency có breaking change.
Showcase: pin repo trên GitHub profile, add vào LinkedIn Featured section.

14

Bài tiếp theo

Bài 17: Viết README.md chuyên nghiệp — cấu trúc và checklist

Danh sách bài viết