Bài 8: Refactor từ notebook (.ipynb) sang script (.py)

1

Mục Tiêu Bài Học

Sau bài này, bạn sẽ:

Hiểu tại sao notebook không phù hợp production và recruiter đánh giá điều đó như thế nào.
Thực hiện được quy trình 6 bước chuyển notebook sang script có cấu trúc.
Áp dụng các pattern cơ bản: tách config, dependency injection, logging.
Biết dùng nbconvert và jupytext để hỗ trợ quá trình chuyển đổi.
Viết unit test tối thiểu sau khi refactor để xác nhận kết quả không thay đổi.

2

Notebook Vs Script — Khi Nào Dùng Gì

Notebook và script giải quyết hai vấn đề khác nhau. Vấn đề xảy ra khi dùng notebook ngoài phạm vi phù hợp của nó.

Khía cạnh	Notebook (.ipynb)	Script (.py)
Use case phù hợp	EDA, prototype, demo	Production, automation, CI/CD
Reproducibility	Phụ thuộc thứ tự chạy cell	Flow cố định từ trên xuống
Testing	Khó (không có hàm riêng biệt)	Pytest dễ dàng
Git diff	Khó (JSON lồng nhau, output embed)	Plain text, diff rõ ràng
Modular	Tất cả trong 1 file	Tách thành nhiều module
Long-running job	Khó (session phụ thuộc kernel)	Native, có thể schedule
Deploy	Không trực tiếp	Direct (python script.py)

Vấn Đề Hidden State Trong Notebook

Notebook lưu trữ state giữa các lần chạy cell. Cell có thể chạy không theo thứ tự, biến có thể còn tồn tại từ lần chạy trước, và kết quả phụ thuộc vào lịch sử thao tác — không chỉ vào code hiện tại.

# Cell 1 (chạy lần 1)
df = pd.read_csv("data.csv")
df = df[df["status"] == "active"]

# Cell 2 (chạy lần 2, sau khi đã sửa cell 1 nhưng quên chạy lại)
# df vẫn là dataframe đã filter từ lần trước
# kết quả sẽ sai mà không có thông báo lỗi nào
result = df.groupby("category").agg({"value": "sum"})

Script buộc code chạy theo thứ tự từ đầu đến cuối mỗi lần, loại bỏ hoàn toàn vấn đề này.

CV Và Portfolio

Recruiter xem GitHub repository của bạn. Nếu chỉ thấy .ipynb không có .py, tín hiệu gửi đi là: candidate biết prototype nhưng chưa biết production. Với AI Engineer trái ngành, đây là điểm trừ cụ thể và tránh được.

3

Quy Trình Refactor 6 Bước

Bước 1: Cleanup Notebook

Trước khi chuyển đổi, notebook phải chạy được clean từ đầu đến cuối:

Xóa tất cả cell debug, print thử nghiệm, cell tạm thời.
Xóa code trùng lặp: nếu bạn đã viết lại 1 đoạn xử lý 3 lần, giữ phiên bản cuối.
Kernel → Restart & Run All — đảm bảo chạy tuyến tính từ trên xuống không lỗi.
Nếu Restart & Run All lỗi, fix cho đến khi pass.

Đây là bước quan trọng nhất. Nếu notebook không chạy được clean, refactor sẽ đưa lỗi sang script.

Bước 2: Xác Định Các Component

Đọc lại notebook, đánh dấu từng nhóm cell theo chức năng:

Data loading: đọc file, kết nối database
Preprocessing: xử lý missing, type casting, outlier
Feature engineering: tạo feature mới, encoding, scaling
Model training: fit model, cross-validation
Evaluation: tính metrics, plot
Inference: load model đã train, predict trên data mới

Mỗi component sẽ thành 1 function hoặc 1 module riêng.

Bước 3: Extract Functions

Chuyển từng nhóm cell thành function có signature rõ ràng. Pure function (input → output, không side effect) là mục tiêu:

# Notebook — code phẳng trong cell
df = pd.read_csv("data/train.csv")
df = df.dropna()
df["age_bucket"] = pd.cut(df["age"], bins=5)
X = df.drop("target", axis=1)
y = df["target"]

# Refactor — tách thành các function riêng biệt
import pandas as pd
from typing import tuple

def load_data(path: str) -> pd.DataFrame:
    return pd.read_csv(path)

def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    return df.dropna()

def feature_engineer(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()  # không mutate input
    df["age_bucket"] = pd.cut(df["age"], bins=5)
    return df

def split_x_y(df: pd.DataFrame, target_col: str) -> tuple[pd.DataFrame, pd.Series]:
    return df.drop(target_col, axis=1), df[target_col]

Lưu ý df.copy() trong feature_engineer: tránh mutate dataframe gốc — lỗi ngầm phổ biến khi refactor.

Bước 4: Chuyển Functions Vào Modules

Nhóm functions vào file .py trong thư mục src/:

src/
├── __init__.py
├── data/
│   ├── __init__.py
│   ├── load.py        # load_data
│   └── preprocess.py  # clean_data, feature_engineer, split_x_y
├── models/
│   ├── __init__.py
│   ├── train.py       # train_model
│   └── predict.py     # predict
└── evaluation/
    ├── __init__.py
    └── metrics.py     # compute_metrics

File __init__.py trống cũng cần tạo để Python nhận folder là package.

Bước 5: Entry Script

Tạo scripts/train.py — file entry point tổng hợp pipeline:

# scripts/train.py
import sys
import yaml

from src.data.load import load_data
from src.data.preprocess import clean_data, feature_engineer, split_x_y
from src.models.train import train_model, save_model
from src.evaluation.metrics import compute_metrics


def main(config_path: str) -> None:
    with open(config_path) as f:
        config = yaml.safe_load(f)

    df = load_data(config["data_path"])
    df = clean_data(df)
    df = feature_engineer(df)
    X, y = split_x_y(df, config["target_col"])

    model = train_model(X, y, **config["model_params"])
    metrics = compute_metrics(model, X, y)

    print(metrics)
    save_model(model, config["output_path"])


if __name__ == "__main__":
    config_file = sys.argv[1] if len(sys.argv) > 1 else "configs/train.yaml"
    main(config_file)

Bước 6: Verify

Chạy script và so sánh kết quả với notebook:

python scripts/train.py configs/train.yaml

Metric trên test set phải khớp với notebook (sai số float nhỏ chấp nhận được do seed). Nếu không khớp, debug từng bước trong pipeline để tìm điểm khác biệt.

4

Pattern: Tách Config Khỏi Code

Notebook thường hardcode hyperparameter và path trực tiếp trong cell. Khi refactor sang script, những giá trị này nên được chuyển ra file config riêng.

# configs/train.yaml
data_path: data/processed/train.csv
target_col: target
model_params:
  n_estimators: 100
  max_depth: 10
  random_state: 42
output_path: models/model.pkl

Load config trong entry script:

import yaml

with open("configs/train.yaml") as f:
    config = yaml.safe_load(f)

model = train_model(X, y, **config["model_params"])

Lợi ích cụ thể:

Thay đổi hyperparameter không cần sửa code.
Track experiment: mỗi config file là 1 experiment, commit cùng với code.
CI/CD có thể override config qua environment variable mà không cần sửa script.

Sử dụng path tương đối (data/processed/train.csv) trong config, không phải path tuyệt đối (/Users/username/project/data/...). Path tuyệt đối sẽ fail khi chạy trên máy khác hoặc trong Docker container.

5

Pattern: Dependency Injection

Function không tự load data hoặc tự tạo dependency bên trong. Nhận mọi thứ qua argument:

# Xấu — function tự hardcode path, không thể test
def train():
    df = pd.read_csv("data/train.csv")
    X = df.drop("target", axis=1)
    y = df["target"]
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X, y)
    return model


# Tốt — nhận data và params qua argument
def train_model(X: pd.DataFrame, y: pd.Series, **model_params) -> RandomForestClassifier:
    model = RandomForestClassifier(**model_params)
    model.fit(X, y)
    return model

Phiên bản "tốt" có thể được gọi trong unit test với data giả mà không cần file thực:

def test_train_model():
    X = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
    y = pd.Series([0, 1, 0])
    model = train_model(X, y, n_estimators=5, random_state=0)
    assert hasattr(model, "predict")

Phiên bản "xấu" không thể test như vậy vì nó sẽ cố đọc file thực từ đường dẫn hardcode.

6

Logging Thay Vì Print

print() phù hợp để debug trong notebook. Trong script production, dùng module logging của Python standard library:

import logging

logger = logging.getLogger(__name__)


def train_model(X: pd.DataFrame, y: pd.Series, **model_params):
    logger.info(f"Training on {len(X)} samples, {X.shape[1]} features")
    model = RandomForestClassifier(**model_params)
    model.fit(X, y)
    logger.info("Training complete")
    return model

Cấu hình logging tại entry script:

# scripts/train.py
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(name)s %(levelname)s %(message)s",
)

So sánh với print:

Logging có level: DEBUG, INFO, WARNING, ERROR. Có thể tắt DEBUG khi chạy production mà không cần sửa code.
Logging có timestamp và tên module — dễ trace khi có lỗi trong pipeline dài.
Logging có thể ghi ra file và stdout đồng thời.

Bài 10 sẽ đi chi tiết hơn về cấu hình logging nhiều handler và log rotation.

7

Error Handling Cơ Bản

Trong notebook, exception khiến cell fail và bạn fix trực tiếp. Trong script chạy unattended (scheduled job, CI pipeline), lỗi cần được handle và log rõ nguyên nhân:

def load_data(path: str) -> pd.DataFrame:
    try:
        df = pd.read_csv(path)
        logger.info(f"Loaded {len(df)} rows from {path}")
        return df
    except FileNotFoundError:
        logger.error(f"Data file not found: {path}")
        raise
    except pd.errors.EmptyDataError:
        logger.error(f"Data file is empty: {path}")
        raise

Pattern chung: log lỗi với context cụ thể (path, params liên quan), sau đó raise lại để caller quyết định xử lý tiếp. Không nuốt exception bằng except Exception: pass — đây là lỗi phổ biến làm mất tín hiệu lỗi quan trọng.

8

Type Hints Và Docstring

Type hints (Python 3.5+) và docstring là mức tối thiểu để code AI/ML được coi là chuyên nghiệp:

def feature_engineer(df: pd.DataFrame) -> pd.DataFrame:
    """Tạo feature từ raw dataframe.

    Args:
        df: DataFrame với các cột: age (int), income (float), category (str).

    Returns:
        DataFrame gốc cộng thêm các cột: age_bucket (category), income_log (float).
    """
    df = df.copy()
    df["age_bucket"] = pd.cut(df["age"], bins=5)
    df["income_log"] = df["income"].apply(lambda x: x if x <= 0 else __import__("math").log(x))
    return df

Type hints không bắt buộc về mặt runtime nhưng:

IDE (VS Code, PyCharm) dùng để autocomplete và cảnh báo type mismatch.
mypy dùng để static check toàn bộ codebase.
Người đọc code hiểu ngay input/output mà không cần đọc body function.

Bài 9 sẽ đi sâu hơn về type hints nâng cao và docstring convention (Google style vs NumPy style).

9

CLI Argument

Thay vì hardcode path trong code, nhận input qua CLI argument để script có thể tái sử dụng với nhiều config khác nhau.

Argparse (Standard Library)

import argparse


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="Train ML model from config")
    parser.add_argument("--config", required=True, help="Path to YAML config file")
    parser.add_argument("--output", default="models/model.pkl", help="Output model path")
    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    main(args.config, args.output)

python scripts/train.py --config configs/train.yaml --output models/v2.pkl

Typer (Third-party, Cleaner Syntax)

typer tạo CLI từ type hint của function — ít boilerplate hơn argparse:

import typer


def train(
    config: str = typer.Option("configs/train.yaml", help="Path to YAML config"),
    output: str = typer.Option("models/model.pkl", help="Output model path"),
) -> None:
    main(config, output)


if __name__ == "__main__":
    typer.run(train)

Typer tự generate --help, validation type, và error message. Phù hợp cho script có nhiều argument.

10

Convert Tự Động — Jupyter nbconvert

nbconvert là tool có sẵn khi cài Jupyter, chuyển .ipynb sang .py tự động:

jupyter nbconvert --to script notebooks/train_model.ipynb
# Tạo ra: notebooks/train_model.py

Output của nbconvert là code phẳng — các cell được nối lại tuần tự với separator comment:

# %%
# Cell 1
import pandas as pd
df = pd.read_csv("data/train.csv")

# %%
# Cell 2
df = df.dropna()

File này là điểm khởi đầu, không phải kết quả cuối. Vẫn cần refactor manual:

Xóa cell magic: %matplotlib inline, %load_ext, %%time — không hợp lệ trong .py.
Nhóm các đoạn code phẳng thành function.
Sắp xếp lại imports lên đầu file.
Xóa output cell đã được embed (nếu notebook có --to script vẫn giữ một số comment output).

11

Pattern Hybrid: Notebook + src/

Notebook vẫn có giá trị như một tài liệu trực quan cho portfolio — recruiter có thể xem trực tiếp trên GitHub mà không cần chạy. Pattern hybrid giữ notebook làm "showcase" nhưng logic thực nằm trong src/:

# notebooks/01-demo.ipynb
# Import từ src thay vì redefine lại code
from src.data.load import load_data
from src.data.preprocess import clean_data, feature_engineer, split_x_y
from src.models.train import train_model
from src.evaluation.metrics import compute_metrics

# Demo chỉ gọi function, không chứa logic
df = load_data("data/sample.csv")
df = clean_data(df)
df = feature_engineer(df)
X, y = split_x_y(df, "target")

model = train_model(X, y, n_estimators=100, random_state=42)
metrics = compute_metrics(model, X, y)
print(metrics)

Cấu trúc thư mục kết quả:

project/
├── notebooks/
│   └── 01-demo.ipynb    # showcase, import từ src
├── src/                 # logic thực
│   ├── data/
│   ├── models/
│   └── evaluation/
├── scripts/
│   └── train.py         # entry point production
├── configs/
│   └── train.yaml
└── tests/
    └── test_preprocess.py

Pattern này cũng cho phép CI/CD chạy python scripts/train.py mà không cần Jupyter kernel.

12

Test Sau Refactor

Sau khi refactor, viết unit test cho các function quan trọng. Mục tiêu ban đầu không phải coverage 100% mà là xác nhận behavior không thay đổi so với notebook:

# tests/test_preprocess.py
import pandas as pd
import pytest
from src.data.preprocess import clean_data, feature_engineer, split_x_y


def test_clean_data_removes_na():
    df = pd.DataFrame({"a": [1.0, None, 3.0], "b": ["x", "y", "z"]})
    result = clean_data(df)
    assert len(result) == 2
    assert result["a"].isna().sum() == 0


def test_clean_data_returns_new_dataframe():
    df = pd.DataFrame({"a": [1.0, None]})
    result = clean_data(df)
    # clean_data không nên mutate df gốc
    assert len(df) == 2  # gốc không đổi
    assert len(result) == 1


def test_split_x_y_shape():
    df = pd.DataFrame({"feat1": [1, 2], "feat2": [3, 4], "label": [0, 1]})
    X, y = split_x_y(df, "label")
    assert "label" not in X.columns
    assert len(X) == len(y) == 2

Chạy test:

pytest tests/ -v

Test fail sau refactor là tín hiệu cụ thể để tìm điểm khác biệt. Test pass không đảm bảo metric model giống nhau (cần verify bước 6 của quy trình), nhưng đảm bảo data pipeline hoạt động đúng.

13

Jupytext — Sync Notebook Và Script

Jupytext là thư viện giữ .ipynb và .py đồng bộ với nhau. Hữu ích khi muốn track notebook trong git mà không bị nhiễu bởi JSON output.

Cài Đặt

pip install jupytext

Pair Notebook Với Script

# Pair notebook.ipynb với notebook.py (percent format)
jupytext --set-formats ipynb,py:percent notebooks/train_model.ipynb

# Sau đó có thể sync theo chiều nào cũng được
jupytext --sync notebooks/train_model.ipynb

Format py:percent dùng # %% để phân chia cell — compatible với VS Code Interactive Window và Spyder.

Git Diff Sạch

Commit .py thay vì .ipynb vào git. File .py là plain text, diff rõ ràng. File .ipynb có thể add vào .gitignore hoặc commit riêng.

Giới Hạn

Jupytext sync code và metadata cell, không sync output (kết quả plot, DataFrame display). Nếu cần giữ output trong notebook, vẫn phải commit .ipynb. Use case chính của Jupytext là team muốn code review notebook qua diff text, không phải output.

14

Common Pitfalls

Refactor Một Nửa

Một phần logic nằm trong src/, phần còn lại vẫn trong notebook và được import lại. Kết quả là code không biết đang chạy logic ở đâu. Fix: hoàn thành refactor toàn bộ, hoặc giữ nguyên notebook và chưa refactor — không để trạng thái lửng giữa.

Function Quá To Sau Refactor

Extract cả 1 cell dài 100 dòng thành 1 function. Function vẫn khó test và khó tái sử dụng. Nếu function dài hơn 30-40 dòng và có nhiều logic riêng biệt, break tiếp thành function nhỏ hơn.

Hardcode Path Tuyệt Đối

# Lỗi — fail khi chạy trên máy khác, trong Docker, trên CI
df = pd.read_csv("/Users/username/Desktop/project/data/train.csv")

# Đúng — path tương đối, load từ config
df = pd.read_csv(config["data_path"])  # "data/train.csv" trong YAML

Global Variable Từ Notebook

Notebook dùng biến toàn cục (df, model, X_train) tự do giữa các cell. Khi refactor thành function, những biến này phải được truyền qua argument — thiếu argument là lỗi phổ biến nhất.

Circular Import

Xảy ra khi src/data/preprocess.py import từ src/models/train.py và ngược lại. Dấu hiệu: ImportError: cannot import name 'X' from partially initialized module. Fix: tách logic dùng chung vào module thứ ba (src/utils/), hoặc restructure để dependency một chiều.

Test Metric Không Khớp Sau Refactor

Nguyên nhân thường gặp:

Feature engineering khác nhau ở một bước (ví dụ: dropna trước hay sau khi tạo feature).
Random seed không được set ở đúng vị trí.
Train/test split trong script dùng thứ tự khác notebook.

Debug bằng cách so sánh shape và sample của DataFrame sau từng bước giữa notebook và script.

15

Tóm Tắt

Notebook phù hợp cho EDA và prototype. Script cần thiết khi deploy, schedule, test, và review trong CI.

Quy trình 6 bước: cleanup notebook → xác định component → extract function → chuyển vào module → viết entry script → verify kết quả.

Các pattern cần nắm: tách config (YAML), dependency injection (không hardcode trong function), logging thay print, error handling có log context.

Pattern hybrid giữ notebook làm showcase, logic thực trong src/, notebook chỉ import và gọi function.

Viết unit test sau refactor để xác nhận behavior không đổi — không cần coverage cao ngay lập tức, chỉ cần cover các transformation quan trọng nhất.

16

Bài Tiếp Theo

Bài 9: Type hints và docstring trong code Python AI — đi sâu vào type annotation nâng cao, Generic types, Protocol, và convention viết docstring (Google style vs NumPy style) với ví dụ thực tế từ ML codebase.

Danh sách bài viết