Bài 10: Logging và error handling chuyên nghiệp

1

Mục Tiêu Bài Học

Sau khi hoàn thành bài này, bạn sẽ:

✅ Thay thế print() bằng logging module đúng cách
✅ Chọn log level phù hợp cho từng tình huống
✅ Cấu hình logging tập trung cho cả project
✅ Viết try/except bắt exception cụ thể, không generic
✅ Tạo custom exception mô tả lỗi domain
✅ Dùng tenacity để retry transient error

2

Vấn Đề Với print() Và except: pass

Code notebook thường có dạng:

print("Training started...")
try:
    model.fit(X, y)
except Exception as e:
    print(f"Error: {e}")
    pass
print("Done")

Những vấn đề cụ thể với pattern này:

print() không có level — không phân biệt info, warning, error.
print() không có timestamp — không biết sự kiện xảy ra lúc nào.
print() không có nguồn (module nào in ra) — khó trace khi project nhiều file.
except Exception as e: print(e) mất traceback gốc — debug chậm hơn nhiều.
except: pass (silent error) — lỗi xảy ra nhưng code tiếp tục chạy, kết quả sai mà không biết tại sao.

Khi người review code (technical interview, code review) thấy print() rải đầy file và except: pass, họ sẽ đặt câu hỏi về khả năng debug và bảo trì code của bạn. Một codebase có logging và exception handling đúng cách thể hiện bạn đã làm việc với code production, không chỉ ở notebook.

3

Python logging Module — Cơ Bản

logging là module có sẵn trong standard library của Python, không cần cài thêm.

import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
)

logger = logging.getLogger(__name__)

logger.debug("Detailed info for debug")
logger.info("General info")
logger.warning("Warning, not fatal")
logger.error("Error, may continue")
logger.critical("Critical, may exit")

Output ví dụ:

2026-05-27 10:00:01 [INFO] __main__: General info
2026-05-27 10:00:01 [WARNING] __main__: Warning, not fatal
2026-05-27 10:00:01 [ERROR] __main__: Error, may continue
2026-05-27 10:00:01 [CRITICAL] __main__: Critical, may exit

Dòng DEBUG không xuất hiện vì level=logging.INFO — chỉ log từ INFO trở lên.

Format placeholders quan trọng:

%(asctime)s — thời gian log
%(levelname)s — tên level (INFO, WARNING, ...)
%(name)s — tên logger, thường là tên module
%(message)s — nội dung log
%(filename)s:%(lineno)d — file và dòng code (thêm vào format nếu cần trace nhanh)

4

Log Level Và Khi Dùng

5 level theo thứ tự tăng dần về độ nghiêm trọng:

Level	Giá trị số	Dùng khi
`DEBUG`	10	Chi tiết để debug: giá trị biến, shape tensor, thứ tự function gọi. Tắt trong production.
`INFO`	20	Tiến trình bình thường: bắt đầu train, save model, load data xong, inference thành công.
`WARNING`	30	Bất thường nhưng không fail: file config thiếu nên dùng default, retry lần 1, feature bị deprecate.
`ERROR`	40	Lỗi cụ thể xảy ra nhưng có thể tiếp tục xử lý: 1 batch fail, 1 sample invalid.
`CRITICAL`	50	Lỗi nghiêm trọng, hệ thống có thể phải dừng: database mất kết nối, model không load được.

Quy tắc đơn giản: đặt level=logging.DEBUG khi chạy local, level=logging.INFO hoặc WARNING khi deploy production.

5

Logger Per Module

Best practice: mỗi module tạo 1 logger riêng với tên = __name__.

# src/models/train.py
import logging

logger = logging.getLogger(__name__)
# __name__ = "src.models.train" khi import từ bên ngoài

def train_epoch(model, dataloader, optimizer):
    logger.debug(f"Starting epoch, batches={len(dataloader)}")
    total_loss = 0.0
    for batch_idx, (X, y) in enumerate(dataloader):
        loss = model.train_step(X, y, optimizer)
        total_loss += loss
        if batch_idx % 100 == 0:
            logger.debug(f"Batch {batch_idx}/{len(dataloader)}, loss={loss:.4f}")
    avg_loss = total_loss / len(dataloader)
    logger.info(f"Epoch done. avg_loss={avg_loss:.4f}")
    return avg_loss

Vì tên logger là src.models.train, hệ thống logging của Python tự động tổ chức theo cấu trúc cây: src → src.models → src.models.train. Điều này cho phép tắt toàn bộ log của src.models chỉ bằng 1 dòng mà không ảnh hưởng các module khác:

logging.getLogger("src.models").setLevel(logging.WARNING)

Không dùng:

logging.getLogger("myapp") — tên hard-code, mất hierachy khi refactor.
logging.getLogger() (root logger) — ảnh hưởng toàn bộ logging của process, kể cả library bên ngoài.

6

Cấu Hình Logging Cho Cả Project

Tập trung config vào 1 file, gọi 1 lần duy nhất ở entry point.

# src/core/logging_config.py
import logging
import sys
from pathlib import Path


def setup_logging(level: str = "INFO", log_file: str | None = None) -> None:
    """Configure root logger for the project.

    Args:
        level: Log level string. One of DEBUG, INFO, WARNING, ERROR, CRITICAL.
        log_file: Optional path to log file. If None, logs only to stdout.
    """
    fmt = "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
    datefmt = "%Y-%m-%d %H:%M:%S"

    handlers: list[logging.Handler] = [
        logging.StreamHandler(sys.stdout),
    ]

    if log_file is not None:
        Path(log_file).parent.mkdir(parents=True, exist_ok=True)
        handlers.append(logging.FileHandler(log_file, encoding="utf-8"))

    logging.basicConfig(
        level=level,
        format=fmt,
        datefmt=datefmt,
        handlers=handlers,
        force=True,  # Xóa handlers cũ nếu basicConfig đã gọi trước đó
    )

    # Suppress noisy library logs
    logging.getLogger("urllib3").setLevel(logging.WARNING)
    logging.getLogger("httpx").setLevel(logging.WARNING)
    logging.getLogger("matplotlib").setLevel(logging.WARNING)
    logging.getLogger("PIL").setLevel(logging.WARNING)

Gọi ở entry point:

# train.py hoặc main.py
from src.core.logging_config import setup_logging

setup_logging(level="INFO", log_file="logs/train.log")

# Sau đó các module khác chỉ cần:
# logger = logging.getLogger(__name__)
# và sẽ tự inherit format, level đã config

Tham số force=True (Python 3.8+) quan trọng khi chạy trong notebook hoặc test — tránh bị ignore vì basicConfig chỉ có tác dụng lần đầu gọi.

7

Logging Cho ML/AI

Các milestone nên log trong ML workflow:

import time
import logging

logger = logging.getLogger(__name__)


def train_model(
    X,
    y,
    model_class,
    **params,
):
    logger.info(
        f"Training start: n_samples={len(X)}, n_features={X.shape[1]}, "
        f"model={model_class.__name__}, params={params}"
    )

    start = time.time()
    model = model_class(**params)
    model.fit(X, y)
    elapsed = time.time() - start

    train_acc = model.score(X, y)
    logger.info(
        f"Training done in {elapsed:.1f}s. "
        f"train_accuracy={train_acc:.4f}"
    )

    return model


def load_data(path: str):
    logger.info(f"Loading data from {path}")
    import pandas as pd
    df = pd.read_csv(path)
    logger.info(f"Data loaded: shape={df.shape}, columns={list(df.columns)}")
    return df


def save_model(model, path: str) -> None:
    import joblib
    logger.info(f"Saving model to {path}")
    joblib.dump(model, path)
    logger.info(f"Model saved successfully")

Những gì nên log trong ML:

Shape của dataset khi load (phát hiện data mismatch sớm).
Hyperparameter đang dùng (trace lại experiment sau).
Thời gian training (phát hiện regression về performance).
Metric sau mỗi epoch hoặc sau train xong.
Path của model save/load (tránh nhầm model cũ/mới).

Không nên log trong ML:

Toàn bộ DataFrame hoặc array lớn — chỉ log shape và thống kê cơ bản.
Loss từng batch trong production — chỉ log từng epoch hoặc mỗi N batch.

8

Structured Logging — JSON Format

Log dạng text thuần phù hợp cho development. Log dạng JSON phù hợp hơn khi log được đẩy lên log aggregator (ELK, Datadog, Loki) vì có thể filter và query theo field.

Cần cài thêm: pip install python-json-logger

import logging
from pythonjsonlogger import jsonlogger


def setup_json_logging(level: str = "INFO") -> None:
    handler = logging.StreamHandler()
    formatter = jsonlogger.JsonFormatter(
        "%(asctime)s %(levelname)s %(name)s %(message)s"
    )
    handler.setFormatter(formatter)

    root = logging.getLogger()
    root.addHandler(handler)
    root.setLevel(level)


logger = logging.getLogger(__name__)

# Thêm context vào log entry qua extra={}
logger.info(
    "Training complete",
    extra={
        "model": "random_forest",
        "accuracy": 0.923,
        "training_time_sec": 47.3,
        "n_estimators": 200,
    },
)

Output JSON:

{"asctime": "2026-05-27 10:00:01", "levelname": "INFO", "name": "src.models.train", "message": "Training complete", "model": "random_forest", "accuracy": 0.923, "training_time_sec": 47.3, "n_estimators": 200}

Lưu ý: python-json-logger (pythonjsonlogger) — package tên cũ là python-json-logger, hiện là python-json-logger>=2.0. Không nhầm với jsonlogger khác trên PyPI.

9

Tránh Log Sensitive Data

Các loại dữ liệu không được log:

API key, secret token, password.
PII (Personally Identifiable Information): tên, email, CMND, số thẻ.
Full prompt và response của LLM trong production — tốn quota log và có thể chứa PII của user.

Nếu cần log một phần để debug, truncate hoặc mask:

def mask_api_key(key: str) -> str:
    if len(key) <= 8:
        return "***"
    return key[:4] + "..." + key[-4:]


def truncate(text: str, max_len: int = 100) -> str:
    if len(text) <= max_len:
        return text
    return text[:max_len] + f"...[truncated, total={len(text)}]"


# Sử dụng
api_key = "sk-abcdefghijklmnopqrstuvwxyz1234567890"
logger.debug(f"Using API key: {mask_api_key(api_key)}")
# → Using API key: sk-a...7890

user_prompt = "...very long user input..."
logger.info(f"Received prompt: {truncate(user_prompt, 100)}")

10

Error Handling — try/except Đúng Cách

Anti-pattern:

# Bắt tất cả exception nhưng chỉ print, mất traceback
try:
    result = process_data(df)
except Exception as e:
    print(f"Error: {e}")
    return None

# Silent error — bug ẩn
try:
    config = load_config("config.yaml")
except:
    pass

Pattern đúng — bắt exception cụ thể, xử lý phù hợp:

import logging

logger = logging.getLogger(__name__)


def load_config(path: str) -> dict:
    try:
        with open(path, encoding="utf-8") as f:
            import yaml
            return yaml.safe_load(f)
    except FileNotFoundError:
        logger.warning(f"Config file not found at {path}, using defaults")
        return {}
    except yaml.YAMLError as e:
        logger.error(f"Config file malformed: {e}")
        raise  # Re-raise — caller cần biết config không dùng được


def process_batch(batch: list) -> list:
    results = []
    for item in batch:
        try:
            results.append(transform(item))
        except ValueError as e:
            logger.warning(f"Skipping invalid item: {e}. item_id={item.get('id')}")
            # Tiếp tục xử lý batch, skip item lỗi
        except Exception:
            logger.exception(f"Unexpected error processing item_id={item.get('id')}")
            raise  # Lỗi không expected → raise để không tiếp tục sai
    return results

Quy tắc khi nào raise, khi nào skip:

Lỗi expected và có thể phục hồi (file thiếu dùng default, 1 sample lỗi skip): xử lý và tiếp tục.
Lỗi không expected (Exception generic): log đầy đủ rồi raise — đừng nuốt lỗi.
Lỗi liên quan logic domain (data sai hoàn toàn, model không load): raise custom exception.

11

Custom Exception Cho Domain

Custom exception giúp caller biết chính xác loại lỗi mà không cần đọc message string.

# src/exceptions.py


class AppBaseError(Exception):
    """Base exception for this project."""


class ModelLoadError(AppBaseError):
    """Raised when model loading fails."""


class InferenceError(AppBaseError):
    """Raised when inference fails."""


class DataValidationError(AppBaseError):
    """Raised when input data fails validation."""

Sử dụng với raise ... from e:

import os
import joblib
import logging
from src.exceptions import ModelLoadError

logger = logging.getLogger(__name__)


def load_model(path: str):
    if not os.path.exists(path):
        raise ModelLoadError(f"Model file not found: {path}")
    try:
        model = joblib.load(path)
        logger.info(f"Model loaded from {path}")
        return model
    except Exception as e:
        raise ModelLoadError(f"Failed to load model from {path}: {e}") from e
        # `from e` giữ nguyên original exception trong __cause__
        # traceback sẽ hiển thị cả 2 exception, giúp debug nhanh hơn

Caller bắt custom exception:

from src.exceptions import ModelLoadError


def run_inference(model_path: str, features: list):
    try:
        model = load_model(model_path)
        return model.predict(features)
    except ModelLoadError as e:
        logger.critical(f"Cannot run inference: {e}")
        raise

Pattern raise X from e (exception chaining) quan trọng: nó gắn original_exception.__cause__ = e, Python hiển thị cả 2 khi in traceback, giúp phân biệt "lỗi tôi định nghĩa" và "lỗi gốc từ thư viện".

12

logger.exception vs logger.error

Hai method gần giống nhau nhưng có 1 khác biệt quan trọng:

logger.error("msg") — log message ở level ERROR, không có traceback.
logger.exception("msg") — log message ở level ERROR, kèm full traceback. Chỉ dùng bên trong except block.

try:
    result = risky_computation(data)
except ZeroDivisionError as e:
    # Dùng .error khi bạn muốn log lỗi nhưng đã có đủ context
    logger.error(f"Division by zero with input: {data['denominator']}")
    return default_value

try:
    model = load_model(path)
except Exception:
    # Dùng .exception khi lỗi là unexpected — cần full traceback để debug
    logger.exception("Unexpected error loading model")
    raise

Output của logger.exception:

2026-05-27 10:00:01 [ERROR] src.inference: Unexpected error loading model
Traceback (most recent call last):
  File "src/inference.py", line 42, in load_model
    model = pickle.load(f)
  ...
EOFError: Ran out of input

Với logger.error thì chỉ có dòng đầu, không có traceback block phía dưới. Trong portfolio code, dùng logger.exception cho unexpected errors để cho thấy bạn biết tầm quan trọng của traceback khi debug production.

13

Retry Pattern Với tenacity

Trong code AI, gọi external service (OpenAI API, HuggingFace API, database) hay gặp transient error: timeout, rate limit, network hiccup. Retry thủ công dễ viết sai (không có backoff, retry vô hạn).

Cài: pip install tenacity (tenacity 8.x)

import logging
import requests
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
    before_sleep_log,
)

logger = logging.getLogger(__name__)


@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
    retry=retry_if_exception_type((ConnectionError, TimeoutError)),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    reraise=True,  # Raise exception gốc sau khi hết lần retry
)
def call_external_api(url: str, payload: dict) -> dict:
    """Call external API with retry on transient errors."""
    response = requests.post(url, json=payload, timeout=5)
    response.raise_for_status()
    return response.json()

Giải thích các tham số:

stop_after_attempt(3) — thử tối đa 3 lần, sau đó raise.
wait_exponential(multiplier=1, min=2, max=10) — chờ 2s, 4s, 8s (capped 10s) giữa các lần retry. Tránh flood server khi đang bị lỗi.
retry_if_exception_type(...) — chỉ retry nếu exception là loại transient. Không retry ValueError hay AuthError.
before_sleep_log(logger, logging.WARNING) — tự động log warning trước mỗi lần sleep, tiện hơn lambda.
reraise=True — sau khi hết lần retry, raise exception gốc thay vì RetryError wrapper.

Retry cho OpenAI API (rate limit):

from openai import RateLimitError, APIConnectionError
from tenacity import retry_if_exception_type, wait_exponential, stop_after_attempt, retry


@retry(
    retry=retry_if_exception_type((RateLimitError, APIConnectionError)),
    wait=wait_exponential(multiplier=2, min=5, max=60),
    stop=stop_after_attempt(5),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def chat_completion(client, messages: list) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    return response.choices[0].message.content

14

Context Manager log_duration

Một pattern hay cho portfolio: dùng context manager để log thời gian của bất kỳ block code nào, tái sử dụng không cần copy-paste start = time.time() mọi nơi.

import time
import logging
from contextlib import contextmanager
from typing import Generator

logger = logging.getLogger(__name__)


@contextmanager
def log_duration(operation: str, level: int = logging.INFO) -> Generator[None, None, None]:
    """Context manager that logs start, end, and elapsed time of an operation.

    Args:
        operation: Human-readable description of the operation.
        level: Log level to use for the messages.

    Example:
        with log_duration("data loading"):
            df = pd.read_csv("large_file.csv")
    """
    logger.log(level, f"Starting: {operation}")
    start = time.perf_counter()
    try:
        yield
    except Exception:
        elapsed = time.perf_counter() - start
        logger.log(level, f"Failed: {operation} after {elapsed:.2f}s")
        raise
    else:
        elapsed = time.perf_counter() - start
        logger.log(level, f"Completed: {operation} in {elapsed:.2f}s")

Sử dụng:

with log_duration("model training"):
    model = train_model(X_train, y_train)

with log_duration("embedding generation", level=logging.DEBUG):
    embeddings = encoder.encode(documents)

with log_duration("database query"):
    results = db.execute(query).fetchall()

Output:

2026-05-27 10:00:01 [INFO] src.pipeline: Starting: model training
2026-05-27 10:00:48 [INFO] src.pipeline: Completed: model training in 47.23s

15

Logging Trong FastAPI

FastAPI dùng uvicorn làm ASGI server, uvicorn đã có logger riêng (uvicorn.access, uvicorn.error). Không cần ghi đè, chỉ cần tạo logger cho code của mình.

import logging
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from src.core.logging_config import setup_logging
from src.exceptions import DataValidationError, InferenceError

setup_logging(level="INFO")

app = FastAPI()
logger = logging.getLogger(__name__)


class PredictRequest(BaseModel):
    features: list[float]


@app.post("/predict")
async def predict(req: PredictRequest):
    logger.info(f"Predict request received: n_features={len(req.features)}")

    try:
        result = model.predict([req.features])[0]
    except DataValidationError as e:
        logger.warning(f"Invalid input: {e}")
        raise HTTPException(status_code=422, detail=str(e))
    except InferenceError as e:
        logger.error(f"Inference failed: {e}")
        raise HTTPException(status_code=500, detail="Inference error")
    except Exception:
        logger.exception("Unexpected error in /predict")
        raise HTTPException(status_code=500, detail="Internal error")

    logger.info(f"Predict success: result={result:.4f}")
    return {"prediction": result}

Điểm quan trọng: trả về detail="Internal error" cho client thay vì message lỗi thật — message lỗi đi vào server log, không lộ ra ngoài. Đây là security practice cơ bản khi build API.

16

Log File Rotation

Nếu chạy inference service dài hạn mà dùng FileHandler thẳng, log file sẽ tăng kích thước vô hạn. Dùng RotatingFileHandler để tự cắt file theo size, hoặc TimedRotatingFileHandler để cắt theo thời gian.

import logging
from logging.handlers import RotatingFileHandler
from pathlib import Path


def setup_logging_with_rotation(
    log_file: str = "logs/app.log",
    level: str = "INFO",
    max_bytes: int = 10 * 1024 * 1024,  # 10MB mỗi file
    backup_count: int = 5,              # Giữ 5 file backup
) -> None:
    Path(log_file).parent.mkdir(parents=True, exist_ok=True)

    fmt = "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
    formatter = logging.Formatter(fmt, datefmt="%Y-%m-%d %H:%M:%S")

    file_handler = RotatingFileHandler(
        log_file,
        maxBytes=max_bytes,
        backupCount=backup_count,
        encoding="utf-8",
    )
    file_handler.setFormatter(formatter)

    stream_handler = logging.StreamHandler()
    stream_handler.setFormatter(formatter)

    logging.basicConfig(level=level, handlers=[file_handler, stream_handler], force=True)

Với maxBytes=10MB và backupCount=5, hệ thống giữ tối đa 55MB log (5 file backup + 1 current). File rotate được đặt tên app.log.1, app.log.2, ..., app.log.5. File cũ nhất sẽ bị xóa khi rotate.

17

Production Tools

Python logging module xử lý tốt output log ra stdout/file. Việc aggregate, search, alert là nhiệm vụ của các tool bên ngoài:

Tool	Dùng cho	Ghi chú
Sentry	Error tracking, grouping, alert	Tích hợp với Python qua `sentry-sdk`; free tier cho cá nhân
Loki + Grafana	Log aggregation + visualization	Open source, self-host được; phù hợp portfolio
Datadog	Log, trace, metric tập trung	Paid; phổ biến ở startup/enterprise
AWS CloudWatch	Log khi deploy trên AWS	Tự động collect log từ ECS, Lambda
Google Cloud Logging	Log khi deploy trên GCP	Tương tự CloudWatch

Cho portfolio cá nhân: tích hợp Sentry (free) vào FastAPI app là đủ để chứng minh bạn biết error tracking. Chỉ cần thêm:

import sentry_sdk

sentry_sdk.init(
    dsn="https://<your-dsn>@sentry.io/<project-id>",
    traces_sample_rate=0.1,  # 10% request trace
)

18

Common Pitfalls

Các lỗi hay gặp và cách tránh:

Pitfall	Vấn đề	Cách sửa
`print()` thay `logger`	Không có level, timestamp, module name	Dùng `logger.info()`
Log trong mỗi iteration	Log spam khi dataset lớn	Log mỗi N iteration: `if i % 100 == 0`
`except: pass`	Bug ẩn, không biết lỗi xảy ra	Bắt exception cụ thể, log ít nhất `warning`
`except Exception as e: print(e)`	Mất traceback gốc	`logger.exception("...")` bên trong except
Log toàn bộ DataFrame	Log file bloat, chậm	Chỉ log `df.shape` và `df.dtypes`
Gọi `basicConfig` mỗi module	Format xung đột, handler bị duplicate	Gọi `setup_logging()` 1 lần ở entry point
Bật DEBUG trong production	Log volume tăng gấp 10-100x, lộ internal state	Đặt level theo env: `os.getenv("LOG_LEVEL", "INFO")`

Đặt log level từ environment variable:

import os
from src.core.logging_config import setup_logging

log_level = os.getenv("LOG_LEVEL", "INFO").upper()
setup_logging(level=log_level)

Khi chạy local: LOG_LEVEL=DEBUG python train.py. Khi deploy: để mặc định INFO hoặc WARNING.

19

Tóm Tắt

✅ Dùng logging module thay print() — có level, timestamp, module name.
✅ Mỗi module tạo 1 logger: logger = logging.getLogger(__name__).
✅ Config logging 1 lần ở entry point qua setup_logging(); suppress noisy library logs.
✅ Bắt exception cụ thể, không dùng except Exception nếu có thể tránh được.
✅ Dùng logger.exception() cho unexpected error — kèm full traceback.
✅ Custom exception mô tả lỗi domain (ModelLoadError, DataValidationError).
✅ raise X from e để preserve exception chain.
✅ tenacity 8.x cho retry transient error với exponential backoff.
✅ Không log sensitive data (API key, PII); mask hoặc truncate nếu cần debug.
✅ Đặt log level từ environment variable, không hard-code DEBUG trong production.

20

Bài Tiếp Theo

Bài 11: Project 1 — Customer Churn Prediction (ML cổ điển end-to-end)

Danh sách bài viết