Bài 9: Type hints và docstring trong code Python AI

1

Mục tiêu bài học

Sau bài này, bạn sẽ:

✅ Viết type hint cho function, data structure, AI/ML types
✅ Phân biệt Optional, Union, Literal, TypedDict
✅ Dùng NDArray, pd.DataFrame, torch.Tensor trong annotation
✅ Viết docstring theo Google style và NumPy style
✅ Chạy mypy để phát hiện type mismatch
✅ Nhận biết anti-pattern và pitfalls phổ biến

2

Tại sao cần type hints và docstring

Type hints mang lại gì

Python không enforce type lúc runtime — đây là tính năng của ngôn ngữ, không phải bug. Type hint là annotation thuần text cho IDE và static checker, không ảnh hưởng tốc độ thực thi.

Lợi ích cụ thể:

IDE autocomplete chính xác hơn: VS Code / PyCharm đọc type annotation để gợi ý method. Không có type hint, IDE phải đoán dựa trên inference — dễ sai khi dùng generic dict hay list.
Phát hiện bug trước runtime: mypy hoặc pyright báo lỗi type mismatch ngay lúc viết code, không đợi đến lúc chạy thực tế.
Document ẩn trong signature: Người đọc biết ngay features: list[float] thay vì phải đọc cả hàm để hiểu features chứa gì.

Docstring mang lại gì

Type hint nói cái gì (what). Docstring nói tại sao và như thế nào (why, how). Hai thứ bổ sung cho nhau, không thay thế nhau.

Giải thích behavior, không chỉ parameter type
Ghi rõ precondition, exception, edge case
Auto-generate tài liệu API với Sphinx, MkDocs, pdoc

Tín hiệu trong portfolio

Code có type hint + docstring nhất quán cho thấy người viết quen với workflow team và code review. Đây là kỹ năng được đánh giá trong technical screen — không phải chỉ xem code chạy được không, mà xem code có maintainable không.

3

Type hint cơ bản (Python 3.9+)

Cú pháp: param: Type cho parameter, -> ReturnType cho return value.

def add(a: int, b: int) -> int:
    return a + b

def greet(name: str = "World") -> str:
    return f"Hello, {name}"

def get_usernames() -> list[str]:
    return ["alice", "bob"]

def is_valid(score: float) -> bool:
    return 0.0 <= score <= 1.0

Python không throw error nếu type sai lúc runtime. Đoạn sau chạy bình thường:

def add(a: int, b: int) -> int:
    return a + b

add("hello", "world")  # Runtime OK — trả "helloworld"
                        # Nhưng mypy/pyright báo lỗi

Đây là lý do cần static checker — Python không tự làm điều này.

4

Type cho data structure

Python 3.9+ dùng built-in list, dict, tuple, set trực tiếp — không cần import từ typing.

def process_features(features: list[float]) -> dict[str, float]:
    mean = sum(features) / len(features)
    variance = sum((x - mean) ** 2 for x in features) / len(features)
    return {"mean": mean, "variance": variance}

def fetch_records() -> list[dict[str, str]]:
    return [{"id": "1", "name": "Alice"}, {"id": "2", "name": "Bob"}]

# tuple: kích thước cố định, type từng vị trí
def split_ratio() -> tuple[float, float, float]:
    return (0.7, 0.15, 0.15)

# set
def unique_labels(labels: list[str]) -> set[str]:
    return set(labels)

Python 3.8 (legacy): dùng from typing import List, Dict, Tuple, Set. Từ 3.9 trở đi các alias này bị deprecated (vẫn hoạt động nhưng sẽ bị xoá trong Python tương lai).

# Python 3.8 — legacy style
from typing import List, Dict

def process(features: List[float]) -> Dict[str, float]:
    ...

# Python 3.9+ — preferred
def process(features: list[float]) -> dict[str, float]:
    ...

5

Optional và Union

Optional[X] là shorthand của Union[X, None] — dùng khi function có thể trả None.

from typing import Optional

# Optional[dict] = dict | None
def find_user(user_id: int) -> Optional[dict]:
    """Return user dict if found, None otherwise."""
    ...

# Python 3.10+ — cú pháp | ngắn hơn
def find_user(user_id: int) -> dict | None:
    ...

Union (hoặc | từ Python 3.10+) khi một param chấp nhận nhiều type:

# Union nhiều type
def parse_input(x: int | str | float) -> float:
    return float(x)

# Parameter có thể là None (Optional param)
def train(
    X: list[list[float]],
    y: list[int],
    sample_weight: list[float] | None = None,
) -> None:
    ...

Thực tế cú pháp X | None và X | Y yêu cầu Python 3.10+. Nếu codebase cần hỗ trợ 3.8/3.9, dùng Optional và Union từ typing.

6

Literal và TypedDict

Literal — giới hạn giá trị cụ thể

Literal constraint param chỉ nhận một tập giá trị cố định — tương tự enum nhưng không cần define class.

from typing import Literal

def load_model(version: Literal["v1", "v2", "v3"]) -> object:
    ...

load_model("v1")   # OK
load_model("v4")   # mypy error: Argument 1 to "load_model" has
                   # incompatible type "Literal['v4']"; expected
                   # "Literal['v1', 'v2', 'v3']"

ModelSize = Literal["small", "medium", "large"]

def get_embedding_dim(size: ModelSize) -> int:
    dims = {"small": 384, "medium": 768, "large": 1536}
    return dims[size]

TypedDict — dict có schema cố định

TypedDict định nghĩa cấu trúc của dict — hữu ích khi code vẫn trả dict (không dùng dataclass/Pydantic) nhưng muốn type checker hiểu key nào có.

from typing import TypedDict

class PredictionResult(TypedDict):
    label: str
    confidence: float
    model_version: str

def classify(text: str) -> PredictionResult:
    return {
        "label": "positive",
        "confidence": 0.92,
        "model_version": "v2",
    }

result = classify("Great product!")
# IDE autocomplete: result["label"], result["confidence"]
# mypy báo lỗi nếu access key không tồn tại

TypedDict không validate lúc runtime — nó chỉ cho static checker biết schema. Nếu cần runtime validation, dùng Pydantic (mục tiếp theo).

7

Pydantic BaseModel

Pydantic (v2+) validate type lúc runtime, không chỉ static check. Phù hợp cho API request/response, config loading, data contract giữa components.

from pydantic import BaseModel, Field

class PredictRequest(BaseModel):
    features: list[float]
    model_version: str = "v1"
    top_k: int = Field(default=5, ge=1, le=100)

class PredictResponse(BaseModel):
    label: str
    confidence: float
    latency_ms: float

def predict(req: PredictRequest) -> PredictResponse:
    # req.features đã được validate là list[float]
    ...
    return PredictResponse(label="cat", confidence=0.95, latency_ms=12.3)

# Runtime validation:
req = PredictRequest(features=[1.0, 2.0, 3.0], top_k=150)
# ValidationError: top_k must be <= 100

Pydantic v2 (pydantic>=2.0) dùng Rust core, nhanh hơn v1 đáng kể. API có thay đổi nhỏ: model_validate thay parse_obj, model_dump thay dict().

# Pydantic v2
data = {"features": [1.0, 2.5], "model_version": "v2"}
req = PredictRequest.model_validate(data)
req_dict = req.model_dump()

8

Type cho AI/ML: NDArray, DataFrame, Tensor

NumPy — NDArray

NumPy 1.20+ giới thiệu numpy.typing.NDArray với generic dtype.

import numpy as np
from numpy.typing import NDArray

def normalize(arr: NDArray[np.float64]) -> NDArray[np.float64]:
    """Normalize array to zero mean, unit variance."""
    return (arr - arr.mean()) / arr.std()

def one_hot(labels: NDArray[np.int32], n_classes: int) -> NDArray[np.float32]:
    result = np.zeros((len(labels), n_classes), dtype=np.float32)
    result[np.arange(len(labels)), labels] = 1.0
    return result

Nếu không cần strict dtype, dùng np.ndarray đơn giản hơn:

def batch_predict(X: np.ndarray) -> np.ndarray:
    ...

Pandas — DataFrame và Series

import pandas as pd

def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    """Drop NaN, convert dtypes, return clean dataframe."""
    return df.dropna().reset_index(drop=True)

def train_model(
    X: pd.DataFrame,
    y: pd.Series,
) -> object:  # hoặc sklearn.base.BaseEstimator
    ...

Pandas hiện không hỗ trợ generic column type (như DataFrame[float]) trong chuẩn typing. Thư viện pandera cung cấp DataFrameModel cho schema validation chi tiết hơn nếu cần.

PyTorch — Tensor và Module

import torch
import torch.nn as nn

def forward_pass(
    model: nn.Module,
    x: torch.Tensor,
) -> torch.Tensor:
    model.eval()
    with torch.no_grad():
        return model(x)

def compute_loss(
    logits: torch.Tensor,
    labels: torch.Tensor,
) -> torch.Tensor:
    return nn.CrossEntropyLoss()(logits, labels)

PyTorch chưa có generic Tensor type (như Tensor[float32, (B, C, H, W)]) trong stdlib. Thư viện jaxtyping cung cấp annotation shape nếu cần.

9

Generic và TypeVar

TypeVar cho phép viết function generic — khi return type phụ thuộc vào input type.

from typing import TypeVar

T = TypeVar("T")

def first_element(items: list[T]) -> T:
    if not items:
        raise ValueError("Empty list")
    return items[0]

first_element([1, 2, 3])       # inferred: int
first_element(["a", "b"])      # inferred: str
first_element([1.0, 2.0])      # inferred: float

TypeVar với constraint:

from typing import TypeVar
import numpy as np
import pandas as pd

# T chỉ có thể là np.ndarray hoặc pd.DataFrame
ArrayLike = TypeVar("ArrayLike", np.ndarray, pd.DataFrame)

def validate_shape(data: ArrayLike, expected_cols: int) -> ArrayLike:
    if data.shape[1] != expected_cols:
        raise ValueError(f"Expected {expected_cols} columns, got {data.shape[1]}")
    return data

Python 3.12 giới thiệu cú pháp type parameter mới gọn hơn (def fn[T](x: list[T]) -> T), nhưng cú pháp TypeVar vẫn hoạt động và phổ biến hơn ở thời điểm hiện tại.

10

Callable

Callable[[ArgTypes...], ReturnType] dùng khi parameter là một function.

from typing import Callable

def apply(fn: Callable[[int, int], int], a: int, b: int) -> int:
    return fn(a, b)

apply(lambda x, y: x + y, 3, 4)   # = 7
apply(lambda x, y: x * y, 3, 4)   # = 12

# Thực tế ML: custom metric function
MetricFn = Callable[[list[int], list[int]], float]

def evaluate(
    y_true: list[int],
    y_pred: list[int],
    metric: MetricFn,
) -> float:
    return metric(y_true, y_pred)

Nếu signature phức tạp hơn (keyword args, *args), dùng Protocol từ typing để define interface chính xác hơn Callable.

from typing import Protocol

class Scorer(Protocol):
    def __call__(
        self,
        y_true: list[int],
        y_pred: list[int],
        *,
        normalize: bool = True,
    ) -> float: ...

11

Docstring — 3 style phổ biến

3 convention docstring phổ biến trong Python: Google style, NumPy style, và Sphinx (reST) style.

a) Google style — khuyến nghị cho readability

Được dùng bởi Google, Keras, nhiều ML project lớn. Dễ đọc dưới dạng plain text.

import pandas as pd
import sklearn.base

def train_model(
    X: pd.DataFrame,
    y: pd.Series,
    n_estimators: int = 100,
) -> sklearn.base.BaseEstimator:
    """Train random forest classifier.

    Args:
        X: Feature dataframe with shape (n_samples, n_features).
        y: Target series with shape (n_samples,).
        n_estimators: Number of trees in the forest. Default 100.

    Returns:
        Trained RandomForestClassifier instance.

    Raises:
        ValueError: If X and y have different number of samples.

    Example:
        >>> X, y = load_data("train.csv")
        >>> model = train_model(X, y, n_estimators=200)
        >>> predictions = model.predict(X_test)
    """
    if len(X) != len(y):
        raise ValueError(
            f"X has {len(X)} samples but y has {len(y)} samples"
        )
    from sklearn.ensemble import RandomForestClassifier

    clf = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    return clf.fit(X, y)

b) NumPy style — phổ biến trong scientific Python

Được dùng bởi NumPy, SciPy, scikit-learn. Format dài hơn nhưng rõ ràng type và description tách biệt.

def train_model(X, y, n_estimators=100):
    """Train random forest classifier.

    Parameters
    ----------
    X : pd.DataFrame
        Feature dataframe with shape (n_samples, n_features).
    y : pd.Series
        Target with shape (n_samples,).
    n_estimators : int, optional
        Number of trees (default 100).

    Returns
    -------
    sklearn.base.BaseEstimator
        Trained RandomForestClassifier.

    Raises
    ------
    ValueError
        If X and y have different number of samples.

    Examples
    --------
    >>> model = train_model(X_train, y_train)
    >>> model.predict(X_test)
    """
    ...

c) Sphinx (reST) style — legacy

Format cũ, ít dùng trong code mới. Vẫn xuất hiện trong project Django, Python stdlib.

def train_model(X, y, n_estimators=100):
    """Train random forest classifier.

    :param X: Feature dataframe.
    :type X: pd.DataFrame
    :param y: Target series.
    :type y: pd.Series
    :param n_estimators: Number of trees.
    :type n_estimators: int
    :returns: Trained model.
    :rtype: sklearn.base.BaseEstimator
    :raises ValueError: If X and y have different lengths.
    """
    ...

Chọn style nào? Quan trọng nhất là nhất quán trong codebase. Nếu không có yêu cầu cụ thể, Google style đọc dễ hơn khi xem raw source code.

12

Khi nào cần docstring

Không phải mọi function đều cần docstring đầy đủ. Over-document cũng là vấn đề.

Cần docstring

Function public API — được gọi từ module khác hoặc bởi người dùng thư viện
Function có logic phức tạp, side effect không rõ từ tên
Function có precondition hoặc exception không rõ
Class quan trọng với nhiều method

# Cần docstring — logic không hiển nhiên
def cosine_similarity_batch(
    query: np.ndarray,
    corpus: np.ndarray,
    top_k: int = 10,
) -> tuple[np.ndarray, np.ndarray]:
    """Compute cosine similarity between query and corpus vectors.

    Args:
        query: Query vector with shape (embedding_dim,).
        corpus: Corpus matrix with shape (n_docs, embedding_dim).
        top_k: Number of top results to return.

    Returns:
        Tuple of (indices, scores) sorted by descending score.
        indices: shape (top_k,), dtype int64.
        scores: shape (top_k,), dtype float32.
    """
    ...

Không cần docstring

# Trivial helper — tên đủ rõ
def _to_lowercase(text: str) -> str:
    return text.lower()

# Simple property getter
@property
def num_classes(self) -> int:
    return len(self._classes)

Nguyên tắc: nếu cần đọc code body để hiểu function làm gì — thêm docstring. Nếu tên + type hint đã nói đủ — bỏ qua.

Module-level docstring

File .py quan trọng (public module, main entry point) nên có docstring ngắn ở đầu file:

"""Feature engineering pipeline for churn prediction.

Transforms raw customer data into model-ready features:
- Temporal features (recency, frequency, tenure)
- Aggregated behavioral metrics
- Encoded categorical variables
"""
import pandas as pd
...

13

Static type check với mypy

pip install mypy

# Check toàn thư mục src
mypy src/

# Check file cụ thể
mypy src/model.py

Ví dụ mypy phát hiện lỗi:

# model.py
def predict(model: object, x: list[float]) -> int:
    return model.predict(x)   # mypy: "object" has no attribute "predict"

def load_weights(path: str) -> dict[str, float]:
    import json
    return json.load(open(path))  # mypy OK nếu json.load trả Any

$ mypy src/model.py
src/model.py:3: error: "object" has no attribute "predict"  [attr-defined]
Found 1 error in 1 file (checked 1 source file)

Cấu hình mypy trong pyproject.toml

[tool.mypy]
python_version = "3.11"
strict = true
ignore_missing_imports = true

strict = true bật tất cả strict flag, bao gồm:

--disallow-untyped-defs: function phải có annotation
--no-implicit-optional: không tự infer Optional
--warn-return-any: cảnh báo khi return Any

Với codebase mới: bật strict ngay từ đầu. Với codebase cũ chưa có type hint: bắt đầu với strict = false, thêm dần.

ignore_missing_imports

Nhiều thư viện ML (torch, sklearn, transformers) có stub type không đầy đủ. ignore_missing_imports = true tránh báo lỗi cho các import này.

14

Tools liên quan

Tool	Chức năng	Ghi chú
`mypy`	Static type checker	Reference implementation, nhiều plugin
`pyright`	Static type checker (Microsoft)	Nhanh hơn mypy, dùng bởi VS Code Pylance
`ruff`	Linter + formatter	Có rule nhóm `ANN` (annotation), `D` (docstring)
`pydocstyle`	Docstring style checker	Standalone hoặc qua ruff `D` rules

Cấu hình ruff kiểm tra annotation và docstring trong pyproject.toml:

[tool.ruff.lint]
select = [
    "E",    # pycodestyle errors
    "W",    # pycodestyle warnings
    "F",    # pyflakes
    "ANN",  # flake8-annotations (type hints)
    "D",    # pydocstyle (docstrings)
]

[tool.ruff.lint.pydocstyle]
convention = "google"  # hoặc "numpy"

Chạy check toàn project:

ruff check src/
mypy src/

15

Auto-generate docs từ docstring

Khi project có nhiều public API, docstring có thể được compile thành site tài liệu tự động.

Tool	Setup	Use case
pdoc	Minimal, 0 config	Internal tool, nhỏ gọn
MkDocs + mkdocstrings	YAML config, Markdown-based	Modern project, GitHub Pages
Sphinx	Complex setup, nhiều extension	Large library, Python stdlib standard

Ví dụ với pdoc — nhanh nhất để xem docs local:

pip install pdoc
pdoc src/model.py          # mở browser
pdoc --output-dir docs src/  # export HTML

Với MkDocs:

pip install mkdocs mkdocstrings[python]
mkdocs new .
mkdocs serve   # http://localhost:8000

Cho portfolio, viết docstring đủ tốt rồi generate bằng pdoc hoặc MkDocs giúp project trông professional mà effort thấp.

16

Anti-pattern thường gặp

Type hint sai logic

# SAI: annotation nói int nhưng trả str
def get_count(items: list) -> int:
    return f"Total: {len(items)}"  # trả str, không phải int

# SAI: annotation nói str nhưng function có thể trả None
def get_name(user_id: int) -> str:
    user = db.find(user_id)
    return user.name  # nếu user là None → AttributeError
    # Đúng: -> str | None, và xử lý None case

Any type lạm dụng

from typing import Any

# Không có ý nghĩa — mypy bỏ qua check
def process(data: Any) -> Any:
    ...

# Dùng Any khi thật sự cần (ví dụ: JSON blob không biết schema)
def parse_json(raw: str) -> dict[str, Any]:
    import json
    return json.loads(raw)  # OK — value có thể là mọi type

Docstring placeholder

# SAI: placeholder vô nghĩa
def embed_text(text: str) -> list[float]:
    """TODO: write docstring"""
    ...

# SAI: docstring lỗi thời sau khi đổi code
def normalize(arr: np.ndarray) -> np.ndarray:
    """Normalize array to [0, 1] range."""  # code đổi sang z-score rồi
    return (arr - arr.mean()) / arr.std()   # nhưng docstring chưa update

# SAI: docstring chỉ lặp lại tên function
def get_user(user_id: int) -> dict:
    """Get user by user_id."""  # không thêm thông tin gì
    ...

Mixed docstring style

# SAI: file dùng lẫn Google và NumPy style
def fn_a(x: int) -> int:
    """Do something.

    Args:           # Google style
        x: Input.
    """
    ...

def fn_b(x: int) -> int:
    """Do something.

    Parameters     # NumPy style — inconsistent
    ----------
    x : int
    """
    ...

Over-type complex generic

# Không nên tốn thời gian type hint mức này
# nếu mypy không support tốt hoặc ít người đọc
from typing import TypeVar, Generic, Iterator

T_co = TypeVar("T_co", covariant=True)

class DataStream(Generic[T_co]):
    def __iter__(self) -> Iterator[T_co]: ...

# Đơn giản hơn và đủ dùng:
class DataStream:
    def __iter__(self) -> Iterator[dict]: ...

Forward reference chưa resolve

# SAI: MyModel chưa được define tại thời điểm annotation
def clone(model: MyModel) -> MyModel:  # NameError
    ...

class MyModel:
    pass

# Đúng: dùng string để forward reference
def clone(model: "MyModel") -> "MyModel":
    ...

class MyModel:
    pass

# Hoặc Python 3.10+: from __future__ import annotations
from __future__ import annotations

def clone(model: MyModel) -> MyModel:   # OK — lazy evaluation
    ...

class MyModel:
    pass

17

Bài tập thực hành

Bài 1: Add type hint

Thêm type annotation đầy đủ cho các function sau (không thay đổi logic):

def split_data(features, labels, ratio=0.8):
    n = int(len(features) * ratio)
    return features[:n], features[n:], labels[:n], labels[n:]

def compute_accuracy(y_true, y_pred):
    correct = sum(a == b for a, b in zip(y_true, y_pred))
    return correct / len(y_true)

def load_config(path):
    import json
    with open(path) as f:
        return json.load(f)

Bài 2: Viết docstring

Viết Google-style docstring cho function sau, đủ Args, Returns, Raises:

def batch_embed(
    texts: list[str],
    model: object,
    batch_size: int = 32,
) -> np.ndarray:
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        emb = model.encode(batch)
        results.append(emb)
    return np.vstack(results)

Bài 3: Chạy mypy

Tạo file check_types.py với nội dung sau, chạy mypy check_types.py, đọc output và sửa lỗi:

def add(a: int, b: int) -> str:
    return a + b

def get_label(score: float) -> str:
    if score > 0.5:
        return "positive"
    # thiếu return cho trường hợp còn lại

def process(items: list[int]) -> list[int]:
    return [str(x) for x in items]  # sai type

Bài 4: TypedDict vs Pydantic

Định nghĩa schema cho một RAG query result với 2 cách: TypedDict và Pydantic BaseModel. Schema cần có: query (str), answer (str), sources (list source URL), confidence (float 0-1), latency_ms (float, optional).

18

Tóm tắt

✅ Type hint: param: Type, -> ReturnType. Python 3.9+ dùng built-in list, dict, không cần import từ typing
✅ X | None (3.10+) hoặc Optional[X] cho nullable. X | Y cho union nhiều type
✅ Literal giới hạn giá trị cụ thể. TypedDict cho dict có schema. Pydantic khi cần runtime validation
✅ NDArray[np.float64] cho NumPy, pd.DataFrame/pd.Series cho Pandas, torch.Tensor cho PyTorch
✅ Google style: Args:, Returns:, Raises:. NumPy style: section underline. Chọn 1 và nhất quán
✅ mypy --strict để bắt lỗi type trước runtime. ruff D để enforce docstring style
✅ Docstring khi function public, logic phức tạp, hoặc có exception. Bỏ qua trivial helper
✅ Tránh Any lạm dụng, docstring placeholder, docstring lỗi thời, và mixed style

19

Bài tiếp theo

Bài 10: Logging và error handling chuyên nghiệp

Danh sách bài viết