Bài 7: Cấu trúc thư mục chuẩn cho dự án AI / ML

1

Mục Tiêu Bài Học

Sau khi hoàn thành bài này, bạn sẽ:

✅ Hiểu tại sao cấu trúc thư mục ảnh hưởng đến chất lượng project
✅ Nắm được 3 cấu trúc phổ biến: ML cổ điển, FastAPI service, RAG app
✅ Phân biệt src layout vs flat layout và biết khi nào dùng cái nào
✅ Biết cách dùng pyproject.toml và Makefile
✅ Tránh được các pitfall phổ biến khi setup project

2

Vì Sao Cấu Trúc Thư Mục Quan Trọng

Cấu trúc thư mục là thứ recruiter và teammate nhìn vào đầu tiên khi mở repo. Một project có cấu trúc tốt truyền đạt được rằng người viết nó hiểu separation of concerns và biết cách tổ chức code cho teamwork.

Tìm code nhanh

Khi reviewer hoặc teammate cần tìm logic preprocessing, họ nhìn vào src/data/ hoặc app/ml/preprocessor.py — không phải lần mò qua 10 notebook. Convention giảm thời gian onboard.

Modularity

Code nằm trong module riêng biệt dễ viết test, dễ import lại, dễ thay thế. Ngược lại, logic nằm trong notebook cell không thể import mà không copy-paste.

Separation of concerns

Data pipeline, model training, serving API, và deployment config là những concerns khác nhau. Tách chúng ra folder riêng giúp mỗi phần có thể thay đổi độc lập — ví dụ swap model mà không đụng vào API layer.

Convention → onboard nhanh

Project theo Cookiecutter Data Science hay FastAPI convention cho phép người mới đọc README.md và bắt đầu contribute trong vài giờ, không cần hỏi "file train ở đâu?"

3

Cấu Trúc 1: Cookiecutter Data Science

Template gốc từ drivendata/cookiecutter-data-science. Phù hợp với ML cổ điển, EDA-heavy project, kaggle-style.

project/
├── data/
│   ├── raw/           # immutable original data — không sửa trực tiếp
│   ├── interim/       # dữ liệu trung gian đã transform
│   ├── processed/     # dataset cuối để train model
│   └── external/      # dữ liệu từ nguồn bên ngoài
├── docs/
├── models/            # trained model artifact (.pkl, .pt, .onnx)
├── notebooks/         # exploration only
├── references/        # data dictionaries, manuals
├── reports/
│   └── figures/       # charts export từ notebook
├── requirements.txt
├── setup.py
├── src/
│   ├── __init__.py
│   ├── data/          # script download, preprocess data
│   ├── features/      # feature engineering
│   ├── models/        # train, predict, evaluate
│   └── visualization/ # plotting functions
├── tests/
├── .gitignore
└── README.md

Điểm mạnh

Rõ ràng cho ML cổ điển: EDA → feature engineering → train → evaluate.
Được nhiều tutorial và Kaggle notebook follow — dễ tìm tài liệu tham khảo.
Tách biệt data/raw (immutable) với data/processed giúp reproduce kết quả.

Điểm yếu

Thiếu app/ hoặc api/ folder — không phù hợp khi cần serve model qua API.
Không có Dockerfile trong template gốc — cần tự thêm nếu deploy.
Overkill cho RAG/LLM project: src/features/ và src/visualization/ thường không dùng tới.

4

Cấu Trúc 2: FastAPI ML Service

Dùng khi project là một ML-backed API service — train offline, serve online qua FastAPI.

project/
├── app/
│   ├── __init__.py
│   ├── main.py            # FastAPI app entry point
│   ├── api/
│   │   ├── __init__.py
│   │   ├── routes/
│   │   │   ├── predict.py # POST /predict endpoint
│   │   │   └── health.py  # GET /health endpoint
│   │   └── schemas.py     # Pydantic request/response models
│   ├── core/
│   │   ├── config.py      # settings via Pydantic BaseSettings
│   │   └── logging.py     # logging setup
│   ├── ml/
│   │   ├── model.py       # load model, run inference
│   │   └── preprocessor.py
│   └── deps.py            # FastAPI dependency injection
├── tests/
│   ├── unit/
│   └── integration/
├── notebooks/             # exploration, không vào production
├── data/
├── models/
├── scripts/
│   └── train.py
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
├── requirements.txt
└── README.md

Điểm mạnh

Tách rõ app/ (production code) với notebooks/ (exploration) và scripts/ (offline jobs).
app/ml/ chứa inference logic — dễ swap model mà không đụng vào API routes.
Có sẵn Dockerfile và docker-compose.yml — deploy-ready từ đầu.
Unit test và integration test tách folder riêng.

Khi nào dùng

Project train model offline (sklearn, LightGBM, PyTorch) → export model artifact → serve prediction qua REST API. Ví dụ: churn prediction service, image classification API, fraud detection endpoint.

5

Cấu Trúc 3: RAG / LLM App

Dùng cho project Retrieval-Augmented Generation (RAG), chatbot với LLM, hoặc AI agent. Khác với ML service ở chỗ không có bước train model — thay vào đó có ingest pipeline, vector DB, và prompt management.

project/
├── app/
│   ├── api/             # FastAPI endpoints
│   ├── chains/          # LangChain / LangGraph chains
│   ├── prompts/         # prompt templates (.yaml hoặc .txt)
│   ├── retrieval/       # vector DB clients (Chroma, Qdrant, Pinecone)
│   ├── tools/           # agent tools
│   ├── evaluation/      # RAG evaluation logic
│   └── config.py
├── data/
│   ├── raw/             # source documents (PDF, HTML, DOCX)
│   └── processed/       # chunks, embeddings đã cache
├── scripts/
│   ├── ingest.py        # load documents → chunk → embed → upsert
│   └── evaluate.py      # chạy evaluation trên golden set
├── eval/
│   ├── golden_set.json  # câu hỏi + expected answer để eval
│   └── results/
├── tests/
├── notebooks/           # prototype chain mới, visualize retrieval
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md

Folder đặc thù

app/chains/: LangChain/LangGraph pipeline logic — tách ra để test độc lập với API layer.
app/prompts/: prompt template dưới dạng file .yaml hoặc .txt — versioned cùng code, dễ A/B test.
app/evaluation/: RAG-specific metrics (faithfulness, answer relevance, context recall).
eval/golden_set.json: dataset Q&A để chạy regression test khi thay đổi model hoặc chunking strategy.

Lưu ý

data/processed/ chứa embedding đã cache — thường lớn, cần .gitignore và lưu trên object storage (S3, GCS). data/raw/ tương tự — PDF nội bộ không commit lên GitHub.

6

Src Layout vs Flat Layout

Src layout

Code production nằm trong src/<package_name>/.

project/
├── src/
│   └── mypackage/
│       ├── __init__.py
│       ├── model.py
│       └── utils.py
├── tests/
└── pyproject.toml

Lợi thế: Python không tự import package từ thư mục gốc — bạn phải pip install -e . mới dùng được. Điều này ép bạn cài đúng, tránh lỗi import ngầm. Chuẩn hơn cho production library.

Flat layout

Code nằm trực tiếp trong <package_name>/ ở root.

project/
├── mypackage/
│   ├── __init__.py
│   ├── model.py
│   └── utils.py
├── tests/
└── pyproject.toml

Lợi thế: Đơn giản hơn. Dễ import khi chạy script từ root.

Khuyến nghị

Flat layout: portfolio project cá nhân, app (FastAPI service, RAG app). Folder chính là app/ — không cần src/ wrapper.
Src layout: khi bạn build Python package thực sự — thứ người khác sẽ pip install. Ví dụ: shared ML utils library trong team.

7

Giải Thích Từng Folder

Folder	Mục đích	Git?
`data/raw/`	Dữ liệu gốc, immutable. Không sửa trực tiếp.	.gitignore — lưu DVC hoặc S3
`data/processed/`	Dataset đã clean, ready để train hoặc embed.	.gitignore
`models/`	Trained model artifact (.pkl, .pt, .onnx, .bin).	.gitignore — lưu MLflow, HuggingFace Hub, hoặc S3
`notebooks/`	EDA, prototype, visualization. Không chứa logic production.	Commit được — nhưng clean output trước khi push
`src/` hoặc `app/`	Code production: model, pipeline, API.	Commit
`tests/`	Unit test và integration test.	Commit
`scripts/`	Standalone script: train, eval, ingest, migrate.	Commit
`configs/`	YAML/JSON config: hyperparameter, model name, path.	Commit (trừ file chứa secret)

8

File Chuẩn ở Root

Root của project nên có các file sau:

File	Vai trò
`README.md`	Tài liệu chính: mô tả project, cách cài đặt, cách chạy.
`pyproject.toml`	Dependency management + tool config (ruff, pytest, mypy). Modern alternative cho `setup.py` + `requirements.txt`.
`requirements.txt`	Pinned dependencies để reproduce. Dùng song song hoặc thay thế `pyproject.toml` tùy context.
`Dockerfile`	Containerize app.
`docker-compose.yml`	Multi-service setup (app + vector DB + cache).
`.gitignore`	Exclude data, models, `__pycache__`, `.env`, `.venv`.
`.env.example`	Template env vars (API keys, DB URL). Commit file này, không commit `.env` thật.
`LICENSE`	License file. MIT là lựa chọn phổ biến cho portfolio public.
`Makefile`	Shortcut command: install, test, lint, train, serve.

9

pyproject.toml — Modern Python

pyproject.toml (PEP 517/518/621) thay thế bộ setup.py + setup.cfg + requirements.txt riêng lẻ. Tất cả config về package và tool nằm trong 1 file.

[project]
name = "my-ai-project"
version = "0.1.0"
description = "RAG chatbot on internal docs"
requires-python = ">=3.11"
dependencies = [
    "fastapi>=0.115",
    "uvicorn[standard]>=0.30",
    "langchain>=0.3",
    "openai>=1.40",
]

[project.optional-dependencies]
dev = [
    "pytest>=8.0",
    "ruff>=0.5",
    "mypy>=1.10",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.ruff]
line-length = 100

[tool.ruff.lint]
select = ["E", "F", "I"]

[tool.mypy]
python_version = "3.11"
strict = true

[tool.pytest.ini_options]
testpaths = ["tests"]

Cài đặt với optional deps

# Cài production deps
pip install .

# Cài production + dev deps
pip install ".[dev]"

# Editable mode (thay đổi code không cần cài lại)
pip install -e ".[dev]"

Khi nào vẫn dùng requirements.txt

Khi deploy lên môi trường không dùng build backend (ví dụ: một số CI pipeline đơn giản), hoặc khi cần pin exact version để reproduce:

# Export pinned versions từ môi trường hiện tại
pip freeze > requirements.txt

10

Makefile — Automation Command

Makefile đặt shortcut cho những command hay dùng. Thay vì nhớ và gõ lại toàn bộ command mỗi lần, chạy make test hoặc make serve.

.PHONY: install test lint train serve docker-build

install:
	pip install -e ".[dev]"

test:
	pytest tests/ -v

lint:
	ruff check .
	mypy app/

train:
	python scripts/train.py --config configs/train.yaml

serve:
	uvicorn app.main:app --reload --port 8000

docker-build:
	docker build -t my-ai-app:latest .

Lưu ý: Makefile dùng tab (không phải space) để indent. Nếu copy-paste bị lỗi missing separator, kiểm tra lại ký tự thụt đầu dòng.

Thay thế: justfile

just là alternative hiện đại hơn Makefile — syntax đơn giản hơn, không có quirk về tab. Tùy sở thích team.

11

YAML Config Cho Hyperparameter

Hard-code hyperparameter trong script là anti-pattern — thay đổi một giá trị phải sửa code, khó track thay đổi qua commit. Tách config ra file YAML giúp thử nhiều config mà không chỉnh code.

# configs/train.yaml
model:
  name: bert-base-uncased
  num_labels: 5

training:
  batch_size: 32
  learning_rate: 2.0e-5
  epochs: 3
  warmup_steps: 100

data:
  train_path: data/processed/train.parquet
  val_path: data/processed/val.parquet
  max_length: 512

Load config đơn giản với PyYAML

import yaml
from dataclasses import dataclass

with open("configs/train.yaml") as f:
    cfg = yaml.safe_load(f)

# Truy cập
lr = cfg["training"]["learning_rate"]
model_name = cfg["model"]["name"]

Load config với OmegaConf (khuyến nghị cho project phức tạp)

from omegaconf import OmegaConf

cfg = OmegaConf.load("configs/train.yaml")

# Dot-access thay vì dict
lr = cfg.training.learning_rate
model_name = cfg.model.name

# Override từ CLI: cfg.training.learning_rate=3e-5

Hydra (meta-framework trên OmegaConf) hữu ích khi cần compose nhiều config file hoặc chạy sweep. Nhưng với project portfolio, OmegaConf đủ dùng mà không cần add thêm abstraction.

12

Notebooks — Dùng Đúng Chỗ

Notebook phù hợp cho: EDA, visualize distribution, prototype chain mới, viết report kèm chart. Không phù hợp cho: logic production, code sẽ được gọi từ nhiều nơi.

Naming convention

notebooks/
├── 01-eda.ipynb
├── 02-feature-engineering.ipynb
├── 03-model-baseline.ipynb
├── 04-model-tuning.ipynb
└── 05-error-analysis.ipynb

Prefix số giúp sắp xếp theo thứ tự workflow — dễ tìm khi cần review lại.

Quy trình sau khi POC xong

Logic đã verify trong notebook → extract ra function trong src/ hoặc app/.
Viết test cho function đó.
Notebook giữ lại phần visualization và narrative — xóa cell debug.

Bài 8 sẽ đi chi tiết vào quy trình refactor notebook → script.

13

Tests Folder

tests/
├── conftest.py          # shared fixtures
├── unit/
│   ├── test_preprocessor.py
│   ├── test_model.py
│   └── test_utils.py
└── integration/
    ├── test_api.py      # test HTTP endpoint
    └── test_pipeline.py # test end-to-end pipeline

Naming convention

File: test_<module>.py
Function: test_<feature>_<scenario>

# tests/unit/test_preprocessor.py
import pytest
from app.ml.preprocessor import clean_text

def test_clean_text_removes_html_tags():
    result = clean_text("<b>Hello</b> World")
    assert result == "Hello World"

def test_clean_text_handles_empty_string():
    result = clean_text("")
    assert result == ""

conftest.py — shared fixtures

# tests/conftest.py
import pytest
from fastapi.testclient import TestClient
from app.main import app

@pytest.fixture
def client():
    return TestClient(app)

@pytest.fixture
def sample_text():
    return "This is a sample document for testing."

Project không có tests/ là dấu hiệu code quality thấp — recruiter và tech lead để ý điều này khi review portfolio.

14

Scripts vs CLI Tool

Script đơn giản

Nếu script chỉ có 1-2 tham số, argparse đủ dùng:

# scripts/train.py
import argparse
import yaml

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", default="configs/train.yaml")
    args = parser.parse_args()

    with open(args.config) as f:
        cfg = yaml.safe_load(f)

    # ... training logic

if __name__ == "__main__":
    main()

CLI tool với Typer (khi nhiều sub-command)

# scripts/cli.py
import typer

app = typer.Typer()

@app.command()
def train(
    config: str = typer.Option("configs/train.yaml", help="Path to config file"),
    dry_run: bool = typer.Option(False, help="Validate config without training"),
):
    """Train model from config."""
    typer.echo(f"Loading config: {config}")
    if dry_run:
        typer.echo("Dry run — skipping training.")
        return
    # ... training logic

@app.command()
def evaluate(model_path: str, data_path: str):
    """Evaluate trained model on test set."""
    ...

if __name__ == "__main__":
    app()

# Sử dụng
python scripts/cli.py train --config configs/train.yaml
python scripts/cli.py evaluate models/best_model.pt data/processed/test.parquet
python scripts/cli.py --help

Typer tự generate --help từ docstring và type hints — không cần viết tay.

15

Khi Nào Dùng Cấu Trúc Nào

Use case	Cấu trúc phù hợp
Single Kaggle / tutorial notebook	Chỉ cần `notebook.ipynb` + `requirements.txt` + `README.md`
Small CLI tool hoặc utility script	`<package>/` + `tests/` + `pyproject.toml`
ML research / EDA project	Cookiecutter Data Science
Model-backed REST API	FastAPI ML Service
RAG chatbot / LLM app	RAG / LLM App structure
Reusable Python library	Src layout + `pyproject.toml`

Nguyên tắc: chọn cấu trúc phù hợp scope — không over-engineer project nhỏ, không under-engineer project production.

16

Tạo Skeleton Nhanh Với Cookiecutter

pip install cookiecutter

# Tạo ML project theo Cookiecutter Data Science template
cookiecutter https://github.com/drivendata/cookiecutter-data-science

Sau khi chạy, cookiecutter hỏi project name, author, license — tự tạo toàn bộ skeleton. Dùng làm starting point, sau đó xóa folder không cần.

Template khác

cookiecutter-fastapi: FastAPI project với Docker, CI/CD skeleton.
cookiecutter-pypackage: Python package chuẩn với CI, docs, test.
Hoặc tự tạo template riêng cho team và lưu trên GitHub.

17

Pitfalls Phổ Biến

1. Notebook là source of truth

Logic quan trọng nằm trong notebook cell, không được refactor ra module. Kết quả: không reproduce được, không test được, không review được qua diff.

2. Trộn raw và processed data

Tất cả data trong 1 folder data/ không có subfolder. Sau vài tháng không nhớ file nào là original, file nào đã được transform.

3. sys.path.append linh tinh

# Anti-pattern — thấy nhiều trong notebook
import sys
sys.path.append("../../src")
from models import predict

Thay vào đó: cài package đúng cách với pip install -e . từ đầu.

4. Không có tests/ folder

Project không có test nào là dấu hiệu rõ ràng ở code review. Ngay cả vài unit test đơn giản cũng tốt hơn không có gì.

5. Commit data và model lên GitHub

File CSV 200MB, model .pkl 500MB trong repo làm git clone chậm, vượt GitHub 100MB limit. Dùng .gitignore + DVC hoặc cloud storage.

6. Commit file .env chứa API key

Commit .env là lỗi bảo mật nghiêm trọng. Chỉ commit .env.example với placeholder. Thêm .env vào .gitignore ngay từ commit đầu tiên.

7. Over-engineering project nhỏ

Áp dụng Cookiecutter Data Science đầy đủ cho 1 notebook demo thành 20 folder trống. Cấu trúc phải phù hợp scope — project nhỏ thì cấu trúc nhỏ.

18

Tóm Tắt

✅ 3 cấu trúc phổ biến: Cookiecutter DS (ML cổ điển), FastAPI ML Service (API), RAG/LLM App (chatbot/agent).
✅ Flat layout cho app/portfolio; src layout cho reusable library.
✅ pyproject.toml thay thế setup.py + requirements.txt riêng lẻ.
✅ Makefile hoặc justfile giảm gõ command lặp lại.
✅ YAML config cho hyperparameter — không hard-code trong script.
✅ notebooks/ chỉ để explore — logic production đi vào src/ hoặc app/.
✅ tests/ bắt buộc — unit + integration.
✅ data/ và models/ vào .gitignore — lưu trên DVC/S3/MLflow.
✅ Không commit .env — chỉ commit .env.example.

19

Bài Tiếp Theo

Bài 8: Refactor từ notebook (.ipynb) sang script (.py) — quy trình chuyển logic từ notebook cell thành Python module có thể test và tái sử dụng.

Danh sách bài viết

Bài 7: Cấu trúc thư mục chuẩn cho dự án AI / ML

Mục lục

Mục Tiêu Bài Học

Vì Sao Cấu Trúc Thư Mục Quan Trọng

Tìm code nhanh

Modularity

Separation of concerns

Convention → onboard nhanh

Cấu Trúc 1: Cookiecutter Data Science

Điểm mạnh

Điểm yếu

Cấu Trúc 2: FastAPI ML Service

Điểm mạnh

Khi nào dùng

Cấu Trúc 3: RAG / LLM App

Folder đặc thù

Lưu ý

Src Layout vs Flat Layout

Src layout

Flat layout

Khuyến nghị

Giải Thích Từng Folder

File Chuẩn ở Root

pyproject.toml — Modern Python

Cài đặt với optional deps

Khi nào vẫn dùng requirements.txt

Makefile — Automation Command

Thay thế: justfile

YAML Config Cho Hyperparameter

Load config đơn giản với PyYAML

Load config với OmegaConf (khuyến nghị cho project phức tạp)

Notebooks — Dùng Đúng Chỗ

Naming convention

Quy trình sau khi POC xong

Tests Folder

Naming convention

conftest.py — shared fixtures

Scripts vs CLI Tool

Script đơn giản

CLI tool với Typer (khi nhiều sub-command)

Khi Nào Dùng Cấu Trúc Nào

Tạo Skeleton Nhanh Với Cookiecutter

Template khác

Pitfalls Phổ Biến

1. Notebook là source of truth

2. Trộn raw và processed data

3. sys.path.append linh tinh

4. Không có tests/ folder

5. Commit data và model lên GitHub

6. Commit file .env chứa API key

7. Over-engineering project nhỏ

Tóm Tắt

Bài Tiếp Theo

Tài liệu tham khảo