Bài 12: Project 2 — Image Classification + Demo Gradio

1

Tổng quan project

Project này xây dựng một image classifier (phân loại ảnh) hoàn chỉnh từ dữ liệu đến demo live. Khác với bài 11 dùng ML cổ điển trên tabular data, bài 12 tập trung vào Computer Vision với deep learning.

Thông số project

Domain: phân loại ảnh — chọn 1 trong: ảnh y tế (X-ray), sản phẩm bán lẻ, loài chim, món ăn Việt Nam.
Scale: 5–10 class, 1k–5k ảnh.
Tech stack: Python 3.11, PyTorch 2.x, torchvision 0.18+, Gradio 4.x, Hugging Face Spaces.
Timeline ước tính: 2–3 tuần (bao gồm chuẩn bị data, train, demo, deploy).

Vì sao project này phù hợp portfolio

Cho thấy kỹ năng Deep Learning + Computer Vision — mảng nhiều job yêu cầu.
Transfer learning = practical: train nhanh ngay trên laptop CPU/MPS, không cần GPU mạnh.
Gradio demo ai cũng test được trong browser — recruiter không cần cài gì.
HF Spaces deploy miễn phí, URL public, dễ gắn vào README và CV.

Cấu trúc repo cuối cùng

image-classifier/
├── data/
│   ├── train/
│   ├── val/
│   └── test/
├── scripts/
│   ├── split_dataset.py
│   └── train.py
├── models/
│   └── classifier_v1.pth
├── examples/
│   ├── pho_sample.jpg
│   └── banh_mi_sample.jpg
├── app.py              # Gradio demo
├── requirements.txt
└── README.md

2

Chọn dataset

Option A — Public dataset

Phù hợp để bắt đầu nhanh hoặc khi muốn benchmark trên dataset chuẩn.

Cats vs Dogs (Kaggle) — beginner, 2 class, ~25k ảnh. Quá phổ biến, ít điểm cộng originality.
Stanford Dogs — 120 breed, ~20k ảnh. Khó hơn, nhưng vẫn là dataset tutorial quen thuộc.
Food-101 — 101 món ăn, 1000 ảnh/class. Thực tế hơn, nhưng train lâu nếu dùng toàn bộ.
Chest X-Ray Pneumonia (Kaggle) — medical domain, 2 class (NORMAL / PNEUMONIA), ~5.8k ảnh. Điểm cộng vì domain có ý nghĩa.

Option B — Tự thu thập (khuyến nghị cho originality)

Portfolio sẽ nổi bật hơn nếu dataset là do bạn tự xây dựng. Ví dụ: 8 món ăn Việt Nam (phở, bún chả, bánh mì, bún bò, cơm tấm, bánh xèo, gỏi cuốn, chả giò).

# Cài icrawler
pip install icrawler

# Crawl ảnh Google
python - <<'EOF'
from icrawler.builtin import GoogleImageCrawler

classes = ["pho vietnam", "bun cha vietnam", "banh mi vietnam", "bun bo hue"]
for cls in classes:
    crawler = GoogleImageCrawler(storage={"root_dir": f"data/raw/{cls.split()[0]}"})
    crawler.crawl(keyword=cls, max_num=300)
EOF

Sau khi crawl, lọc thủ công ảnh nhiễu (~5–10 phút/class), giữ lại 200–400 ảnh sạch mỗi class. Tổng 1500–3200 ảnh là đủ để train tốt với transfer learning.

Lưu ý trước khi chọn

Đảm bảo class balance: chênh lệch >3× giữa class lớn nhất và nhỏ nhất sẽ cần xử lý thêm (weighted loss hoặc oversampling).
Kiểm tra license nếu publish demo — ảnh Google Images không phải public domain.
Public dataset (Kaggle, HF Datasets) thường có license rõ ràng hơn.

3

Bước 1 — Setup cấu trúc thư mục

torchvision's ImageFolder đọc nhãn từ tên thư mục, nên cấu trúc dữ liệu phải đúng dạng:

data/
├── train/
│   ├── pho/        # *.jpg
│   ├── banh_mi/
│   └── bun_bo/
├── val/
│   ├── pho/
│   ├── banh_mi/
│   └── bun_bo/
└── test/
    ├── pho/
    ├── banh_mi/
    └── bun_bo/

Script scripts/split_dataset.py để tự động split từ raw data:

# scripts/split_dataset.py
import os
import shutil
import random
from pathlib import Path

RAW_DIR = Path("data/raw")
OUT_DIR = Path("data")
SPLITS = {"train": 0.70, "val": 0.15, "test": 0.15}
SEED = 42

random.seed(SEED)

for cls_dir in RAW_DIR.iterdir():
    if not cls_dir.is_dir():
        continue
    images = list(cls_dir.glob("*.jpg")) + list(cls_dir.glob("*.png"))
    random.shuffle(images)

    n = len(images)
    n_train = int(n * SPLITS["train"])
    n_val = int(n * SPLITS["val"])

    buckets = {
        "train": images[:n_train],
        "val": images[n_train : n_train + n_val],
        "test": images[n_train + n_val :],
    }

    for split, files in buckets.items():
        dest = OUT_DIR / split / cls_dir.name
        dest.mkdir(parents=True, exist_ok=True)
        for f in files:
            shutil.copy(f, dest / f.name)

    print(f"{cls_dir.name}: {n_train}/{int(n * SPLITS['val'])}/{n - n_train - int(n * SPLITS['val'])}")

4

Bước 2 — Data loading và augmentation

Mean/std [0.485, 0.456, 0.406] và [0.229, 0.224, 0.225] là giá trị chuẩn hóa của ImageNet — dùng khi load pretrained model từ ImageNet weights.

import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

train_tfm = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

val_tfm = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

train_ds = datasets.ImageFolder("data/train", transform=train_tfm)
val_ds   = datasets.ImageFolder("data/val",   transform=val_tfm)
test_ds  = datasets.ImageFolder("data/test",  transform=val_tfm)

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True,  num_workers=4)
val_loader   = DataLoader(val_ds,   batch_size=32, shuffle=False, num_workers=4)
test_loader  = DataLoader(test_ds,  batch_size=32, shuffle=False, num_workers=4)

print(train_ds.class_to_idx)
# {"banh_mi": 0, "bun_bo": 1, "pho": 2, ...}
print(f"Train: {len(train_ds)} | Val: {len(val_ds)} | Test: {len(test_ds)}")

Ghi chú về augmentation

RandomHorizontalFlip: phù hợp với ảnh đồ ăn, sản phẩm, động vật. Không dùng với ảnh chữ, ký tự, hay X-ray có orientation cố định.
ColorJitter: giúp model robust hơn với điều kiện ánh sáng khác nhau.
Không augment tập val/test — chỉ resize và normalize để đánh giá trung thực.

Kiểm tra class imbalance

from collections import Counter

labels = [label for _, label in train_ds.samples]
counts = Counter(labels)
for cls_name, idx in train_ds.class_to_idx.items():
    print(f"{cls_name:20s}: {counts[idx]} ảnh")

Nếu class lớn nhất / nhỏ nhất > 3, cân nhắc dùng WeightedRandomSampler hoặc class_weight trong loss.

5

Bước 3 — Transfer learning với EfficientNet B0

Transfer learning (học chuyển giao): dùng backbone đã được pretrain trên ImageNet (~1.2M ảnh, 1000 class), freeze toàn bộ backbone, chỉ train lại classifier head cho task mới. Cách này hoạt động tốt ngay cả với dataset nhỏ (vài trăm ảnh/class).

Tại sao EfficientNet B0?

Nhỏ (~5.3M params), inference nhanh trên CPU — quan trọng khi deploy HF Spaces CPU.
Accuracy/size tốt hơn ResNet-50 trên nhiều benchmark (Tan & Le, 2019, arXiv:1905.11946).
Có trong torchvision 0.13+ với EfficientNet_B0_Weights.IMAGENET1K_V1.

import torch
import torch.nn as nn
from torchvision import models

device = (
    torch.device("cuda")  if torch.cuda.is_available()
    else torch.device("mps") if torch.backends.mps.is_available()
    else torch.device("cpu")
)
print(f"Using device: {device}")

# Load pretrained weights
model = models.efficientnet_b0(weights=models.EfficientNet_B0_Weights.IMAGENET1K_V1)

# Freeze backbone
for param in model.features.parameters():
    param.requires_grad = False

# Thay classifier head
num_classes = len(train_ds.classes)
in_features = model.classifier[1].in_features  # 1280 với B0
model.classifier[1] = nn.Linear(in_features, num_classes)

model.to(device)

# Đếm params trainable
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total     = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable:,} / {total:,}")
# Trainable params: 10,248 / 5,288,548  (chỉ train head)

Nếu muốn dùng ResNet-50

# Thay thế đoạn trên bằng:
model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)
for param in model.parameters():
    param.requires_grad = False
model.fc = nn.Linear(model.fc.in_features, num_classes)
model.to(device)

ResNet-50 (~25M params) chính xác hơn một chút nhưng chậm hơn khi inference trên CPU, phù hợp nếu bạn deploy trên GPU Space.

6

Bước 4 — Training loop

import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=1e-3,
    weight_decay=1e-4,
)
scheduler = CosineAnnealingLR(optimizer, T_max=10)

best_val_acc = 0.0

for epoch in range(10):
    # --- Train ---
    model.train()
    train_loss = 0.0
    train_correct = 0

    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(X)
        loss = criterion(logits, y)
        loss.backward()
        optimizer.step()
        train_loss    += loss.item() * X.size(0)
        train_correct += (logits.argmax(1) == y).sum().item()

    scheduler.step()

    # --- Validate ---
    model.eval()
    val_correct = 0
    with torch.no_grad():
        for X, y in val_loader:
            X, y = X.to(device), y.to(device)
            val_correct += (model(X).argmax(1) == y).sum().item()

    train_acc = train_correct / len(train_ds)
    val_acc   = val_correct   / len(val_ds)
    avg_loss  = train_loss    / len(train_ds)

    print(f"Epoch {epoch+1:2d}: loss={avg_loss:.4f}  train_acc={train_acc:.4f}  val_acc={val_acc:.4f}")

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), "models/best_head.pth")

print(f"\nBest val_acc: {best_val_acc:.4f}")

Giải thích các lựa chọn

AdamW với weight_decay=1e-4: AdamW (Loshchilov & Hutter, 2017, arXiv:1711.05101) tách weight decay ra khỏi gradient update — tốt hơn Adam + L2 thông thường.
CosineAnnealingLR: giảm learning rate theo hình sin, giúp hội tụ mượt hơn step decay, phù hợp với số epoch nhỏ (10–15).
Save best val acc thay vì save epoch cuối — tránh overfit vào training set.

7

Bước 5 — Fine-tuning (optional)

Sau khi train head đã ổn định (~10 epoch), unfreeze backbone và train thêm với learning rate thấp hơn ~100× để tinh chỉnh toàn bộ network.

# Unfreeze backbone
for param in model.features.parameters():
    param.requires_grad = True

# LR rất nhỏ để không phá vỡ pretrained weights
optimizer = optim.AdamW(model.parameters(), lr=1e-5, weight_decay=1e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=5)

for epoch in range(5):
    # Giống loop trên, train thêm 5 epoch
    ...

Khi nào nên fine-tune

Dataset của bạn khác xa ImageNet (ảnh y tế, ảnh vệ tinh, ảnh kính hiển vi).
Đã có >1000 ảnh/class — fine-tune với dataset nhỏ thường dễ bị overfit.
Accuracy từ head-only đã đạt >80% — fine-tune mới có hiệu quả.

Lưu ý

Nếu dataset domain gần ImageNet (ảnh đồ ăn, thú vật, sản phẩm), thường head-only đã cho kết quả tốt (85–92% val acc) mà không cần fine-tune.

8

Bước 6 — Evaluation

Dùng test set (không đụng trong toàn bộ quá trình train) để đánh giá cuối cùng.

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load best checkpoint
model.load_state_dict(torch.load("models/best_head.pth", map_location=device))
model.eval()

all_preds  = []
all_labels = []

with torch.no_grad():
    for X, y in test_loader:
        X = X.to(device)
        preds = model(X).argmax(1).cpu()
        all_preds.extend(preds.numpy())
        all_labels.extend(y.numpy())

print(classification_report(
    all_labels, all_preds,
    target_names=train_ds.classes,
    digits=3,
))

# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(
    cm, annot=True, fmt="d",
    xticklabels=train_ds.classes,
    yticklabels=train_ds.classes,
    ax=ax,
)
ax.set_xlabel("Predicted")
ax.set_ylabel("True")
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)

Đọc classification_report

precision: trong số ảnh model predict là class X, bao nhiêu % đúng.
recall: trong số ảnh thực sự là class X, model bắt đúng bao nhiêu %.
f1-score: harmonic mean của precision và recall — metric tổng hợp tốt khi class imbalance.
Thêm confusion matrix vào README — visualize class nào hay bị nhầm.

9

Bước 7 — Lưu checkpoint

Lưu kèm metadata cần thiết để load lại không cần đọc code train:

import os

os.makedirs("models", exist_ok=True)

torch.save({
    "model_state_dict": model.state_dict(),
    "class_names": train_ds.classes,   # list, thứ tự phải khớp với idx
    "model_arch": "efficientnet_b0",
    "num_classes": len(train_ds.classes),
    "val_acc": best_val_acc,
}, "models/classifier_v1.pth")

print("Saved models/classifier_v1.pth")

Tại sao phải lưu class_names

Khi load model ở nơi khác (Gradio app, production server), class_names phải đúng thứ tự index. Nếu chỉ lưu state_dict, bạn phải nhớ/guess thứ tự — rất dễ sai nếu folder data bị rename hay sort khác.

ImageFolder sort class theo alphabet: ['banh_mi', 'bun_bo', 'com_tam', 'pho'] → index 0, 1, 2, 3. Nếu thứ tự này không được lưu, model sẽ predict sai label.

10

Bước 8 — Gradio demo

File app.py — chạy local với python app.py hoặc deploy lên HF Spaces:

# app.py
import gradio as gr
import torch
from torchvision import transforms, models
from PIL import Image

# --- Load model ---
ckpt = torch.load("models/classifier_v1.pth", map_location="cpu")
class_names = ckpt["class_names"]

model = models.efficientnet_b0()
model.classifier[1] = torch.nn.Linear(
    model.classifier[1].in_features,
    len(class_names),
)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# --- Transform (phải khớp val_tfm lúc train) ---
tfm = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

def predict(image: Image.Image) -> dict:
    x = tfm(image.convert("RGB")).unsqueeze(0)  # (1, 3, 224, 224)
    with torch.no_grad():
        probs = torch.softmax(model(x), dim=1)[0].numpy()
    return {class_names[i]: float(probs[i]) for i in range(len(class_names))}

demo = gr.Interface(
    fn=predict,
    inputs=gr.Image(type="pil"),
    outputs=gr.Label(num_top_classes=3),
    title="Vietnamese Food Classifier",
    description="Phân loại 8 món ăn Việt Nam. Upload ảnh để nhận kết quả.",
    examples=[
        ["examples/pho_sample.jpg"],
        ["examples/banh_mi_sample.jpg"],
    ],
)

if __name__ == "__main__":
    demo.launch()

Chạy local

pip install gradio==4.44.0
python app.py
# → Running on local URL: http://127.0.0.1:7860

Common pitfall: transform mismatch

Transform trong app.py phải giống hệt val_tfm lúc train — cùng resize, cùng mean/std. Mismatch là nguyên nhân phổ biến nhất khiến demo accuracy thấp hơn test set.

11

Bước 9 — Deploy lên Hugging Face Spaces

Tạo Space

Vào huggingface.co/new-space, điền tên, chọn SDK: Gradio, visibility: Public.

Clone repo Space về máy:

git clone https://huggingface.co/spaces/<username>/<space-name>
cd <space-name>

Copy file vào:

cp ../app.py .
cp ../models/classifier_v1.pth .
mkdir examples
cp ../examples/*.jpg examples/

requirements.txt

gradio==4.44.0
torch==2.3.0+cpu
torchvision==0.18.0+cpu
--extra-index-url https://download.pytorch.org/whl/cpu

Dùng bản +cpu để giảm size (~700MB thay vì ~2.5GB CUDA). HF Spaces CPU free tier có RAM 16GB — đủ cho EfficientNet B0.

README.md (YAML frontmatter bắt buộc cho HF)

---
title: Vietnamese Food Classifier
emoji: 🍜
colorFrom: orange
colorTo: yellow
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
---

Push lên HF

git add app.py classifier_v1.pth requirements.txt README.md examples/
git commit -m "init: vietnamese food classifier"
git push

Sau 2–5 phút build, Space live tại https://huggingface.co/spaces/<username>/<space-name>.

Nếu file .pth > 100MB

HF git thông thường block file >10MB. Dùng git-lfs:

git lfs install
git lfs track "*.pth"
git add .gitattributes
git add classifier_v1.pth
git commit -m "add model via lfs"
git push

12

Bước 10 — README cho GitHub

README là thứ recruiter nhìn đầu tiên khi vào repo. Cần đủ 5 phần sau:

Cấu trúc README tối thiểu

# Vietnamese Food Classifier

Demo live: https://huggingface.co/spaces/<user>/vietnamese-food-classifier

![demo](assets/demo.gif)

## Dataset
- 8 class: phở, bún chả, bánh mì, bún bò, cơm tấm, bánh xèo, gỏi cuốn, chả giò
- 2400 ảnh sau khi lọc (~300/class), split 70/15/15

## Architecture
ImageFolder → EfficientNet B0 (pretrained ImageNet) → Linear(1280, 8) → Softmax

## Results

| Metric        | Score  |
|---------------|--------|
| Val accuracy  | 89.2%  |
| Test accuracy | 87.8%  |
| Macro F1      | 0.876  |

![confusion matrix](assets/confusion_matrix.png)

## Tech stack
- PyTorch 2.3, torchvision 0.18
- Gradio 4.44
- Hugging Face Spaces (CPU free tier)

## Quick start
```bash
pip install -r requirements.txt
python app.py
```

Demo GIF

Record 10–15s trên local Gradio (http://127.0.0.1:7860), upload vài ảnh, convert sang GIF:

macOS: QuickTime → record screen → convert với ffmpeg -i demo.mov -r 10 demo.gif.
Windows: ScreenToGif (miễn phí, export trực tiếp .gif).

GIF cỡ <5MB, tốc độ 10fps là đủ. Giữ dưới 15s để không mất tập trung.

13

Bonus — Grad-CAM visualization

Grad-CAM (Selvaraju et al., 2017, arXiv:1610.02391) tạo heatmap cho thấy vùng ảnh nào ảnh hưởng nhiều nhất đến quyết định của model. Đây là kỹ thuật explainability — thể hiện depth kỹ thuật trong portfolio.

pip install grad-cam

from pytorch_grad_cam import GradCAM
from pytorch_grad_cam.utils.image import show_cam_on_image
from pytorch_grad_cam.utils.model_targets import ClassifierOutputTarget
import numpy as np

# Target layer: layer cuối của backbone
target_layers = [model.features[-1]]

cam = GradCAM(model=model, target_layers=target_layers)

# Lấy 1 ảnh từ test_loader
X, y = next(iter(test_loader))
input_tensor = X[0:1].to(device)  # shape (1, 3, 224, 224)

# Tạo heatmap cho class dự đoán
targets = [ClassifierOutputTarget(model(input_tensor).argmax().item())]
grayscale_cam = cam(input_tensor=input_tensor, targets=targets)[0]

# Overlay lên ảnh gốc
rgb_img = X[0].permute(1, 2, 0).numpy()
rgb_img = (rgb_img * np.array([0.229, 0.224, 0.225]) + np.array([0.485, 0.456, 0.406]))
rgb_img = np.clip(rgb_img, 0, 1)

visualization = show_cam_on_image(rgb_img, grayscale_cam, use_rgb=True)

import matplotlib.pyplot as plt
plt.imsave("assets/gradcam_sample.png", visualization)

Thêm một vài ảnh Grad-CAM vào README để chứng minh model học đúng feature (phần nước phở, hình dạng bánh mì…) thay vì background.

14

Target kết quả và pitfalls

Target kết quả

Val accuracy: 85–92% (với ~300 ảnh/class, EfficientNet B0 head-only).
Test accuracy: cách val accuracy không quá 3%.
HF Spaces demo hoạt động, latency <1s trên CPU free tier.
README có demo GIF + confusion matrix + accuracy table.

Pitfalls phổ biến

Vấn đề	Triệu chứng	Cách xử lý
Dataset imbalance không xử lý	Model predict class đa số, recall class thiểu số gần 0	WeightedRandomSampler hoặc class_weight trong CrossEntropyLoss
Test set leak	Test acc cao bất thường, deploy thực tế thấp hơn	Kiểm tra ảnh trùng bằng hash/perceptual hash trước khi split
Transform mismatch train vs serve	Demo accuracy thấp hơn test set đáng kể	Dùng chung hàm `val_tfm` ở cả train code và app.py
Lưu thiếu class_names	Predict được index nhưng không biết tên class	Luôn lưu `class_names` trong checkpoint dict
Code dùng .cuda() cứng	App crash khi deploy CPU-only Space	Dùng biến `device` và `map_location=device`
Augmentation không phù hợp	Accuracy thấp dù data đủ lớn	Không flip ảnh có orientation cố định (X-ray, chữ viết)

Hướng nâng cấp (khi muốn tăng difficulty)

Chuyển sang object detection với YOLOv8 — detect và classify nhiều vật trong 1 ảnh.
Multi-label classification — 1 ảnh thuộc nhiều class đồng thời (ví dụ ảnh có cả phở và bánh mì).
Knowledge distillation — train model nhỏ hơn từ EfficientNet B0 làm teacher.
Active learning loop — model chọn ảnh nào cần label tiếp theo để tăng accuracy nhanh nhất.

15

Bài tiếp theo

Bài 13: Project 3 — RAG Chatbot trên tài liệu nội bộ

Danh sách bài viết