Bài 35: AI System Design — kiến trúc 1 RAG production

1

Mục Tiêu Bài Học

Sau bài này bạn sẽ:

✅ Biết format và thời lượng của system design interview vòng AI
✅ Nắm framework 6 bước để trả lời có cấu trúc
✅ Walk-through được toàn bộ kiến trúc RAG production với số liệu cụ thể
✅ Biết cách justify trade-off cho từng component
✅ Chuẩn bị được 13 follow-up question phổ biến

2

Format Của System Design Interview

Thời Lượng Và Cấu Trúc

Thông thường 45–60 phút với phân bổ thời gian như sau:

Giai đoạn	Thời gian	Hoạt động
Requirements clarify	5–10 phút	Hỏi functional + non-functional
Capacity estimate	5 phút	Ước lượng storage, QPS, cost
High-level design	10–15 phút	Vẽ diagram, giải thích component
Deep dive	15–20 phút	Chi tiết component critical
Scaling + failure	5–10 phút	Trả lời follow-up

Prompt Điển Hình

Interviewer thường bắt đầu bằng câu rất mở:

"Design a RAG system for a company with 10,000 documents
and 500 internal users."

"Design a document Q&A chatbot that can handle 100 QPS."

"How would you build a knowledge base assistant for customer support?"

Công Cụ Vẽ Diagram

Virtual interview: Excalidraw (hand-drawn style, dễ thao tác nhanh), Miro (collaborative, interviewer xem được real-time), hoặc shared Google Slides.
Onsite: Whiteboard vật lý — viết chữ to, rõ ràng, cấu trúc box + arrow.
Không dùng: PowerPoint có template sẵn — interviewer biết bạn đang present, không think live.

3

Interviewer Đánh Giá Gì

Tiêu chí	Biểu hiện tốt	Biểu hiện kém
Requirement gathering	Hỏi clarify trước khi design — scale, latency, budget	Bắt đầu vẽ ngay khi chưa rõ requirement
Architecture thinking	Identify đủ component, luồng dữ liệu rõ	Thiếu component, diagram mơ hồ
Trade-off analysis	Nêu được ưu/nhược của từng lựa chọn, giải thích tại sao chọn	Chỉ nêu tool name, không giải thích why
Scaling	Biết khi nào bottleneck, cách horizontal/vertical scale	"Chỉ cần add more server" không chi tiết
Failure handling	Nêu failure scenario cụ thể + fallback/alert	Bỏ qua phần monitoring hoàn toàn
Communication	Nói to suy nghĩ, hỏi xác nhận hướng đi	Im lặng vẽ 10 phút rồi mới giải thích

Interviewer không tìm kiếm perfect solution — họ đánh giá structured thinking và communication. Thừa nhận unknowns là tốt: "Tôi chưa làm ở scale này nhưng approach của tôi sẽ là..."

4

Framework 6 Bước

Áp dụng nhất quán cho bất kỳ AI system design nào:

Requirements — Functional + non-functional.
Capacity estimate — Storage, QPS, cost.
High-level architecture — Component + data flow.
Deep dive — Critical component.
Scaling — Handle 10x growth.
Failure handling — Monitor + alert + fallback.

Phần còn lại của bài này walk-through từng bước với example cụ thể:

Prompt: "Design a RAG chatbot for internal company knowledge base."

5

Bước 1 — Requirements Gathering

Dành 5–10 phút đầu để clarify. Đừng bỏ qua bước này — nó cho thấy bạn không assume blindly.

Câu Hỏi Functional

User type: End-user (customer)? Internal team? Developer qua API?
Document type: PDF? Markdown? HTML? Code repository?
Document size: Bao nhiêu file? Kích thước trung bình?
Update frequency: Static (upload một lần)? Daily ingest? Real-time?
Query pattern: Factual Q&A? Multi-turn conversation? Multi-language?
Citation: Phải cite source document không?
Latency target: < 2s? < 5s? Có streaming không?

Câu Hỏi Non-Functional

Scale: Bao nhiêu concurrent user? Bao nhiêu QPS peak?
Availability: 99%? 99.9%? Downtime budget?
Cost: Budget tháng là bao nhiêu?
Security: Có PII không? Auth cơ chế gì?
Compliance: GDPR? Data residency requirement?

Example — Sau Khi Clarify

Context sau khi hỏi:
- 500 nhân viên, peak ~50 query/ngày, ~5 QPS
- 10,000 documents (50 trang trung bình), Markdown + PDF
- Updated daily bởi content team
- Cần cite source (tên doc + section)
- Latency < 3s end-to-end
- Uptime 99% (downtime < 7.2h/tháng OK)
- Budget $500/tháng

Với requirement đã rõ, bạn bắt đầu design — không phải trước đó.

6

Bước 2 — Capacity Estimate

Ước lượng nhanh để chọn tech stack phù hợp — không cần chính xác tuyệt đối, cần order of magnitude đúng.

Storage

Raw text:
  10,000 docs × 50 trang × 500 token/trang × ~4 bytes/token ≈ 1GB

Chunks sau khi split (500 token/chunk, overlap 50):
  10,000 × 50 trang ≈ 500,000 chunks

Embeddings (text-embedding-3-small, 1536 dim, float32):
  500,000 × 1,536 × 4 bytes ≈ 3GB

Metadata JSON (source, date, section, dept):
  500,000 × ~1KB ≈ 500MB

Tổng vector DB: ~3.5GB → Single node Qdrant đủ dùng

Compute — Latency Per Request

Embed query (OpenAI API):  ~100ms
Cache check (Redis):        ~5ms
Vector search (Qdrant):     ~50ms (HNSW, top-k=5)
LLM generate (gpt-4o-mini): ~1,000–2,000ms
Streaming TTFT:             ~300–500ms

P95 total: ~1.5–2.5s (within <3s target)

Cost Estimate — Per Month

Component	Chi tiết	Ước lượng
Embedding (ingestion, 1 lần)	500k chunks × ~250 token × $0.00002/1k token	~$2.50 one-time
Embedding (query)	50 query/ngày × 30 ngày × 100 token	~$0.003/tháng
LLM (gpt-4o-mini)	1,500 query/tháng × 4k token avg × $0.15/1M input	~$30/tháng
Vector DB (Qdrant self-host)	VPS 2vCPU 4GB RAM	~$20/tháng
App server (FastAPI)	2vCPU 4GB	~$20/tháng
Redis cache	Upstash free tier → paid nếu cần	~$5/tháng
Object storage (S3-compatible)	1GB raw doc	~$1/tháng

Tổng: ~$76/tháng — nằm trong budget $500. Có nhiều headroom để scale hay thêm feature.

7

Bước 3 — High-Level Architecture

Diagram Tổng Quan

┌──────────┐
│   User   │
└────┬─────┘
     │ HTTPS query
     ↓
┌─────────────┐
│ API Gateway │  (Auth JWT/API-key, Rate limit 100 req/min/user)
└──────┬──────┘
       ↓
┌─────────────┐     ┌──────────┐
│  FastAPI    │────▶│  Redis   │  (Cache: exact-match query → response, TTL 24h)
│  Service    │     └──────────┘
└──────┬──────┘
       │
       ├──▶ ┌───────────────┐
       │    │ Embed Service │  (OpenAI text-embedding-3-small)
       │    └───────────────┘
       │
       ├──▶ ┌───────────────┐
       │    │  Vector DB    │  (Qdrant, HNSW, top-k=5)
       │    └───────────────┘
       │
       └──▶ ┌───────────────┐
            │  LLM Service  │  (OpenAI gpt-4o-mini, streaming)
            └───────────────┘

─── Ingestion path (offline, cron daily) ───────────────────

┌───────────────┐
│  Ingestion    │  (Cron job / Airflow DAG)
│  Pipeline     │
└──────┬────────┘
       │
       ├──▶ Document Store (S3 / MinIO)
       ├──▶ Embed Service (batch)
       └──▶ Vector DB (upsert)

Hai Luồng Chính

Query path (online):

User gửi query qua HTTPS.
API Gateway verify auth, rate limit.
FastAPI check Redis cache — nếu hit, trả về ngay.
Nếu miss: embed query → vector search → lấy top-k chunks → build prompt → call LLM → stream response về user.
Lưu response vào Redis cache.

Ingestion path (offline):

Cron job chạy hàng ngày (hoặc trigger manual).
Load doc mới từ source (Google Drive, SharePoint, upload API).
Lưu raw doc vào Document Store.
Split → embed (batch) → upsert vào Vector DB.
Chunk ID = hash(content) → idempotent, re-run an toàn.

8

Bước 4 — Deep Dive Component

a) Chunking Strategy

Quyết định quan trọng nhất ảnh hưởng chất lượng retrieval:

Recursive character splitter: 500 token, overlap 50. Đây là default tốt cho text thường.
Markdown header-aware: Giữ nguyên section heading để context không bị mất. LangChain MarkdownHeaderTextSplitter handle được.
Code-aware: Không split giữa chừng một function — dùng Language.PYTHON splitter.
PDF: Extract text layer trước (pdfminer.six hoặc pymupdf), rồi mới chunk.

Mỗi chunk lưu kèm metadata: source_doc, page, section, department, updated_at.

b) Retrieval

Vector search: Cosine similarity, top-k = 5 (tăng lên 8 nếu docs dài).
Hybrid search: Kết hợp vector + BM25 (keyword) với Reciprocal Rank Fusion — tốt hơn pure vector khi query ngắn hoặc có tên riêng.
Metadata filter: Lọc theo department hoặc date trước khi vector search giảm search space.
Re-rank (optional): Cohere Rerank hoặc cross-encoder model sau khi lấy top-k → cải thiện precision, thêm latency ~100–200ms và cost.

c) Prompt Engineering

System: You answer questions based on the provided context only.
For each claim, cite the source as [doc_name, section].
If the answer is not in the context, respond:
"I don't have information on this in the knowledge base."
Do not speculate beyond what is stated.

Context:
[1] (source: Engineering/onboarding.md, Section 3)
...chunk text...

[2] (source: HR/benefits-2025.pdf, Page 12)
...chunk text...

User: {query}

Temperature = 0 để giảm hallucination. Không thêm kiến thức ngoài context.

d) Streaming Response

FastAPI dùng StreamingResponse với Server-Sent Events (SSE).
Token được trả về từng chunk ngay khi LLM generate — TTFT (time to first token) < 500ms.
User thấy text xuất hiện dần thay vì chờ 2 giây trước khi nhận toàn bộ response.

e) Ingestion — Chi Tiết Idempotency

import hashlib

def generate_chunk_id(content: str, source: str) -> str:
    """Stable ID — re-ingesting same content không tạo duplicate."""
    return hashlib.sha256(f"{source}::{content}".encode()).hexdigest()[:16]

# Khi upsert vào Qdrant:
# - Nếu chunk_id đã tồn tại → skip (idempotent)
# - Nếu doc bị xóa → delete by filter(source_doc=...)

9

Bước 5 — Scaling

10x User (5,000 nhân viên, 50 QPS)

Component	Hiện tại (5 QPS)	Sau 10x (50 QPS)
FastAPI	1 instance	Horizontal scale (2–4 instance), load balancer
Redis	Single node	Redis Cluster hoặc tăng memory instance
Qdrant	Single node	Qdrant cluster (replication factor 2)
OpenAI LLM	Tier 1	Upgrade tier, tăng RPM limit
Cost	~$76/tháng	~$300–400/tháng (vẫn trong budget $500)

10x Document (100,000 docs)

Chunks tăng lên 5M → Qdrant cluster 3 node vẫn handle được.
Ingestion pipeline cần parallel worker — Celery + Redis broker, 4–8 worker.
Embedding batch: $0.00002/1k token × 5M chunks × 250 token ≈ $25 one-time.
HNSW index rebuild không realtime — Qdrant optimize index ngoài giờ thấp.

Multi-Region (nếu global)

Replicate Vector DB sang region gần user — latency vector search giảm 50–100ms.
LLM call: route sang region gần nhất (OpenAI hỗ trợ US/EU endpoint).
Read replica cho Vector DB, write về primary.
Cache layer tại mỗi region — reduce cross-region traffic.

10

Bước 6 — Failure Handling

Failure Scenarios Và Xử Lý

Failure	Impact	Xử lý
LLM API down (OpenAI outage)	Toàn bộ query fail	Fallback sang provider phụ (Anthropic Claude Haiku) — circuit breaker pattern
Vector DB slow / timeout	Latency tăng, timeout	Aggressive cache, timeout 2s → return cached response hoặc error graceful
Embedding API rate limit	Query không embed được	Retry với exponential backoff (0.5s, 1s, 2s), queue query
Ingestion pipeline fail	Doc mới không được index	Dead Letter Queue (DLQ) + Slack alert, retry hàng ngày
Redis cache miss rate cao	Tốn token LLM hơn dự kiến	Tăng TTL, điều tra cache key strategy

Monitoring Metrics

Latency: P50, P95, P99 per endpoint (target P95 < 3s).
Error rate: 5xx / total request — alert khi > 1%.
LLM cost: Token/ngày — alert khi spike > 50% so với baseline.
Cache hit rate: Target > 30% để tiết kiệm cost.
Retrieval score: Average cosine similarity của top-1 chunk — nếu giảm, embedding drift hoặc doc quality vấn đề.
Ingestion lag: Số doc chờ xử lý — alert nếu queue > 100.

Alerting

LLM API error rate > 5% liên tục 5 phút → trigger failover sang backup provider.
Daily cost spike > 50% so với 7-day average → page on-call.
P95 latency > 5s trong 10 phút → alert.
Ingestion job fail 3 lần liên tiếp → Slack alert đến content team.

11

Quality Monitoring — RAG Specific

Ngoài system metric, RAG cần theo dõi quality metric riêng:

Logging Cho Evaluation

Sample 5% query production → log: query, retrieved chunks, full prompt, LLM response.
Không log toàn bộ (cost + storage) — 5% là đủ để detect pattern.
Mask PII trước khi log nếu có requirement.

Eval Set — Golden Dataset

Xây dựng 100–200 cặp (question, ground-truth answer) từ document thực.
Chạy weekly eval tự động trên eval set → track metric qua thời gian.
Alert nếu Faithfulness hoặc Answer Relevancy giảm > 10% so với baseline.

Ragas Metric

Metric	Đo gì	Threshold tham chiếu
Faithfulness	Response có dựa trên context không? (không hallucinate)	> 0.85
Answer Relevancy	Response có trả lời đúng câu hỏi không?	> 0.80
Context Recall	Retrieval có lấy đủ context cần thiết không?	> 0.75

Ragas (github.com/explodinggradients/ragas) cung cấp sẵn pipeline đánh giá các metric này. Tham khảo Ragas docs v0.1.x cho cách integrate với LangChain.

12

Trade-off Discussion

Khi interviewer hỏi "Tại sao chọn X mà không phải Y?", cần justify rõ ràng:

Embedding Model

Option	Ưu điểm	Nhược điểm
OpenAI text-embedding-3-small	Chất lượng tốt, maintenance zero	Ongoing cost, vendor lock-in
bge-large-en (self-host)	Không có per-call cost sau khi deploy	Cần GPU/CPU infra, slightly lower MTEB score
multilingual-e5-large	Hỗ trợ multi-language tốt hơn	1024 dim, cần test benchmark với corpus thực

Với budget $500 và 500 user: OpenAI API rẻ hơn self-host ở scale này. Self-host cost-effective khi embed volume > 10M token/tháng.

LLM

Model	Cost (input)	Khi nào dùng
gpt-4o-mini	$0.15/1M token	Default — tốt cho factual Q&A, nhanh
gpt-4o	$2.50/1M token (~17x đắt hơn)	Khi cần reasoning phức tạp hoặc synthesis nhiều chunk
Claude Haiku (Anthropic)	$0.25/1M token	Backup provider, hỗ trợ Vietnamese tốt

Vector DB

Option	Chi phí	Ops overhead
Pinecone (managed)	$50–70/tháng starter	Gần zero — scale tự động
Qdrant (self-host)	~$20/tháng VPS	Cần manage backup, upgrade, monitoring
Weaviate Cloud	Free sandbox, $25+ paid	Ít hơn self-host

Chunk Size

Size	Ưu điểm	Nhược điểm
200 token	Retrieval precise	Mất context xung quanh câu trả lời
500 token	Balance giữa precision và context	Sweet spot cho hầu hết use case
1,000 token	Nhiều context hơn	Dilute embedding, tốn LLM context window hơn

13

Follow-up Questions Thường Gặp

Chuẩn bị trước các câu hỏi follow-up điển hình:

Câu hỏi	Hướng trả lời
"How do you handle multi-language?"	Dùng multilingual embedding model (e.g., multilingual-e5-large); prompt instruct LLM trả lời theo ngôn ngữ của query.
"How do you reduce hallucination?"	Temperature = 0, prompt force cite source, faithfulness eval hàng tuần, track drop > 10%.
"How do you cite source?"	Metadata (source_doc, section) đính kèm mỗi chunk; prompt instruct format [doc, section]; response parser extract citations.
"How do you handle PII in documents?"	Pre-process strip PII trước khi ingest (presidio hoặc regex pattern); mask trong log.
"How do you A/B test a new chunking strategy?"	Route 10% traffic sang variant mới; compare faithfulness + answer relevancy trên eval set; rollback nếu metric giảm.
"What if a document is updated during a query?"	Snapshot pattern — query đọc từ stable snapshot; ingestion cập nhật version mới sau. Eventual consistency là OK cho internal knowledge base.
"How do you handle confidential documents with access control?"	Tag metadata per-chunk với permission level; filter vector search theo permission của user trước khi retrieve.

14

Pitfalls Phổ Biến

Lỗi	Hệ quả	Cách tránh
Bắt đầu vẽ diagram ngay khi chưa clarify requirement	Design cho sai use case, phải làm lại	Luôn hỏi clarify 5–10 phút đầu
Bỏ qua non-functional requirement (scale, cost)	Chọn tech stack không phù hợp	Hỏi QPS, budget, availability trước khi design
Over-engineering	Design cho 1M user khi chỉ cần 500	Design cho requirement thực tế, note cách scale khi cần
Bỏ qua monitoring và failure discussion	Điểm trừ lớn trong rubric	Luôn mention monitoring ở bước cuối dù có ít thời gian
Vague khi bị hỏi detail	"We'd use a cache" mà không nói Redis, TTL, invalidation	Cụ thể hóa: tech, số, cơ chế
Dùng sai thuật ngữ	Nhầm "embedding" với "vector", "chunk" với "token"	Ôn lại định nghĩa — bài 33 Module 7 có danh sách

15

Practice Strategy

Tài Liệu Tham Khảo

"Designing Data-Intensive Applications" (Martin Kleppmann, O'Reilly 2017) — nền tảng hệ thống phân tán: replication, partitioning, consistency. Không về AI nhưng là nền tảng cần thiết.
ByteByteGo (bytebytego.com) — newsletter + YouTube channel. Alex Xu có nhiều AI system design walkthrough trong newsletter 2024.
Ragas paper (arXiv:2309.15217, 2023) — framework đánh giá RAG pipeline.
HNSW paper (arXiv:1603.09320, Malkov & Yashunin 2016) — thuật toán indexing cho approximate nearest neighbor.

Practice Plan

Build 1 RAG project end-to-end (bài 13 series này) — có experience thực tế trước khi design trên giấy.
Chạy mock system design 5–10 lần với bạn hoặc mentor — mỗi lần 45 phút, record lại để review.
Sau mỗi mock: note lại phần nào còn vague, bổ sung vào prep notes.
Thực hành thiết kế nhiều variant: RAG cho code search, RAG cho legal doc, RAG cho customer support — mỗi case có khác biệt nhỏ về requirement.

16

Bài Tiếp Theo

Bài 36: AI System Design — kiến trúc 1 ML inference pipeline

Danh sách bài viết