Bài 33: Câu hỏi technical về LLM / RAG / Agent

1

Mục Tiêu Bài Học

Sau bài này bạn sẽ:

✅ Trả lời được 30 câu hỏi LLM/RAG/Agent ở mức technical interview
✅ Phân biệt đúng các khái niệm hay bị nhầm: encoder vs decoder, RAG vs fine-tune, LoRA vs full FT
✅ Biết cách gắn câu trả lời vào project thực tế của mình
✅ Nắm được những topic mới 2024–2025 dễ bị hỏi thêm

2

Cấu Trúc 30 Câu

Câu hỏi LLM/RAG/Agent trong phỏng vấn AI Engineer đo hai thứ: hiểu architecture ở mức cơ học và kinh nghiệm thực tế (đã build gì, gặp trade-off nào). Bài này tổ chức theo 6 nhóm:

Category	Câu	Nội dung
LLM Foundations	Q1–Q5	Transformer, BERT vs GPT, attention, tokenization
Prompting & Generation	Q6–Q10	Temperature, sampling, CoT, function calling
RAG	Q11–Q15	Pipeline, chunking, embedding, evaluation
Vector Database	Q16–Q20	HNSW, hybrid search, metadata filtering
Fine-tuning	Q21–Q25	LoRA, QLoRA, RLHF, DPO, catastrophic forgetting
Agent + Production	Q26–Q30	ReAct, multi-agent, hallucination, cost, production challenges

Mức độ được hỏi phụ thuộc role: role LLM Engineer hỏi sâu hơn category 1–3; role MLOps hỏi nhiều hơn category 6. RAG (Q11–Q15) là phần xuất hiện phổ biến nhất trong phỏng vấn 2024–2025 do hầu hết team đều đang build RAG application.

3

Category 1 — LLM Foundations (Q1–Q5)

Q1: Explain Transformer architecture.

Transformer (Vaswani et al., 2017 — "Attention Is All You Need") có hai dạng chính:

Encoder-decoder: Encoder đọc input, decoder sinh output tuần tự (T5, BART). Dùng cho translation, summarization.
Decoder-only: Không có encoder, chỉ decoder sinh token từ trái sang phải (GPT, LLaMA, Gemini). Dùng cho text generation.

Mỗi block transformer gồm:

Multi-head attention — tìm mối quan hệ giữa các token.
Feed-forward network (FFN) — 2 linear layer, activation ở giữa (GELU, SiLU).
Residual connection + Layer Norm — tránh vanishing gradient, ổn định training.

Positional encoding:

Sinusoidal (bản gốc): công thức sin/cos theo position và dimension.
Learned: embedding được học cùng model (BERT).
RoPE (Rotary Position Embedding): xoay vector Q và K theo position. LLaMA, Gemma, Mistral dùng RoPE — cho phép extrapolate ra context dài hơn lúc train.

Điểm hay bị hỏi thêm: "Tại sao cần position encoding?" — vì attention không phân biệt thứ tự token; nếu không có PE, "cat sat on mat" và "mat on sat cat" cho cùng output.

Q2: BERT vs GPT — khác biệt chính là gì?

Tiêu chí	BERT	GPT
Architecture	Encoder-only	Decoder-only
Attention direction	Bidirectional — thấy token cả hai phía	Causal — chỉ thấy token bên trái
Pre-training task	Masked Language Modeling (điền vào chỗ trống)	Next token prediction (đoán token tiếp theo)
Ứng dụng	Classification, NER, embedding, semantic search	Text generation, chatbot, completion

Nguyên tắc thực tế: Khi cần embedding để tìm kiếm hay classify text → dùng encoder (BERT, RoBERTa, sentence-transformers). Khi cần sinh text → dùng decoder (GPT, LLaMA).

Q3: Self-attention — cơ chế tính toán?

Self-attention tính output cho mỗi token bằng công thức:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Trong đó:

Q (Query), K (Key), V (Value): 3 linear projection của input embedding.
d_k: dimension của key vector.
sqrt(d_k): scale factor — nếu không scale, dot product lớn → softmax bão hòa → gradient nhỏ.

Output tại mỗi position là weighted sum của tất cả V, với weight phụ thuộc vào độ tương đồng giữa Q của token đó và K của toàn bộ sequence.

Ví dụ: Trong câu "The cat sat on it", khi tính attention cho "it", Q của "it" sẽ có weight cao nhất với K của "cat" — model học được "it" refer đến "cat".

Độ phức tạp: O(n²·d) với n là sequence length. Đây là lý do context dài tốn memory — bài 10 mục "Bonus" đề cập giải pháp.

Q4: Multi-head attention — vì sao multi-head?

Thay vì dùng 1 attention function trên full dimension, multi-head attention chia thành h head, mỗi head học một loại quan hệ khác nhau:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O

head_i = Attention(Q*W_Q_i, K*W_K_i, V*W_V_i)

Lý do:

Mỗi head có thể chú ý đến quan hệ khác nhau: syntactic, semantic, coreference, positional.
Parallel computation — không tăng compute so với 1 head cùng dimension vì split dimension: h head × (d_model/h) = d_model.
Richer representation từ nhiều "góc nhìn" trên cùng input.

Q5: Tokenization — BPE là gì?

BPE (Byte Pair Encoding) là sub-word tokenization: không tách theo từ cũng không tách theo ký tự, mà học các đơn vị sub-word phổ biến.

Thuật toán:

Bắt đầu từ character-level vocabulary.
Đếm tần suất mọi cặp ký tự liền nhau trong corpus.
Merge cặp phổ biến nhất thành 1 token mới.
Lặp lại cho đến khi đạt vocab size target.

Ví dụ: "unaffable" → ["un", "##aff", "##able"] (tùy vocab). Từ lạ vẫn tokenize được, không bị unknown.

Các variant phổ biến:

WordPiece (BERT, DistilBERT): merge theo max likelihood thay vì frequency.
SentencePiece (T5, LLaMA): train trực tiếp trên raw text, không cần pre-tokenize bằng space.
tiktoken (OpenAI GPT-3.5/4): BPE trên byte level — không bị lỗi với Unicode hiếm.

Điểm hay bị hỏi thêm: "Tokenizer ảnh hưởng gì đến cost?" — nhiều token hơn = nhiều token input/output = chi phí cao hơn. Tiếng Việt và tiếng Trung thường bị tokenize kém hiệu quả hơn tiếng Anh với cùng BPE vocab.

4

Category 2 — Prompting & Generation (Q6–Q10)

Q6: Temperature trong sampling ảnh hưởng thế nào?

Temperature T scale logit trước softmax:

P(token_i) = softmax(logits / T)

T → 0: Distribution nhọn về token có xác suất cao nhất — deterministic (greedy decoding).
T = 1: Distribution gốc của model, không thay đổi.
T > 1: Distribution phẳng hơn — đa dạng nhưng dễ mất coherence.

Thực tế:

Factual QA, code generation, data extraction: T=0 hoặc T≤0.2.
Chatbot thông thường: T=0.7.
Creative writing, brainstorming: T=0.9–1.2.

Temperature không ảnh hưởng đến tốc độ inference — chỉ thay đổi bước sampling cuối cùng.

Q7: Top-K vs Top-P (nucleus) sampling?

Method	Cơ chế	Hạn chế
Greedy	Chọn token max probability	Repetitive, thiếu đa dạng
Top-K	Sample trong K token có prob cao nhất	K cứng — không adapt khi distribution thay đổi
Top-P (nucleus)	Sample trong tập nhỏ nhất có cumulative prob ≥ P	P phụ thuộc domain — cần tune

Top-P linh hoạt hơn Top-K: khi model confident (distribution nhọn), tập sample nhỏ; khi model uncertain (distribution phẳng), tập sample lớn. Nhiều API cho phép kết hợp cả hai.

# OpenAI API
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    temperature=0.7,
    top_p=0.9,          # nucleus sampling
    # top_k không có trong OpenAI API; có trong Anthropic, Cohere
)

Q8: Chain-of-Thought (CoT) là gì?

CoT là kỹ thuật prompt khuyến khích model sinh intermediate reasoning steps trước khi đưa ra answer. Kết quả là accuracy cao hơn trên reasoning task (Wei et al., 2022).

Zero-shot CoT: Thêm "Let's think step by step." vào cuối câu hỏi. Đơn giản, hiệu quả với model lớn (GPT-4, Claude).

Few-shot CoT: Cung cấp 2–5 example có reasoning chain đầy đủ. Tốt hơn zero-shot cho task phức tạp nhưng tốn token.

Giới hạn:

CoT không giúp nhiều với model nhỏ (<7B) — model không đủ capacity để reason đúng.
CoT steps có thể sai nhưng answer vẫn đúng (chain không nhất thiết là chain thực tế của model).
Tốn thêm token → tốn thêm chi phí và latency.

Q9: Zero-shot vs few-shot vs fine-tuning — khi nào dùng?

Thứ tự thử theo chi phí tăng dần:

Zero-shot: Chỉ mô tả task trong system prompt. Free, không cần data. Thử trước tiên.
Few-shot: Thêm 2–8 example vào prompt. Không cần train, nhưng tốn token per request.
RAG: Inject relevant context vào prompt. Tốt cho knowledge-intensive task, không cần retrain.
Fine-tuning: Train thêm trên task-specific data. Cần data, compute, thời gian. Dùng khi ba bước trên không đủ hoặc cần cải thiện format/style.

Fine-tuning không tự động tốt hơn prompting — đặc biệt với proprietary model (GPT-4), fine-tuning GPT-3.5 không nhất thiết vượt GPT-4 zero-shot.

Q10: Function calling / Tool use hoạt động thế nào?

Model được cung cấp schema của một hoặc nhiều function (JSON). Khi model quyết định cần gọi tool, nó trả về JSON chứa tên function và arguments thay vì text thông thường. Application execute function → trả kết quả → model tiếp tục generate.

# OpenAI function calling (v1 SDK)
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search product database by keyword",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "limit": {"type": "integer", "default": 5}
                },
                "required": ["query"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

# Nếu model gọi tool
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    args = json.loads(tool_call.function.arguments)
    result = search_database(**args)          # application thực thi
    # Append result vào messages, gọi lại model

Use case: Agent gọi search engine, calculator, database query, API ngoài. Function calling là cơ chế nền tảng để build agent — model không tự "thực thi" code, chỉ output instruction.

5

Category 3 — RAG (Q11–Q15)

Q11: RAG là gì? Vì sao cần?

Retrieval-Augmented Generation (Lewis et al., 2020) kết hợp retrieval component với LLM generation:

LLM có knowledge cutoff — không biết thông tin sau ngày train.
Context window có giới hạn — không nhét toàn bộ document base vào prompt.
RAG giải quyết cả hai bằng cách retrieve chỉ những đoạn văn bản liên quan, inject vào prompt runtime.

So với fine-tuning: RAG không thay đổi weight của model. Knowledge nằm trong document store bên ngoài — dễ update, dễ trace nguồn (citation), không cần retrain khi data thay đổi.

Giới hạn của RAG:

Retrieval sai → answer sai (garbage in, garbage out).
Thêm latency cho bước embedding + vector search.
Không giúp model học pattern hay style mới.

Q12: Mô tả RAG pipeline từ đầu đến cuối.

Giai đoạn Ingest (offline):

Documents (PDF, HTML, Markdown)
  → Load & parse
  → Split thành chunks
  → Embed mỗi chunk → vector
  → Store vector + metadata vào vector DB

Giai đoạn Query (online):

User question
  → Embed question → query vector
  → Similarity search → top-K chunks
  → Format prompt: [system] + [context chunks] + [question]
  → LLM generate answer
  → (Optional) re-rank, cite sources

Điểm hay bị hỏi thêm: "Query và document có cần dùng cùng embedding model không?" — Có, bắt buộc. Nếu dùng khác model, vector nằm ở hai không gian khác nhau, cosine similarity không có nghĩa gì.

Q13: Chọn chunk size thế nào?

Chunk size ảnh hưởng đến precision và recall của retrieval:

Chunk size	Ưu điểm	Nhược điểm
Nhỏ (100–200 token)	Precision cao — retrieved chunk rất liên quan	Thiếu context, câu answer bị incomplete
Vừa (400–600 token)	Balance tốt cho hầu hết use case	Vẫn cần eval
Lớn (1000+ token)	Context đầy đủ	Dilute relevance, tốn context window, slow LLM

Thực tế: Không có con số tối ưu áp dụng cho mọi domain. Cần eval với golden set câu hỏi + expected answer. LangChain RecursiveCharacterTextSplitter với chunk_size=500, chunk_overlap=50 là điểm khởi đầu hợp lý cho văn bản tiếng Anh/Việt thông thường.

Kỹ thuật nâng cao:

Sentence-level chunking: Giữ nguyên câu hoàn chỉnh, không cắt giữa câu.
Parent-child chunking: Embed chunk nhỏ, nhưng trả về chunk lớn hơn cho LLM context.
Semantic chunking: Cắt theo thay đổi chủ đề (embedding distance threshold).

Q14: Chọn embedding model thế nào?

Các lựa chọn chính:

Model	Type	Dimension	Ghi chú
text-embedding-3-small	Closed (OpenAI)	1536 (mặc định)	Rẻ, nhanh, đủ tốt cho general use
text-embedding-3-large	Closed (OpenAI)	3072	Tốt hơn cho domain phức tạp
embed-english-v3.0	Closed (Cohere)	1024	Tốt cho retrieval task
bge-large-en-v1.5	Open (BAAI)	1024	Self-host, top MTEB leaderboard
all-MiniLM-L6-v2	Open (sentence-transformers)	384	Nhẹ, nhanh, prototype
nomic-embed-text-v1.5	Open (Nomic)	768	Matryoshka — có thể giảm dimension

Tiêu chí chọn:

Multilingual? Nếu có tiếng Việt: multilingual-e5-large hoặc paraphrase-multilingual-mpnet-base-v2.
Self-host hay managed? Open model = self-host, không trả token cost; closed = trả theo usage.
Eval trên domain của bạn — benchmark MTEB không phản ánh chính xác domain-specific performance. Luôn test trên 100–200 câu hỏi golden set của domain thực tế.

Q15: Đánh giá RAG — dùng metric nào?

Framework đánh giá RAG phổ biến nhất là RAGAS (Es et al., 2023):

Metric	Đo gì	Input cần
Faithfulness	Answer có grounded trong retrieved context không — không hallucinate	Question, answer, context
Answer Relevancy	Answer có liên quan đến question không	Question, answer
Context Precision	Top-K chunks có relevant không — retrieval precision	Question, context, ground truth
Context Recall	Chunks có chứa đủ thông tin để trả lời không — retrieval recall	Context, ground truth

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Cần có ground truth để eval
data = {
    "question": [...],
    "answer": [...],        # LLM answer
    "contexts": [...],      # retrieved chunks (list of list)
    "ground_truth": [...]   # expected answer
}

result = evaluate(
    Dataset.from_dict(data),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)

Thực tế: Faithfulness là metric quan trọng nhất — answer thấp điểm faithfulness nghĩa là model đang hallucinate ngoài context. Context Precision thấp cho biết chunking hoặc embedding model không tốt.

6

Category 4 — Vector Database (Q16–Q20)

Q16: Vector DB khác relational DB thế nào?

Relational DB (PostgreSQL, MySQL) và vector DB giải quyết bài toán khác nhau, không thay thế nhau:

Tiêu chí	Relational DB	Vector DB
Storage unit	Row + column (structured)	High-dimensional vector (float array)
Query type	Exact match, range, join	Approximate nearest neighbor (ANN)
Distance metric	N/A	Cosine similarity, L2 (Euclidean), dot product
Use case	OLTP, transactional, structured reporting	Semantic search, recommendation, RAG

Nhiều hệ thống dùng cả hai: relational DB cho structured data (user profile, order), vector DB cho semantic search (document retrieval).

Q17: HNSW index hoạt động thế nào?

HNSW (Hierarchical Navigable Small World) là index structure phổ biến nhất trong vector DB (Malkov & Yashunin, 2016):

Xây dựng multi-layer graph: layer trên thưa (ít node, long-range link), layer dưới dày (nhiều node, short-range link).
Search bắt đầu từ layer trên cùng, greedy navigate xuống gần entry point của query, rồi refine ở layer dưới.
Độ phức tạp: O(log N) — so với brute force O(N).

Trade-off:

Memory: graph structure tốn bộ nhớ — khoảng 50–100 byte/vector ngoài vector data. 1M vector × 1536-dim float32 ≈ 6GB vector + overhead graph.
Build time: chậm hơn flat index nhưng query nhanh hơn nhiều.
Tham số quan trọng: ef_construction (build quality), M (số edge mỗi node), ef_search (query recall vs speed).

Q18: Khi nào dùng ChromaDB vs Pinecone vs Qdrant?

Vector DB	Dùng khi	Không dùng khi
ChromaDB	Local prototype, < vài triệu vector, đội <= 1 người	Production, nhiều user concurrent
Pinecone	Cần managed cloud, không muốn tự operate, scale lớn	Budget tight (đắt hơn open-source), cần self-host vì compliance
Qdrant	Self-host production, cần payload filtering mạnh, open-source	Không có ops bandwidth để tự quản lý
Weaviate	Hybrid search (vector + BM25) built-in, knowledge graph	Pure vector search — overhead quá lớn
pgvector	Đã dùng PostgreSQL, vector volume nhỏ–vừa, muốn ít infra thêm	Hàng chục triệu vector, cần ANN tốc độ cao

Q19: Hybrid search là gì?

Hybrid search kết hợp:

Dense search: Embedding similarity (cosine). Tốt cho semantic query ("nói về machine learning").
Sparse search (BM25): Keyword matching dựa trên TF-IDF. Tốt cho exact term ("GPT-4o release date", "python IndexError").

Kết quả từ hai nguồn được merge bằng Reciprocal Rank Fusion (RRF):

score(doc) = sum over each ranker: 1 / (k + rank(doc))

k thường = 60 (constant để giảm ảnh hưởng của rank cực cao).

Khi nào hybrid tốt hơn pure dense:

Query chứa tên riêng, mã sản phẩm, từ kỹ thuật hiếm.
Code search — BM25 catch exact function name, dense catch semantic intent.
Legal/medical domain với terminology strict.

Q20: Metadata filtering trong vector search?

Metadata filtering cho phép kết hợp ANN search với filter trên structured attribute:

# Qdrant example: search vector + filter by date and category
results = client.search(
    collection_name="docs",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(key="date", range=DatetimeRange(gte="2024-01-01")),
            FieldCondition(key="category", match=MatchValue(value="technical"))
        ]
    ),
    limit=10
)

Hai chiến lược:

Pre-filter: Apply metadata filter trước ANN search — tập nhỏ hơn, ANN chính xác hơn, nhưng có thể miss nếu filter quá hẹp.
Post-filter: ANN trước, filter sau — dễ implement nhưng top-K thực tế nhỏ hơn K.

Index metadata field riêng để filter nhanh — không để metadata là nested JSON không indexed.

7

Category 5 — Fine-tuning (Q21–Q25)

Q21: Khi nào nên fine-tune vs dùng RAG?

Quyết định dựa trên loại vấn đề cần giải quyết:

Vấn đề	Giải pháp phù hợp
Model không biết thông tin mới / nội bộ	RAG — knowledge nằm ngoài model
Model cần output đúng format cứng (JSON schema, CSV)	Fine-tuning — học format pattern
Model cần tone/style nhất quán của brand	Fine-tuning — học style từ example
Model cần perform tốt hơn trên task domain cụ thể	Fine-tuning (nếu RAG không đủ)
Knowledge thay đổi thường xuyên	RAG — update document store, không retrain
Cần cite nguồn	RAG — có retrieved chunks để reference

Default: Thử RAG trước. Fine-tuning có chi phí cao hơn (data collection, training compute, eval iteration) và không tự động update knowledge.

Q22: Full fine-tuning vs PEFT (LoRA) — khác biệt?

Full fine-tuning: Update toàn bộ weight của model. Với LLM 7B–70B, tốn nhiều GPU VRAM (hàng chục đến hàng trăm GB), có risk catastrophic forgetting.

LoRA (Low-Rank Adaptation — Hu et al., 2021):

Freeze toàn bộ pre-trained weight.
Thêm 2 low-rank matrix A, B vào mỗi linear layer được chọn: ΔW = A × B, rank r << d.
Chỉ train A và B — khoảng 0.1–1% số parameter so với full FT.
Inference: merge ΔW vào W gốc — không tăng latency.

QLoRA (Dettmers et al., 2023): Quantize base model xuống 4-bit (NF4), sau đó train LoRA adapter ở precision cao hơn. Cho phép fine-tune LLaMA-13B trên 1× A100 (24GB VRAM) hoặc 2× RTX 3090.

# QLoRA với HuggingFace + bitsandbytes
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16"
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"])
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~13M || all params: ~13B || trainable%: 0.10%

Q23: Dataset cho instruction tuning — cần gì?

Format chuẩn cho instruction fine-tuning:

{
  "instruction": "Classify the sentiment of this review.",
  "input": "The product broke after 2 days. Terrible quality.",
  "output": "Negative"
}

Nguyên tắc quality:

Quality quan trọng hơn quantity. 1,000 example chất lượng cao thường tốt hơn 100,000 example nhiễu.
Diverse instruction — không chỉ 1 loại task.
Output phải nhất quán và đúng — label noise gây hại trực tiếp đến model.

Nguồn dataset public:

Alpaca (52K, Stanford)
Dolly (15K, Databricks)
OpenHermes-2.5 (1M, filtered từ nhiều source)
Orca-2 (Microsoft, synthetic reasoning data)

Q24: RLHF vs DPO — khác nhau thế nào?

Cả hai dùng để align model với human preference (tạo ra output human prefer hơn).

RLHF (Reinforcement Learning from Human Feedback):

Thu thập comparison data: human rank output A vs B.
Train reward model từ comparison data.
Fine-tune LLM bằng PPO (Proximal Policy Optimization) để maximize reward.

Phức tạp, nhiều hyperparameter, reward model có thể bị exploit (reward hacking).

DPO (Direct Preference Optimization — Rafailov et al., 2023):

Không cần reward model riêng.
Optimize trực tiếp LLM trên preference pairs (chosen, rejected) bằng binary cross-entropy.
Đơn giản hơn, stable hơn, kết quả tương đương hoặc tốt hơn RLHF trong nhiều benchmark.

Hầu hết open-source fine-tuning pipeline 2024–2025 (LLaMA-3, Mistral) dùng DPO hoặc variant (SimPO, ORPO) thay vì RLHF đầy đủ.

Q25: Catastrophic forgetting là gì? Giảm thiểu thế nào?

Fine-tuning trên task cụ thể có thể làm model quên kiến thức general từ pre-training. Biểu hiện: model giỏi task mới nhưng tệ hơn trên task cũ.

Cách giảm thiểu:

Low learning rate: 1e-4 đến 1e-5 (lower than pre-training). Ít update lớn hơn → ít overwrite hơn.
Fewer epoch: 1–3 epoch cho instruction tuning, tránh overfit.
Data mixing: Mix task-specific data với general data (tỷ lệ 20–30% general).
LoRA: Freeze base model weight — không có catastrophic forgetting về định nghĩa vì weight gốc không thay đổi.
Elastic Weight Consolidation (EWC): Penalize thay đổi của weight quan trọng với task cũ.

8

Category 6 — Agent + Production (Q26–Q30)

Q26: Mô tả ReAct agent loop.

ReAct (Reason + Act — Yao et al., 2022) là pattern agent phổ biến nhất:

Thought: Tôi cần tìm thông tin về giá Bitcoin hôm nay.
Action: search("bitcoin price today")
Observation: Bitcoin is trading at $67,240 as of 2024-11-01.

Thought: Đã có dữ liệu. Người dùng hỏi so sánh với tháng trước.
Action: search("bitcoin price october 2024")
Observation: Bitcoin averaged ~$62,000 in October 2024.

Thought: Đủ dữ liệu để trả lời.
Answer: Bitcoin hiện ở $67,240, tăng khoảng 8% so với tháng trước.

Mỗi bước: model reason về bước tiếp theo → gọi action → nhận observation → tiếp tục đến khi có đủ thông tin.

Giới hạn:

Có thể loop vô hạn nếu không có stopping condition.
Latency tăng tuyến tính với số tool call.
Cost tăng theo số token trong conversation history.

Q27: Multi-agent pattern — có những dạng nào?

Khi task quá phức tạp cho 1 agent:

Supervisor + Workers: Một orchestrator agent phân công task cho specialist agent (search agent, code agent, summarizer). Đơn giản, dễ debug.
Peer collaboration: Agent giao tiếp ngang hàng, không có supervisor. Phức tạp, khó trace.
Hierarchical (team of teams): Nhiều layer supervisor. Dùng cho workflow rất lớn — ít phổ biến trong sản phẩm thực tế.

Framework: LangGraph (LangChain), AutoGen (Microsoft), CrewAI. Mỗi framework có trade-off về flexibility vs abstraction.

Trade-off thực tế: Multi-agent phức tạp hơn nhiều để debug và monitor. Chỉ dùng khi 1 agent không đủ — đừng over-engineer.

Q28: Hallucination — detect và mitigate thế nào?

Detect:

Faithfulness evaluation: Dùng LLM khác (GPT-4, Claude) làm judge — kiểm tra từng claim trong answer có grounded trong retrieved context không.
Self-consistency: Sample nhiều answer với temperature cao, đếm consensus — câu trả lời ổn định hơn thường chính xác hơn.
Citation check: Model trích dẫn source → verify source tồn tại và nội dung khớp.

Mitigate:

RAG: Cung cấp ground truth context — giảm hallucination rõ rệt so với model without context.
Temperature thấp: T=0 cho factual task.
Structured output + schema validation: JSON mode, Pydantic validation — buộc model output đúng format, dễ parse và verify.
System prompt strict: "Only answer based on the provided context. If you don't know, say so."
Smaller context scope: Đừng nhét quá nhiều context không liên quan — model dễ confuse.

Q29: Tối ưu cost LLM production thế nào?

Caching:

Exact match cache: Hash prompt → cache response. Effective cho FAQ, static query.
Semantic cache: Embed query → vector search → nếu tìm thấy câu tương tự đã cached, trả cached response. GPTCache, Langchain cache.

Model routing:

Dùng model nhỏ (GPT-4o-mini, Claude Haiku, Gemini Flash) cho task đơn giản.
Route lên model lớn chỉ khi cần — tỷ lệ phụ thuộc distribution task của product.

Batch API: OpenAI Batch API, Anthropic Message Batches giảm 50% cost cho non-realtime workload (report generation, data labeling).

Token budget:

Trim conversation history — giữ lại N message gần nhất, summarize phần cũ.
Shorter system prompt — mỗi request đều tốn token system prompt.
Prompt compression — LLMLingua compress prompt lên 3–20× với minimal quality loss.

Q30: Production challenges với LLM — những gì cần chuẩn bị?

Latency:

TTFT (Time to First Token) — user cảm nhận trực tiếp. Streaming reduce perceived latency.
Throughput — token/sec. Ảnh hưởng capacity planning.
Giải pháp: speculative decoding, KV cache, batching (vLLM).

Cost monitoring:

Track token usage per request, per user, per feature.
Budget alert khi cost spike bất thường.

Quality monitoring:

Drift detection — quality model thay đổi khi provider update model (GPT-4 turbo → GPT-4o có behavior differences).
A/B test prompt version trước khi rollout.
Human feedback loop — thumbs up/down trong UI, sample review hàng tuần.

Safety:

Prompt injection: User nhét instruction vào context để override system behavior. Mitigate: sanitize input, validate output, privilege separation (user input ≠ system instruction).
Content filter: OpenAI Moderation API, Llama Guard, custom classifier.

Compliance:

PII handling — không gửi sensitive data đến third-party API nếu không có DPA.
Data residency — một số enterprise yêu cầu data không rời EU/VN.
Logging — cân bằng giữa audit trail và privacy.

9

Tips Trả Lời LLM Technical

1. Reference paper khi có thể

Interviewer đánh giá cao khi bạn mention paper gốc. Không cần thuộc đầy đủ — "Attention Is All You Need (2017)", "LoRA (Hu et al., 2021)", "RAG (Lewis et al., 2020)", "DPO (Rafailov et al., 2023)" đủ để thể hiện bạn có nền tảng lý thuyết.

2. Gắn vào project thực tế

Câu trả lời có ví dụ cụ thể từ project của bạn thuyết phục hơn câu trả lời abstract. Ví dụ: "Trong RAG project của tôi, tôi dùng bge-large-en-v1.5 vì corpus tiếng Anh chuyên ngành, chunk size 450 token với overlap 50, và faithfulness score đạt 0.87 trên golden set 150 câu."

3. Nêu số liệu cụ thể khi có

Cost, latency, metric score — số liệu cho thấy bạn đã thực sự đo đạc, không chỉ đọc documentation. Ví dụ: "Sau khi thêm semantic cache, tỷ lệ cache hit khoảng 35% và giảm cost API xuống ~$0.04/request."

4. Honest về giới hạn

Nếu không biết: "Tôi chưa làm phần này trong thực tế, nhưng theo hiểu biết của tôi thì..." hoặc "Cái này tôi cần check lại." Đây là câu trả lời tốt hơn confident-wrong.

5. Phân biệt RAG vs fine-tuning rõ ràng

Đây là câu hay bị nhầm nhất. Interviewer muốn thấy bạn biết khi nào dùng cái gì và tại sao — không phải "RAG luôn tốt hơn" hay "fine-tuning luôn tốt hơn".

10

Bonus — Topic 2024–2025

Những topic mới xuất hiện trong interview 2024–2025 — thường hỏi ở mức "aware" chứ không deep-dive:

Long context models

Gemini 1.5 Pro (1M token context), Claude 3 (200K token). Giảm phụ thuộc vào chunking cho một số use case. Nhưng: longer context không tự nhiên tốt hơn — "lost in the middle" problem (Liu et al., 2023) cho thấy LLM retrieve tệ hơn với thông tin ở giữa context dài.

Multimodal

GPT-4o, Claude 3 Sonnet/Opus, Gemini — nhận image + text input. Use case: document understanding (PDF với bảng/biểu đồ), visual QA, screenshot-to-code. Không phải tất cả task đều cần — text-only model rẻ hơn đáng kể.

Reasoning models (o1, o3, DeepSeek-R1)

Test-time compute scaling: model dùng thêm compute khi inference (chain-of-thought ẩn dài hơn) thay vì chỉ pre-training compute. Tốt cho math, coding, scientific reasoning. Trade-off: chậm hơn và đắt hơn per token nhiều lần so với standard model.

MoE (Mixture of Experts)

Mixtral-8x7B, GPT-4 (tin đồn), Deepseek-MoE: model có nhiều FFN "expert", mỗi token chỉ activate 1–2 expert qua router. Số active parameter ít hơn total parameter → inference nhanh/rẻ hơn tương đương dense model. Trade-off: cần nhiều VRAM hơn để load toàn bộ model.

Constitutional AI (Anthropic)

Alignment technique dùng tập nguyên tắc (constitution) để hướng dẫn model self-critique và revise response — thay vì chỉ dùng human feedback. Claude series được train với phương pháp này.

11

Common Pitfalls

Những lỗi thường gặp khi trả lời câu hỏi LLM technical:

Pitfall	Vấn đề	Cách sửa
Nhầm retrieval với generation	"RAG generate document từ database" — sai	RAG retrieve existing document, LLM mới generate answer
Claim "RAG always better"	Oversimplify — interviewer biết exception	Nêu trade-off: khi nào RAG không đủ, khi nào cần fine-tune
Buzzword stuffing	"Agentic AI, AGI, autonomous" không giải thích	Giải thích cụ thể: agent = LLM + tool + loop, không hơn
Skip practical detail	Nói về RAG nhưng không biết metric eval	Chuẩn bị RAGAS, faithfulness, context precision
Reference outdated version	"LangChain 0.0.x", "OpenAI v0" — outdated API	Update với LangChain v0.3+, OpenAI SDK v1+
Claim hiểu fine-tuning nhưng chưa làm	Interviewer hỏi sâu về LoRA rank, learning rate → lộ	Honest: "Tôi đã đọc paper và thử QLoRA, nhưng chưa deploy production"

12

Bài Tiếp Theo

Bài 34: Coding interview cho AI Engineer — gì sẽ bị hỏi

Danh sách bài viết