Bài 34: Coding interview cho AI Engineer — gì sẽ bị hỏi

1

Mục Tiêu Bài Học

Sau bài này bạn sẽ:

Biết coding interview AI Engineer khác gì so với Software Engineer thuần túy.
Nắm được 6 topic chính và tỉ lệ xuất hiện thực tế.
Đọc + hiểu solution mẫu cho 8 dạng câu hỏi phổ biến.
Biết cách xử lý live coding session 30–45 phút theo từng bước.
Có kế hoạch luyện tập 4–8 tuần có cấu trúc.

2

Coding Interview AI Engineer — Khác Software Engineer Thế Nào

Coding round cho AI Engineer vẫn dùng Leetcode hoặc HackerRank shared screen, nhưng có một số khác biệt rõ về phân bổ topic:

Tiêu chí	Software Engineer	AI Engineer
Thời gian / problem	30–45 phút	30–45 phút
DSA nặng (graph, tree)	Cao — BFS/DFS/DP bắt buộc	Thấp hơn — chủ yếu array, hashmap, easy DP
Data manipulation	Hiếm	Thường xuyên — Pandas, SQL
ML-specific coding	Không có	Có — implement K-means, cosine similarity, metrics
Pass bar	Medium trong 30 phút + clean code	Medium trong 30 phút + clean code

Format phổ biến: 1 problem, live coding trên shared IDE, interviewer hỏi follow-up sau khi bạn giải xong (complexity, edge case, alternative approach).

3

Topic Phổ Biến Và Tỉ Lệ Xuất Hiện

Ước tính dựa trên pattern thực tế từ các báo cáo cộng đồng (Glassdoor, Blind, LeetCode Discuss):

Topic	Tỉ lệ	Kỹ thuật cốt lõi
Array / String	~40%	Two pointer, sliding window, hashmap, binary search
Hashmap / Set	~20%	Counter, frequency, group by key, lookup acceleration
Data Manipulation	~15%	Pandas groupby, SQL aggregation, window function
ML-Specific Coding	~10%	K-means, cosine similarity, metrics, train/test split
Easy DP	~10%	1D DP linear — climbing stairs, house robber, LIS
OOP / System	~5%	LRU Cache, rate limiter, decorator

Lưu ý: tỉ lệ trên thay đổi theo loại công ty. Phần mục 13 phân tích chi tiết hơn.

4

Example Questions — Array / String

Q1 — Two Sum (Hashmap, O(N))

Problem: Cho array nums và target, trả về indices của 2 phần tử có tổng bằng target.

Input:  nums=[2,7,11,15], target=9
Output: [0, 1]  # nums[0]+nums[1] = 2+7 = 9

def two_sum(nums: list[int], target: int) -> list[int]:
    seen: dict[int, int] = {}
    for i, n in enumerate(nums):
        complement = target - n
        if complement in seen:
            return [seen[complement], i]
        seen[n] = i
    return []

Complexity: Time O(N), Space O(N). Brute force O(N²) không pass large input.

Q2 — Longest Substring Without Repeating (Sliding Window)

Problem: Tìm độ dài substring dài nhất không có ký tự lặp.

Input:  "abcabcbb"
Output: 3  # "abc"

def length_longest(s: str) -> int:
    seen: dict[str, int] = {}
    left = 0
    max_len = 0
    for right, c in enumerate(s):
        if c in seen and seen[c] >= left:
            left = seen[c] + 1
        seen[c] = right
        max_len = max(max_len, right - left + 1)
    return max_len

Key insight: seen[c] >= left đảm bảo character trùng thực sự nằm trong window hiện tại, không phải window cũ.

Complexity: Time O(N), Space O(min(N, charset)).

5

Example Questions — Hashmap / Set

Q3 — Top K Frequent Elements

Problem: Tìm k phần tử xuất hiện nhiều nhất trong array.

Input:  nums=[1,1,1,2,2,3], k=2
Output: [1, 2]

from collections import Counter
import heapq

def top_k_frequent(nums: list[int], k: int) -> list[int]:
    count = Counter(nums)
    # nlargest: O(N log k) — tốt hơn sort O(N log N) khi k nhỏ
    return heapq.nlargest(k, count.keys(), key=count.get)

Complexity: Time O(N log k), Space O(N).

Follow-up interviewer hay hỏi: "Nếu k gần bằng N thì dùng sort thay vì heap có hợp lý hơn không?" — Câu trả lời: khi k ≈ N, sort O(N log N) và heap O(N log N) tương đương, nhưng heap có constant factor cao hơn nên sort thực tế nhanh hơn.

6

Example Questions — Data Manipulation (Pandas + SQL)

Q4 — Pandas: Top 3 Customer Per Category

Problem: Cho DataFrame có cột category, customer, revenue. Tìm top 3 customer theo tổng revenue trong mỗi category.

import pandas as pd

df = pd.DataFrame({
    "category": ["A", "A", "B", "B", "A"],
    "customer": ["x", "y", "x", "z", "y"],
    "revenue": [100, 200, 150, 300, 50],
})

top3_per_cat = (
    df.groupby(["category", "customer"])["revenue"]
    .sum()
    .reset_index()
    .sort_values(["category", "revenue"], ascending=[True, False])
    .groupby("category")
    .head(3)
)

Output:

  category customer  revenue
1        A        y      250
0        A        x      100
3        B        z      300
2        B        x      150

Lưu ý: Dùng sort_values trước groupby().head() để đảm bảo thứ tự giảm dần trước khi lấy top N.

Q5 — SQL: Top 10 Customers By Revenue

SELECT
    customer_id,
    SUM(revenue) AS total
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;

Follow-up: "Nếu cần top 10 theo từng tháng?" — Dùng window function:

SELECT
    DATE_TRUNC('month', order_date) AS month,
    customer_id,
    SUM(revenue) AS total,
    RANK() OVER (
        PARTITION BY DATE_TRUNC('month', order_date)
        ORDER BY SUM(revenue) DESC
    ) AS rank
FROM orders
GROUP BY 1, 2
QUALIFY rank <= 10;

7

Example Questions — ML-Specific Coding

Q6 — Cosine Similarity

import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    dot = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return float(dot / (norm_a * norm_b))

Edge case quan trọng: Xử lý zero-vector trước khi chia — tránh ZeroDivisionError và nan.

Q7 — K-Means From Scratch

import numpy as np

def kmeans(
    X: np.ndarray,
    k: int,
    n_iter: int = 100,
) -> tuple[np.ndarray, np.ndarray]:
    """
    Returns:
        labels: (N,) cluster assignment
        centroids: (k, D) final centroid positions
    """
    # Init: chọn ngẫu nhiên k điểm từ data
    indices = np.random.choice(len(X), k, replace=False)
    centroids = X[indices].copy()

    for _ in range(n_iter):
        # Assign: tính khoảng cách từ mỗi điểm tới mỗi centroid
        # X[:, None] shape (N, 1, D) broadcast với centroids (k, D)
        distances = np.linalg.norm(X[:, None] - centroids, axis=2)  # (N, k)
        labels = distances.argmin(axis=1)  # (N,)

        # Update: tính lại centroid
        new_centroids = np.array([
            X[labels == i].mean(axis=0) for i in range(k)
        ])

        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return labels, centroids

Follow-up thường gặp:

"Điều gì xảy ra nếu 1 cluster không có điểm nào?" — X[labels == i] sẽ rỗng, .mean() trả về nan. Cần xử lý: reinitialize centroid đó hoặc skip update.
"K-means++ khác gì K-means thường?" — Khởi tạo centroid có trọng số theo khoảng cách, giảm trường hợp xấu do random init.
"Complexity?" — O(N × k × n_iter × D) với D là số chiều.

Q8 — Compute F1 Score

def f1_score(y_true: list[int], y_pred: list[int]) -> float:
    tp = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 1)
    fp = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 1)
    fn = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 0)

    if tp == 0:
        return 0.0

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    return 2 * precision * recall / (precision + recall)

8

Example Questions — OOP / System

Q9 — LRU Cache

Problem: Implement LRU (Least Recently Used) Cache với capacity cố định. get(key) O(1), put(key, value) O(1).

from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity: int) -> None:
        self.cache: OrderedDict[int, int] = OrderedDict()
        self.capacity = capacity

    def get(self, key: int) -> int:
        if key not in self.cache:
            return -1
        self.cache.move_to_end(key)  # mark as recently used
        return self.cache[key]

    def put(self, key: int, value: int) -> None:
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)  # evict least recently used

Tại sao OrderedDict? Duy trì thứ tự insertion, move_to_end() và popitem(last=False) đều O(1). Nếu interviewer yêu cầu không dùng OrderedDict, implement bằng doubly linked list + hashmap.

9

Workflow Live Coding 30–45 Phút

Step 1 — Understand (5 phút)

Đọc kỹ problem statement.
Hỏi clarification: input range, edge case, expected output format.
Walk through example input/output bằng lời để confirm hiểu đúng.

Ví dụ: "I see nums is an integer array. Can it contain duplicates? Can target be negative?"

Step 2 — Plan (5 phút)

Nêu high-level approach trước khi code.
Nói ra trade-off: "I think hashmap gives O(N) vs brute force O(N²)".
Nếu interviewer có suggestion, lắng nghe — đây không phải trick, thường là hint thực sự.

Step 3 — Code (20–25 phút)

Viết clean, readable — dùng tên biến rõ nghĩa (user_count thay vì c).
Nói ra quyết định khi đang gõ: "I'm using a dictionary here to store the index of each number".
Test mentally với example khi code xong function signature.
Dùng type hint Python 3.10+ (lowercase): list[int], dict[str, int].

Step 4 — Test + Improve (5–10 phút)

Chạy với example đề bài cho.
Test edge case: empty input, 1 phần tử, duplicate, giá trị âm.
Nêu complexity: Time O(?), Space O(?).
Nếu còn thời gian, đề xuất optimization hoặc alternative approach.

10

Common Mistakes Và Best Practice

Những lỗi hay gặp

Jump straight to code: Bỏ qua Step 1–2, code ngay — interviewer không biết bạn đang nghĩ gì.
Code trong im lặng: Giải đúng nhưng không giao tiếp — kết quả thường fail vòng phỏng vấn vì thiếu communication signal.
Skip edge case: Không check empty input, null — bug xuất hiện lúc test.
Không test: Nộp code chưa chạy thử với example — mất điểm nặng.
Over-engineer: Giải O(N log N) khi O(N²) đã đủ và đơn giản hơn ở context medium problem.
Dùng library function cho task đang được test: Ví dụ dùng sorted() khi bài hỏi implement sort algorithm.

Best practice

Tên biến rõ nghĩa: seen_indices thay vì d, max_window thay vì m.
Decompose function: Nếu function > 30 dòng, split thành helper function.
Comment tối giản: Code nên tự đọc được — chỉ comment khi logic không rõ ràng.
Type hint: Cho thấy tư duy rõ ràng về input/output contract.
Explicit handle empty input: if not nums: return [] ở đầu function.

Về IDE autocomplete

Một số interview platform không có autocomplete (CoderPad plain mode, HackerRank). Luyện tập với plain text editor định kỳ để không bị phụ thuộc.

11

Complexity Analysis

Sau khi code xong, interviewer gần như chắc chắn hỏi complexity. Cần nêu cả 2:

Time complexity: O(?) — cái gì chiếm đa số thời gian (vòng lặp ngoài cùng, nested loop, sort).
Space complexity: O(?) — bộ nhớ phụ trợ (không tính input).

Pattern	Time	Space
Brute force nested loop	O(N²)	O(1)
Hashmap lookup	O(N)	O(N)
Sliding window	O(N)	O(1) hoặc O(k)
Sort	O(N log N)	O(1) in-place hoặc O(N)
Heap nlargest(k)	O(N log k)	O(k)
1D DP	O(N)	O(N) hoặc O(1) nếu rolling

Optimization signal: Sau khi nêu complexity, hỏi "Is this acceptable or should I optimize further?" — cho thấy bạn biết có trade-off và không tự ý over-engineer.

12

Khi Stuck — Phải Làm Gì

Các bước khi bị block

Verbalize thinking: "I'm comparing whether a two-pointer approach or a hashmap would work better here." — giúp interviewer thấy quá trình suy nghĩ ngay cả khi chưa ra đáp án.
Brute force trước: Giải O(N²) và nêu rõ "This is a brute force, let me think about optimizing." — working solution luôn tốt hơn không có gì.
Simplify: Giải version đơn giản hơn của problem trước (input nhỏ, bỏ constraint phụ).
Hỏi hint: "Could you give me a hint on the data structure?" — không phải dấu hiệu yếu; từ chối hint mới là vấn đề.

Điều không nên làm khi stuck

Im lặng hoàn toàn quá 3 phút mà không nói gì.
Tỏ ra căng thẳng hoặc xin lỗi nhiều lần — ảnh hưởng tới signal.
Từ chối hint từ interviewer — trông cứng đầu hơn là giỏi.

What NOT to do (absolute)

Dùng AI / Google trong live coding: Instant reject ở tất cả công ty có integrity policy.
Học thuộc lòng solution mà không hiểu: Interviewer hỏi follow-up sẽ lộ ngay.

13

Phân Biệt Theo Loại Công Ty

Loại công ty	DSA focus	Practical focus	Ghi chú
Big Tech / FAANG-style	Cao — medium/hard DSA	Thấp	Cần luyện graph, tree, hard DP
Startup (AI-focused)	Thấp	Cao — Pandas, SQL, ML script	Hay hỏi take-home hoặc implement model
Vietnam local	Medium	Medium	Array/hashmap + Python practical
Consulting / outsource	Easy DSA	Case study analysis	Leetcode easy + problem-solving mindset

Trước khi interview, research glassdoor reviews hoặc hỏi recruiter: "What does the coding round typically look like?" — recruiter thường sẵn sàng cho biết format.

14

Practice Strategy 4–8 Tuần

Tuần 1–2

30 easy problem Leetcode.
Topic: Array, String, Hashmap.
Mục tiêu: quen cú pháp, quen pattern cơ bản, không bị đơ với problem đơn giản.

Tuần 3–4

30 medium problem.
Topic: Sliding window, 2-pointer, easy DP, greedy.
Sau mỗi problem: viết lại complexity, test edge case bằng tay.

Tuần 5–6

10 medium + 5 hard.
Topic: Tree BFS/DFS, linked list (nếu target Big Tech).
Mock interview Pramp 2–3 sessions — lấy feedback về communication.

Tuần 7–8

Review lại topic yếu.
Pandas + SQL: 10–15 problem trên StrataScratch.
ML-specific: 5–10 problem implement từ đầu (K-means, KNN, metrics).
1–2 mock interview nữa để giảm lo lắng.

Pitfall cần tránh

Luyện quá nhiều hard problem — entry-level interview chủ yếu hỏi medium.
Bỏ qua Pandas/SQL — dễ fail data manipulation question.
Học thuộc solution thay vì hiểu pattern — không derive được khi interviewer thay đổi constraint.
Không chạy code thực sự — bug slip qua mà không biết.

15

Resources

Resource	Dùng cho	Chi phí
Leetcode	DSA practice — free tier đủ dùng	Free / Premium
NeetCode.io	Roadmap 75/150 problems có giải thích video	Free
StrataScratch	SQL + Pandas — data science interview focus	Free / Premium
Pramp	Peer mock interview — real-time feedback	Free
interviewing.io	Mock interview với ex-FAANG interviewer	Có phí
HackerRank	Alternative platform — dùng khi company test trên đây	Free

Danh sách bài viết