Bài 81: Metrics Prometheus Middleware Sâu

Mục lục

Mục Tiêu Bài Học
4 Loại Prometheus Metric
Cài metrics Crate + Prometheus Exporter
HTTP Request Metrics Middleware
Refactor /metrics Endpoint B57
Custom Business Metrics — Service Layer Pattern
Cardinality Pitfall + Label Design Rule
Prometheus Scrape Config + Grafana Dashboard Preview
Tổng Kết
Bài Tập Củng Cố
Bài Tiếp Theo

Mục Tiêu Bài Học

Sau bài học, bạn sẽ:

Hiểu 4 loại Prometheus metric: Counter, Gauge, Histogram, Summary — pros/cons và use case mỗi loại.
Refactor /metrics endpoint B57 từ format manual sang crate metrics + metrics-exporter-prometheus.
Implement HTTP request metrics middleware: request_count counter + response_time histogram + error_count counter.
Pattern labels (key=value) cho dimensional metrics — method, status, route.
Hiểu pitfall labels cardinality: high cardinality → memory blow up Prometheus server.
Setup Prometheus scrape config + Grafana dashboard preview với 4 PromQL query chuẩn.
Business metrics service layer pattern: orders_created_total, payment_succeeded_total, webhook_received_total.
MatchedPath extract route template MANDATORY thay raw URI để chống cardinality blow up.

4 Loại Prometheus Metric

Prometheus định nghĩa 4 loại metric (xem spec tại prometheus.io/docs/concepts/metric_types). Quyết định loại nào cho từng dimension Shop API ảnh hưởng trực tiếp tới chất lượng dashboard + alert downstream.

Counter — chỉ tăng (monotonic), reset khi process restart:

Use case: request_count_total, error_count_total, bytes_sent_total.
Operator duy nhất: += (tăng) — KHÔNG bao giờ giảm.
Format naming convention: suffix _total (http_requests_total, orders_created_total).
Query: PromQL rate(http_requests_total[5m]) = request/giây trung bình 5 phút gần nhất.

Gauge — tăng giảm tự do, point-in-time value:

Use case: db_pool_active, memory_usage_bytes, temperature_celsius, active_connections.
Operator: = (set), += (tăng), -= (giảm).
Pool metric B57 đã dùng pattern này: shop_db_pool_size, shop_db_pool_active, shop_db_pool_utilization.
Query PromQL trực tiếp: shop_db_pool_utilization trả giá trị hiện tại scrape gần nhất.

Histogram — bucket distribution (count + sum + bucket array):

Use case: request_duration_seconds_bucket, response_size_bytes_bucket.
Output 3 series per histogram: <name>_bucket{le="0.1"}, <name>_sum, <name>_count.
Tính p50, p95, p99 qua PromQL histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])).
Bucket lock cho HTTP latency: cần granular phía dưới (web API < 100ms phổ biến) — lock 12 bucket 1ms → 10s.
Aggregate cross-instance được (cộng _bucket rồi compute quantile).

Summary — client-side quantile pre-compute:

Use case: hiếm dùng (Histogram tốt hơn 95% case thực tế).
Pros: KHÔNG cần PromQL compute, query nhanh.
Cons: KHÔNG aggregate cross-instance được (mean của median KHÔNG bằng median tổng); quantile bị giới hạn ở client config.
Shop API SKIP Summary toàn series.

Lock decision Shop API 4 metric type:

Counter: http_requests_total, http_errors_total, orders_created_total, payment_succeeded_total.
Gauge: shop_db_pool_active, shop_db_pool_idle (B57 lock continued — pattern khác middleware-auto vì point-in-time).
Histogram: http_request_duration_seconds, cart_checkout_duration_seconds.
Summary: SKIP toàn bộ.

Cài metrics Crate + Prometheus Exporter

Workspace deps update shop/Cargo.toml:

[workspace.dependencies]
metrics = "0.23"
metrics-exporter-prometheus = "0.15"

Crate metrics cung cấp facade (macro counter!, gauge!, histogram!) tương tự crate tracing; metrics-exporter-prometheus là backend recorder render Prometheus exposition format. Pattern lock: emit metric ở mọi nơi qua macro facade — backend swap-able runtime (Datadog / StatsD / Prometheus / OTel).

Tạo file mới crates/shop-api/src/metrics.rs:

// File: crates/shop-api/src/metrics.rs
use metrics_exporter_prometheus::{Matcher, PrometheusBuilder, PrometheusHandle};

/// HTTP latency bucket — 12 bucket từ 1ms đến 10s.
/// Granular phía dưới (web API phần lớn < 100ms),
/// coarse phía trên (tail latency cần thấy > 1s).
const HTTP_LATENCY_BUCKETS: &[f64] = &[
    0.001, 0.005, 0.01, 0.025, 0.05, 0.1,
    0.25, 0.5, 1.0, 2.5, 5.0, 10.0,
];

pub fn install_recorder() -> PrometheusHandle {
    PrometheusBuilder::new()
        // Bucket override per-metric MANDATORY cho histogram —
        // default bucket [0.005, 0.01, 0.025, 0.05, ...] quá thô cho HTTP latency.
        .set_buckets_for_metric(
            Matcher::Full("http_request_duration_seconds".into()),
            HTTP_LATENCY_BUCKETS,
        )
        .unwrap()
        .set_buckets_for_metric(
            Matcher::Full("cart_checkout_duration_seconds".into()),
            HTTP_LATENCY_BUCKETS,
        )
        .unwrap()
        .install_recorder()
        .expect("install metrics recorder")
}

Function install_recorder() trả PrometheusHandle — handle này có method .render() sinh Prometheus exposition format text từ mọi metric đã emit. Handle lưu vào AppState để handler /metrics render khi scrape tới.

Update crates/shop-api/src/state.rs extend AppState:

// File: crates/shop-api/src/state.rs
use metrics_exporter_prometheus::PrometheusHandle;

#[derive(Clone)]
pub struct AppState {
    pub config: AppConfig,
    pub db: PgPool,
    // ... 6 service Arc<dyn> B72 lock
    pub product_service: Arc<dyn ProductService>,
    pub order_service: Arc<dyn OrderService>,
    // ...

    /// Prometheus handle render metric text khi /metrics request.
    pub metrics_handle: PrometheusHandle,
}

Update crates/shop-api/src/main.rs install recorder TRƯỚC khi build AppState (để mọi metric emit từ middleware + service layer route đúng recorder):

// File: crates/shop-api/src/main.rs
fn main() -> anyhow::Result<()> {
    dotenvy::dotenv().ok();
    let config = AppConfig::from_env()?;
    init_tracing(config.env);  // B80 lock

    // Install metrics recorder TRƯỚC build AppState.
    // Recorder là global singleton — macro counter!/histogram!
    // route về recorder này từ mọi nơi trong process.
    let metrics_handle = shop_api::metrics::install_recorder();

    // Tokio runtime build sau (cần recorder ready trước async task spawn).
    let runtime = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()?;

    runtime.block_on(async {
        let pool = create_pool(&PoolConfig::from_config(&config)).await?;

        let state = AppState {
            config: config.clone(),
            db: pool,
            // ... wire 6 service
            metrics_handle,
        };

        // ... build router + serve
        Ok::<_, anyhow::Error>(())
    })
}

Bucket lock 12 bucket 1ms → 10s phân bổ logarithm-like: dày phía dưới (1ms, 5ms, 10ms, 25ms, 50ms, 100ms — phù hợp web API healthy P95 thường < 100ms), thưa phía trên (250ms, 500ms, 1s, 2.5s, 5s, 10s — phù hợp tail latency detect outlier). Total 12 bucket cho 1 metric histogram = 12 time series per label combination — cardinality manageable.

HTTP Request Metrics Middleware

Tạo file mới crates/shop-api/src/middleware/metrics_layer.rs:

// File: crates/shop-api/src/middleware/metrics_layer.rs
use axum::{
    extract::{MatchedPath, Request},
    middleware::Next,
    response::Response,
};
use std::time::Instant;

pub async fn metrics_middleware(req: Request, next: Next) -> Response {
    let start = Instant::now();
    let method = req.method().clone();

    // Use MatchedPath để lấy route template thay raw URI —
    // chống cardinality blow up:
    //   raw URI: /products/iphone-15, /products/samsung-s24, ...
    //   matched: /products/{slug}  (1 label value duy nhất)
    let path = req
        .extensions()
        .get::<MatchedPath>()
        .map(|p| p.as_str().to_string())
        .unwrap_or_else(|| "<unmatched>".to_string());

    // Forward request xuống stack.
    let response = next.run(req).await;

    let status = response.status().as_u16();
    let latency = start.elapsed().as_secs_f64();

    // Counter: total request — 3 dimension chuẩn (method + path + status).
    metrics::counter!(
        "http_requests_total",
        "method" => method.as_str().to_string(),
        "path" => path.clone(),
        "status" => status.to_string(),
    )
    .increment(1);

    // Histogram: request duration — 2 dimension (method + path,
    // KHÔNG kèm status để tránh cardinality x50 status code).
    metrics::histogram!(
        "http_request_duration_seconds",
        "method" => method.as_str().to_string(),
        "path" => path.clone(),
    )
    .record(latency);

    // Counter: error count chỉ emit khi 4xx/5xx — giảm cardinality
    // (KHÔNG emit cho 2xx, vì http_requests_total đã track tổng).
    if status >= 400 {
        metrics::counter!(
            "http_errors_total",
            "method" => method.as_str().to_string(),
            "path" => path,
            "status" => status.to_string(),
        )
        .increment(1);
    }

    response
}

Lock pattern Shop API:

Use MatchedPath MANDATORY thay raw URI:
- Bad: /products/iphone-15, /products/samsung-s24 → mỗi slug = label value khác, 1M slug = 1M time series.
- Good: /products/{slug} → 1 label value duy nhất cho mọi product slug.
Labels limit 3 dimension chuẩn: method (10 value HTTP method) + path (100 route template Shop API) + status (50 status code phổ biến).
KHÔNG add user_id / request_id label — high cardinality (10K user = 10K series cho mỗi metric, 100K = blow up).
Histogram bỏ status label — vì latency distribution theo route + method đủ ý nghĩa, thêm status x50 cardinality không lợi ích phân tích.

Update crates/shop-api/src/middleware/mod.rs:

// File: crates/shop-api/src/middleware/mod.rs
pub mod trace_layer;       // B80
pub mod metrics_layer;     // B81 NEW

pub use trace_layer::custom_trace_layer;
pub use metrics_layer::metrics_middleware;

Wire vào router (crates/shop-api/src/router.rs) — đặt INNER hơn trace_layer để span trace có sẵn khi metric emit (event log + metric cùng request_id correlation):

// File: crates/shop-api/src/router.rs
use axum::middleware as axum_mw;
use crate::middleware::{custom_trace_layer, metrics_middleware};

pub fn build_router(state: AppState) -> Router {
    Router::new()
        .merge(routes::health::routes())
        .route("/metrics", get(routes::metrics::metrics))
        .nest("/api/v1", api_v1())
        // INNER: metrics_layer cần MatchedPath đã set.
        .layer(axum_mw::from_fn(metrics_middleware))  // B81 NEW
        // OUTER hơn: trace_layer wrap span cho mọi event.
        .layer(custom_trace_layer())                  // B80
        .layer(RequestBodyLimitLayer::new(2 * 1024 * 1024))  // B79 OUTERMOST
        .with_state(state)
}

Lưu ý ordering: MatchedPath được axum populate vào Extensions SAU khi router matched route — middleware đọc MatchedPath phải đặt SAU .route()/.nest() trong builder chain (axum 0.8 đảm bảo bằng from_fn layer scope). Stack giờ 9 layer (8 cũ B80 + metrics_layer mới B81).

Refactor /metrics Endpoint B57

Endpoint /metrics giờ phải combine 2 nguồn: (a) middleware-auto HTTP metrics từ PrometheusHandle::render() + (b) pool gauge metrics B57 manual format (vì pool stats không qua macro metrics mà query trực tiếp PgPool::size()).

Refactor crates/shop-api/src/routes/metrics.rs:

// File: crates/shop-api/src/routes/metrics.rs
use axum::{extract::State, http::header, response::IntoResponse};
use crate::state::AppState;

pub async fn metrics(State(state): State<AppState>) -> impl IntoResponse {
    // 1. HTTP metrics auto từ middleware (counter + histogram).
    let http_metrics = state.metrics_handle.render();

    // 2. Pool gauge metrics — manual format B57 lock continued.
    let pool_active = state.db.size() - state.db.num_idle() as u32;
    let pool_utilization =
        pool_active as f64 / state.config.pool_max_connections as f64;

    let pool_metrics = format!(
        "\n# HELP shop_db_pool_size Total connections in pool\n\
         # TYPE shop_db_pool_size gauge\n\
         shop_db_pool_size {}\n\
         # HELP shop_db_pool_idle Idle connections\n\
         # TYPE shop_db_pool_idle gauge\n\
         shop_db_pool_idle {}\n\
         # HELP shop_db_pool_active Active connections\n\
         # TYPE shop_db_pool_active gauge\n\
         shop_db_pool_active {}\n\
         # HELP shop_db_pool_max Max connections allowed\n\
         # TYPE shop_db_pool_max gauge\n\
         shop_db_pool_max {}\n\
         # HELP shop_db_pool_utilization Pool utilization ratio\n\
         # TYPE shop_db_pool_utilization gauge\n\
         shop_db_pool_utilization {:.4}\n",
        state.db.size(),
        state.db.num_idle(),
        pool_active,
        state.config.pool_max_connections,
        pool_utilization,
    );

    (
        [(
            header::CONTENT_TYPE,
            "text/plain; version=0.0.4; charset=utf-8",
        )],
        format!("{}{}", http_metrics, pool_metrics),
    )
}

Endpoint trả body kết hợp 2 phần. Ví dụ output sau 142 request GET /api/v1/products + 5 lỗi validation 422:

# HTTP metrics (auto từ middleware)
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/v1/products",status="200"} 142
http_requests_total{method="POST",path="/api/v1/products",status="422"} 5

# HELP http_request_duration_seconds HTTP request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/api/v1/products",le="0.005"} 89
http_request_duration_seconds_bucket{method="GET",path="/api/v1/products",le="0.025"} 130
http_request_duration_seconds_bucket{method="GET",path="/api/v1/products",le="0.1"} 142
http_request_duration_seconds_bucket{method="GET",path="/api/v1/products",le="+Inf"} 142
http_request_duration_seconds_count{method="GET",path="/api/v1/products"} 142
http_request_duration_seconds_sum{method="GET",path="/api/v1/products"} 3.45

# HELP http_errors_total Total number of HTTP errors (4xx/5xx)
# TYPE http_errors_total counter
http_errors_total{method="POST",path="/api/v1/products",status="422"} 5

# Pool metrics (manual format B57 continued)
shop_db_pool_size 5
shop_db_pool_idle 3
shop_db_pool_active 2
shop_db_pool_max 20
shop_db_pool_utilization 0.1000

Pattern lock B81: /metrics endpoint là composite — middleware-auto HTTP metrics (phần lớn) + pool-manual gauge (low-overhead trực tiếp query sqlx pool API). G15 sẽ migrate pool gauge sang macro gauge! + background task định kỳ update 5s để unify pipeline.

Custom Business Metrics — Service Layer Pattern

HTTP middleware track infrastructure metric (request/response latency/error). Business metric (đơn hàng/giao dịch/sự kiện domain) emit ở service layer — đúng nơi business event xảy ra, độc lập transport HTTP.

Pattern lock — emit metric trong service layer:

// File: crates/shop-core/src/orders.rs (PgOrderService)
use std::time::Instant;

impl OrderService for PgOrderService {
    async fn create_order(
        &self,
        dto: CreateOrderDto,
        actor_user_id: i64,
        request_id: &str,
    ) -> Result<OrderResponseDto, OrderError> {
        let start = Instant::now();

        let result = shop_db::orders::create_order_atomic(
            &self.pool,
            actor_user_id,
            &dto.items,
            &dto.payment_method,
            request_id,
        )
        .await;

        let elapsed = start.elapsed().as_secs_f64();

        match &result {
            Ok(order) => {
                metrics::counter!(
                    "orders_created_total",
                    "payment_type" => dto.payment_method.label(),  // cod | stripe | bank_transfer
                )
                .increment(1);

                metrics::histogram!("order_creation_duration_seconds")
                    .record(elapsed);

                tracing::info!(
                    order_id = order.id,
                    total = %order.total,
                    "order created"
                );
            }
            Err(e) => {
                metrics::counter!(
                    "orders_failed_total",
                    "reason" => e.label(),  // insufficient_stock | product_not_found | ...
                )
                .increment(1);
            }
        }

        result.map(OrderResponseDto::from)
    }
}

Lock pattern business metric Shop API:

orders_created_total{payment_type} counter — tách theo payment method (3 value: cod, stripe, bank_transfer).
orders_failed_total{reason} counter — phân loại lỗi business (insufficient_stock / product_not_found / payment_declined).
payment_succeeded_total{type} counter — emit sau Stripe webhook payment_intent.succeeded B71.
webhook_received_total{event_type} counter — Stripe webhook B71 mọi event type (payment_intent.succeeded / payment_intent.payment_failed / charge.refunded).
cart_checkout_duration_seconds histogram — đo end-to-end checkout flow.

Quy tắc: 1 metric per service operation critical (create_order, checkout, payment_webhook, password_reset) — KHÔNG emit cho mọi function (filtering noise). Mỗi metric có doc comment giải thích semantic + sample dashboard query để team mới onboard hiểu nhanh.

Cardinality Pitfall + Label Design Rule

Cardinality = số time series Prometheus phải lưu cho 1 metric, bằng tích Descartes số unique value mỗi label. Mỗi unique combination labels = 1 time series riêng — mỗi series tốn ~3KB memory Prometheus server.

Pitfall high cardinality:

user_id label 100K user → 100K time series cho mỗi metric.
3 metric HTTP × 100K user × 10 method × 50 status = 1.5 tỷ time series → Prometheus crash OOM.
Industry incident thật: Shopify 2019 outage 3h do 1 dev thêm order_id label vào counter.

Rule lock label design Shop API:

Cardinality cap: max 1000 unique value per label.
OK labels (low cardinality, stable):
- method ~10 value (GET/POST/PUT/PATCH/DELETE/HEAD/OPTIONS).
- status ~50 value (200/201/204/301/302/400/401/403/404/409/422/429/500/502/503/504/...).
- path ~100 route template Shop API (giới hạn bởi route definition).
- payment_type 3 value (cod / stripe / bank_transfer).
- tier 3 value (vip / regular / new) — aggregate by tier thay user_id.
NOT OK labels (high cardinality, unbounded):
- user_id, order_id, product_id, session_id, request_id.
- timestamp (unique mỗi giây — infinity series theo thời gian).
- IP address (~4 tỷ IPv4 + 2^128 IPv6).
- user_agent string raw.
- url query string raw.

Anti-pattern:

// KHÔNG làm — high cardinality blow up
metrics::counter!(
    "user_request_total",
    "user_id" => user.id.to_string(),  // 100K user = 100K series
    "request_id" => request_id.to_string(),  // unbounded
)
.increment(1);

Pattern correct — aggregate theo bucket trước khi label:

// Aggregate by tier — 3 series thay 100K series
let tier = if user.total_orders > 100 {
    "vip"
} else if user.total_orders > 10 {
    "regular"
} else {
    "new"
};

metrics::counter!(
    "user_request_total",
    "tier" => tier,
)
.increment(1);

Lock decision Shop API: aggregate trước khi label — chuyển identifier high cardinality thành bucket low cardinality (tier / category / region / device_class). Identifier raw để trong log event (B80 lock) cho debug 1 request cụ thể; metric label chỉ dùng cho aggregate dashboard.

Defensive check trước deploy: curl /metrics | sort | uniq | wc -l đếm series — Shop API target < 10K series tổng (web API trung bình). Vượt 100K = đỏ cảnh báo, vượt 1M = Prometheus sẽ crash.

Prometheus Scrape Config + Grafana Dashboard Preview

Preview deploy stack G15 — Prometheus pull pattern: Prometheus server scrape /metrics endpoint Shop API mỗi 15 giây, tạo file docker-compose.observability.yml ở root project:

# File: docker-compose.observability.yml
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: shop_prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: shop_grafana
    ports:
      - "3001:3000"
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: "Admin"
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

File config Prometheus prometheus.yml:

# File: prometheus.yml
global:
  scrape_interval: 15s     # Scrape mỗi 15s — lock Shop API
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'shop_api'
    static_configs:
      - targets: ['host.docker.internal:3000']
        labels:
          service: 'shop-api'
          env: 'local'
    metrics_path: /metrics
    scrape_timeout: 10s

Start stack docker compose -f docker-compose.observability.yml up -d → Prometheus UI http://localhost:9090 → Grafana http://localhost:3001 add datasource Prometheus URL http://prometheus:9090.

4 PromQL query chuẩn cho Grafana dashboard:

# 1. Request rate (req/s, last 5 min)
sum(rate(http_requests_total[5m]))

# 2. Error rate (ratio 4xx+5xx / total)
sum(rate(http_errors_total[5m]))
  /
sum(rate(http_requests_total[5m]))

# 3. P95 latency (seconds)
histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# 4. Top 5 slow routes by P95 latency
topk(
  5,
  histogram_quantile(
    0.95,
    sum by (path, le) (rate(http_request_duration_seconds_bucket[5m]))
  )
)

4 query này là RED method standard (Rate / Errors / Duration) cho microservice — Tom Wilkie Google SRE pattern industry-wide adoption. Mỗi panel Grafana dashboard wire 1 query, refresh 15s align scrape_interval.

Pattern lock dashboard Shop API: 4 panel chuẩn cho mọi service Shop API tương lai (shop-api / shop-worker / future microservice). G18 deploy thêm alert rule: error_rate > 0.005 for 5m = trigger PagerDuty P2, p95_latency > 1s for 5m = warning Slack non-paging.

Verify pipeline end-to-end: terminal 1 cargo run -p shop-api + terminal 2 oha -z 30s -c 50 http://localhost:3000/api/v1/products + terminal 3 Grafana panel "Request rate" thấy spike ~4K req/s, "P95 latency" ~25ms, "Top 5 slow routes" liệt kê endpoint slowest. Test thêm error path: curl -X POST -H 'Content-Type: application/json' -d '{}' /api/v1/products spam 5 lần 422 → "Error rate" panel nhảy lên ~5%.

Tổng Kết

4 Prometheus metric: Counter (monotonic), Gauge (tự do), Histogram (bucket distribution), Summary (SKIP — Histogram tốt hơn).
Crate metrics v0.23 + metrics-exporter-prometheus v0.15 lock workspace dep.
PrometheusBuilder::install_recorder() init ở main.rs TRƯỚC build AppState.
PrometheusHandle trong AppState — render Prometheus text khi /metrics request.
HTTP middleware 3 metric: http_requests_total counter + http_request_duration_seconds histogram + http_errors_total counter.
MatchedPath extract route template thay raw URI — MANDATORY chống cardinality blow up.
Bucket lock latency: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] — 12 bucket 1ms → 10s.
3 dimension chuẩn: method + path + status (histogram bỏ status x50 cardinality).
Cardinality cap: max 1000 unique value/label.
OK labels: method/status/path/payment_type/tier; NOT OK: user_id/order_id/IP/timestamp/request_id.
Business metrics service layer: orders_created_total{payment_type} + orders_failed_total{reason} + payment_succeeded_total{type} + webhook_received_total{event_type} + cart_checkout_duration_seconds histogram.
Refactor /metrics: combine middleware-auto (HTTP metric) + pool-manual B57 (gauge).
Prometheus scrape 15s interval + Grafana 4 PromQL query chuẩn (RED method).
Anti-pattern user_id label 100K series → giải pháp aggregate by tier 3 series.
File path lock: NEW crates/shop-api/src/metrics.rs + NEW crates/shop-api/src/middleware/metrics_layer.rs; refactor crates/shop-api/src/routes/metrics.rs.
Stack giờ 9 layer (8 cũ B80 + metrics_layer mới B81).
Foundation cho B82 (timeout per-route), G15 OpenTelemetry distributed tracing, G18 production observability stack.

Bài Tập Củng Cố

Tự trả lời, đáp án ở cuối:

4 loại Prometheus metric — phân tích pros/cons mỗi loại Counter/Gauge/Histogram/Summary. Tại sao Histogram > Summary cho 95% case thực tế? Cho ví dụ scenario aggregate cross-instance.
MatchedPath vs raw uri.path() — phân tích cardinality pitfall. Cho ví dụ scenario /products/{slug} 1M slug raw URI vs template — số time series khác nhau như nào? Memory blow up Prometheus tính ra sao?
Bucket lock 1ms → 10s 12 bucket — quy tắc chọn bucket distribution cho web API. PromQL histogram_quantile(0.95, ...) compute p95 từ bucket array ra sao? Sai bucket → kết quả sai như thế nào?
Label cardinality cap 1000 — anti-pattern user_id label 100K user. Solution aggregate by tier giảm series như nào? Cho ví dụ Shop API thực tế dimension nào aggregate được, dimension nào KHÔNG.
4 PromQL query Grafana — rate(http_requests_total[5m]) vs irate(http_requests_total[5m]). Khi nào dùng rate, khi nào irate? Pitfall irate với scrape interval thưa.

Đáp án

4 loại Prometheus metric pros/cons: Counter monotonic chỉ tăng, reset khi restart — pros: đơn giản 1 operator +=, query rate() ra request/s cross-instance đúng (cộng counter rồi rate); cons: KHÔNG track giá trị tuyệt đối hiện tại (cần delta giữa 2 scrape), reset gây "rate spike" giả khi restart. Use case: total event (request, error, byte sent). Gauge tự do tăng giảm point-in-time — pros: phản ánh trạng thái hiện tại trực tiếp (pool size, memory, temperature), không cần compute; cons: aggregate cross-instance phức tạp (sum/avg/max tùy semantic), miss event giữa 2 scrape (gauge spike rồi xuống trong <15s scrape không thấy). Use case: capacity gauge (pool, memory, CPU). Histogram bucket distribution count + sum + bucket array — pros: aggregate cross-instance được (cộng _bucket rồi histogram_quantile), không cần client config quantile, bucket cố định scrape lib tự tăng count, latency tail visible qua bucket cao; cons: cardinality x12 (12 bucket × số label combination), bucket sai chọn → quantile sai (nếu p99 thực = 200ms nhưng bucket lớn nhất = 100ms thì histogram_quantile(0.99) trả 100ms). Use case: latency distribution, response size distribution. Summary client-side quantile pre-compute — pros: query nhanh không cần PromQL compute, quantile accurate ở client; cons: KHÔNG aggregate cross-instance (mean của median ≠ median tổng — toán học fundamental), client config quantile cố định không flexible, mỗi instance giữ window data tốn memory client. Use case: hiếm dùng, chỉ khi single-instance + quantile cố định biết trước. Tại sao Histogram > Summary 95% case: (a) microservice 10-100 instance — Histogram aggregate cộng bucket cross-instance + 1 query compute quantile chính xác tổng thể; Summary KHÔNG aggregate được, dashboard mỗi instance riêng vô nghĩa; (b) flexible quantile runtime — Histogram tính p50/p90/p95/p99 ad-hoc từ 1 bucket array, Summary cố định quantile config compile-time đổi phải redeploy; (c) memory client thấp — Histogram chỉ giữ counter per bucket, Summary giữ window N sample (tracking exact quantile algorithm tốn memory). Shop API SKIP Summary toàn series. Aggregate cross-instance scenario: 10 pod Shop API K8s, mỗi pod scrape Prometheus riêng — histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{service="shop-api"}[5m]))) = compute p95 latency tổng cross 10 pod (cộng bucket 10 pod rồi quantile = đúng tổng thể); với Summary phải compute riêng 10 quantile rồi avg/max — sai toán học không phản ánh distribution tổng.
MatchedPath vs raw URI cardinality pitfall: raw URI request.uri().path() trả path đầy đủ với param value embedded (/api/v1/products/iphone-15, /api/v1/products/samsung-s24, /api/v1/orders/12345, /api/v1/orders/67890); MatchedPath trả route template axum match (/api/v1/products/{slug}, /api/v1/orders/{id}) — template số lượng cố định bằng số route definition (~100 cho Shop API hoàn chỉnh). Scenario 1M product slug: catalog Shop API có 1M product khác nhau slug; user request 1M unique slug throughout 1 ngày. Với raw URI label: http_requests_total{path="/api/v1/products/iphone-15"}, ...path="/api/v1/products/samsung-s24"... → 1M time series cho riêng metric counter này. Thêm http_request_duration_seconds_bucket 12 bucket × 1M path × 10 method = 120M time series. Memory Prometheus: mỗi series ~3KB → 120M × 3KB = 360GB RAM → Prometheus crash OOM ngay scrape đầu. Với MatchedPath template: 1 path value /api/v1/products/{slug} bất kể 1M slug khác nhau client gửi → counter chỉ 50 series (10 method × 5 status phổ biến), histogram 12 bucket × 10 method = 120 series — tổng < 200 series cho route products. Toàn Shop API ~100 route × 200 series/route = ~20K series — Prometheus 8GB RAM xử lý thoải mái. Memory tính toán: Prometheus default chunk size ~120 sample/series + retention 15 ngày = 1 series ~3KB active memory; 100K series = ~300MB; 1M series = ~3GB; 10M series = ~30GB → trên cluster Prometheus dual instance HA limit ~10M series. Industry rule: target < 1M series tổng cluster, alert khi vượt 80%. Shop API mục tiêu < 10K series cho mọi service.
Bucket lock 1ms → 10s + histogram_quantile: quy tắc chọn bucket distribution cho web API — granular phía dưới (web API healthy P95 thường 10-100ms — cần bucket dày 1ms/5ms/10ms/25ms/50ms/100ms để quantile chính xác), thưa phía trên (tail latency > 1s rare — bucket 250ms/500ms/1s/2.5s/5s/10s đủ detect outlier không cần granular). Logarithm-like distribution: ratio mỗi bucket ~2x-5x neighbor. 12 bucket = balance giữa accuracy quantile và cardinality cost. PromQL histogram_quantile: input là array bucket cumulative count {le="0.001"} 5, {le="0.005"} 89, {le="0.01"} 130, ..., {le="+Inf"} 142; algorithm: (a) tính total count = 142 (bucket +Inf); (b) target quantile 0.95 = position 142 × 0.95 = 134.9 thứ tự cumulative; (c) tìm bucket boundary đầu tiên có cumulative ≥ 134.9 → bucket le="0.025" có 130 (chưa đủ), le="0.1" có 142 (vừa đủ); (d) linear interpolate giữa le="0.025" (130) và le="0.1" (142): (134.9 - 130) / (142 - 130) = 0.408 → boundary value = 0.025 + 0.408 × (0.1 - 0.025) = 0.025 + 0.031 = 0.056s = 56ms. Vậy p95 = 56ms. Sai bucket → quantile sai: nếu bucket thưa phía dưới (0.1, 1, 10 3 bucket) thì p95 latency 50ms sẽ bị interpolate trong khoảng 0 → 0.1 với linear assumption — kết quả sai lớn (có thể trả 50ms ổn nhưng cũng có thể trả 95ms nếu distribution thực tế lệch). Nếu bucket lớn nhất < thực tế p99 (vd bucket lớn nhất 1s nhưng p99 thực = 2s outlier), histogram_quantile trả <= 1s (vô tận hứng vào bucket +Inf, không interpolate được). Pattern Shop API: 12 bucket 1ms-10s đủ cover happy path web API (1ms-100ms) + tail (1s-10s) + outlier (> 10s rơi vào +Inf bucket — detect outlier qua rate(_bucket{le="+Inf"}) - rate(_bucket{le="10"})). Migrate bucket: KHÔNG thay đổi bucket runtime — sẽ break PromQL query history (bucket mới + cũ không cộng được); muốn đổi bucket phải tạo metric mới _v2 chạy song song, drain query cũ trong 14 ngày, rồi drop bucket cũ.
Cardinality cap 1000 + aggregate by tier: anti-pattern user_id label 100K user — mỗi user request 1 endpoint sinh 1 time series riêng http_requests_total{method="GET",path="/api/v1/cart",user_id="42"}; 100K user × 10 endpoint cart = 1M series cho riêng metric này, scale theo growth user = unbounded. Solution: aggregate trước khi label — chuyển user_id (high cardinality unbounded) thành tier (3 value lock). Logic aggregate: let tier = match user.total_orders { 0..=10 => "new", 11..=100 => "regular", _ => "vip" }; → 3 series thay 100K series, giảm 33,333x. Aggregate dimension hữu ích phân tích: tier "vip" có conversion rate cao hơn? P95 latency tier "new" khác "regular"? — vẫn trả lời được câu hỏi business mà không tốn memory. Shop API dimension aggregate được: user_tier (3 value: vip/regular/new), product_category (~20 value: electronics/clothing/books/...), order_payment_method (3 value: cod/stripe/bank_transfer), request_region (~10 value Vietnam province cluster), device_class (3 value: mobile/tablet/desktop từ User-Agent parse), auth_type (3 value: anonymous/authenticated/admin). Dimension KHÔNG aggregate được: user_id raw (giữ trong log B80 cho debug 1 request), order_id raw, request_id trace (giữ trong span tracing B80), session_id, IP address (privacy concern + cardinality 4 tỷ). Defensive enforcement: clippy lint custom preview G19 shop_clippy::no_id_in_metric_label grep regex metrics::(counter|gauge|histogram)!.*user_id|order_id|request_id|session_id warn compile-time; review board code review checklist mandatory mọi PR metric. Sai 1 lần deploy production = downtime 3h Shopify 2019 incident.
rate vs irate: rate(metric[5m]) compute average rate of increase per second over 5 minute window — uses ALL samples trong window (smooth curve, ít noise, phù hợp dashboard long-term + alert sustained); irate(metric[5m]) compute instant rate dựa trên 2 sample CUỐI cùng trong window (responsive ngay, capture spike nhanh, nhưng noisy). Khi nào dùng rate: (a) dashboard request rate / error rate / latency trend phân tích trend dài hạn — rate smooth không jittery; (b) alert rule sustained condition — rate(error_total[5m]) > 0.1 for 5m = trigger khi rate sustained trung bình > 0.1 trong 5 phút (loại false positive spike 10s); (c) capacity planning — avg_over_time(rate(http_requests_total[5m])[1d:5m]) compute average request rate 24h cho headroom. Khi nào dùng irate: (a) debug incident realtime — thấy spike chính xác giây nào; (b) ad-hoc query Prometheus UI explore behavior; (c) metric volatile cần thấy detail (CPU usage / GC pause). Pitfall irate với scrape interval thưa: irate needs ≥ 2 sample trong window — nếu scrape interval = 15s và window = 1m thì có 4 sample, irate lấy 2 cuối → OK. Nhưng nếu window = 30s thì chỉ 2 sample = irate sample chính xác 2 cuối nhưng dễ alias miss spike giữa 2 scrape; window quá nhỏ < 2 × scrape_interval = irate empty result vì không có 2 sample. Rule of thumb: window query MUST ≥ 4 × scrape_interval cho rate reliable (15s × 4 = 60s minimum window), MUST ≥ 2 × scrape_interval cho irate minimum + recommend 4 × cho stable. Shop API scrape_interval = 15s lock B81 → dashboard query window ≥ 1m (rate(...[5m]) standard, irate(...[1m]) debug). G18 deploy nếu scale lên 100+ instance có thể giảm scrape_interval xuống 30s tiết kiệm tài nguyên Prometheus — đổi tương ứng query window lên 2m minimum. 4 query lock RED method (Rate / Errors / Duration / Saturation): (1) Rate = request/s tổng; (2) Errors = error ratio; (3) Duration = P95 latency; (4) Saturation = resource utilization (pool / memory / CPU) — Tom Wilkie Google SRE pattern industry-wide adoption, 4 panel chuẩn cho mọi service dashboard Shop API.

Bài Tiếp Theo

Bài 82: Timeout Per-Route — tower-http TimeoutLayer — tower-http TimeoutLayer + per-route timeout config (5s default + 30s import + 60s upload), graceful timeout response 504 Gateway Timeout, timeout pitfall + background task cleanup.

Danh sách bài viết