Mục lục
- Mục Tiêu Bài Học
- 4 Loại Prometheus Metric
- Cài metrics Crate + Prometheus Exporter
- HTTP Request Metrics Middleware
- Refactor /metrics Endpoint B57
- Custom Business Metrics — Service Layer Pattern
- Cardinality Pitfall + Label Design Rule
- Prometheus Scrape Config + Grafana Dashboard Preview
- Tổng Kết
- Bài Tập Củng Cố
- Bài Tiếp Theo
Mục Tiêu Bài Học
Sau bài học, bạn sẽ:
- Hiểu 4 loại Prometheus metric: Counter, Gauge, Histogram, Summary — pros/cons và use case mỗi loại.
- Refactor
/metricsendpoint B57 từ format manual sang cratemetrics+metrics-exporter-prometheus. - Implement HTTP request metrics middleware:
request_countcounter +response_timehistogram +error_countcounter. - Pattern labels (key=value) cho dimensional metrics —
method,status,route. - Hiểu pitfall labels cardinality: high cardinality → memory blow up Prometheus server.
- Setup Prometheus scrape config + Grafana dashboard preview với 4 PromQL query chuẩn.
- Business metrics service layer pattern:
orders_created_total,payment_succeeded_total,webhook_received_total. MatchedPathextract route template MANDATORY thay raw URI để chống cardinality blow up.
4 Loại Prometheus Metric
Prometheus định nghĩa 4 loại metric (xem spec tại prometheus.io/docs/concepts/metric_types). Quyết định loại nào cho từng dimension Shop API ảnh hưởng trực tiếp tới chất lượng dashboard + alert downstream.
Counter — chỉ tăng (monotonic), reset khi process restart:
- Use case:
request_count_total,error_count_total,bytes_sent_total. - Operator duy nhất:
+=(tăng) — KHÔNG bao giờ giảm. - Format naming convention: suffix
_total(http_requests_total,orders_created_total). - Query: PromQL
rate(http_requests_total[5m])= request/giây trung bình 5 phút gần nhất.
Gauge — tăng giảm tự do, point-in-time value:
- Use case:
db_pool_active,memory_usage_bytes,temperature_celsius,active_connections. - Operator:
=(set),+=(tăng),-=(giảm). - Pool metric B57 đã dùng pattern này:
shop_db_pool_size,shop_db_pool_active,shop_db_pool_utilization. - Query PromQL trực tiếp:
shop_db_pool_utilizationtrả giá trị hiện tại scrape gần nhất.
Histogram — bucket distribution (count + sum + bucket array):
- Use case:
request_duration_seconds_bucket,response_size_bytes_bucket. - Output 3 series per histogram:
<name>_bucket{le="0.1"},<name>_sum,<name>_count. - Tính
p50,p95,p99qua PromQLhistogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])). - Bucket lock cho HTTP latency: cần granular phía dưới (web API < 100ms phổ biến) — lock 12 bucket 1ms → 10s.
- Aggregate cross-instance được (cộng
_bucketrồi compute quantile).
Summary — client-side quantile pre-compute:
- Use case: hiếm dùng (Histogram tốt hơn 95% case thực tế).
- Pros: KHÔNG cần PromQL compute, query nhanh.
- Cons: KHÔNG aggregate cross-instance được (mean của median KHÔNG bằng median tổng); quantile bị giới hạn ở client config.
- Shop API SKIP Summary toàn series.
Lock decision Shop API 4 metric type:
- Counter:
http_requests_total,http_errors_total,orders_created_total,payment_succeeded_total. - Gauge:
shop_db_pool_active,shop_db_pool_idle(B57 lock continued — pattern khác middleware-auto vì point-in-time). - Histogram:
http_request_duration_seconds,cart_checkout_duration_seconds. - Summary: SKIP toàn bộ.
Cài metrics Crate + Prometheus Exporter
Workspace deps update shop/Cargo.toml:
[workspace.dependencies]
metrics = "0.23"
metrics-exporter-prometheus = "0.15"
Crate metrics cung cấp facade (macro counter!, gauge!, histogram!) tương tự crate tracing; metrics-exporter-prometheus là backend recorder render Prometheus exposition format. Pattern lock: emit metric ở mọi nơi qua macro facade — backend swap-able runtime (Datadog / StatsD / Prometheus / OTel).
Tạo file mới crates/shop-api/src/metrics.rs:
// File: crates/shop-api/src/metrics.rs
use metrics_exporter_prometheus::{Matcher, PrometheusBuilder, PrometheusHandle};
/// HTTP latency bucket — 12 bucket từ 1ms đến 10s.
/// Granular phía dưới (web API phần lớn < 100ms),
/// coarse phía trên (tail latency cần thấy > 1s).
const HTTP_LATENCY_BUCKETS: &[f64] = &[
0.001, 0.005, 0.01, 0.025, 0.05, 0.1,
0.25, 0.5, 1.0, 2.5, 5.0, 10.0,
];
pub fn install_recorder() -> PrometheusHandle {
PrometheusBuilder::new()
// Bucket override per-metric MANDATORY cho histogram —
// default bucket [0.005, 0.01, 0.025, 0.05, ...] quá thô cho HTTP latency.
.set_buckets_for_metric(
Matcher::Full("http_request_duration_seconds".into()),
HTTP_LATENCY_BUCKETS,
)
.unwrap()
.set_buckets_for_metric(
Matcher::Full("cart_checkout_duration_seconds".into()),
HTTP_LATENCY_BUCKETS,
)
.unwrap()
.install_recorder()
.expect("install metrics recorder")
}
Function install_recorder() trả PrometheusHandle — handle này có method .render() sinh Prometheus exposition format text từ mọi metric đã emit. Handle lưu vào AppState để handler /metrics render khi scrape tới.
Update crates/shop-api/src/state.rs extend AppState:
// File: crates/shop-api/src/state.rs
use metrics_exporter_prometheus::PrometheusHandle;
#[derive(Clone)]
pub struct AppState {
pub config: AppConfig,
pub db: PgPool,
// ... 6 service Arc<dyn> B72 lock
pub product_service: Arc<dyn ProductService>,
pub order_service: Arc<dyn OrderService>,
// ...
/// Prometheus handle render metric text khi /metrics request.
pub metrics_handle: PrometheusHandle,
}
Update crates/shop-api/src/main.rs install recorder TRƯỚC khi build AppState (để mọi metric emit từ middleware + service layer route đúng recorder):
// File: crates/shop-api/src/main.rs
fn main() -> anyhow::Result<()> {
dotenvy::dotenv().ok();
let config = AppConfig::from_env()?;
init_tracing(config.env); // B80 lock
// Install metrics recorder TRƯỚC build AppState.
// Recorder là global singleton — macro counter!/histogram!
// route về recorder này từ mọi nơi trong process.
let metrics_handle = shop_api::metrics::install_recorder();
// Tokio runtime build sau (cần recorder ready trước async task spawn).
let runtime = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()?;
runtime.block_on(async {
let pool = create_pool(&PoolConfig::from_config(&config)).await?;
let state = AppState {
config: config.clone(),
db: pool,
// ... wire 6 service
metrics_handle,
};
// ... build router + serve
Ok::<_, anyhow::Error>(())
})
}
Bucket lock 12 bucket 1ms → 10s phân bổ logarithm-like: dày phía dưới (1ms, 5ms, 10ms, 25ms, 50ms, 100ms — phù hợp web API healthy P95 thường < 100ms), thưa phía trên (250ms, 500ms, 1s, 2.5s, 5s, 10s — phù hợp tail latency detect outlier). Total 12 bucket cho 1 metric histogram = 12 time series per label combination — cardinality manageable.
HTTP Request Metrics Middleware
Tạo file mới crates/shop-api/src/middleware/metrics_layer.rs:
// File: crates/shop-api/src/middleware/metrics_layer.rs
use axum::{
extract::{MatchedPath, Request},
middleware::Next,
response::Response,
};
use std::time::Instant;
pub async fn metrics_middleware(req: Request, next: Next) -> Response {
let start = Instant::now();
let method = req.method().clone();
// Use MatchedPath để lấy route template thay raw URI —
// chống cardinality blow up:
// raw URI: /products/iphone-15, /products/samsung-s24, ...
// matched: /products/{slug} (1 label value duy nhất)
let path = req
.extensions()
.get::<MatchedPath>()
.map(|p| p.as_str().to_string())
.unwrap_or_else(|| "<unmatched>".to_string());
// Forward request xuống stack.
let response = next.run(req).await;
let status = response.status().as_u16();
let latency = start.elapsed().as_secs_f64();
// Counter: total request — 3 dimension chuẩn (method + path + status).
metrics::counter!(
"http_requests_total",
"method" => method.as_str().to_string(),
"path" => path.clone(),
"status" => status.to_string(),
)
.increment(1);
// Histogram: request duration — 2 dimension (method + path,
// KHÔNG kèm status để tránh cardinality x50 status code).
metrics::histogram!(
"http_request_duration_seconds",
"method" => method.as_str().to_string(),
"path" => path.clone(),
)
.record(latency);
// Counter: error count chỉ emit khi 4xx/5xx — giảm cardinality
// (KHÔNG emit cho 2xx, vì http_requests_total đã track tổng).
if status >= 400 {
metrics::counter!(
"http_errors_total",
"method" => method.as_str().to_string(),
"path" => path,
"status" => status.to_string(),
)
.increment(1);
}
response
}
Lock pattern Shop API:
- Use
MatchedPathMANDATORY thay raw URI:- Bad:
/products/iphone-15,/products/samsung-s24→ mỗi slug = label value khác, 1M slug = 1M time series. - Good:
/products/{slug}→ 1 label value duy nhất cho mọi product slug.
- Bad:
- Labels limit 3 dimension chuẩn:
method(10 value HTTP method) +path(100 route template Shop API) +status(50 status code phổ biến). - KHÔNG add
user_id/request_idlabel — high cardinality (10K user = 10K series cho mỗi metric, 100K = blow up). - Histogram bỏ
statuslabel — vì latency distribution theo route + method đủ ý nghĩa, thêm status x50 cardinality không lợi ích phân tích.
Update crates/shop-api/src/middleware/mod.rs:
// File: crates/shop-api/src/middleware/mod.rs
pub mod trace_layer; // B80
pub mod metrics_layer; // B81 NEW
pub use trace_layer::custom_trace_layer;
pub use metrics_layer::metrics_middleware;
Wire vào router (crates/shop-api/src/router.rs) — đặt INNER hơn trace_layer để span trace có sẵn khi metric emit (event log + metric cùng request_id correlation):
// File: crates/shop-api/src/router.rs
use axum::middleware as axum_mw;
use crate::middleware::{custom_trace_layer, metrics_middleware};
pub fn build_router(state: AppState) -> Router {
Router::new()
.merge(routes::health::routes())
.route("/metrics", get(routes::metrics::metrics))
.nest("/api/v1", api_v1())
// INNER: metrics_layer cần MatchedPath đã set.
.layer(axum_mw::from_fn(metrics_middleware)) // B81 NEW
// OUTER hơn: trace_layer wrap span cho mọi event.
.layer(custom_trace_layer()) // B80
.layer(RequestBodyLimitLayer::new(2 * 1024 * 1024)) // B79 OUTERMOST
.with_state(state)
}
Lưu ý ordering: MatchedPath được axum populate vào Extensions SAU khi router matched route — middleware đọc MatchedPath phải đặt SAU .route()/.nest() trong builder chain (axum 0.8 đảm bảo bằng from_fn layer scope). Stack giờ 9 layer (8 cũ B80 + metrics_layer mới B81).
Refactor /metrics Endpoint B57
Endpoint /metrics giờ phải combine 2 nguồn: (a) middleware-auto HTTP metrics từ PrometheusHandle::render() + (b) pool gauge metrics B57 manual format (vì pool stats không qua macro metrics mà query trực tiếp PgPool::size()).
Refactor crates/shop-api/src/routes/metrics.rs:
// File: crates/shop-api/src/routes/metrics.rs
use axum::{extract::State, http::header, response::IntoResponse};
use crate::state::AppState;
pub async fn metrics(State(state): State<AppState>) -> impl IntoResponse {
// 1. HTTP metrics auto từ middleware (counter + histogram).
let http_metrics = state.metrics_handle.render();
// 2. Pool gauge metrics — manual format B57 lock continued.
let pool_active = state.db.size() - state.db.num_idle() as u32;
let pool_utilization =
pool_active as f64 / state.config.pool_max_connections as f64;
let pool_metrics = format!(
"\n# HELP shop_db_pool_size Total connections in pool\n\
# TYPE shop_db_pool_size gauge\n\
shop_db_pool_size {}\n\
# HELP shop_db_pool_idle Idle connections\n\
# TYPE shop_db_pool_idle gauge\n\
shop_db_pool_idle {}\n\
# HELP shop_db_pool_active Active connections\n\
# TYPE shop_db_pool_active gauge\n\
shop_db_pool_active {}\n\
# HELP shop_db_pool_max Max connections allowed\n\
# TYPE shop_db_pool_max gauge\n\
shop_db_pool_max {}\n\
# HELP shop_db_pool_utilization Pool utilization ratio\n\
# TYPE shop_db_pool_utilization gauge\n\
shop_db_pool_utilization {:.4}\n",
state.db.size(),
state.db.num_idle(),
pool_active,
state.config.pool_max_connections,
pool_utilization,
);
(
[(
header::CONTENT_TYPE,
"text/plain; version=0.0.4; charset=utf-8",
)],
format!("{}{}", http_metrics, pool_metrics),
)
}
Endpoint trả body kết hợp 2 phần. Ví dụ output sau 142 request GET /api/v1/products + 5 lỗi validation 422:
# HTTP metrics (auto từ middleware)
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/api/v1/products",status="200"} 142
http_requests_total{method="POST",path="/api/v1/products",status="422"} 5
# HELP http_request_duration_seconds HTTP request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/api/v1/products",le="0.005"} 89
http_request_duration_seconds_bucket{method="GET",path="/api/v1/products",le="0.025"} 130
http_request_duration_seconds_bucket{method="GET",path="/api/v1/products",le="0.1"} 142
http_request_duration_seconds_bucket{method="GET",path="/api/v1/products",le="+Inf"} 142
http_request_duration_seconds_count{method="GET",path="/api/v1/products"} 142
http_request_duration_seconds_sum{method="GET",path="/api/v1/products"} 3.45
# HELP http_errors_total Total number of HTTP errors (4xx/5xx)
# TYPE http_errors_total counter
http_errors_total{method="POST",path="/api/v1/products",status="422"} 5
# Pool metrics (manual format B57 continued)
shop_db_pool_size 5
shop_db_pool_idle 3
shop_db_pool_active 2
shop_db_pool_max 20
shop_db_pool_utilization 0.1000
Pattern lock B81: /metrics endpoint là composite — middleware-auto HTTP metrics (phần lớn) + pool-manual gauge (low-overhead trực tiếp query sqlx pool API). G15 sẽ migrate pool gauge sang macro gauge! + background task định kỳ update 5s để unify pipeline.
Custom Business Metrics — Service Layer Pattern
HTTP middleware track infrastructure metric (request/response latency/error). Business metric (đơn hàng/giao dịch/sự kiện domain) emit ở service layer — đúng nơi business event xảy ra, độc lập transport HTTP.
Pattern lock — emit metric trong service layer:
// File: crates/shop-core/src/orders.rs (PgOrderService)
use std::time::Instant;
impl OrderService for PgOrderService {
async fn create_order(
&self,
dto: CreateOrderDto,
actor_user_id: i64,
request_id: &str,
) -> Result<OrderResponseDto, OrderError> {
let start = Instant::now();
let result = shop_db::orders::create_order_atomic(
&self.pool,
actor_user_id,
&dto.items,
&dto.payment_method,
request_id,
)
.await;
let elapsed = start.elapsed().as_secs_f64();
match &result {
Ok(order) => {
metrics::counter!(
"orders_created_total",
"payment_type" => dto.payment_method.label(), // cod | stripe | bank_transfer
)
.increment(1);
metrics::histogram!("order_creation_duration_seconds")
.record(elapsed);
tracing::info!(
order_id = order.id,
total = %order.total,
"order created"
);
}
Err(e) => {
metrics::counter!(
"orders_failed_total",
"reason" => e.label(), // insufficient_stock | product_not_found | ...
)
.increment(1);
}
}
result.map(OrderResponseDto::from)
}
}
Lock pattern business metric Shop API:
orders_created_total{payment_type}counter — tách theo payment method (3 value:cod,stripe,bank_transfer).orders_failed_total{reason}counter — phân loại lỗi business (insufficient_stock / product_not_found / payment_declined).payment_succeeded_total{type}counter — emit sau Stripe webhook payment_intent.succeeded B71.webhook_received_total{event_type}counter — Stripe webhook B71 mọi event type (payment_intent.succeeded / payment_intent.payment_failed / charge.refunded).cart_checkout_duration_secondshistogram — đo end-to-end checkout flow.
Quy tắc: 1 metric per service operation critical (create_order, checkout, payment_webhook, password_reset) — KHÔNG emit cho mọi function (filtering noise). Mỗi metric có doc comment giải thích semantic + sample dashboard query để team mới onboard hiểu nhanh.
Cardinality Pitfall + Label Design Rule
Cardinality = số time series Prometheus phải lưu cho 1 metric, bằng tích Descartes số unique value mỗi label. Mỗi unique combination labels = 1 time series riêng — mỗi series tốn ~3KB memory Prometheus server.
Pitfall high cardinality:
user_idlabel 100K user → 100K time series cho mỗi metric.- 3 metric HTTP × 100K user × 10 method × 50 status = 1.5 tỷ time series → Prometheus crash OOM.
- Industry incident thật: Shopify 2019 outage 3h do 1 dev thêm
order_idlabel vào counter.
Rule lock label design Shop API:
- Cardinality cap: max 1000 unique value per label.
- OK labels (low cardinality, stable):
method~10 value (GET/POST/PUT/PATCH/DELETE/HEAD/OPTIONS).status~50 value (200/201/204/301/302/400/401/403/404/409/422/429/500/502/503/504/...).path~100 route template Shop API (giới hạn bởi route definition).payment_type3 value (cod / stripe / bank_transfer).tier3 value (vip / regular / new) — aggregate by tier thay user_id.
- NOT OK labels (high cardinality, unbounded):
user_id,order_id,product_id,session_id,request_id.timestamp(unique mỗi giây — infinity series theo thời gian).IP address(~4 tỷ IPv4 + 2^128 IPv6).user_agentstring raw.url query stringraw.
Anti-pattern:
// KHÔNG làm — high cardinality blow up
metrics::counter!(
"user_request_total",
"user_id" => user.id.to_string(), // 100K user = 100K series
"request_id" => request_id.to_string(), // unbounded
)
.increment(1);
Pattern correct — aggregate theo bucket trước khi label:
// Aggregate by tier — 3 series thay 100K series
let tier = if user.total_orders > 100 {
"vip"
} else if user.total_orders > 10 {
"regular"
} else {
"new"
};
metrics::counter!(
"user_request_total",
"tier" => tier,
)
.increment(1);
Lock decision Shop API: aggregate trước khi label — chuyển identifier high cardinality thành bucket low cardinality (tier / category / region / device_class). Identifier raw để trong log event (B80 lock) cho debug 1 request cụ thể; metric label chỉ dùng cho aggregate dashboard.
Defensive check trước deploy: curl /metrics | sort | uniq | wc -l đếm series — Shop API target < 10K series tổng (web API trung bình). Vượt 100K = đỏ cảnh báo, vượt 1M = Prometheus sẽ crash.
Prometheus Scrape Config + Grafana Dashboard Preview
Preview deploy stack G15 — Prometheus pull pattern: Prometheus server scrape /metrics endpoint Shop API mỗi 15 giây, tạo file docker-compose.observability.yml ở root project:
# File: docker-compose.observability.yml
services:
prometheus:
image: prom/prometheus:latest
container_name: shop_prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: shop_grafana
ports:
- "3001:3000"
environment:
GF_AUTH_ANONYMOUS_ENABLED: "true"
GF_AUTH_ANONYMOUS_ORG_ROLE: "Admin"
volumes:
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
File config Prometheus prometheus.yml:
# File: prometheus.yml
global:
scrape_interval: 15s # Scrape mỗi 15s — lock Shop API
evaluation_interval: 15s
scrape_configs:
- job_name: 'shop_api'
static_configs:
- targets: ['host.docker.internal:3000']
labels:
service: 'shop-api'
env: 'local'
metrics_path: /metrics
scrape_timeout: 10s
Start stack docker compose -f docker-compose.observability.yml up -d → Prometheus UI http://localhost:9090 → Grafana http://localhost:3001 add datasource Prometheus URL http://prometheus:9090.
4 PromQL query chuẩn cho Grafana dashboard:
# 1. Request rate (req/s, last 5 min)
sum(rate(http_requests_total[5m]))
# 2. Error rate (ratio 4xx+5xx / total)
sum(rate(http_errors_total[5m]))
/
sum(rate(http_requests_total[5m]))
# 3. P95 latency (seconds)
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# 4. Top 5 slow routes by P95 latency
topk(
5,
histogram_quantile(
0.95,
sum by (path, le) (rate(http_request_duration_seconds_bucket[5m]))
)
)
4 query này là RED method standard (Rate / Errors / Duration) cho microservice — Tom Wilkie Google SRE pattern industry-wide adoption. Mỗi panel Grafana dashboard wire 1 query, refresh 15s align scrape_interval.
Pattern lock dashboard Shop API: 4 panel chuẩn cho mọi service Shop API tương lai (shop-api / shop-worker / future microservice). G18 deploy thêm alert rule: error_rate > 0.005 for 5m = trigger PagerDuty P2, p95_latency > 1s for 5m = warning Slack non-paging.
Verify pipeline end-to-end: terminal 1 cargo run -p shop-api + terminal 2 oha -z 30s -c 50 http://localhost:3000/api/v1/products + terminal 3 Grafana panel "Request rate" thấy spike ~4K req/s, "P95 latency" ~25ms, "Top 5 slow routes" liệt kê endpoint slowest. Test thêm error path: curl -X POST -H 'Content-Type: application/json' -d '{}' /api/v1/products spam 5 lần 422 → "Error rate" panel nhảy lên ~5%.
Tổng Kết
- 4 Prometheus metric: Counter (monotonic), Gauge (tự do), Histogram (bucket distribution), Summary (SKIP — Histogram tốt hơn).
- Crate
metricsv0.23 +metrics-exporter-prometheusv0.15 lock workspace dep. PrometheusBuilder::install_recorder()init ởmain.rsTRƯỚC build AppState.PrometheusHandletrong AppState — render Prometheus text khi/metricsrequest.- HTTP middleware 3 metric:
http_requests_totalcounter +http_request_duration_secondshistogram +http_errors_totalcounter. MatchedPathextract route template thay raw URI — MANDATORY chống cardinality blow up.- Bucket lock latency:
[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]— 12 bucket 1ms → 10s. - 3 dimension chuẩn:
method+path+status(histogram bỏ status x50 cardinality). - Cardinality cap: max 1000 unique value/label.
- OK labels:
method/status/path/payment_type/tier; NOT OK:user_id/order_id/IP/timestamp/request_id. - Business metrics service layer:
orders_created_total{payment_type}+orders_failed_total{reason}+payment_succeeded_total{type}+webhook_received_total{event_type}+cart_checkout_duration_secondshistogram. - Refactor
/metrics: combine middleware-auto (HTTP metric) + pool-manual B57 (gauge). - Prometheus scrape 15s interval + Grafana 4 PromQL query chuẩn (RED method).
- Anti-pattern
user_idlabel 100K series → giải pháp aggregate bytier3 series. - File path lock: NEW
crates/shop-api/src/metrics.rs+ NEWcrates/shop-api/src/middleware/metrics_layer.rs; refactorcrates/shop-api/src/routes/metrics.rs. - Stack giờ 9 layer (8 cũ B80 +
metrics_layermới B81). - Foundation cho B82 (timeout per-route), G15 OpenTelemetry distributed tracing, G18 production observability stack.
Bài Tập Củng Cố
Tự trả lời, đáp án ở cuối:
- 4 loại Prometheus metric — phân tích pros/cons mỗi loại Counter/Gauge/Histogram/Summary. Tại sao Histogram > Summary cho 95% case thực tế? Cho ví dụ scenario aggregate cross-instance.
MatchedPathvs rawuri.path()— phân tích cardinality pitfall. Cho ví dụ scenario/products/{slug}1M slug raw URI vs template — số time series khác nhau như nào? Memory blow up Prometheus tính ra sao?- Bucket lock 1ms → 10s 12 bucket — quy tắc chọn bucket distribution cho web API. PromQL
histogram_quantile(0.95, ...)compute p95 từ bucket array ra sao? Sai bucket → kết quả sai như thế nào? - Label cardinality cap 1000 — anti-pattern
user_idlabel 100K user. Solution aggregate bytiergiảm series như nào? Cho ví dụ Shop API thực tế dimension nào aggregate được, dimension nào KHÔNG. - 4 PromQL query Grafana —
rate(http_requests_total[5m])vsirate(http_requests_total[5m]). Khi nào dùngrate, khi nàoirate? Pitfalliratevới scrape interval thưa.
Đáp án
- 4 loại Prometheus metric pros/cons: Counter monotonic chỉ tăng, reset khi restart — pros: đơn giản 1 operator
+=, queryrate()ra request/s cross-instance đúng (cộng counter rồi rate); cons: KHÔNG track giá trị tuyệt đối hiện tại (cần delta giữa 2 scrape), reset gây "rate spike" giả khi restart. Use case: total event (request, error, byte sent). Gauge tự do tăng giảm point-in-time — pros: phản ánh trạng thái hiện tại trực tiếp (pool size, memory, temperature), không cần compute; cons: aggregate cross-instance phức tạp (sum/avg/max tùy semantic), miss event giữa 2 scrape (gauge spike rồi xuống trong <15s scrape không thấy). Use case: capacity gauge (pool, memory, CPU). Histogram bucket distribution count + sum + bucket array — pros: aggregate cross-instance được (cộng_bucketrồihistogram_quantile), không cần client config quantile, bucket cố định scrape lib tự tăng count, latency tail visible qua bucket cao; cons: cardinality x12 (12 bucket × số label combination), bucket sai chọn → quantile sai (nếu p99 thực = 200ms nhưng bucket lớn nhất = 100ms thìhistogram_quantile(0.99)trả 100ms). Use case: latency distribution, response size distribution. Summary client-side quantile pre-compute — pros: query nhanh không cần PromQL compute, quantile accurate ở client; cons: KHÔNG aggregate cross-instance (mean của median ≠ median tổng — toán học fundamental), client config quantile cố định không flexible, mỗi instance giữ window data tốn memory client. Use case: hiếm dùng, chỉ khi single-instance + quantile cố định biết trước. Tại sao Histogram > Summary 95% case: (a) microservice 10-100 instance — Histogram aggregate cộng bucket cross-instance + 1 query compute quantile chính xác tổng thể; Summary KHÔNG aggregate được, dashboard mỗi instance riêng vô nghĩa; (b) flexible quantile runtime — Histogram tính p50/p90/p95/p99 ad-hoc từ 1 bucket array, Summary cố định quantile config compile-time đổi phải redeploy; (c) memory client thấp — Histogram chỉ giữ counter per bucket, Summary giữ window N sample (tracking exact quantile algorithm tốn memory). Shop API SKIP Summary toàn series. Aggregate cross-instance scenario: 10 pod Shop API K8s, mỗi pod scrape Prometheus riêng —histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{service="shop-api"}[5m])))= compute p95 latency tổng cross 10 pod (cộng bucket 10 pod rồi quantile = đúng tổng thể); với Summary phải compute riêng 10 quantile rồi avg/max — sai toán học không phản ánh distribution tổng. MatchedPathvs raw URI cardinality pitfall: raw URIrequest.uri().path()trả path đầy đủ với param value embedded (/api/v1/products/iphone-15,/api/v1/products/samsung-s24,/api/v1/orders/12345,/api/v1/orders/67890);MatchedPathtrả route template axum match (/api/v1/products/{slug},/api/v1/orders/{id}) — template số lượng cố định bằng số route definition (~100 cho Shop API hoàn chỉnh). Scenario 1M product slug: catalog Shop API có 1M product khác nhau slug; user request 1M unique slug throughout 1 ngày. Với raw URI label:http_requests_total{path="/api/v1/products/iphone-15"},...path="/api/v1/products/samsung-s24"...→ 1M time series cho riêng metric counter này. Thêmhttp_request_duration_seconds_bucket12 bucket × 1M path × 10 method = 120M time series. Memory Prometheus: mỗi series ~3KB → 120M × 3KB = 360GB RAM → Prometheus crash OOM ngay scrape đầu. VớiMatchedPathtemplate: 1 path value/api/v1/products/{slug}bất kể 1M slug khác nhau client gửi → counter chỉ 50 series (10 method × 5 status phổ biến), histogram 12 bucket × 10 method = 120 series — tổng < 200 series cho route products. Toàn Shop API ~100 route × 200 series/route = ~20K series — Prometheus 8GB RAM xử lý thoải mái. Memory tính toán: Prometheus default chunk size ~120 sample/series + retention 15 ngày = 1 series ~3KB active memory; 100K series = ~300MB; 1M series = ~3GB; 10M series = ~30GB → trên cluster Prometheus dual instance HA limit ~10M series. Industry rule: target < 1M series tổng cluster, alert khi vượt 80%. Shop API mục tiêu < 10K series cho mọi service.- Bucket lock 1ms → 10s +
histogram_quantile: quy tắc chọn bucket distribution cho web API — granular phía dưới (web API healthy P95 thường 10-100ms — cần bucket dày 1ms/5ms/10ms/25ms/50ms/100ms để quantile chính xác), thưa phía trên (tail latency > 1s rare — bucket 250ms/500ms/1s/2.5s/5s/10s đủ detect outlier không cần granular). Logarithm-like distribution: ratio mỗi bucket ~2x-5x neighbor. 12 bucket = balance giữa accuracy quantile và cardinality cost. PromQLhistogram_quantile: input là array bucket cumulative count{le="0.001"} 5, {le="0.005"} 89, {le="0.01"} 130, ..., {le="+Inf"} 142; algorithm: (a) tính total count = 142 (bucket+Inf); (b) target quantile 0.95 = position 142 × 0.95 = 134.9 thứ tự cumulative; (c) tìm bucket boundary đầu tiên có cumulative ≥ 134.9 → bucketle="0.025"có 130 (chưa đủ),le="0.1"có 142 (vừa đủ); (d) linear interpolate giữale="0.025"(130) vàle="0.1"(142): (134.9 - 130) / (142 - 130) = 0.408 → boundary value = 0.025 + 0.408 × (0.1 - 0.025) = 0.025 + 0.031 = 0.056s = 56ms. Vậy p95 = 56ms. Sai bucket → quantile sai: nếu bucket thưa phía dưới (0.1, 1, 103 bucket) thì p95 latency 50ms sẽ bị interpolate trong khoảng0 → 0.1với linear assumption — kết quả sai lớn (có thể trả 50ms ổn nhưng cũng có thể trả 95ms nếu distribution thực tế lệch). Nếu bucket lớn nhất < thực tế p99 (vd bucket lớn nhất 1s nhưng p99 thực = 2s outlier),histogram_quantiletrả <= 1s (vô tận hứng vào bucket+Inf, không interpolate được). Pattern Shop API: 12 bucket 1ms-10s đủ cover happy path web API (1ms-100ms) + tail (1s-10s) + outlier (> 10s rơi vào+Infbucket — detect outlier quarate(_bucket{le="+Inf"}) - rate(_bucket{le="10"})). Migrate bucket: KHÔNG thay đổi bucket runtime — sẽ break PromQL query history (bucket mới + cũ không cộng được); muốn đổi bucket phải tạo metric mới_v2chạy song song, drain query cũ trong 14 ngày, rồi drop bucket cũ. - Cardinality cap 1000 + aggregate by tier: anti-pattern
user_idlabel 100K user — mỗi user request 1 endpoint sinh 1 time series riênghttp_requests_total{method="GET",path="/api/v1/cart",user_id="42"}; 100K user × 10 endpoint cart = 1M series cho riêng metric này, scale theo growth user = unbounded. Solution: aggregate trước khi label — chuyểnuser_id(high cardinality unbounded) thànhtier(3 value lock). Logic aggregate:let tier = match user.total_orders { 0..=10 => "new", 11..=100 => "regular", _ => "vip" };→ 3 series thay 100K series, giảm 33,333x. Aggregate dimension hữu ích phân tích: tier "vip" có conversion rate cao hơn? P95 latency tier "new" khác "regular"? — vẫn trả lời được câu hỏi business mà không tốn memory. Shop API dimension aggregate được:user_tier(3 value: vip/regular/new),product_category(~20 value: electronics/clothing/books/...),order_payment_method(3 value: cod/stripe/bank_transfer),request_region(~10 value Vietnam province cluster),device_class(3 value: mobile/tablet/desktop từ User-Agent parse),auth_type(3 value: anonymous/authenticated/admin). Dimension KHÔNG aggregate được:user_idraw (giữ trong log B80 cho debug 1 request),order_idraw,request_idtrace (giữ trong span tracing B80),session_id,IPaddress (privacy concern + cardinality 4 tỷ). Defensive enforcement: clippy lint custom preview G19shop_clippy::no_id_in_metric_labelgrep regexmetrics::(counter|gauge|histogram)!.*user_id|order_id|request_id|session_idwarn compile-time; review board code review checklist mandatory mọi PR metric. Sai 1 lần deploy production = downtime 3h Shopify 2019 incident. ratevsirate:rate(metric[5m])compute average rate of increase per second over 5 minute window — uses ALL samples trong window (smooth curve, ít noise, phù hợp dashboard long-term + alert sustained);irate(metric[5m])compute instant rate dựa trên 2 sample CUỐI cùng trong window (responsive ngay, capture spike nhanh, nhưng noisy). Khi nào dùngrate: (a) dashboard request rate / error rate / latency trend phân tích trend dài hạn —ratesmooth không jittery; (b) alert rule sustained condition —rate(error_total[5m]) > 0.1 for 5m= trigger khi rate sustained trung bình > 0.1 trong 5 phút (loại false positive spike 10s); (c) capacity planning —avg_over_time(rate(http_requests_total[5m])[1d:5m])compute average request rate 24h cho headroom. Khi nào dùngirate: (a) debug incident realtime — thấy spike chính xác giây nào; (b) ad-hoc query Prometheus UI explore behavior; (c) metric volatile cần thấy detail (CPU usage / GC pause). Pitfalliratevới scrape interval thưa:irateneeds ≥ 2 sample trong window — nếu scrape interval = 15s và window = 1m thì có 4 sample, irate lấy 2 cuối → OK. Nhưng nếu window = 30s thì chỉ 2 sample = irate sample chính xác 2 cuối nhưng dễ alias miss spike giữa 2 scrape; window quá nhỏ < 2 × scrape_interval = irate empty result vì không có 2 sample. Rule of thumb: window query MUST ≥ 4 × scrape_interval choratereliable (15s × 4 = 60s minimum window), MUST ≥ 2 × scrape_interval choirateminimum + recommend 4 × cho stable. Shop API scrape_interval = 15s lock B81 → dashboard query window ≥ 1m (rate(...[5m])standard,irate(...[1m])debug). G18 deploy nếu scale lên 100+ instance có thể giảm scrape_interval xuống 30s tiết kiệm tài nguyên Prometheus — đổi tương ứng query window lên 2m minimum. 4 query lock RED method (Rate / Errors / Duration / Saturation): (1) Rate = request/s tổng; (2) Errors = error ratio; (3) Duration = P95 latency; (4) Saturation = resource utilization (pool / memory / CPU) — Tom Wilkie Google SRE pattern industry-wide adoption, 4 panel chuẩn cho mọi service dashboard Shop API.
Bài Tiếp Theo
Bài 82: Timeout Per-Route — tower-http TimeoutLayer — tower-http TimeoutLayer + per-route timeout config (5s default + 30s import + 60s upload), graceful timeout response 504 Gateway Timeout, timeout pitfall + background task cleanup.
