-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
Description
The current metrics implementation provides basic observability but lacks the granularity needed for performance optimization and bottleneck analysis across different loaders and workloads.
Current State
We have these metrics instrumented in src/amp/metrics.py:
records_processed(Counter) - loader, table, connectionbytes_processed(Counter) - loader, tableprocessing_latency(Histogram) - loader, operationbatch_sizes(Histogram) - loader, tableerrors(Counter) - loader, error_type, tableretry_attempts(Counter) - loader, operation, reasonactive_connections(Gauge) - loader, targetreorg_events(Counter) - loader, network, table
Gaps
1. Flight SQL Client Metrics
Current: Zero instrumentation on the data fetching side
Proposed:
flight_fetch_latency_seconds(Histogram) - query latency from Flight SQL serverflight_bytes_received_total(Counter) - bytes received from serverflight_batches_received_total(Counter) - Arrow batches received
2. Parallel Streaming Metrics
Current: Zero instrumentation in src/amp/streaming/
Proposed:
streaming_worker_utilization(Gauge) - active workers / total workersstreaming_queue_depth(Gauge) - batches waiting per workerstreaming_batch_wait_seconds(Histogram) - time batches spend in queuestreaming_worker_processing_seconds(Histogram) - per-worker processing time
3. Phase-Level Latency Breakdown
Current: Only load_batch operation tracked
Proposed: Separate histograms or labels for:
fetch- time to receive data from Flight SQLtransform- time for any data transformationwrite- time to write to target system
Use Cases This Enables
# Compare loader throughput
rate(amp_records_processed_total[5m]) by (loader)
# Identify bottleneck: fetch vs load
histogram_quantile(0.95, rate(amp_phase_latency_seconds_bucket{phase="fetch"}[5m]))
histogram_quantile(0.95, rate(amp_phase_latency_seconds_bucket{phase="write"}[5m]))
# Parallel efficiency
amp_streaming_worker_utilization
# Queue pressure
rate(amp_streaming_batch_wait_seconds_sum[5m]) / rate(amp_streaming_batch_wait_seconds_count[5m])
Implementation Notes
- All new metrics should use the existing
_get_or_create_metric()helper for test isolation - Consider adding a
workloadordatasetlabel for A/B comparisons - Memory metrics could be added via
psutilif needed for correlation analysis