VoIP Monitoring with Prometheus and Grafana
VoIP systems fail in ways that don't show up in generic infrastructure monitoring. CPU at 20%, memory healthy, network up — but calls are dropping because the SIP registrar is rejecting REGISTERs due to a certificate expiry, or RTP packet loss is hitting 8% on a specific carrier route, or the T38 fax relay is silently failing. This post covers building a VoIP-specific observability stack that surfaces the metrics that matter: call quality, registration health, SIP error rates, and carrier route performance.
What to Monitor in VoIP Infrastructure
Before instrumenting anything, define the four signal types for VoIP:
| Signal | Examples | Tool |
|---|---|---|
| SIP signaling | INVITE rate, 4xx/5xx rate, REGISTER failures | kamailio_exporter, snmp_exporter |
| Media quality | Packet loss, jitter, MOS score | rtpengine metrics, Homer SIPcapture |
| Infrastructure | CPU, memory, network I/O, disk | node_exporter |
| Business | ASR (answer rate), ACD (avg call duration), NER | CDR database queries |
A complete monitoring setup scrapes all four. Infrastructure metrics alone give you uptime; the other three give you quality.
Kamailio Metrics with kamailio_exporter
Kamailio exposes internal statistics via the statistics module. The kamailio_exporter translates these to Prometheus metrics.
Install the exporter:
# Run kamailio_exporter as a sidecar
docker run -d \
--name kamailio-exporter \
--network host \
-e KAMAILIO_HOST=127.0.0.1 \
-e KAMAILIO_PORT=5060 \
-p 9494:9494 \
hunterlong/kamailio-exporter
Or configure Kamailio to expose a JSON stats endpoint:
loadmodule "xhttp.so"
loadmodule "statistics.so"
event_route[xhttp:request] {
if ($hu =~ "^/metrics") {
xhttp_reply("200", "OK", "text/plain", $stat(all));
exit;
}
}
Key Kamailio metrics to track:
# prometheus/recording_rules.yml
groups:
- name: kamailio_derived
rules:
- record: kamailio:sip_4xx_rate
expr: rate(kamailio_core_rcv_requests_total{method="INVITE",status=~"4.."}[5m])
- record: kamailio:register_failure_rate
expr: rate(kamailio_core_rcv_requests_total{method="REGISTER",status="401"}[5m])
/ rate(kamailio_core_rcv_requests_total{method="REGISTER"}[5m])
- record: kamailio:active_dialogs
expr: kamailio_dialog_active
- record: kamailio:invite_per_second
expr: rate(kamailio_core_rcv_requests_total{method="INVITE"}[1m])
Asterisk Metrics
Asterisk does not natively expose Prometheus metrics. Use one of two approaches:
Option 1: asterisk_exporter (AMI-based)
# /etc/asterisk_exporter/config.yml
ami:
host: 127.0.0.1
port: 5038
username: prometheus
password: secret
metrics:
- active_channels
- active_calls
- active_agents
- queue_waiting
- queue_completed
# /etc/asterisk/manager.conf
[prometheus]
secret=secret
permit=127.0.0.1/255.255.255.255
read=system,call,agent,user,config,dtmf,reporting,cdr,dialplan
write=
Option 2: CEL to Prometheus via Loki/Grafana pipeline
Write CDR/CEL events to a PostgreSQL table and expose them via a custom exporter. This approach gives you business metrics (ASR, ACD, call volumes by trunk) that the AMI exporter cannot provide:
# asterisk_business_exporter.py
from prometheus_client import Gauge, start_http_server
import psycopg2
import time
asr_gauge = Gauge('asterisk_asr_ratio', 'Answer-Seizure Ratio', ['trunk'])
acd_gauge = Gauge('asterisk_acd_seconds', 'Average Call Duration', ['trunk'])
calls_gauge = Gauge('asterisk_active_calls', 'Active calls', ['direction'])
def collect_metrics():
conn = psycopg2.connect("host=localhost dbname=asterisk_cdr user=monitor")
cur = conn.cursor()
# ASR per trunk (last 5 minutes)
cur.execute("""
SELECT
accountcode AS trunk,
ROUND(AVG(CASE WHEN disposition='ANSWERED' THEN 1.0 ELSE 0.0 END), 3) AS asr,
AVG(CASE WHEN disposition='ANSWERED' THEN billsec ELSE NULL END) AS acd
FROM cdr
WHERE calldate > NOW() - INTERVAL '5 minutes'
AND accountcode IS NOT NULL
GROUP BY accountcode
""")
for trunk, asr, acd in cur.fetchall():
asr_gauge.labels(trunk=trunk).set(asr or 0)
acd_gauge.labels(trunk=trunk).set(acd or 0)
if __name__ == '__main__':
start_http_server(9200)
while True:
collect_metrics()
time.sleep(30)
rtpengine Metrics
rtpengine exposes Prometheus metrics natively when built with --with-transcoding:
# /etc/rtpengine/rtpengine.conf
[rtpengine]
prometheus = yes
prometheus-listen = 127.0.0.1:9900
Key media quality metrics from rtpengine:
| Metric | Alert threshold | Description |
|---|---|---|
rtpengine_packet_loss_ratio | > 0.03 | Packet loss > 3% |
rtpengine_jitter_ms | > 50 | Jitter > 50ms |
rtpengine_mos_score | < 3.5 | MOS below acceptable |
rtpengine_active_sessions | > 80% capacity | Approaching session limit |
rtpengine_transcoded_sessions | Rate spike | Unexpected transcoding |
MOS (Mean Opinion Score) ranges from 1 (unusable) to 5 (excellent). A score above 4.0 is toll-quality; 3.5–4.0 is acceptable; below 3.5 users notice degradation. Set your alert at 3.5.
Prometheus Alerting Rules
# prometheus/alerts/voip.yml
groups:
- name: voip_sip
rules:
- alert: HighSIP5xxRate
expr: |
rate(kamailio_core_rcv_replies_total{status=~"5.."}[5m])
/ rate(kamailio_core_rcv_replies_total[5m]) > 0.05
for: 3m
labels:
severity: critical
team: voip
annotations:
summary: "SIP 5xx rate {{ $value | humanizePercentage }} on {{ $labels.instance }}"
runbook: "https://wiki.example.com/runbooks/sip-5xx"
- alert: KamailioDialogsHigh
expr: kamailio:active_dialogs > 8000
for: 5m
labels:
severity: warning
annotations:
summary: "Active dialogs approaching capacity: {{ $value }}"
- alert: RegistrationFailureSpike
expr: kamailio:register_failure_rate > 0.2
for: 2m
labels:
severity: critical
annotations:
summary: "20%+ of SIP registrations failing — possible auth issue or attack"
- name: voip_media
rules:
- alert: MediaQualityDegraded
expr: rtpengine_mos_score < 3.5
for: 5m
labels:
severity: warning
annotations:
summary: "MOS score {{ $value }} below 3.5 on {{ $labels.instance }}"
- alert: MediaPacketLossHigh
expr: rtpengine_packet_loss_ratio > 0.03
for: 3m
labels:
severity: critical
annotations:
summary: "RTP packet loss {{ $value | humanizePercentage }} — calls impacted"
- alert: rtpengineCapacityHigh
expr: rtpengine_active_sessions / rtpengine_max_sessions > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "rtpengine at {{ $value | humanizePercentage }} capacity"
Grafana Dashboard Layout
Structure your Grafana dashboard in four rows:
Row 1: SIP Signaling Health
- INVITE rate (calls/sec) — line graph, 1h window
- SIP 4xx/5xx rate — stat panel with threshold coloring
- Active dialogs — gauge panel
- Registration success rate — stat panel
Row 2: Media Quality
- MOS score distribution by trunk — heatmap
- Packet loss % by carrier — time series
- Jitter ms — time series with threshold line at 50ms
- Active RTP sessions — gauge
Row 3: Infrastructure
- CPU per VoIP node — multi-series line
- Network I/O (bytes/sec) — time series
- Memory usage — time series
Row 4: Business Metrics
- ASR by trunk — bar gauge
- ACD (average call duration) — stat panel
- Total calls in last 24h — stat panel
- Calls by outcome (Answered/No Answer/Busy) — pie chart
Prometheus Scrape Configuration
# prometheus.yml
scrape_configs:
- job_name: 'kamailio'
static_configs:
- targets: ['kamailio-1:9494', 'kamailio-2:9494']
scrape_interval: 10s
- job_name: 'asterisk'
static_configs:
- targets: ['asterisk-1:9200', 'asterisk-2:9200']
scrape_interval: 30s
- job_name: 'rtpengine'
static_configs:
- targets: ['rtpengine-1:9900', 'rtpengine-2:9900']
scrape_interval: 10s
- job_name: 'coturn'
static_configs:
- targets: ['turn-1:9641']
scrape_interval: 30s
- job_name: 'node'
static_configs:
- targets: ['kamailio-1:9100', 'asterisk-1:9100', 'rtpengine-1:9100']
scrape_interval: 15s
Storage Sizing for VoIP Metrics
VoIP monitoring generates high-cardinality metrics — per-call, per-trunk, per-carrier labels multiply metric series. Calculate your Prometheus storage requirements:
- Samples per scrape: ~500 (typical VoIP stack)
- Scrape interval: 10s → 6 scrapes/minute
- Samples/minute: 3,000
- Samples/day: 4,320,000
- Prometheus bytes per sample: ~1.5 bytes (compressed)
- Storage/day: ~6 MB
- 90-day retention: ~540 MB
This fits comfortably on any VPS. For longer retention or higher cardinality (1,000+ trunks), use Thanos or Mimir to offload to object storage and query across retention windows.




