VoIP Monitoring with Prometheus and Grafana

VoIP systems fail in ways that don't show up in generic infrastructure monitoring. CPU at 20%, memory healthy, network up — but calls are dropping because the SIP registrar is rejecting REGISTERs due to a certificate expiry, or RTP packet loss is hitting 8% on a specific carrier route, or the T38 fax relay is silently failing. This post covers building a VoIP-specific observability stack that surfaces the metrics that matter: call quality, registration health, SIP error rates, and carrier route performance.

What to Monitor in VoIP Infrastructure

Before instrumenting anything, define the four signal types for VoIP:

Signal	Examples	Tool
SIP signaling	INVITE rate, 4xx/5xx rate, REGISTER failures	kamailio_exporter, snmp_exporter
Media quality	Packet loss, jitter, MOS score	rtpengine metrics, Homer SIPcapture
Infrastructure	CPU, memory, network I/O, disk	node_exporter
Business	ASR (answer rate), ACD (avg call duration), NER	CDR database queries

A complete monitoring setup scrapes all four. Infrastructure metrics alone give you uptime; the other three give you quality.

Kamailio Metrics with kamailio_exporter

Kamailio exposes internal statistics via the statistics module. The kamailio_exporter translates these to Prometheus metrics.

Install the exporter:

# Run kamailio_exporter as a sidecar
docker run -d \
  --name kamailio-exporter \
  --network host \
  -e KAMAILIO_HOST=127.0.0.1 \
  -e KAMAILIO_PORT=5060 \
  -p 9494:9494 \
  hunterlong/kamailio-exporter

Or configure Kamailio to expose a JSON stats endpoint:

loadmodule "xhttp.so"
loadmodule "statistics.so"

event_route[xhttp:request] {
    if ($hu =~ "^/metrics") {
        xhttp_reply("200", "OK", "text/plain", $stat(all));
        exit;
    }
}

Key Kamailio metrics to track:

# prometheus/recording_rules.yml
groups:
  - name: kamailio_derived
    rules:
      - record: kamailio:sip_4xx_rate
        expr: rate(kamailio_core_rcv_requests_total{method="INVITE",status=~"4.."}[5m])

      - record: kamailio:register_failure_rate
        expr: rate(kamailio_core_rcv_requests_total{method="REGISTER",status="401"}[5m])
           / rate(kamailio_core_rcv_requests_total{method="REGISTER"}[5m])

      - record: kamailio:active_dialogs
        expr: kamailio_dialog_active

      - record: kamailio:invite_per_second
        expr: rate(kamailio_core_rcv_requests_total{method="INVITE"}[1m])

Asterisk Metrics

Asterisk does not natively expose Prometheus metrics. Use one of two approaches:

Option 1: asterisk_exporter (AMI-based)

# /etc/asterisk_exporter/config.yml
ami:
  host: 127.0.0.1
  port: 5038
  username: prometheus
  password: secret

metrics:
  - active_channels
  - active_calls
  - active_agents
  - queue_waiting
  - queue_completed

# /etc/asterisk/manager.conf
[prometheus]
secret=secret
permit=127.0.0.1/255.255.255.255
read=system,call,agent,user,config,dtmf,reporting,cdr,dialplan
write=

Option 2: CEL to Prometheus via Loki/Grafana pipeline

Write CDR/CEL events to a PostgreSQL table and expose them via a custom exporter. This approach gives you business metrics (ASR, ACD, call volumes by trunk) that the AMI exporter cannot provide:

# asterisk_business_exporter.py
from prometheus_client import Gauge, start_http_server
import psycopg2
import time

asr_gauge = Gauge('asterisk_asr_ratio', 'Answer-Seizure Ratio', ['trunk'])
acd_gauge = Gauge('asterisk_acd_seconds', 'Average Call Duration', ['trunk'])
calls_gauge = Gauge('asterisk_active_calls', 'Active calls', ['direction'])

def collect_metrics():
    conn = psycopg2.connect("host=localhost dbname=asterisk_cdr user=monitor")
    cur = conn.cursor()
    
    # ASR per trunk (last 5 minutes)
    cur.execute("""
        SELECT
            accountcode AS trunk,
            ROUND(AVG(CASE WHEN disposition='ANSWERED' THEN 1.0 ELSE 0.0 END), 3) AS asr,
            AVG(CASE WHEN disposition='ANSWERED' THEN billsec ELSE NULL END) AS acd
        FROM cdr
        WHERE calldate > NOW() - INTERVAL '5 minutes'
          AND accountcode IS NOT NULL
        GROUP BY accountcode
    """)
    
    for trunk, asr, acd in cur.fetchall():
        asr_gauge.labels(trunk=trunk).set(asr or 0)
        acd_gauge.labels(trunk=trunk).set(acd or 0)

if __name__ == '__main__':
    start_http_server(9200)
    while True:
        collect_metrics()
        time.sleep(30)

rtpengine Metrics

rtpengine exposes Prometheus metrics natively when built with --with-transcoding:

# /etc/rtpengine/rtpengine.conf
[rtpengine]
prometheus = yes
prometheus-listen = 127.0.0.1:9900

Key media quality metrics from rtpengine:

Metric	Alert threshold	Description
`rtpengine_packet_loss_ratio`	> 0.03	Packet loss > 3%
`rtpengine_jitter_ms`	> 50	Jitter > 50ms
`rtpengine_mos_score`	< 3.5	MOS below acceptable
`rtpengine_active_sessions`	> 80% capacity	Approaching session limit
`rtpengine_transcoded_sessions`	Rate spike	Unexpected transcoding

MOS (Mean Opinion Score) ranges from 1 (unusable) to 5 (excellent). A score above 4.0 is toll-quality; 3.5–4.0 is acceptable; below 3.5 users notice degradation. Set your alert at 3.5.

Prometheus Alerting Rules

# prometheus/alerts/voip.yml
groups:
  - name: voip_sip
    rules:
      - alert: HighSIP5xxRate
        expr: |
          rate(kamailio_core_rcv_replies_total{status=~"5.."}[5m])
          / rate(kamailio_core_rcv_replies_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
          team: voip
        annotations:
          summary: "SIP 5xx rate {{ $value | humanizePercentage }} on {{ $labels.instance }}"
          runbook: "https://wiki.example.com/runbooks/sip-5xx"

      - alert: KamailioDialogsHigh
        expr: kamailio:active_dialogs > 8000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Active dialogs approaching capacity: {{ $value }}"

      - alert: RegistrationFailureSpike
        expr: kamailio:register_failure_rate > 0.2
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "20%+ of SIP registrations failing — possible auth issue or attack"

  - name: voip_media
    rules:
      - alert: MediaQualityDegraded
        expr: rtpengine_mos_score < 3.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MOS score {{ $value }} below 3.5 on {{ $labels.instance }}"

      - alert: MediaPacketLossHigh
        expr: rtpengine_packet_loss_ratio > 0.03
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "RTP packet loss {{ $value | humanizePercentage }} — calls impacted"

      - alert: rtpengineCapacityHigh
        expr: rtpengine_active_sessions / rtpengine_max_sessions > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "rtpengine at {{ $value | humanizePercentage }} capacity"

Grafana Dashboard Layout

Structure your Grafana dashboard in four rows:

Row 1: SIP Signaling Health

INVITE rate (calls/sec) — line graph, 1h window
SIP 4xx/5xx rate — stat panel with threshold coloring
Active dialogs — gauge panel
Registration success rate — stat panel

Row 2: Media Quality

MOS score distribution by trunk — heatmap
Packet loss % by carrier — time series
Jitter ms — time series with threshold line at 50ms
Active RTP sessions — gauge

Row 3: Infrastructure

CPU per VoIP node — multi-series line
Network I/O (bytes/sec) — time series
Memory usage — time series

Row 4: Business Metrics

ASR by trunk — bar gauge
ACD (average call duration) — stat panel
Total calls in last 24h — stat panel
Calls by outcome (Answered/No Answer/Busy) — pie chart

Prometheus Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'kamailio'
    static_configs:
      - targets: ['kamailio-1:9494', 'kamailio-2:9494']
    scrape_interval: 10s

  - job_name: 'asterisk'
    static_configs:
      - targets: ['asterisk-1:9200', 'asterisk-2:9200']
    scrape_interval: 30s

  - job_name: 'rtpengine'
    static_configs:
      - targets: ['rtpengine-1:9900', 'rtpengine-2:9900']
    scrape_interval: 10s

  - job_name: 'coturn'
    static_configs:
      - targets: ['turn-1:9641']
    scrape_interval: 30s

  - job_name: 'node'
    static_configs:
      - targets: ['kamailio-1:9100', 'asterisk-1:9100', 'rtpengine-1:9100']
    scrape_interval: 15s

Storage Sizing for VoIP Metrics

VoIP monitoring generates high-cardinality metrics — per-call, per-trunk, per-carrier labels multiply metric series. Calculate your Prometheus storage requirements:

Samples per scrape: ~500 (typical VoIP stack)
Scrape interval: 10s → 6 scrapes/minute
Samples/minute: 3,000
Samples/day: 4,320,000
Prometheus bytes per sample: ~1.5 bytes (compressed)
Storage/day: ~6 MB
90-day retention: ~540 MB

This fits comfortably on any VPS. For longer retention or higher cardinality (1,000+ trunks), use Thanos or Mimir to offload to object storage and query across retention windows.

VoIP Monitoring with Prometheus and Grafana

VoIP Monitoring with Prometheus and Grafana

Ready to build on carrier-grade voice?