Monitor VICIdial Server Health and Capacity

A practical, top-to-bottom guide to watching your VICIdial server's load, channels, disk, latency, and health so problems surface before they hurt your floor.

A VICIdial server is a busy machine. It places and answers calls, mixes audio, writes recordings to disk, runs a database, and serves the agent screen to everyone logged in. When all of that fits comfortably on the box, your floor hums. When it doesn't, calls drop, agents see their screens freeze, and you find out from angry agents instead of from a graph. This guide is the map: the handful of numbers worth watching, what each one is really telling you, and the report on your dialer that shows it.

If you are new to this, don't worry about memorizing everything at once. Health monitoring is just a small set of habits: glance at a few graphs once a day, know the two or three numbers that mean trouble, and act before a yellow turns red. The rest of this article walks through each signal in order and links to a focused deep dive on each one.

The big picture: metric to meaning to action

Every health signal follows the same shape. You read a number, you translate it into a plain-English statement about the server, and you take one of three actions: do nothing, watch it more closely, or add capacity. Holding that loop in your head keeps monitoring from feeling like staring at a wall of numbers.

flowchart TD
  A[Read a metric] --> B{Inside normal range}
  B -- Yes --> C[No action needed]
  B -- No --> D{Trending up over days}
  D -- No --> E[Watch more closely]
  D -- Yes --> F[Plan more capacity]
  E --> A
  F --> G[Resize or split load]

The Server Performance Report: your single most useful screen

The Server Performance Report samples the box every five seconds and rolls it up for any window you pick. It deliberately ignores campaigns and agents and looks only at the physical server. The summary at the top hands you the most important numbers in one row: how many calls the server handled, the total off-hook minutes, the average number of Asterisk channels in use, the average system load, the peak system load, and the USER, SYSTEM, and IDLE processor split. Below that, graphs draw the same signals over time so you can see the shape of your day. Learning to read this one screen is the foundation, and we break it down field by field in how to read the server performance report.

This report is not on by default. An administrator has to flip the System Performance setting to Y on the server record before any of it gets collected, which is also covered in how to enable system performance logging. Note: this is one of the few VICIdial reports you cannot download, so screenshot the graphs if you need to keep a record.

Reading the graphs

The graph portion plots number of processes, system load, channel count, and the USER versus SYSTEM processor percentages across your chosen window. The value here is shape, not single points. A load line that climbs steadily through the shift and never recovers tells a different story than one that spikes at the top of the hour and settles. If graphs are new to you, start with reading VICIdial server graphs for beginners, then go deeper with how to read the performance graphs.

System load: the one number to learn first

System load is a rough measure of how many tasks are waiting for the processor. As a starting rule, compare load against the number of CPU cores: a load equal to your core count means the box is fully busy, and a load well above it means work is piling up. That is why a load of 4 is fine on an eight-core machine and alarming on a two-core one. The Server Performance Report gives you both the average and the peak, and the gap between them matters as much as either number alone. We unpack what the figure means in VICIdial system load explained, and what a high reading really signals in what high system load means.

Peak load deserves its own attention. A healthy average with a brutal peak means something briefly overwhelms the box, often at list reset or when a dial burst lands. If your peak is regularly double your average, that is your warning sign long before the average looks bad. There is a guide to interpreting that spike in what peak system load tells you, and a discussion of where to draw the line in safe CPU load thresholds. Tracking load over time is one of the clearest KPI signals you have for a dialer.

USER, SYSTEM, and IDLE

The processor breakdown tells you where the work is going. USER time is your application, SYSTEM time is the kernel handling network and disk, and IDLE is headroom. When IDLE collapses toward zero, you are out of room. A SYSTEM percentage that creeps up while call volume stays flat often points at a network or storage bottleneck rather than the dialer itself. The full read on this split is in VICIdial CPU user, system, and idle explained.

Channels and off-hook minutes: how busy the phone side is

A Channel is one active leg of audio: a ringing line, a connected call, an agent's phone. Channel count is the most direct measure of how hard the telephony side is working, because every concurrent leg consumes processor and bandwidth to mix. The Server Performance Report gives you the average channels in use, and the day-by-day view shows your busiest moments. Watching this number is covered in monitor the Asterisk channel count, with the averaging nuance in average channels in use explained.

Off-hook minutes is the total time the server spent on live calls during a shift. It is a clean stand-in for raw telephony work: more off-hook minutes means more audio mixing, more recording writes, and more carrier traffic. It is also a great early indicator of capacity pressure, because it grows with both call volume and call length. What to read into it is the subject of what off-hook minutes tells you.

Maximum System Stats: your 30-day trend at a glance

Where the performance report zooms into a single window, Maximum System Stats pulls back to thirty days. For each of the last thirty days it draws a bar showing total call count in and out, inbound and outbound counts separately, the most Concurrent calls the system reached, and the most concurrent agents. This is the screen that answers the question a busy floor manager actually asks: is this getting heavier over time? Reading the bars is covered in how to read maximum system stats, with focused looks at tracking the most concurrent calls and tracking the most concurrent agents.

The concurrency peaks matter more than the totals for capacity planning. A server can comfortably handle a high daily total spread across the day and still buckle if everything lands in one fifteen-minute window. Watch the peak bars, not just the height of the totals, and use the 30-day call volume trend to catch slow growth before it catches you.

Running this much monitoring yourself is a real job. On a managed VPS from VICIfast, system performance logging is on from the first boot, the box is a clean Single tenant machine with no noisy neighbors stealing cycles, and resizing to more cores takes a few clicks. Your dialer is live in under 40 seconds and you start with the graphs already collecting.

Server Versions and disk space

The Server Versions page is a one-line-per-server roll call. It lists each server's name, IP, current load, channel count, free disk space, time, software version, and whether it is active. On a single box it is a quick pulse check; on a cluster it is the fastest way to spot the one server drifting out of line. Reading the columns is covered in how to read the server versions page, and keeping versions in sync across boxes is in check the VICIdial version across servers.

Disk space is the silent killer. Recordings, logs, and database files grow every day, and a full disk takes a dialer down hard, often refusing new calls or corrupting writes. The Server Versions page surfaces free space per box so you can catch it early; the discipline of watching it is in monitor disk space from server versions. Recording storage in particular deserves its own watch because it fills faster than anything else.

Backend processes and the keepalive scripts

VICIdial leans on a set of behind-the-scenes scripts to place calls, move agents between sessions, and keep the dialer running. The Keepalive routine restarts these if they die, but you should still know whether they are healthy. The Internal Process Logs page shows debug data for each back-end process, how many times it was launched, when, and how long it has been running over the last seven days across every server. If a process is restarting constantly, that is a problem worth chasing. Reading these is covered in what the internal process logs show.

Latency and lag: the agent-facing health signal

Server load tells you the box is struggling. Agent Latency tells you the agents are feeling it. The Agent Latency Report measures the web-connection delay between the agent screen and the server for everyone currently or recently logged in, and charts it across the day. Rising latency is often the first thing agents notice, long before a graph turns red, which is why it works so well as an early capacity signal. The case for treating it that way is in agent latency as a capacity signal, and the report walkthrough is in how to read the agent latency report.

A related screen, the Latency Gaps Report, flags missing stretches of latency logging while an agent was supposed to be connected. Those gaps frequently explain why an agent reports their screen froze or dropped a call. Then there is the Agent LAGGED Report, which records the actual lag events, the dialer that experienced them, and the agent, campaign, and status at the moment of each one. Together these turn vague agent complaints into specific, server-side evidence. We cover them in how to read the agent LAGGED report and trace the root causes in what causes latency gaps.

When lag is intermittent, the LAGGED Summary Report is the screen to reach for. It charts lag events on a timeline over a date range so clusters of trouble jump out visually, which makes it far easier to correlate a bad stretch with a list reset, a backup window, or a carrier hiccup. Spotting those clusters is the subject of spotting lagged clusters on a timeline.

Debug pages when something is actually wrong

When a metric goes bad and you need to find out why, VICIdial gives level-9 users a set of debug and compare tools. The Asterisk Debug Page dumps the SIP peer list and registry plus the last lines of Asterisk CLI output, so you can confirm your carrier registration is healthy. Checking that registry is covered in check SIP peers and registry. The Settings Compare Utility lets you diff two campaigns or users to find a stray setting, and the DB Schema Compare Utility checks your databases line up, which matters on multi-server setups, as covered in the DB schema compare utility explained.

Health status: healthy, degraded, unreachable

All of the above feeds a simpler question: is the server healthy right now? It helps to think in three states. Healthy means everything responds and the numbers are in range. Degraded means the box is up but something is wrong, high load, climbing latency, or low disk. Unreachable means you cannot talk to it at all. A good monitoring setup notices the slide from healthy to degraded so you act during the warning, not the outage. What degraded actually means is in what degraded health means, and the more serious state is in what unreachable health means.

stateDiagram-v2
  [*] --> Healthy
  Healthy --> Degraded: load or latency rising
  Degraded --> Healthy: capacity added
  Degraded --> Unreachable: server stops responding
  Unreachable --> Degraded: connection restored
  Unreachable --> Healthy: fully recovered

One thing that trips up newcomers is confusing health with billing. A server can be paid and active while still being unhealthy, and the two are tracked separately for exactly that reason. The distinction is worth understanding, and we lay it out in health status versus billing status. To get ahead of outages entirely, set up external uptime checks as described in how to set up uptime monitoring.

Sizing and capacity: when to add more

Monitoring exists to answer one practical question: do you need a bigger box, or another one? The honest answer depends on your agent count, your channels per agent, your call length, and whether you record everything. There is no universal number, but there are sane ranges, and the clearest signal is your own peak concurrency and load trend over thirty days. We work through the sizing math in sizing a VICIdial server by agents and channels, and call out the warning signs in when a server needs more capacity.

The best operators catch capacity problems before they hit the floor by watching the trend, not the crisis. A load average that has crept up ten percent a week, a peak channel count edging toward your ceiling, free disk shrinking on a predictable slope, these are all things you can act on calmly with a week of lead time. The mindset is in spot capacity problems before they hit.

A simple daily routine

You do not need to live inside these reports. A five-minute morning check covers most of it: glance at the Server Versions page for load, channels, and free disk; open yesterday's Server Performance Report and eyeball the load and channel graphs for anything unusual; and skim the latency or LAGGED reports if any agent complained. That short loop, done consistently, catches the great majority of problems while they are still small. A complete checklist lives in what to watch on a VICIdial server daily.

For real-time work, the Real-Time report (the Time On VDAD screen) shows who is logged in, who is on a call, and how long each call has run, all on one server. It is also where you find the session IDs used for live call monitoring, and every manager monitoring session it powers is itself logged for audit. Day to day, the Real-Time report is the screen you keep open during a shift while the others are the ones you read after it.

Putting it together

Healthy VICIdial operations come down to a short list watched consistently: load and its peak, channels and off-hook minutes, free disk, backend process health, and agent latency, rolled up into a simple healthy-degraded-unreachable view and checked against a thirty-day trend. None of it is hard once the logging is on and the habit is in place.

If you would rather skip the setup and start with a dialer that has all of this monitoring switched on from the first second, spin one up with VICIfast. You get a dedicated, single-tenant server with performance logging, disk and load visibility, and health monitoring ready out of the box, live in under 40 seconds, so you can spend your time reading the graphs instead of building them.