Monitor VICIdial back-end processes for restarts and hangs

Using Internal Process Logs to spot a crashing VICIdial process — high launch count with short run times means keepalive is restarting it repeatedly, which is a sign of a real fault.

When a VICIdial back-end process crashes and the keepalive restarts it, the Internal Process Logs page records the evidence — high launch counts combined with short run durations are the clearest signal that a process is failing repeatedly and needs attention.

Understanding the keepalive restart loop

VICIdial's back-end Perl processes — the AST_* scripts, VDAD, listen, and fastlog — run inside screen sessions on the server. A separate keepalive script watches each one. When a process exits for any reason, the keepalive starts it again within seconds. This is intentional: a transient crash should not take the dialer down for more than a moment. However, when a process has an underlying fault — a corrupt configuration, a database error it cannot recover from, a memory leak that triggers an OS kill — it will crash and be restarted, crash and be restarted, over and over. The Internal Process Logs captures every one of those restart events.

Reading the crash pattern

Two numbers tell you almost everything: launch count and run duration. Here is how to read them together:

Launch count of 1, run duration of several days — healthy. The process started after the last planned restart and has been running continuously.
Launch count of 2 or 3, run duration of hours — probably fine. May indicate a planned restart during a maintenance window or a brief power issue. Check the launch timestamps against your change log.
Launch count of 10 or more, run duration of minutes or seconds — active fault. The keepalive is restarting the process repeatedly because it keeps crashing. The timestamps on the launch column will show them clustered in a short window.
Process missing from the list entirely — the process was never launched in the seven-day window, or the keepalive itself is not running on that server. This is a different kind of problem: nothing is watching that process.

flowchart TD
  A["Internal Process Logs"]
  A --> B{"Launch count for process?"}
  B -->|"1 or few"| C{"Run duration = days?"}
  C -->|Yes| D["Process is healthy"]
  C -->|No| E["Check launch time vs change log"]
  B -->|"High count"| F{"Run duration = minutes or less?"}
  F -->|Yes| G["Process is crashing in a loop"]
  G --> H["Check Asterisk logs for error"]
  H --> I["Fix root cause and restart"]
  F -->|No| J["Investigate restart trigger"]
  B -->|"Missing"| K["Keepalive may be down on that server"]

Which processes to watch most closely

Not all back-end processes have the same impact when they crash. Prioritise these:

VDAD — the auto-dial engine. When this crashes, the Hopper stops being filled and outbound Predictive dialing stops placing calls. Agents go idle within a minute or two.
AST_VDauto_dial_FILL — the process that actually triggers the dialing via Asterisk. A crash here means no calls are originated even if VDAD is running and the hopper has leads.
fastlog — when this crashes, call event data stops being written to the database. Your Real-time report goes stale and post-campaign reports will have gaps for the period it was down.
listen — controls the agent browser connection and the real-time display. A crash here does not stop calls but breaks what agents and supervisors see on screen.

Investigating a crashing process

Once you have identified the crashing process and the timestamps from Internal Process Logs, the next step is the Asterisk log and the VICIdial error log on the affected server. The crash time you found in the logs tells you exactly where to look in the error output — scan for the two or three minutes surrounding each restart event. Common culprits are a database connection error, a missing configuration value, or a SIP trunk that is returning unexpected responses and causing the process to die on an unhandled code path.

The Server Versions page gives you the current snapshot of server state — load, channels, disk, and active status — which is useful context alongside a process fault. For the full time-series picture, the VICIdial system load explained guide shows how load spikes correlate with process restarts.

Process monitoring is one layer of a broader server-health approach. For the full framework, see the guide to monitoring VICIdial server health and capacity.

If chasing keepalive restarts on a self-managed box is not the best use of your time, start a VICIfast trial and get a fully managed VICIdial server online in under 40 seconds — back-end processes monitored and recoverable from the start.