Transcription Options for VICIdial Call Recordings

VICIdial does not ship a built-in transcription engine, but its audio files work with any external STT platform. Here's how transcription fits into the recording pipeline.

VICIdial does not ship a built-in Call transcription engine. What it does ship is a reliable audio file for every recorded call. Those files are the input that any external speech-to-text platform needs. Getting transcription working is mostly a pipeline question: how do the audio files get from the VICIdial server to the transcription service, and how do the resulting transcripts get back to somewhere useful?

How VICIdial produces the audio files

Native Call recording in VICIdial is handled by Asterisk's MixMonitor application. MixMonitor writes audio files to the recording directory on the server, typically under the path configured in server settings. Files are written in either WAV or GSM format depending on campaign configuration. WAV files are larger but compatible with every transcription service. GSM files are compact but may need to be converted before some services accept them.

Each file is named using a pattern that encodes call metadata — campaign, agent, date, and the call's unique ID. That naming convention matters for transcription pipelines because it determines how you match a transcript back to the original call record in the CRM or in VICIdial's own recording log.

Stereo recordings and transcription accuracy

Agent scripts in the VICIdial admin (Scripts).

VICIdial supports Stereo recording, where the agent's audio and the customer's audio are recorded on separate channels of a single WAV file. Most transcription services can consume a stereo file and produce speaker-separated transcripts — labeling lines as "Agent" and "Customer" automatically. This is a significant quality improvement over transcribing a mono mix where both voices appear in one channel. If transcription accuracy is important to your QA process, enabling stereo recording before you connect a transcription service is worth doing first.

Integration approaches

flowchart TD
  A[Call ends - MixMonitor writes WAV file] --> B{File format}
  B -->|WAV| C[Send directly to STT service]
  B -->|GSM| D[Convert to WAV first]
  D --> C
  C --> E[Transcript returned as text]
  E --> F{Where does transcript go?}
  F -->|Stored locally| G[Database or file alongside recording]
  F -->|Analytics platform| H[Keyword search and scoring dashboard]
  F -->|CRM field| I[Attached to lead or contact record]

Post-call batch processing: a cron job or script picks up completed recording files at regular intervals and submits them to a cloud speech-to-text API (application programming interface). Transcripts come back as text and are stored alongside the recording metadata. This approach adds latency but keeps infrastructure simple.
Real-time streaming: some platforms accept a live audio stream during the call and return a transcript before the call ends. This requires more integration work on the Asterisk side and a stable, low-latency connection to the transcription service. Real-time transcription is the foundation for live agent prompting and live compliance alerts.
Third-party analytics platforms with built-in transcription: tools like CallMiner handle both transcription and analysis in one layer. VICIdial feeds them audio files; the platform handles the rest. For operations that want analytics beyond plain text, this is often the simpler path than assembling separate STT and analysis tools.

Cepstral — text-to-speech, not transcription

VICIdial does include a native integration with Cepstral, but Cepstral is a TTS (text to speech) (text-to-speech) engine — it converts text scripts into spoken audio for IVR prompts and campaign messages. It is not a transcription tool and cannot convert call recordings back into text. The two are often confused because both involve speech and recordings, so it is worth being clear: if you want transcription, Cepstral is not the answer.

What to confirm before connecting a transcription service

Recording format: WAV passes to most services without conversion. GSM may need an intermediate step.
Stereo vs mono: decide before enabling recording whether you want speaker-separated channels.
File naming: your pipeline needs to parse the recording filename to link transcripts back to call records.
Storage location: confirm the recording directory path in server settings before writing any file-pickup scripts.

The full recording pipeline that feeds any transcription workflow is covered in VICIdial call recording explained. For the OREKA/CallMiner integration path that bundles recording and analysis together, see the ORECX OREKA integration explained.

Running a hosted VICIdial setup where recording is already on and files are accessible? See VICIfast pricing for managed options.