Backend Development 8 min read

Implementation of SIP‑Based DTMF Signal Capture for Intelligent Voice Robots

This article explains how an intelligent voice robot leverages TTS and SIP to convert server alerts into spoken notifications, detailing the end‑to‑end workflow, DTMF transmission methods, SIP detection techniques, SDP media negotiation, and RTP‑based DTMF parsing to enable reliable key‑press handling.

58 Tech
58 Tech
58 Tech
Implementation of SIP‑Based DTMF Signal Capture for Intelligent Voice Robots

With the widespread adoption of internet technologies, online alerts have evolved from SMS/email to WeChat and voice calls. Because SMS/email suffer from latency, many organizations now convert server alerts into synthesized speech using Text‑to‑Speech (TTS) and let users respond via keypad.

The intelligent voice robot combines TTS and SIP to provide voice playback and keypad recognition, applied to 58.com operations alerts and recruitment invitations, reducing manual effort and improving efficiency (see Figure 1).

Overall Process : In a SIP call, the user (Agent A) and the robot (Agent B) maintain a continuous media stream. The robot distinguishes between voice streams and DTMF signals, routing voice to a decoder and DTMF to a recognition module, then processes the resulting data.

How DTMF is Transmitted : Traditional telephone/keypad devices encode digits using Dual‑Tone Multi‑Frequency (DTMF), which combines one high‑frequency and one low‑frequency tone per symbol (0‑9, *, #, A‑D) as illustrated in Figure 3.

How SIP Detects DTMF : Three methods are commonly used:

SIPINFO : DTMF is carried in the INFO signal field of SIP messages, keeping RTP traffic untouched but risking synchronization issues.

RFC2833 : The telephone‑event parameter is negotiated via SDP; DTMF is sent in RTP packets with a specific payload type (e.g., PT = 126).

INBAND : DTMF tones are mixed directly into the audio RTP stream and later extracted by spectral analysis.

Session Establishment : SIP INVITE, TRYING, RINGING, 200 OK, and ACK messages are exchanged between the caller (Agent A) and callee (Agent B) to set up the call (see Figure 5).

Media Negotiation : SDP describes media parameters such as connection info (c=), session attributes (a=), and media description (m=). Both parties exchange SDP in INVITE and 180 RINGING messages to agree on IP, ports, codecs, and DTMF event handling (see Figure 6).

DTMF Parsing (RFC2833) :

When an RTP packet with PT = 126 arrives, its payload contains the DTMF digit.

Multiple packets may carry the same digit; duplicate removal is performed using the RTP timestamp.

The extracted digit is then processed by the voice robot’s logic.

By following RFC2833, the 58.com TEG AI Lab built a SIP‑based voice robot that reliably captures user keypad input, providing a reference implementation for similar systems.

Conclusion : Technical documentation on SIP‑based DTMF capture is scarce; this article summarizes the RFC2833 specification and the robot’s implementation to guide readers in building comparable solutions.

backendTTSRTPSIPtelephonyDTMFvoice robot
58 Tech
Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.