Overview

When you open joinvoxa.com and dial a number, a lot happens in roughly 800 milliseconds. A browser API captures your voice, a signaling protocol establishes the session, media packets traverse the internet, and a carrier bridges the gap to the public switched telephone network (PSTN) — the global infrastructure behind every landline and mobile number on earth.

This article walks through every layer of that path, with enough technical detail to be useful for developers who want to understand or evaluate the stack.

Layer 1: The Browser — getUserMedia and WebRTC

Everything starts with a single browser API call:

```javascript

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

```

This triggers the browser's permission prompt. Once granted, the browser opens a handle to the system's audio input device and delivers a `MediaStream` object containing one or more audio tracks sampled at 48 kHz.

WebRTC's `RTCPeerConnection` takes this stream and handles the rest of the media pipeline automatically:

  • Acoustic echo cancellation (AEC): Removes the speaker output from the microphone signal so the far end doesn't hear their own voice reflected back.
  • Noise suppression: Applies a DSP filter to attenuate broadband background noise.
  • Automatic gain control (AGC): Normalises microphone volume so quiet speakers and loud environments are both handled cleanly.
  • Encoding: Compresses the audio using the Opus codec at 16–32 kbps for voice (compared to 64 kbps for the older G.711 used on legacy PSTN lines).

Layer 2: Signaling — SDP and the Offer/Answer Model

WebRTC is a media transport standard, not a signaling protocol. Something external has to negotiate the session parameters before media flows. Voxa uses a WebSocket-based signaling channel over TLS.

The exchange follows the SDP offer/answer model:

  1. Offer: The browser generates an SDP (Session Description Protocol) blob describing its capabilities — supported codecs, ICE candidates, DTLS fingerprint.
  2. Server receives the offer and forwards it (transformed) toward the SIP layer.
  3. Answer: The SIP infrastructure returns its own SDP, which the browser sets as the remote description.
  4. ICE candidates are trickled over the WebSocket as they are discovered on both sides.

A condensed example of what an SDP offer looks like:

```

v=0

o=- 4611733 2 IN IP4 127.0.0.1

s=-

t=0 0

m=audio 9 UDP/TLS/RTP/SAVPF 111 103 9 0 8

a=rtpmap:111 opus/48000/2

a=fmtp:111 minptime=10;useinbandfec=1

a=rtpmap:103 ISAC/16000

a=ice-ufrag:Xj3m

a=ice-pwd:aRx9k2...

a=fingerprint:sha-256 A3:1F:...

```

The `useinbandfec=1` parameter in the Opus fmtp line is significant: it enables Opus Forward Error Correction, which embeds redundant audio data in packets so that up to ~10% packet loss can be concealed without audible artefacts.

Layer 3: ICE — Finding a Network Path

ICE (Interactive Connectivity Establishment) is the protocol that figures out how packets should actually travel between your browser and Voxa's media servers.

It works by gathering "candidates" — potential network addresses — on both sides:

  • Host candidates: The device's local IP addresses (e.g., 192.168.1.42)
  • Server-reflexive candidates: The public IP:port seen by a STUN server on the internet (resolves NAT)
  • Relay candidates: Addresses on a TURN relay server, used when direct paths are blocked by firewalls

Voxa operates STUN and TURN infrastructure in multiple regions. ICE performs connectivity checks between all candidate pairs and selects the lowest-latency working path. For most users on home broadband, a server-reflexive path works. Corporate firewalls with strict UDP egress filtering fall back to TURN relay over TCP port 443, which passes through almost every firewall configuration.

Layer 4: DTLS-SRTP — Encryption

Once ICE nominates a path, the browser and server perform a DTLS (Datagram TLS) handshake over UDP. This establishes keying material used to encrypt all subsequent media using SRTP (Secure Real-time Transport Protocol).

The DTLS fingerprint in the SDP (the `a=fingerprint` line above) ties the encryption keys to the signaling session, preventing man-in-the-middle attacks even if the signaling channel were somehow compromised.

All Voxa calls are encrypted end-to-end between the browser and the media server. The media server is the point where packets are decrypted for PSTN bridging — this is architecturally unavoidable when connecting to the telephone network, which does not support end-to-end encryption.

Layer 5: Voxa's Media Server

Voxa's media servers run on a cluster of bare-metal instances distributed across multiple cloud regions. Their job:

  1. Receive RTP packets from the browser over the ICE-selected path
  2. Transcode from Opus (what browsers send) to G.711 μ-law or G.711 A-law (what most PSTN carriers expect)
  3. Manage jitter buffering — absorbing variance in packet arrival times and delivering a smooth audio stream downstream
  4. Apply SIPREC — call recording hooks if enabled for the account
  5. Forward the G.711 stream to the SIP trunk via the SIP/RTP path

Transcoding from Opus to G.711 is a lossy step — G.711 at 64 kbps has lower fidelity than Opus at 32 kbps — but it is unavoidable for PSTN compatibility. On calls to destinations that support G.722 wideband (some modern carriers and enterprise PBXes), Voxa negotiates wideband codec pass-through for better quality.

Layer 6: SIP Trunking

SIP (Session Initiation Protocol) is the signaling protocol of the modern telephone network. Voxa's platform maintains persistent SIP trunk registrations with multiple tier-1 carrier interconnects.

When a call is initiated:

  1. Voxa's call controller sends a SIP `INVITE` to the carrier with the E.164-formatted destination number in the `To:` header and Voxa's media server as the `Contact:`
  2. The carrier responds with `100 Trying`, then `180 Ringing` once the call is alerting at the destination
  3. The destination answers — carrier sends `200 OK` with its own SDP media description
  4. Voxa's controller sends `ACK` — the call is established and RTP media flows directly between Voxa's media server and the carrier's media gateway

The SIP flow in simplified form:

```

Browser ──WebRTC──► Voxa Media Server ──RTP/G.711──► Carrier Gateway ──PSTN──► Phone

Browser ──WS/SDP──► Voxa Signaling ──SIP──────► Carrier SIP Proxy

```

Layer 7: PSTN Interconnect — The Last Mile

At the carrier's media gateway, the G.711 RTP stream is converted to a TDM (Time Division Multiplexing) signal or handed off to the destination carrier via SS7 signalling, which routes the call to the correct local exchange and ultimately rings the destination handset.

This final segment is invisible to Voxa — once the call is handed to the carrier, Voxa has no visibility into how the carrier routes it domestically. This is why call quality to some destinations (particularly those with aging PSTN infrastructure) can vary: the last-mile connection is outside VoIP providers' control.

Latency Budget

For a call from a browser in London to a landline in New York, a typical latency budget:

| Segment | Typical Latency |

|---|---|

| Browser audio capture + Opus encoding | 10–20 ms |

| ICE path (London → Voxa EU media server) | 5–15 ms |

| Voxa media server processing + transcode | 2–5 ms |

| SIP trunk (Voxa EU → US carrier gateway) | 80–110 ms |

| Carrier PSTN routing (US domestic) | 5–20 ms |

| Total one-way | ~100–170 ms |

Round-trip perceived delay is double this: ~200–340 ms. The ITU-T G.114 recommendation is under 150 ms one-way for comfortable conversation; most transatlantic calls land in the acceptable range.

What This Means for Developers Evaluating Voxa

A few architectural implications worth understanding:

Firewall requirements: UDP egress on ephemeral ports (3478 for STUN, 5349 for TURN over TLS) is needed for optimal performance. If your environment blocks all UDP, TCP TURN fallback on port 443 ensures calls still work, with slightly higher latency.

Browser compatibility: WebRTC is supported in all modern browsers (Chrome, Firefox, Safari 11+, Edge). Safari has historically had quirks with certain Opus fmtp parameters — Voxa's signaling layer normalises these differences automatically.

No SDK required: Voxa's browser client is plain WebRTC — no proprietary SDK, no native app, no plugin. This is a deliberate architectural choice: the browser's built-in WebRTC implementation is audited, maintained by browser vendors, and works across platforms without any code distribution.

Call legs and billing: Voxa bills on the B-leg duration — the time from when the destination phone is answered (`200 OK`) to when the call is terminated (`BYE`). Ring time is not billed.

The full call path from browser to PSTN involves no black boxes — every component is a well-documented open standard. That's the architectural case for WebRTC-to-PSTN calling as the foundation for modern communication infrastructure.