Skip to content

Authentication

1 post with the tag “Authentication”

Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition

Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition

Section titled “Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition”

The browser WebSocket API does not support custom HTTP headers, which creates a challenge for speech recognition services that require authentication data in headers. This article shares how we solved that problem in the HagiCode project with a backend proxy pattern, and how the approach evolved from playground experiments to production use.

When we started building speech recognition for the HagiCode project, we confidently chose ByteDance’s Doubao speech recognition service. The initial design was straightforward: let the frontend connect directly to Doubao’s WebSocket service. How hard could that be? Just open a connection and send some data, right?

Then came the surprise: Doubao’s API requires authentication information to be passed through HTTP headers, including things like accessToken and secretKey. That immediately became awkward, because the browser WebSocket API simply does not support setting custom headers.

So what do you do when the browser will not let you send them?

At the time, we weighed two options:

  1. Put the credentials into URL query parameters - simple and blunt
  2. Add a proxy layer on the backend - more work at first glance

The first option exposes credentials directly in frontend code and local storage. Is that really safe? I was not comfortable with it. On top of that, some APIs require header-based verification, so this approach is not even viable.

In the end, we chose the second option: implement a WebSocket proxy on the backend. Coincidentally, this pattern was first validated in our playground environment, and only after we confirmed its stability did we move it into production. After all, nobody wants production to double as a lab experiment.

The approach shared in this article comes from our practical experience in the HagiCode project.

HagiCode is an AI coding assistant project with voice interaction support. Because we needed to call a speech recognition service from the frontend, we ran straight into this WebSocket authentication problem, which led us to the solution described here. Sometimes these technical roadblocks are frustrating, but they also force you to learn patterns that turn out to be useful later.

The standard WebSocket API looks wonderfully simple:

const ws = new WebSocket('wss://example.com/ws');

But that simplicity is exactly where the problem lies - it only passes parameters in the URL, and it cannot set headers the way an HTTP request can:

// This is not supported in the WebSocket API
const ws = new WebSocket('wss://example.com/ws', {
headers: {
'Authorization': 'Bearer token'
}
});

And that is the core issue. For services like Doubao speech recognition that depend on header-based authentication, this limitation is a hard blocker.

Once you accept that constraint, the architecture has to change.

When designing the solution, we compared the trade-offs carefully.

Decision 1: Choosing the proxy pattern

We compared two approaches:

OptionProsConsDecision
Native WebSocketLightweight, simple, direct forwardingConnection management must be handled manuallyChosen
SignalRAutomatic reconnection, strong typingOverly complex, extra dependenciesRejected

We ultimately chose native WebSocket. To be honest, it was the lightest option and a better fit for simple bidirectional binary stream forwarding. Pulling in SignalR would have felt like overengineering for this use case, and it could add extra latency.

Decision 2: Connection management strategy

We adopted a “one connection, one session” model - each frontend WebSocket connection maps to its own independent Doubao backend connection.

The benefits are straightforward:

  • Simple to implement and aligned with the common usage pattern
  • Easier to debug and troubleshoot
  • Good resource isolation, preventing interference between sessions

Put simply, the direct solution is sometimes the best one. Complexity does not automatically make a design better.

Decision 3: Storing authentication data

Credentials are stored in backend configuration files (appsettings.yml or environment variables) and loaded through dependency injection:

  • Simple configuration model that matches existing backend conventions
  • Sensitive data never reaches the frontend
  • Supports multi-environment setup for development, testing, and production

That level of separation matters. No one wants credentials floating around in places they should not be.

The overall data flow looks like this:

Frontend (browser)
│ ws://backend/api/voice/ws
│ WebSocket (binary)
Backend (proxy)
│ wss://openspeech.bytedance.com/
│ (with authentication headers)
Doubao API

The flow itself is not complicated:

  1. The frontend connects to the backend proxy through WebSocket
  2. The backend proxy receives audio data and connects to the Doubao API with authenticated headers
  3. The Doubao API returns recognition results, and the proxy forwards them to the frontend
  4. The whole process remains fully asynchronous with bidirectional streaming

Once the responsibilities are split clearly, the design becomes quite natural.

app.Map("/ws", async context =>
{
if (context.WebSockets.IsWebSocketRequest)
{
// Read configuration from query parameters
var appId = context.Request.Query["appId"];
var accessToken = context.Request.Query["accessToken"];
// Validate required parameters
if (string.IsNullOrEmpty(appId) || string.IsNullOrEmpty(accessToken))
{
context.Response.StatusCode = 400;
return;
}
// Accept the WebSocket connection
using var webSocket = await context.WebSockets.AcceptWebSocketAsync();
// Message handling loop
var buffer = new byte[4096];
while (!webSocket.CloseStatus.HasValue)
{
var result = await webSocket.ReceiveAsync(buffer, CancellationToken.None);
if (result.MessageType == WebSocketMessageType.Close)
{
await webSocket.CloseAsync(
result.CloseStatus.Value,
result.CloseStatusDescription,
CancellationToken.None);
break;
}
// Process audio data
await HandleAudioDataAsync(buffer, result.Count);
}
}
});
public class DoubaoSessionManager : IDoubaoSessionManager
{
private readonly ConcurrentDictionary<string, DoubaoSession> _sessions = new();
public DoubaoSession CreateSession(string connectionId)
{
var session = new DoubaoSession(connectionId);
_sessions[connectionId] = session;
return session;
}
public async Task SendAudioAsync(string connectionId, byte[] audioData)
{
if (_sessions.TryGetValue(connectionId, out var session))
{
await session.SendAudioAsync(audioData);
}
}
public void RemoveSession(string connectionId)
{
if (_sessions.TryRemove(connectionId, out var session))
{
session.Dispose();
}
}
}

Using ConcurrentDictionary for session management means thread safety is largely handled for us. Each incoming connection gets its own session, and cleanup happens automatically on disconnect.

public class ClientConfigDto
{
public string AppId { get; set; } = null!;
public string Access set; } =Token { get; null!;
public string? ServiceUrl { get; set; }
public string? ResourceId { get; set; }
public int? SampleRate { get; set; }
public int? BitsPerSample { get; set; }
public int? Channels { get; set; }
public void Validate()
{
if (string.IsNullOrWhiteSpace(AppId))
throw new ArgumentException("AppId is required");
if (string.IsNullOrWhiteSpace(AccessToken))
throw new ArgumentException("AccessToken is required");
}
}

Configuration validation helps surface problems during startup instead of letting them fail later at runtime. That safeguard is worth having.

The frontend and backend use JSON text messages for control, while binary messages carry audio data.

Example control message:

{
"type": "control",
"messageId": "msg_123",
"timestamp": "2026-03-03T10:00:00Z",
"payload": {
"command": "StartRecognition",
"parameters": {
"hotwordId": "hotword1",
"boosting_table_id": "table123"
}
}
}

Example recognition result:

{
"type": "result",
"timestamp": "2026-03-03T10:00:03Z",
"payload": {
"text": "Hello world",
"confidence": 0.95,
"duration": 1500,
"isFinal": true,
"utterances": [
{
"text": "Hello",
"startTime": 0,
"endTime": 800,
"definite": true
}
]
}
}

This design separates control signals from audio payloads, which makes the system easier to reason about and implement. Splitting responsibilities cleanly is often the right call.

class DoubaoVoiceClient {
constructor(config) {
this.config = config;
this.ws = null;
}
async connect() {
const url = new URL(this.config.wsUrl);
// Add query parameters
Object.entries(this.config.params).forEach(([key, value]) => {
url.searchParams.set(key, value);
});
this.ws = new WebSocket(url);
return new Promise((resolve, reject) => {
this.ws.onopen = () => {
console.log('[DoubaoVoice] Connected');
resolve();
};
this.ws.onmessage = (event) => {
this._handleMessage(JSON.parse(event.data));
};
this.ws.onerror = reject;
});
}
_handleMessage(message) {
switch (message.type) {
case 'status':
this._handleStatus(message.payload);
break;
case 'result':
this.onResult?.(message.payload);
break;
case 'error':
console.error('[DoubaoVoice] Error:', message.payload);
break;
}
}
}
// Usage example
const client = new DoubaoVoiceClient({
wsUrl: 'ws://localhost:5000/ws',
params: {
appId: 'your-app-id',
accessToken: 'your-access-token',
sampleRate: 16000,
bitsPerSample: 16,
channels: 1
}
});

Using AudioWorkletNode for audio processing gives better performance:

audio-worklet.js
class AudioProcessorWorklet extends AudioWorkletProcessor {
process(inputs, outputs, parameters) {
const input = inputs[0]?.[0];
if (!input) return true;
// Convert to 16-bit PCM
const pcm = new Int16Array(input.length);
for (let i = 0; i < input.length; i++) {
pcm[i] = Math.max(-32768, Math.min(32767, input[i] * 32767));
}
this.port.postMessage({
type: 'audioData',
data: pcm.buffer
}, [pcm.buffer]);
return true;
}
}
registerProcessor('audio-processor', AudioProcessorWorklet);
// Main thread code
async function startAudioRecording() {
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true,
sampleRate: 48000
}
});
const audioContext = new AudioContext();
const audioSource = audioContext.createMediaStreamSource(stream);
await audioContext.audioWorklet.addModule('/audio-worklet.js');
const audioWorkletNode = new AudioWorkletNode(audioContext, 'audio-processor');
audioWorkletNode.port.onmessage = (event) => {
if (event.data.type === 'audioData' && ws?.readyState === WebSocket.OPEN) {
ws.send(event.data.data); // Send binary data directly
}
};
audioSource.connect(audioWorkletNode);
}

AudioWorklet performs far better than ScriptProcessorNode and avoids the audio stutter problems that older processing paths often introduce.

{
"Serilog": {
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft": "Warning",
"System": "Warning"
}
},
"WriteTo": [
{ "Name": "Console" },
{
"Name": "File",
"Args": { "path": "logs/log-.txt", "rollingInterval": "Day" }
}
]
},
"Kestrel": {
"Urls": "http://0.0.0.0:5000"
}
}

Logging configuration matters because it makes problems much easier to trace. Serilog’s file sink can roll logs daily, which keeps individual log files at a manageable size.

  • Regularly log session state to trace the full connection lifecycle
  • Monitor the number and duration of audio segments to detect abnormal connections
  • Record connection status and reconnection behavior for the Doubao service

These are basic operational practices, but they make a real difference in production.

  • Capture and log all WebSocket exceptions
  • Use IAsyncDisposable to ensure resources are cleaned up
  • Implement graceful connection shutdown and timeout handling

In short, favor robustness.

  • Sample rate: 16000 Hz (recommended) or 8000 Hz
  • Bit depth: 16-bit
  • Channels: mono
  • Encoding: PCM (raw)

If the format is wrong, recognition may fail or quality may degrade significantly. These requirements matter.

  • Keep sensitive credentials only in backend configuration
  • Enforce connection limits to prevent resource exhaustion
  • Use HTTPS/WSS in production

Security is never a minor concern.

  • Use asynchronous operations to avoid blocking
  • Tune buffer sizes as needed (default: 4096 bytes)
  • Consider connection pooling and reuse strategies

Apply these optimizations where they make sense for your workload.

  1. Docker deployment: Package the proxy service as a container for easier scaling and management
  2. Load balancing: Use Nginx or Envoy as a reverse proxy for WebSocket traffic
  3. Health checks: Implement heartbeat-based checks to monitor service availability
  4. Log aggregation: Send logs to a centralized logging system such as ELK or Loki

Deployment can be simple or complex depending on the team and environment, so adjust accordingly.

The WebSocket proxy pattern solves the core problem that the browser WebSocket API does not support custom headers. In the HagiCode project, this pattern proved both feasible and stable as it moved from playground validation into production deployment.

Key takeaways:

  • A backend proxy can pass authentication information securely
  • Native WebSocket is lightweight and efficient for simple scenarios
  • The “one connection, one session” model simplifies both implementation and debugging
  • The frontend-backend protocol should separate control signals from audio data

If you are building a feature that also depends on WebSocket authentication, I hope this pattern gives you a useful starting point.

If you have questions, feel free to discuss them with us. Technical progress happens faster when people compare notes.


Thank you for reading. If you found this article helpful, please click the like button below 👍 so more people can discover it.

This content was created with AI-assisted collaboration, reviewed by me, and reflects my own views and position.