Authentication

Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition

Mar 5, 2026

Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition

The browser WebSocket API does not support custom HTTP headers, which creates a challenge for speech recognition services that require authentication data in headers. This article shares how we solved that problem in the HagiCode project with a backend proxy pattern, and how the approach evolved from playground experiments to production use.

Background

When we started building speech recognition for the HagiCode project, we confidently chose ByteDance’s Doubao speech recognition service. The initial design was straightforward: let the frontend connect directly to Doubao’s WebSocket service. How hard could that be? Just open a connection and send some data, right?

Then came the surprise: Doubao’s API requires authentication information to be passed through HTTP headers, including things like accessToken and secretKey. That immediately became awkward, because the browser WebSocket API simply does not support setting custom headers.

So what do you do when the browser will not let you send them?

At the time, we weighed two options:

Put the credentials into URL query parameters - simple and blunt
Add a proxy layer on the backend - more work at first glance

The first option exposes credentials directly in frontend code and local storage. Is that really safe? I was not comfortable with it. On top of that, some APIs require header-based verification, so this approach is not even viable.

In the end, we chose the second option: implement a WebSocket proxy on the backend. Coincidentally, this pattern was first validated in our playground environment, and only after we confirmed its stability did we move it into production. After all, nobody wants production to double as a lab experiment.

About HagiCode

The approach shared in this article comes from our practical experience in the HagiCode project.

HagiCode is an AI coding assistant project with voice interaction support. Because we needed to call a speech recognition service from the frontend, we ran straight into this WebSocket authentication problem, which led us to the solution described here. Sometimes these technical roadblocks are frustrating, but they also force you to learn patterns that turn out to be useful later.

Technical Challenge Analysis

Browser WebSocket limitations

The standard WebSocket API looks wonderfully simple:

const ws = new WebSocket('wss://example.com/ws');

But that simplicity is exactly where the problem lies - it only passes parameters in the URL, and it cannot set headers the way an HTTP request can:

// This is not supported in the WebSocket API
const ws = new WebSocket('wss://example.com/ws', {
  headers: {
    'Authorization': 'Bearer token'
  }
});

And that is the core issue. For services like Doubao speech recognition that depend on header-based authentication, this limitation is a hard blocker.

Once you accept that constraint, the architecture has to change.

Architectural design decisions

When designing the solution, we compared the trade-offs carefully.

Decision 1: Choosing the proxy pattern

We compared two approaches:

Option	Pros	Cons	Decision
Native WebSocket	Lightweight, simple, direct forwarding	Connection management must be handled manually	Chosen
SignalR	Automatic reconnection, strong typing	Overly complex, extra dependencies	Rejected

We ultimately chose native WebSocket. To be honest, it was the lightest option and a better fit for simple bidirectional binary stream forwarding. Pulling in SignalR would have felt like overengineering for this use case, and it could add extra latency.

Decision 2: Connection management strategy

We adopted a “one connection, one session” model - each frontend WebSocket connection maps to its own independent Doubao backend connection.

The benefits are straightforward:

Simple to implement and aligned with the common usage pattern
Easier to debug and troubleshoot
Good resource isolation, preventing interference between sessions

Put simply, the direct solution is sometimes the best one. Complexity does not automatically make a design better.

Decision 3: Storing authentication data

Credentials are stored in backend configuration files (appsettings.yml or environment variables) and loaded through dependency injection:

Simple configuration model that matches existing backend conventions
Sensitive data never reaches the frontend
Supports multi-environment setup for development, testing, and production

That level of separation matters. No one wants credentials floating around in places they should not be.

Data flow design

The overall data flow looks like this:

Frontend (browser)
  │
  │ ws://backend/api/voice/ws
  │ WebSocket (binary)
  ▼
Backend (proxy)
  │
  │ wss://openspeech.bytedance.com/
  │ (with authentication headers)
  ▼
Doubao API

The flow itself is not complicated:

The frontend connects to the backend proxy through WebSocket
The backend proxy receives audio data and connects to the Doubao API with authenticated headers
The Doubao API returns recognition results, and the proxy forwards them to the frontend
The whole process remains fully asynchronous with bidirectional streaming

Once the responsibilities are split clearly, the design becomes quite natural.

Core component implementation

1. WebSocket endpoint configuration

app.Map("/ws", async context =>
{
    if (context.WebSockets.IsWebSocketRequest)
    {
        // Read configuration from query parameters
        var appId = context.Request.Query["appId"];
        var accessToken = context.Request.Query["accessToken"];

        // Validate required parameters
        if (string.IsNullOrEmpty(appId) || string.IsNullOrEmpty(accessToken))
        {
            context.Response.StatusCode = 400;
            return;
        }

        // Accept the WebSocket connection
        using var webSocket = await context.WebSockets.AcceptWebSocketAsync();

        // Message handling loop
        var buffer = new byte[4096];
        while (!webSocket.CloseStatus.HasValue)
        {
            var result = await webSocket.ReceiveAsync(buffer, CancellationToken.None);

            if (result.MessageType == WebSocketMessageType.Close)
            {
                await webSocket.CloseAsync(
                    result.CloseStatus.Value,
                    result.CloseStatusDescription,
                    CancellationToken.None);
                break;
            }

            // Process audio data
            await HandleAudioDataAsync(buffer, result.Count);
        }
    }
});

2. Session management

public class DoubaoSessionManager : IDoubaoSessionManager
{
    private readonly ConcurrentDictionary<string, DoubaoSession> _sessions = new();

    public DoubaoSession CreateSession(string connectionId)
    {
        var session = new DoubaoSession(connectionId);
        _sessions[connectionId] = session;
        return session;
    }

    public async Task SendAudioAsync(string connectionId, byte[] audioData)
    {
        if (_sessions.TryGetValue(connectionId, out var session))
        {
            await session.SendAudioAsync(audioData);
        }
    }

    public void RemoveSession(string connectionId)
    {
        if (_sessions.TryRemove(connectionId, out var session))
        {
            session.Dispose();
        }
    }
}

Using ConcurrentDictionary for session management means thread safety is largely handled for us. Each incoming connection gets its own session, and cleanup happens automatically on disconnect.

3. Configuration validation

public class ClientConfigDto
{
    public string AppId { get; set; } = null!;
    public string Access set; } =Token { get; null!;
    public string? ServiceUrl { get; set; }
    public string? ResourceId { get; set; }
    public int? SampleRate { get; set; }
    public int? BitsPerSample { get; set; }
    public int? Channels { get; set; }

    public void Validate()
    {
        if (string.IsNullOrWhiteSpace(AppId))
            throw new ArgumentException("AppId is required");
        if (string.IsNullOrWhiteSpace(AccessToken))
            throw new ArgumentException("AccessToken is required");
    }
}

Configuration validation helps surface problems during startup instead of letting them fail later at runtime. That safeguard is worth having.

Message protocol design

The frontend and backend use JSON text messages for control, while binary messages carry audio data.

Example control message:

{
    "type": "control",
    "messageId": "msg_123",
    "timestamp": "2026-03-03T10:00:00Z",
    "payload": {
        "command": "StartRecognition",
        "parameters": {
            "hotwordId": "hotword1",
            "boosting_table_id": "table123"
        }
    }
}

Example recognition result:

{
    "type": "result",
    "timestamp": "2026-03-03T10:00:03Z",
    "payload": {
        "text": "Hello world",
        "confidence": 0.95,
        "duration": 1500,
        "isFinal": true,
        "utterances": [
            {
                "text": "Hello",
                "startTime": 0,
                "endTime": 800,
                "definite": true
            }
        ]
    }
}

This design separates control signals from audio payloads, which makes the system easier to reason about and implement. Splitting responsibilities cleanly is often the right call.

Frontend integration in practice

WebSocket connection

class DoubaoVoiceClient {
    constructor(config) {
        this.config = config;
        this.ws = null;
    }

    async connect() {
        const url = new URL(this.config.wsUrl);
        // Add query parameters
        Object.entries(this.config.params).forEach(([key, value]) => {
            url.searchParams.set(key, value);
        });

        this.ws = new WebSocket(url);

        return new Promise((resolve, reject) => {
            this.ws.onopen = () => {
                console.log('[DoubaoVoice] Connected');
                resolve();
            };

            this.ws.onmessage = (event) => {
                this._handleMessage(JSON.parse(event.data));
            };

            this.ws.onerror = reject;
        });
    }

    _handleMessage(message) {
        switch (message.type) {
            case 'status':
                this._handleStatus(message.payload);
                break;
            case 'result':
                this.onResult?.(message.payload);
                break;
            case 'error':
                console.error('[DoubaoVoice] Error:', message.payload);
                break;
        }
    }
}

// Usage example
const client = new DoubaoVoiceClient({
    wsUrl: 'ws://localhost:5000/ws',
    params: {
        appId: 'your-app-id',
        accessToken: 'your-access-token',
        sampleRate: 16000,
        bitsPerSample: 16,
        channels: 1
    }
});

Audio capture and transmission

Using AudioWorkletNode for audio processing gives better performance:

class AudioProcessorWorklet extends AudioWorkletProcessor {
    process(inputs, outputs, parameters) {
        const input = inputs[0]?.[0];
        if (!input) return true;

        // Convert to 16-bit PCM
        const pcm = new Int16Array(input.length);
        for (let i = 0; i < input.length; i++) {
            pcm[i] = Math.max(-32768, Math.min(32767, input[i] * 32767));
        }

        this.port.postMessage({
            type: 'audioData',
            data: pcm.buffer
        }, [pcm.buffer]);

        return true;
    }
}

registerProcessor('audio-processor', AudioProcessorWorklet);

// Main thread code
async function startAudioRecording() {
    const stream = await navigator.mediaDevices.getUserMedia({
        audio: {
            echoCancellation: true,
            noiseSuppression: true,
            autoGainControl: true,
            sampleRate: 48000
        }
    });

    const audioContext = new AudioContext();
    const audioSource = audioContext.createMediaStreamSource(stream);

    await audioContext.audioWorklet.addModule('/audio-worklet.js');
    const audioWorkletNode = new AudioWorkletNode(audioContext, 'audio-processor');

    audioWorkletNode.port.onmessage = (event) => {
        if (event.data.type === 'audioData' && ws?.readyState === WebSocket.OPEN) {
            ws.send(event.data.data); // Send binary data directly
        }
    };

    audioSource.connect(audioWorkletNode);
}

AudioWorklet performs far better than ScriptProcessorNode and avoids the audio stutter problems that older processing paths often introduce.

Backend configuration

`appsettings.json` example

{
  "Serilog": {
    "MinimumLevel": {
      "Default": "Information",
      "Override": {
        "Microsoft": "Warning",
        "System": "Warning"
      }
    },
    "WriteTo": [
      { "Name": "Console" },
      {
        "Name": "File",
        "Args": { "path": "logs/log-.txt", "rollingInterval": "Day" }
      }
    ]
  },
  "Kestrel": {
    "Urls": "http://0.0.0.0:5000"
  }
}

Logging configuration matters because it makes problems much easier to trace. Serilog’s file sink can roll logs daily, which keeps individual log files at a manageable size.

Notes and best practices

1. Connection monitoring

Regularly log session state to trace the full connection lifecycle
Monitor the number and duration of audio segments to detect abnormal connections
Record connection status and reconnection behavior for the Doubao service

These are basic operational practices, but they make a real difference in production.

2. Error handling

Capture and log all WebSocket exceptions
Use IAsyncDisposable to ensure resources are cleaned up
Implement graceful connection shutdown and timeout handling

In short, favor robustness.

3. Audio format requirements

Sample rate: 16000 Hz (recommended) or 8000 Hz
Bit depth: 16-bit
Channels: mono
Encoding: PCM (raw)

If the format is wrong, recognition may fail or quality may degrade significantly. These requirements matter.

4. Security considerations

Keep sensitive credentials only in backend configuration
Enforce connection limits to prevent resource exhaustion
Use HTTPS/WSS in production

Security is never a minor concern.

5. Performance optimization

Use asynchronous operations to avoid blocking
Tune buffer sizes as needed (default: 4096 bytes)
Consider connection pooling and reuse strategies

Apply these optimizations where they make sense for your workload.

Deployment recommendations

Docker deployment: Package the proxy service as a container for easier scaling and management
Load balancing: Use Nginx or Envoy as a reverse proxy for WebSocket traffic
Health checks: Implement heartbeat-based checks to monitor service availability
Log aggregation: Send logs to a centralized logging system such as ELK or Loki

Deployment can be simple or complex depending on the team and environment, so adjust accordingly.

Summary

The WebSocket proxy pattern solves the core problem that the browser WebSocket API does not support custom headers. In the HagiCode project, this pattern proved both feasible and stable as it moved from playground validation into production deployment.

Key takeaways:

A backend proxy can pass authentication information securely
Native WebSocket is lightweight and efficient for simple scenarios
The “one connection, one session” model simplifies both implementation and debugging
The frontend-backend protocol should separate control signals from audio data

If you are building a feature that also depends on WebSocket authentication, I hope this pattern gives you a useful starting point.

If you have questions, feel free to discuss them with us. Technical progress happens faster when people compare notes.

References

Thank you for reading. If you found this article helpful, please click the like button below 👍 so more people can discover it.

This content was created with AI-assisted collaboration, reviewed by me, and reflects my own views and position.

Author: newbe36524
Article link: https://docs.hagicode.com/blog/2026-03-05-websocket-proxy-for-doubao-speech-recognition/
Copyright notice: Unless otherwise stated, all articles in this blog are licensed under BY-NC-SA. Please include the source when reprinting.

Authentication

Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition

Background

About HagiCode

Technical Challenge Analysis

Browser WebSocket limitations

Architectural design decisions

Data flow design

Core component implementation

1. WebSocket endpoint configuration

2. Session management

3. Configuration validation

Message protocol design

Frontend integration in practice

WebSocket connection

Audio capture and transmission

Backend configuration

appsettings.json example

Notes and best practices

1. Connection monitoring

2. Error handling

3. Audio format requirements

4. Security considerations

5. Performance optimization

Deployment recommendations

Summary

References

`appsettings.json` example