Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition
Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition
Section titled “Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition”The browser WebSocket API does not support custom HTTP headers, which creates a challenge for speech recognition services that require authentication data in headers. This article shares how we solved that problem in the HagiCode project with a backend proxy pattern, and how the approach evolved from playground experiments to production use.
Background
Section titled “Background”When we started building speech recognition for the HagiCode project, we confidently chose ByteDance’s Doubao speech recognition service. The initial design was straightforward: let the frontend connect directly to Doubao’s WebSocket service. How hard could that be? Just open a connection and send some data, right?
Then came the surprise: Doubao’s API requires authentication information to be passed through HTTP headers, including things like accessToken and secretKey. That immediately became awkward, because the browser WebSocket API simply does not support setting custom headers.
So what do you do when the browser will not let you send them?
At the time, we weighed two options:
- Put the credentials into URL query parameters - simple and blunt
- Add a proxy layer on the backend - more work at first glance
The first option exposes credentials directly in frontend code and local storage. Is that really safe? I was not comfortable with it. On top of that, some APIs require header-based verification, so this approach is not even viable.
In the end, we chose the second option: implement a WebSocket proxy on the backend. Coincidentally, this pattern was first validated in our playground environment, and only after we confirmed its stability did we move it into production. After all, nobody wants production to double as a lab experiment.
About HagiCode
Section titled “About HagiCode”The approach shared in this article comes from our practical experience in the HagiCode project.
HagiCode is an AI coding assistant project with voice interaction support. Because we needed to call a speech recognition service from the frontend, we ran straight into this WebSocket authentication problem, which led us to the solution described here. Sometimes these technical roadblocks are frustrating, but they also force you to learn patterns that turn out to be useful later.
Technical Challenge Analysis
Section titled “Technical Challenge Analysis”Browser WebSocket limitations
Section titled “Browser WebSocket limitations”The standard WebSocket API looks wonderfully simple:
const ws = new WebSocket('wss://example.com/ws');But that simplicity is exactly where the problem lies - it only passes parameters in the URL, and it cannot set headers the way an HTTP request can:
// This is not supported in the WebSocket APIconst ws = new WebSocket('wss://example.com/ws', { headers: { 'Authorization': 'Bearer token' }});And that is the core issue. For services like Doubao speech recognition that depend on header-based authentication, this limitation is a hard blocker.
Once you accept that constraint, the architecture has to change.
Architectural design decisions
Section titled “Architectural design decisions”When designing the solution, we compared the trade-offs carefully.
Decision 1: Choosing the proxy pattern
We compared two approaches:
| Option | Pros | Cons | Decision |
|---|---|---|---|
| Native WebSocket | Lightweight, simple, direct forwarding | Connection management must be handled manually | Chosen |
| SignalR | Automatic reconnection, strong typing | Overly complex, extra dependencies | Rejected |
We ultimately chose native WebSocket. To be honest, it was the lightest option and a better fit for simple bidirectional binary stream forwarding. Pulling in SignalR would have felt like overengineering for this use case, and it could add extra latency.
Decision 2: Connection management strategy
We adopted a “one connection, one session” model - each frontend WebSocket connection maps to its own independent Doubao backend connection.
The benefits are straightforward:
- Simple to implement and aligned with the common usage pattern
- Easier to debug and troubleshoot
- Good resource isolation, preventing interference between sessions
Put simply, the direct solution is sometimes the best one. Complexity does not automatically make a design better.
Decision 3: Storing authentication data
Credentials are stored in backend configuration files (appsettings.yml or environment variables) and loaded through dependency injection:
- Simple configuration model that matches existing backend conventions
- Sensitive data never reaches the frontend
- Supports multi-environment setup for development, testing, and production
That level of separation matters. No one wants credentials floating around in places they should not be.
Data flow design
Section titled “Data flow design”The overall data flow looks like this:
Frontend (browser) │ │ ws://backend/api/voice/ws │ WebSocket (binary) ▼Backend (proxy) │ │ wss://openspeech.bytedance.com/ │ (with authentication headers) ▼Doubao APIThe flow itself is not complicated:
- The frontend connects to the backend proxy through WebSocket
- The backend proxy receives audio data and connects to the Doubao API with authenticated headers
- The Doubao API returns recognition results, and the proxy forwards them to the frontend
- The whole process remains fully asynchronous with bidirectional streaming
Once the responsibilities are split clearly, the design becomes quite natural.
Core component implementation
Section titled “Core component implementation”1. WebSocket endpoint configuration
Section titled “1. WebSocket endpoint configuration”app.Map("/ws", async context =>{ if (context.WebSockets.IsWebSocketRequest) { // Read configuration from query parameters var appId = context.Request.Query["appId"]; var accessToken = context.Request.Query["accessToken"];
// Validate required parameters if (string.IsNullOrEmpty(appId) || string.IsNullOrEmpty(accessToken)) { context.Response.StatusCode = 400; return; }
// Accept the WebSocket connection using var webSocket = await context.WebSockets.AcceptWebSocketAsync();
// Message handling loop var buffer = new byte[4096]; while (!webSocket.CloseStatus.HasValue) { var result = await webSocket.ReceiveAsync(buffer, CancellationToken.None);
if (result.MessageType == WebSocketMessageType.Close) { await webSocket.CloseAsync( result.CloseStatus.Value, result.CloseStatusDescription, CancellationToken.None); break; }
// Process audio data await HandleAudioDataAsync(buffer, result.Count); } }});2. Session management
Section titled “2. Session management”public class DoubaoSessionManager : IDoubaoSessionManager{ private readonly ConcurrentDictionary<string, DoubaoSession> _sessions = new();
public DoubaoSession CreateSession(string connectionId) { var session = new DoubaoSession(connectionId); _sessions[connectionId] = session; return session; }
public async Task SendAudioAsync(string connectionId, byte[] audioData) { if (_sessions.TryGetValue(connectionId, out var session)) { await session.SendAudioAsync(audioData); } }
public void RemoveSession(string connectionId) { if (_sessions.TryRemove(connectionId, out var session)) { session.Dispose(); } }}Using ConcurrentDictionary for session management means thread safety is largely handled for us. Each incoming connection gets its own session, and cleanup happens automatically on disconnect.
3. Configuration validation
Section titled “3. Configuration validation”public class ClientConfigDto{ public string AppId { get; set; } = null!; public string Access set; } =Token { get; null!; public string? ServiceUrl { get; set; } public string? ResourceId { get; set; } public int? SampleRate { get; set; } public int? BitsPerSample { get; set; } public int? Channels { get; set; }
public void Validate() { if (string.IsNullOrWhiteSpace(AppId)) throw new ArgumentException("AppId is required"); if (string.IsNullOrWhiteSpace(AccessToken)) throw new ArgumentException("AccessToken is required"); }}Configuration validation helps surface problems during startup instead of letting them fail later at runtime. That safeguard is worth having.
Message protocol design
Section titled “Message protocol design”The frontend and backend use JSON text messages for control, while binary messages carry audio data.
Example control message:
{ "type": "control", "messageId": "msg_123", "timestamp": "2026-03-03T10:00:00Z", "payload": { "command": "StartRecognition", "parameters": { "hotwordId": "hotword1", "boosting_table_id": "table123" } }}Example recognition result:
{ "type": "result", "timestamp": "2026-03-03T10:00:03Z", "payload": { "text": "Hello world", "confidence": 0.95, "duration": 1500, "isFinal": true, "utterances": [ { "text": "Hello", "startTime": 0, "endTime": 800, "definite": true } ] }}This design separates control signals from audio payloads, which makes the system easier to reason about and implement. Splitting responsibilities cleanly is often the right call.
Frontend integration in practice
Section titled “Frontend integration in practice”WebSocket connection
Section titled “WebSocket connection”class DoubaoVoiceClient { constructor(config) { this.config = config; this.ws = null; }
async connect() { const url = new URL(this.config.wsUrl); // Add query parameters Object.entries(this.config.params).forEach(([key, value]) => { url.searchParams.set(key, value); });
this.ws = new WebSocket(url);
return new Promise((resolve, reject) => { this.ws.onopen = () => { console.log('[DoubaoVoice] Connected'); resolve(); };
this.ws.onmessage = (event) => { this._handleMessage(JSON.parse(event.data)); };
this.ws.onerror = reject; }); }
_handleMessage(message) { switch (message.type) { case 'status': this._handleStatus(message.payload); break; case 'result': this.onResult?.(message.payload); break; case 'error': console.error('[DoubaoVoice] Error:', message.payload); break; } }}
// Usage exampleconst client = new DoubaoVoiceClient({ wsUrl: 'ws://localhost:5000/ws', params: { appId: 'your-app-id', accessToken: 'your-access-token', sampleRate: 16000, bitsPerSample: 16, channels: 1 }});Audio capture and transmission
Section titled “Audio capture and transmission”Using AudioWorkletNode for audio processing gives better performance:
class AudioProcessorWorklet extends AudioWorkletProcessor { process(inputs, outputs, parameters) { const input = inputs[0]?.[0]; if (!input) return true;
// Convert to 16-bit PCM const pcm = new Int16Array(input.length); for (let i = 0; i < input.length; i++) { pcm[i] = Math.max(-32768, Math.min(32767, input[i] * 32767)); }
this.port.postMessage({ type: 'audioData', data: pcm.buffer }, [pcm.buffer]);
return true; }}
registerProcessor('audio-processor', AudioProcessorWorklet);
// Main thread codeasync function startAudioRecording() { const stream = await navigator.mediaDevices.getUserMedia({ audio: { echoCancellation: true, noiseSuppression: true, autoGainControl: true, sampleRate: 48000 } });
const audioContext = new AudioContext(); const audioSource = audioContext.createMediaStreamSource(stream);
await audioContext.audioWorklet.addModule('/audio-worklet.js'); const audioWorkletNode = new AudioWorkletNode(audioContext, 'audio-processor');
audioWorkletNode.port.onmessage = (event) => { if (event.data.type === 'audioData' && ws?.readyState === WebSocket.OPEN) { ws.send(event.data.data); // Send binary data directly } };
audioSource.connect(audioWorkletNode);}AudioWorklet performs far better than ScriptProcessorNode and avoids the audio stutter problems that older processing paths often introduce.
Backend configuration
Section titled “Backend configuration”appsettings.json example
Section titled “appsettings.json example”{ "Serilog": { "MinimumLevel": { "Default": "Information", "Override": { "Microsoft": "Warning", "System": "Warning" } }, "WriteTo": [ { "Name": "Console" }, { "Name": "File", "Args": { "path": "logs/log-.txt", "rollingInterval": "Day" } } ] }, "Kestrel": { "Urls": "http://0.0.0.0:5000" }}Logging configuration matters because it makes problems much easier to trace. Serilog’s file sink can roll logs daily, which keeps individual log files at a manageable size.
Notes and best practices
Section titled “Notes and best practices”1. Connection monitoring
Section titled “1. Connection monitoring”- Regularly log session state to trace the full connection lifecycle
- Monitor the number and duration of audio segments to detect abnormal connections
- Record connection status and reconnection behavior for the Doubao service
These are basic operational practices, but they make a real difference in production.
2. Error handling
Section titled “2. Error handling”- Capture and log all WebSocket exceptions
- Use
IAsyncDisposableto ensure resources are cleaned up - Implement graceful connection shutdown and timeout handling
In short, favor robustness.
3. Audio format requirements
Section titled “3. Audio format requirements”- Sample rate: 16000 Hz (recommended) or 8000 Hz
- Bit depth: 16-bit
- Channels: mono
- Encoding: PCM (raw)
If the format is wrong, recognition may fail or quality may degrade significantly. These requirements matter.
4. Security considerations
Section titled “4. Security considerations”- Keep sensitive credentials only in backend configuration
- Enforce connection limits to prevent resource exhaustion
- Use HTTPS/WSS in production
Security is never a minor concern.
5. Performance optimization
Section titled “5. Performance optimization”- Use asynchronous operations to avoid blocking
- Tune buffer sizes as needed (default: 4096 bytes)
- Consider connection pooling and reuse strategies
Apply these optimizations where they make sense for your workload.
Deployment recommendations
Section titled “Deployment recommendations”- Docker deployment: Package the proxy service as a container for easier scaling and management
- Load balancing: Use Nginx or Envoy as a reverse proxy for WebSocket traffic
- Health checks: Implement heartbeat-based checks to monitor service availability
- Log aggregation: Send logs to a centralized logging system such as ELK or Loki
Deployment can be simple or complex depending on the team and environment, so adjust accordingly.
Summary
Section titled “Summary”The WebSocket proxy pattern solves the core problem that the browser WebSocket API does not support custom headers. In the HagiCode project, this pattern proved both feasible and stable as it moved from playground validation into production deployment.
Key takeaways:
- A backend proxy can pass authentication information securely
- Native WebSocket is lightweight and efficient for simple scenarios
- The “one connection, one session” model simplifies both implementation and debugging
- The frontend-backend protocol should separate control signals from audio data
If you are building a feature that also depends on WebSocket authentication, I hope this pattern gives you a useful starting point.
If you have questions, feel free to discuss them with us. Technical progress happens faster when people compare notes.
References
Section titled “References”Thank you for reading. If you found this article helpful, please click the like button below 👍 so more people can discover it.
This content was created with AI-assisted collaboration, reviewed by me, and reflects my own views and position.
- Author: newbe36524
- Article link: https://docs.hagicode.com/blog/2026-03-05-websocket-proxy-for-doubao-speech-recognition/
- Copyright notice: Unless otherwise stated, all articles in this blog are licensed under BY-NC-SA. Please include the source when reprinting.