The browser WebSocket API does not support custom HTTP headers, which creates a challenge for speech recognition services that require authentication data in headers. This article shares how we solved that problem in the HagiCode project with a backend proxy pattern, and how the approach evolved from playground experiments to production use.
When we started building speech recognition for the HagiCode project, we confidently chose ByteDance’s Doubao speech recognition service. The initial design was straightforward: let the frontend connect directly to Doubao’s WebSocket service. How hard could that be? Just open a connection and send some data, right?
Then came the surprise: Doubao’s API requires authentication information to be passed through HTTP headers, including things like accessToken and secretKey. That immediately became awkward, because the browser WebSocket API simply does not support setting custom headers.
So what do you do when the browser will not let you send them?
At the time, we weighed two options:
Put the credentials into URL query parameters - simple and blunt
Add a proxy layer on the backend - more work at first glance
The first option exposes credentials directly in frontend code and local storage. Is that really safe? I was not comfortable with it. On top of that, some APIs require header-based verification, so this approach is not even viable.
In the end, we chose the second option: implement a WebSocket proxy on the backend. Coincidentally, this pattern was first validated in our playground environment, and only after we confirmed its stability did we move it into production. After all, nobody wants production to double as a lab experiment.
The approach shared in this article comes from our practical experience in the HagiCode project.
HagiCode is an AI coding assistant project with voice interaction support. Because we needed to call a speech recognition service from the frontend, we ran straight into this WebSocket authentication problem, which led us to the solution described here. Sometimes these technical roadblocks are frustrating, but they also force you to learn patterns that turn out to be useful later.
When designing the solution, we compared the trade-offs carefully.
Decision 1: Choosing the proxy pattern
We compared two approaches:
Option
Pros
Cons
Decision
Native WebSocket
Lightweight, simple, direct forwarding
Connection management must be handled manually
Chosen
SignalR
Automatic reconnection, strong typing
Overly complex, extra dependencies
Rejected
We ultimately chose native WebSocket. To be honest, it was the lightest option and a better fit for simple bidirectional binary stream forwarding. Pulling in SignalR would have felt like overengineering for this use case, and it could add extra latency.
Decision 2: Connection management strategy
We adopted a “one connection, one session” model - each frontend WebSocket connection maps to its own independent Doubao backend connection.
The benefits are straightforward:
Simple to implement and aligned with the common usage pattern
Easier to debug and troubleshoot
Good resource isolation, preventing interference between sessions
Put simply, the direct solution is sometimes the best one. Complexity does not automatically make a design better.
Decision 3: Storing authentication data
Credentials are stored in backend configuration files (appsettings.yml or environment variables) and loaded through dependency injection:
Simple configuration model that matches existing backend conventions
Sensitive data never reaches the frontend
Supports multi-environment setup for development, testing, and production
That level of separation matters. No one wants credentials floating around in places they should not be.
if (_sessions.TryGetValue(connectionId, outvar session))
{
awaitsession.SendAudioAsync(audioData);
}
}
publicvoidRemoveSession(string connectionId)
{
if (_sessions.TryRemove(connectionId, outvar session))
{
session.Dispose();
}
}
}
Using ConcurrentDictionary for session management means thread safety is largely handled for us. Each incoming connection gets its own session, and cleanup happens automatically on disconnect.
The frontend and backend use JSON text messages for control, while binary messages carry audio data.
Example control message:
{
"type": "control",
"messageId": "msg_123",
"timestamp": "2026-03-03T10:00:00Z",
"payload": {
"command": "StartRecognition",
"parameters": {
"hotwordId": "hotword1",
"boosting_table_id": "table123"
}
}
}
Example recognition result:
{
"type": "result",
"timestamp": "2026-03-03T10:00:03Z",
"payload": {
"text": "Hello world",
"confidence": 0.95,
"duration": 1500,
"isFinal": true,
"utterances": [
{
"text": "Hello",
"startTime": 0,
"endTime": 800,
"definite": true
}
]
}
}
This design separates control signals from audio payloads, which makes the system easier to reason about and implement. Splitting responsibilities cleanly is often the right call.
Logging configuration matters because it makes problems much easier to trace. Serilog’s file sink can roll logs daily, which keeps individual log files at a manageable size.
The WebSocket proxy pattern solves the core problem that the browser WebSocket API does not support custom headers. In the HagiCode project, this pattern proved both feasible and stable as it moved from playground validation into production deployment.
Key takeaways:
A backend proxy can pass authentication information securely
Native WebSocket is lightweight and efficient for simple scenarios
The “one connection, one session” model simplifies both implementation and debugging
The frontend-backend protocol should separate control signals from audio data
If you are building a feature that also depends on WebSocket authentication, I hope this pattern gives you a useful starting point.
If you have questions, feel free to discuss them with us. Technical progress happens faster when people compare notes.