WebSocket

4 posts with the tag “WebSocket”

Typing Is Slower Than Talking, and Talking Is Slower Than a Screenshot - Multimodal Input Practices for AI Coding Assistants

Mar 31, 2026

Typing Is Slower Than Talking, and Talking Is Slower Than a Screenshot - Multimodal Input Practices for AI Coding Assistants

Writing code has a speed limit no matter how fast you type. Sometimes something you could say in one sentence takes forever to type out; sometimes one screenshot explains everything, yet you still have to describe it with a pile of text. This article talks about what we ran into while building HagiCode, from speech recognition to image uploads. In the end, we just wanted to make an AI coding assistant a little easier to use.

Background

While building HagiCode, we noticed a problem - or rather, a problem that naturally surfaced once people started using it heavily: relying on typing alone can be tiring.

Think about it: interaction between users and the Agent is a core scenario. But if every exchange requires nonstop typing at the keyboard, the efficiency is not great:

Typing is too slow: For complicated issues, like error messages or UI problems, typing everything out can take half a minute, while saying it aloud might take ten seconds. That gap is real.
Images are more direct: Sometimes the UI throws an error, sometimes you want to compare a design draft, and sometimes you need to show a code structure. “A picture is worth a thousand words” may be an old saying, but it still holds up. Letting AI directly “see” the problem is much clearer than describing it for ages.
Interaction should feel natural: Modern AI assistants should support text, voice, and images. Users should be able to choose whichever input method feels most natural.

So we decided to add speech recognition and image upload support to HagiCode to make Agent interactions more convenient. If users can type a little less, that is already a win.

About HagiCode

The solutions shared in this article come from our hands-on work in the HagiCode project - or, more accurately, from lessons learned while stumbling through quite a few pitfalls.

HagiCode is an open-source AI coding assistant project with a simple goal: use AI to improve development efficiency. As we kept building, it became clear that users strongly wanted multimodal input. Sometimes speaking one sentence is faster than typing a long paragraph, and sometimes a screenshot is far clearer than a long explanation.

Those needs pushed us forward, and that is how features like speech recognition and image uploads eventually took shape. Users can now interact with AI in the most natural way available to them, and that feels good.

Analysis

Technical Challenges in Speech Recognition

When building speech recognition, we ran into a tricky issue: the browser WebSocket API does not support custom HTTP headers.

The speech recognition service we chose was ByteDance’s Doubao Speech Recognition API. Unfortunately, this API requires authentication information such as accessToken and secretKey to be passed through HTTP headers. That created an immediate technical conflict:

// The browser WebSocket API does not support this approach
const ws = new WebSocket('wss://api.com/ws', {
  headers: {
    'Authorization': 'Bearer token'  // Not supported
  }
});

We basically had two options:

URL query parameter approach: put the authentication info in the URL
- The advantage is that it is simple to implement
- The downside is that credentials are exposed to the frontend, which is insecure; some APIs also require header-based authentication
Backend proxy approach: implement a WebSocket proxy on the backend
- The advantage is that credentials remain securely stored on the backend and the solution is fully compatible with API requirements
- The downside is that implementation is a bit more complex

In the end, we chose the backend proxy approach. Security is not something you compromise on.

Functional Requirements for Image Uploads

Our requirements for image uploads were actually pretty straightforward:

Multiple upload methods: click to select a file, drag and drop, and paste from the clipboard
File validation: type restrictions (PNG, JPG, WebP, GIF) and size limits (5-10 MB) are basic requirements
User experience: upload progress, previews, and error messages so users always know what is happening
Security: server-side validation and protection against malicious file uploads are essential

Solution

Speech Recognition: WebSocket Proxy Architecture

We designed a three-layer architecture for speech recognition and found a path that worked:

Browser WebSocket
       |
       | ws://backend/api/voice/ws
       | (binary audio)
       v
Backend Proxy
       |
       | wss://openspeech.bytedance.com/ (with auth header)
       v
Doubao API

Core component implementations:

Frontend AudioWorklet processor:

class AudioProcessorWorklet extends AudioWorkletProcessor {
  process(inputs, outputs, parameters) {
    const input = inputs[0]?.[0];
    if (!input) return true;

    // Resample to 16 kHz (required by the Doubao API)
    const samples = this.resampleAudio(input, 48000, 16000);

    // Accumulate samples into 500 ms chunks
    this.accumulatedSamples.push(...samples);

    if (this.accumulatedSamples.length >= 8000) {
      // Convert to 16-bit PCM and send
      const pcm = this.floatToPcm16(this.accumulatedSamples);
      this.port.postMessage({ type: 'audioData', data: pcm.buffer }, [pcm.buffer]);
      this.accumulatedSamples = [];
    }
    return true;
  }
}

Backend WebSocket handler (C#):

[HttpGet("ws")]
public async Task GetWebSocket()
{
    if (HttpContext.WebSockets.IsWebSocketRequest)
    {
        await _webSocketHandler.HandleAsync(HttpContext);
    }
}

Frontend VoiceTextArea component:

export const VoiceTextArea = forwardRef<HTMLTextAreaElement, VoiceTextAreaProps>(
  ({ value, onChange, onTextRecognized, maxDuration }, ref) => {
    const { isRecording, interimText, volume, duration, startRecording, stopRecording } =
      useVoiceRecording({ onTextRecognized, maxDuration });

    return (
      <div className="flex gap-2">
        {/* Voice button */}
        <button onClick={handleButtonClick}>
          {isRecording ? <VolumeWaveform volume={volume} /> : <Mic />}
        </button>
        {/* Text input area */}
        <textarea value={displayValue} onChange={handleChange} />
      </div>
    );
  }
);

Image Uploads: Multi-Method Upload Component

We built a full-featured image upload component with support for all three upload methods, covering the most common scenarios users run into.

Core features:

Three upload methods:

// Click to upload
const handleClick = () => fileInputRef.current?.click();

// Drag-and-drop upload
const handleDrop = (e: React.DragEvent) => {
  const file = e.dataTransfer.files?.[0];
  if (file) uploadFile(file);
};

// Clipboard paste
const handlePaste = (e: ClipboardEvent) => {
  for (const item of Array.from(e.clipboardData?.items || [])) {
    if (item.type.startsWith('image/')) {
      const file = item.getAsFile();
      if (file) uploadFile(file);
    }
  }
};

Frontend validation:

const validateFile = (file: File): { valid: boolean; error?: string } => {
  if (!acceptedTypes.includes(file.type)) {
    return { valid: false, error: 'Only PNG, JPG, JPEG, WebP, and GIF images are allowed' };
  }
  if (file.size > maxSize) {
    return { valid: false, error: `Maximum file size is ${(maxSize / 1024 / 1024).toFixed(1)}MB` };
  }
  return { valid: true };
};

Backend upload handler (TypeScript):

export const Route = createFileRoute('/api/upload')({
  server: {
    handlers: {
      POST: async ({ request }) => {
        const formData = await request.formData();
        const file = formData.get('file') as File;

        // Validation
        const validation = validateFile(file);
        if (!validation.isValid) {
          return Response.json({ error: validation.error }, { status: 400 });
        }

        // Save file
        const uuid = uuidv4();
        const filePath = join(uploadDir, `${uuid}${extension}`);
        await writeFile(filePath, buffer);

        return Response.json({ url: `/uploaded/${today}/${uuid}${extension}` });
      }
    }
  }
});

Practical Guide

How to Use Speech Recognition

Configure the speech recognition service:
- Open the speech recognition settings page
- Configure the Doubao Speech AppId and AccessToken
- Optionally configure hotwords to improve recognition accuracy for domain-specific terms
Use it in the input box:
- Click the microphone icon on the left side of the input box
- Start speaking after the waveform animation appears
- Click the icon again to stop recording
- The recognized text is automatically inserted at the cursor position
Hotword configuration example:

TypeScript
React
useState
useEffect

How to Use Image Uploads

Upload methods:
- Click the upload button to choose a file
- Drag an image directly into the upload area
- Use Ctrl+V to paste a screenshot from the clipboard
Supported formats: PNG, JPG, JPEG, WebP, GIF
Size limit: 5 MB by default (configurable)

Notes

Speech recognition:
- Microphone permission is required
- Use in a quiet environment when possible
- The maximum supported recording duration is 300 seconds by default (configurable)
Image uploads:
- Only common image formats are supported
- Pay attention to file size limits
- Uploaded images automatically receive a preview URL
Security considerations:
- Speech recognition credentials are stored on the backend
- Image uploads go through strict server-side validation
- HTTPS/WSS is recommended in production environments

Conclusion

After adding speech recognition and image uploads, the HagiCode user experience improved noticeably. Users can now interact with AI in a more natural way - speaking instead of typing, and sharing screenshots instead of describing everything manually. It feels like finally finding a more comfortable way to communicate.

While building this feature, we ran into the problem that browser WebSocket APIs do not support custom headers. In the end, we solved it with a backend proxy approach. That solution not only preserved security, but also laid the groundwork for integrating other authenticated WebSocket services later on.

The image upload component also benefits from supporting multiple upload methods, letting users choose whatever is most convenient in the moment. Clicking, dragging, or pasting all work, and each path gets the job done quickly.

“Typing is slower than talking, and talking is slower than a screenshot” fits the theme here quite well. If you are building a similar AI assistant product, I hope these experiences help, even if only a little.

References

If this article helped you:

Give it a like so more people can find it
Star us on GitHub: github.com/HagiCode-org/site
Visit the official site to learn more: hagicode.com
Watch the 30-minute hands-on demo: www.bilibili.com/video/BV1pirZBuEzq/
Try one-click installation: docs.hagicode.com/installation/docker-compose
Quick install for Desktop: hagicode.com/desktop/
Public beta has started - feel free to install and try it

Copyright Notice

Thank you for reading. If you found this article useful, feel free to like, bookmark, and share it. This content was created with AI-assisted collaboration, and the final version was reviewed and confirmed by the author.

Author: newbe36524
Original link: https://docs.hagicode.com/blog/2026-03-31-voice-and-image-upload-multimodal-input/
Copyright: Unless otherwise stated, all blog posts on this site are licensed under BY-NC-SA. Please include the source when reposting.

Practical Multi-AI Provider Architecture in the HagiCode Platform

Mar 11, 2026

Practical Multi-AI Provider Architecture in the HagiCode Platform

This article shares the technical approach we used under the Orleans Grain architecture to integrate two AI tools, iflow and OpenCode, through a unified IAIProvider interface, and compares the implementation differences between WebSocket and HTTP communication in detail.

Background

There is nothing especially mysterious about it. While building HagiCode, we ran into a very practical problem: users wanted to work with different AI tools. That is hardly surprising, since everyone has their own habits. Some prefer Claude Code, some love GitHub Copilot, and some teams use tools they developed themselves.

Our initial solution was simple and direct: write dedicated integration code for each AI tool. But the drawbacks showed up quickly. The codebase filled up with if-else branches, every change required testing in multiple places, and every new tool meant writing another pile of logic from scratch.

Later, I realized it would be better to create a unified IAIProvider interface and abstract the capabilities shared by all AI providers. That way, no matter which tool is used underneath, the upper layers can call it in the same way.

Recently, the project needed to integrate two new tools: iflow and OpenCode. Both support the ACP protocol, but their communication styles are different. iflow uses WebSocket, while OpenCode uses an HTTP API. That became a useful architectural test: adapt two different transport modes behind one unified interface.

About HagiCode

The approach shared in this article comes from our practical experience in the HagiCode project. HagiCode is an AI-assisted development platform built on the Orleans Grain architecture. It integrates with different AI providers through a unified IAIProvider interface, allowing users to flexibly choose the AI tools they prefer.

Architecture Design

Unified Interface Abstraction

First, we defined the IAIProvider interface and abstracted the capabilities that every AI provider needs to implement:

public interface IAIProvider
{
    string Name { get; }
    bool SupportsStreaming { get; }
    ProviderCapabilities Capabilities { get; }

    Task<AIResponse> ExecuteAsync(AIRequest request, CancellationToken cancellationToken = default);
    IAsyncEnumerable<AIStreamingChunk> StreamAsync(AIRequest request, CancellationToken cancellationToken = default);
    Task<ProviderTestResult> PingAsync(CancellationToken cancellationToken = default);
    IAsyncEnumerable<AIStreamingChunk> SendMessageAsync(AIRequest request, string? embeddedCommandPrompt = null, CancellationToken cancellationToken = default);
}

This interface includes several key methods:

ExecuteAsync: execute a one-shot AI request
StreamAsync: get streaming responses for real-time display
PingAsync: perform a health check to verify whether the provider is available
SendMessageAsync: send a message with support for embedded commands

IFlowCliProvider: A WebSocket-Based Implementation

iflow uses WebSocket for ACP communication. The overall architecture looks like this:

IFlowCliProvider → ACPSessionManager → WebSocketAcpTransport → iflow CLI
                ↓
         Dynamic port allocation + process management

The core flow is also fairly straightforward:

ACPSessionManager creates and manages ACP sessions.
WebSocketAcpTransport handles WebSocket communication.
A port is allocated dynamically, and the iflow process is started with iflow --experimental-acp --port.
IAIRequestToAcpMapper and IAcpToAIResponseMapper convert requests and responses.

Here is the core code:

private async IAsyncEnumerable<AIStreamingChunk> StreamCoreAsync(
    AIRequest request,
    string? embeddedCommandPrompt,
    [EnumeratorCancellation] CancellationToken cancellationToken)
{
    // Resolve working directory
    var resolvedWorkingDirectory = ResolveWorkingDirectory(request);
    var effectiveRequest = ApplyEmbeddedCommandPrompt(request, embeddedCommandPrompt);

    // Create ACP session
    await using var session = await _sessionManager.CreateSessionAsync(
        Name,
        resolvedWorkingDirectory,
        cancellationToken,
        request.SessionId);

    // Send prompt
    var prompt = _requestMapper.ToPromptString(effectiveRequest);
    var promptResponse = await session.SendPromptAsync(prompt, cancellationToken);

    // Receive streaming response
    await foreach (var notification in session.ReceiveUpdatesAsync(cancellationToken))
    {
        if (_responseMapper.TryConvertToStreamingChunk(notification, out var chunk))
        {
            if (chunk.Type == StreamingChunkType.Metadata && chunk.IsComplete)
            {
                yield return chunk;
                yield break;
            }
            yield return chunk;
        }
    }
}

There are a few design points worth calling out here:

Use await using to ensure the session is released correctly and avoid resource leaks.
Return streaming responses through IAsyncEnumerable, which naturally supports async streams.
Use Metadata chunks to determine completion and ensure the full response has been received.

OpenCodeCliProvider: An HTTP API-Based Implementation

OpenCode provides its service through an HTTP API, so the architecture is slightly different:

OpenCodeCliProvider → OpenCodeRuntimeManager → OpenCodeClient → OpenCode HTTP API
                      ↓
                OpenCodeProcessManager → opencode process management

A notable feature of OpenCode is that it uses an SQLite database to persist session bindings. That makes session recovery and prompt-response recovery possible:

private async Task<OpenCodePromptExecutionResult> ExecutePromptAsync(
    AIRequest request,
    string? embeddedCommandPrompt,
    CancellationToken cancellationToken)
{
    var prompt = BuildPrompt(request, embeddedCommandPrompt);
    var resolvedWorkingDirectory = ResolveWorkingDirectory(request.WorkingDirectory);
    var client = await _runtimeManager.GetClientAsync(resolvedWorkingDirectory, cancellationToken);
    var bindingSessionId = request.SessionId;
    var boundSession = TryGetBinding(bindingSessionId, resolvedWorkingDirectory);

    // Try to use the already bound session
    if (boundSession is not null)
    {
        try
        {
            return await PromptSessionAsync(
                client,
                boundSession,
                BuildPromptRequest(request, prompt, CreatePromptMessageId()),
                request.Model ?? _settings.Model,
                cancellationToken);
        }
        catch (OpenCodeApiException ex) when (IsStaleBinding(ex))
        {
            // The session has expired, remove the binding
            RemoveBinding(bindingSessionId);
        }
    }

    // Create a new session
    var session = await client.Session.CreateAsync(new OpenCodeSessionCreateRequest
    {
        Title = BuildSessionTitle(request)
    }, cancellationToken);

    BindSession(bindingSessionId, session.Id, resolvedWorkingDirectory);
    return await PromptSessionAsync(client, session.Id, ...);
}

This implementation has several interesting highlights:

Session binding mechanism: the same SessionId reuses the same OpenCode session, avoiding repeated session creation.
Expiration handling: when a session is found to be expired, the binding is automatically cleaned up.
Database persistence: bindings are stored in SQLite and remain effective after restart.

Comparing the Two Approaches

Aspect	IFlowCliProvider	OpenCodeCliProvider
Communication	WebSocket (ACP)	HTTP API
Process management	ACPSessionManager	OpenCodeProcessManager
Port allocation	Dynamic port	No port (uses HTTP)
Session management	ACPSession	OpenCodeSession
Persistence	In-memory cache	SQLite database
Startup command	`iflow --experimental-acp --port`	`opencode`
Latency	Lower (long-lived connection)	Relatively higher (HTTP requests)

Which approach you choose depends mainly on your needs. WebSocket is better for scenarios with high real-time requirements, while an HTTP API is simpler and easier to debug.

Practical Guide

Configure Providers

First, enable the two providers in the configuration file:

AI:
  Providers:
    IFlowCli:
      Type: "IFlowCli"
      Enabled: true
      ExecutablePath: "iflow"
      Model: null
      WorkingDirectory: null
    OpenCodeCli:
      Type: "OpenCodeCli"
      Enabled: true
      ExecutablePath: "opencode"
      Model: "anthropic/claude-sonnet-4"
      WorkingDirectory: null

OpenCode:
  Enabled: true
  BaseUrl: "http://localhost:38376"
  ExecutablePath: "opencode"
  StartupTimeoutSeconds: 30
  RequestTimeoutSeconds: 120

Use IFlowCliProvider

// Get provider through the factory
var provider = await _providerFactory.GetProviderAsync(AIProviderType.IFlowCli);

// Execute an AI request
var request = new AIRequest
{
    Prompt = "请帮我重构这个函数",
    WorkingDirectory = "/path/to/project",
    Model = "claude-sonnet-4"
};

// Get the complete response
var response = await provider.ExecuteAsync(request, cancellationToken);
Console.WriteLine(response.Content);

// Or use streaming responses
await foreach (var chunk in provider.StreamAsync(request, cancellationToken))
{
    if (chunk.Type == StreamingChunkType.ContentDelta)
    {
        Console.Write(chunk.Content);
    }
}

Use OpenCodeCliProvider

// Get provider through the factory
var provider = await _providerFactory.GetProviderAsync(AIProviderType.OpenCodeCli);

var request = new AIRequest
{
    Prompt = "请帮我分析这个错误",
    WorkingDirectory = "/path/to/project",
    Model = "anthropic/claude-sonnet-4"
};

var response = await provider.ExecuteAsync(request, cancellationToken);
Console.WriteLine(response.Content);

Health Checks

Before startup or before use, you can check whether the provider is available:

var iflowResult = await iflowProvider.PingAsync(cancellationToken);
if (!iflowResult.Success)
{
    Console.WriteLine($"IFlow is unavailable: {iflowResult.ErrorMessage}");
    return;
}

var openCodeResult = await openCodeProvider.PingAsync(cancellationToken);
if (!openCodeResult.Success)
{
    Console.WriteLine($"OpenCode is unavailable: {openCodeResult.ErrorMessage}");
    return;
}

Embedded Command Support

Both providers support embedded commands, such as /file:xxx:

var request = new AIRequest
{
    Prompt = "分析这个文件的问题",
    SystemMessage = "你是一个代码分析专家"
};

await foreach (var chunk in provider.SendMessageAsync(
    request,
    embeddedCommandPrompt: "/file:src/main.cs",
    cancellationToken))
{
    Console.Write(chunk.Content);
}

Notes and Best Practices

Resource Management

IFlow uses long-lived WebSocket connections, so resource management deserves special attention:

Use await using to ensure sessions are released properly.
Cancellation triggers process cleanup.
ACPSessionManager supports a maximum session count limit.

OpenCode process management is relatively simpler, and OpenCodeRuntimeManager handles it automatically.

Error Handling

Both providers have complete error handling:

IFlow errors are propagated through ACP session updates.
OpenCode errors are thrown through OpenCodeApiException.
It is recommended that the caller catch and handle these exceptions.

Performance Considerations

IFlow WebSocket communication has lower latency than HTTP.
OpenCode session reuse can reduce the overhead of HTTP requests.
The factory cache mechanism avoids repeatedly creating providers.
In high-concurrency scenarios, pay close attention to the limits on process count and connection count.

Configuration Validation

The executable path is validated at startup, but runtime issues can still happen. PingAsync is a useful tool for verifying whether the configuration is correct:

// Check at startup
var provider = await _providerFactory.GetProviderAsync(providerType);
var result = await provider.PingAsync(cancellationToken);
if (!result.Success)
{
    _logger.LogError("Provider {ProviderType} is unavailable: {Error}", providerType, result.ErrorMessage);
}

Summary

This article shares the technical approach used by the HagiCode platform when integrating the two AI tools iflow and OpenCode. Through a unified IAIProvider interface, we adapted different communication styles, WebSocket and HTTP, while keeping the upper-layer calling pattern consistent.

The core idea is actually quite simple:

Define a unified interface abstraction.
Build adapter layers for different implementations.
Manage everything uniformly through the factory pattern.

That gives the system good extensibility. When a new AI tool needs to be integrated later, all we need to do is implement the IAIProvider interface without changing too much existing code.

If you are also working on multi-AI-tool integration, I hope this article is helpful.

References

HagiCode GitHub: github.com/HagiCode-org/site
HagiCode official website: hagicode.com
HagiCode Installation Guide: docs.hagicode.com/installation
ACP protocol specification: github.com/modelcontextprotocol/specification
Orleans documentation: learn.microsoft.com/dotnet/orleans

If this article helped you:

Give it a like so more people can see it
Star us on GitHub: github.com/HagiCode-org/site
Visit the official website to learn more: hagicode.com
Watch the 30-minute hands-on demo: www.bilibili.com/video/BV1pirZBuEzq/
Try one-click installation: docs.hagicode.com/installation/docker-compose
Quick install for Desktop: hagicode.com/desktop/
Public beta has started, and you are welcome to try it

Guide to Implementing Hotword Support for Doubao Speech Recognition

Mar 6, 2026

Guide to Implementing Hotword Support for Doubao Speech Recognition

This article explains in detail how to implement hotword support for Doubao speech recognition in the HagiCode project. By using both custom hotwords and platform hotword tables, you can significantly improve recognition accuracy for domain-specific vocabulary.

Background

Speech recognition technology has developed for many years, yet one problem has consistently bothered developers. General-purpose speech recognition models can cover everyday language, but they often fall short when it comes to professional terminology, product names, and personal names. Think about it: a voice assistant in the medical field needs to accurately recognize terms like “hypertension,” “diabetes,” and “coronary heart disease”; a legal system needs to precisely capture terms such as “cause of action,” “defense,” and “burden of proof.” In these scenarios, a general-purpose model is trying its best, but that is often not enough.

We ran into the same challenge in the HagiCode project. As a multifunctional AI coding assistant, HagiCode needs to handle speech recognition for a wide range of technical terminology. However, the Doubao speech recognition API, in its default configuration, could not fully meet our accuracy requirements for specialized terms. It is not that Doubao is not good enough; rather, every domain has its own terminology system. After some research and technical exploration, we found that the Doubao speech recognition API actually provides hotword support. With a straightforward configuration, it can significantly improve the recognition accuracy of specific vocabulary. In a sense, once you tell it which words to pay attention to, it listens for them more carefully.

What this article shares is the complete solution we used in the HagiCode project to implement Doubao speech recognition hotwords. Both modes, custom hotwords and platform hotword tables, are available, and they can also be combined. With this solution, developers can flexibly configure hotwords based on business scenarios so the speech recognition system can better “recognize” professional, uncommon, yet critical vocabulary.

About HagiCode

The solution shared in this article comes from our practical experience in the HagiCode project. HagiCode is an open-source AI coding assistant project with a modern technology stack, designed to provide developers with an intelligent programming assistance experience. As a complex multilingual, multi-platform project, HagiCode needs to handle speech recognition scenarios involving many technical terms, which in turn drove our research into and implementation of the hotword feature.

If you are interested in HagiCode’s technical implementation, you can visit the GitHub repository for more details, or check out our official documentation for the complete installation and usage guide.

Core Implementation

Understanding the Two Hotword Modes

The Doubao speech recognition API provides two ways to configure hotwords, and each one has its own ideal use cases and advantages.

Custom hotword mode lets us pass hotword text directly through the corpus.context field. This approach is especially suitable for scenarios where you need to quickly configure a small number of hotwords, such as temporarily recognizing a product name or a person’s name. In HagiCode’s implementation, we parse the multi-line hotword text entered by the user into a list of strings, then format it into the context_data array required by the Doubao API. This approach is very direct: you simply tell the system which words to pay attention to, and it does exactly that.

Platform hotword table mode uses the corpus.boosting_table_id field to reference a preconfigured hotword table in the Doubao self-learning platform. This approach is suitable for scenarios where you need to manage a large number of hotwords. We can create and maintain hotword tables on the Doubao self-learning platform, then reference them by ID. For a project like HagiCode, where specialized terms need to be continuously updated and maintained, this mode offers much better manageability. Once the number of hotwords grows, having a centralized place to manage them is far better than entering them manually every time.

Interestingly, these two modes can also be used together. The Doubao API supports including both custom hotwords and a platform hotword table ID in the same request, with the combination strategy controlled by the combine_mode parameter. This flexibility allows HagiCode to handle a wide range of complex professional terminology recognition needs. Sometimes, combining multiple approaches produces better results.

Frontend Type Definitions and Validation

In HagiCode’s frontend implementation, we defined a complete set of hotword configuration types and validation logic. The first part is the type definition:

export interface HotwordConfig {
  contextText: string;           // Multi-line hotword text
  boostingTableId: string;      // Doubao platform hotword table ID
  combineMode: boolean;          // Whether to use both together
}

This simple interface contains all configuration items for the hotword feature. Among them, contextText is the part users interact with most directly: we allow users to enter one hotword phrase per line, which is very intuitive. Asking users to enter one term per line is much easier than making them understand a complicated configuration format.

Next comes the validation function. Based on the Doubao API limitations, we defined strict validation rules: at most 100 lines of hotword text, up to 50 characters per line, and no more than 5000 characters in total; boosting_table_id can be at most 200 characters and may contain only letters, numbers, underscores, and hyphens. These limits are not arbitrary; they come directly from the official Doubao documentation. API limits are API limits, and we have to follow them.

export function validateContextText(contextText: string): HotwordValidationResult {
  if (!contextText || contextText.trim().length === 0) {
    return { isValid: true, errors: [] };
  }

  const lines = contextText.split('\n').filter(line => line.trim().length > 0);
  const errors: string[] = [];

  if (lines.length > 100) {
    errors.push(`Hotword line count cannot exceed 100 lines; current count is ${lines.length}`);
  }

  const totalChars = contextText.length;
  if (totalChars > 5000) {
    errors.push(`Total hotword character count cannot exceed 5000; current count is ${totalChars}`);
  }

  for (let i = 0; i < lines.length; i++) {
    if (lines[i].length > 50) {
      errors.push(`Hotword on line ${i + 1} exceeds the 50-character limit`);
    }
  }

  return { isValid: errors.length === 0, errors };
}

export function validateBoostingTableId(boostingTableId: string): HotwordValidationResult {
  if (!boostingTableId || boostingTableId.trim().length === 0) {
    return { isValid: true, errors: [] };
  }

  const errors: string[] = [];

  if (boostingTableId.length > 200) {
    errors.push(`boosting_table_id cannot exceed 200 characters; current count is ${boostingTableId.length}`);
  }

  if (!/^[a-zA-Z0-9_-]+$/.test(boostingTableId)) {
    errors.push('boosting_table_id can contain only letters, numbers, underscores, and hyphens');
  }

  return { isValid: errors.length === 0, errors };
}

These validation functions run immediately when the user configures hotwords, ensuring that problems are caught as early as possible. From a user experience perspective, this kind of instant feedback is very important. It is always better for users to know what is wrong while they are typing rather than after they submit.

Frontend Configuration Persistence

In HagiCode’s frontend implementation, we chose to use the browser’s localStorage to store hotword configuration. There were several considerations behind this design decision. First, hotword configuration is highly personalized, and different users may have different domain-specific needs. Second, this approach simplifies the backend implementation because it does not require extra database tables or API endpoints. Finally, after users configure it once in the browser, the settings can be loaded automatically on subsequent uses, which is very convenient. Put simply, it is the easiest approach.

const HOTWORD_STORAGE_KEYS = {
  contextText: 'hotword-context-text',
  boostingTableId: 'hotword-boosting-table-id',
  combineMode: 'hotword-combine-mode',
} as const;

export const DEFAULT_HOTWORD_CONFIG: HotwordConfig = {
  contextText: '',
  boostingTableId: '',
  combineMode: false,
};

// Load hotword configuration
export function loadHotwordConfig(): HotwordConfig {
  const contextText = localStorage.getItem(HOTWORD_STORAGE_KEYS.contextText) || '';
  const boostingTableId = localStorage.getItem(HOTWORD_STORAGE_KEYS.boostingTableId) || '';
  const combineMode = localStorage.getItem(HOTWORD_STORAGE_KEYS.combineMode) === 'true';

  return { contextText, boostingTableId, combineMode };
}

// Save hotword configuration
export function saveHotwordConfig(config: HotwordConfig): void {
  localStorage.setItem(HOTWORD_STORAGE_KEYS.contextText, config.contextText);
  localStorage.setItem(HOTWORD_STORAGE_KEYS.boostingTableId, config.boostingTableId);
  localStorage.setItem(HOTWORD_STORAGE_KEYS.combineMode, String(config.combineMode));
}

The logic in this code is straightforward and clear. We read from localStorage when loading configuration, and write to localStorage when saving it. We also provide a default configuration so the system can still work properly when no configuration exists yet. There has to be a sensible default, after all.

Backend SDK Configuration Extension

In HagiCode’s backend implementation, we needed to add hotword-related properties to the SDK configuration class. Taking C# language characteristics and usage patterns into account, we used List<string> to store custom hotword contexts:

public class DoubaoVoiceConfig
{
    /// <summary>
    /// App ID
    /// </summary>
    public string AppId { get; set; } = string.Empty;

    /// <summary>
    /// Access token
    /// </summary>
    public string AccessToken { get; set; } = string.Empty;

    /// <summary>
    /// Service URL
    /// </summary>
    public string ServiceUrl { get; set; } = string.Empty;

    /// <summary>
    /// Custom hotword context list
    /// </summary>
    public List<string>? HotwordContexts { get; set; }

    /// <summary>
    /// Doubao platform hotword table ID
    /// </summary>
    public string? BoostingTableId { get; set; }
}

The design of this configuration class follows HagiCode’s usual concise style. HotwordContexts is a nullable list type, and BoostingTableId is a nullable string, so when there is no hotword configuration, these properties have no effect on the request at all. If you are not using the feature, it should stay out of the way.

Payload Construction Logic

Payload construction is the core of the entire hotword feature. Once we have hotword configuration, we need to format it into the JSON structure required by the Doubao API. This process happens before the SDK sends the request:

private void AddCorpusToRequest(Dictionary<string, object> request)
{
    var corpus = new Dictionary<string, object>();

    // Add custom hotwords
    if (Config.HotwordContexts != null && Config.HotwordContexts.Count > 0)
    {
        corpus["context"] = new Dictionary<string, object>
        {
            ["context_type"] = "dialog_ctx",
            ["context_data"] = Config.HotwordContexts
                .Select(text => new Dictionary<string, object> { ["text"] = text })
                .ToList()
        };
    }

    // Add platform hotword table ID
    if (!string.IsNullOrEmpty(Config.BoostingTableId))
    {
        corpus["boosting_table_id"] = Config.BoostingTableId;
    }

    // Add corpus to the request only when it is not empty
    if (corpus.Count > 0)
    {
        request["corpus"] = corpus;
    }
}

This code shows how to dynamically construct the corpus field based on configuration. The key point is that we add the corpus field only when hotword configuration actually exists. This design ensures backward compatibility: when no hotwords are configured, the request structure remains exactly the same as before. Backward compatibility matters; adding a feature should not disrupt existing logic.

WebSocket Parameter Passing

Between the frontend and backend, hotword parameters are passed through WebSocket control messages. HagiCode is designed so that when the frontend starts recording, it loads the hotword configuration from localStorage and sends it to the backend through a WebSocket message.

const controlMessage = {
  type: 'control',
  payload: {
    command: 'StartRecognition',
    contextText: '高血压\n糖尿病\n冠心病',
    boosting_table_id: 'medical_table',
    combineMode: false
  }
};

There is one detail to note here: the frontend passes multi-line text separated by newline characters, and the backend needs to parse it. The backend WebSocket handler parses these parameters and passes them to the SDK:

private async Task HandleControlMessageAsync(
    string connectionId,
    DoubaoSession session,
    ControlMessage message)
{
    if (message.Payload is SessionControlRequest controlRequest)
    {
        // Parse hotword parameters
        string? contextText = controlRequest.ContextText;
        string? boostingTableId = controlRequest.BoostingTableId;
        bool? combineMode = controlRequest.CombineMode;

        // Parse multi-line text into a hotword list
        if (!string.IsNullOrEmpty(contextText))
        {
            var hotwords = contextText
                .Split('\n', StringSplitOptions.RemoveEmptyEntries)
                .Select(s => s.Trim())
                .Where(s => s.Length > 0)
                .ToList();

            session.HotwordContexts = hotwords;
        }

        session.BoostingTableId = boostingTableId;
    }
}

With this design, passing hotword configuration from frontend to backend becomes clear and efficient. There is nothing especially mysterious about it; the data is simply passed through layer by layer.

Practical Guide

Configure Custom Hotwords

In real usage, configuring custom hotwords is very simple. Open the speech recognition settings page in HagiCode and find the “Hotword Configuration” section. In the “Custom Hotword Text” input box, enter one hotword phrase per line.

For example, if you are developing a medical-related application, you could configure it like this:

高血压
糖尿病
冠心病
心绞痛
心肌梗死
心力衰竭

After you save the configuration, these hotwords are automatically passed to the Doubao API every time speech recognition starts. In our tests, once hotwords were configured, the recognition accuracy for related professional terms improved noticeably. The improvement is real, and clearly better than before.

Configure a Platform Hotword Table

If you need to manage a large number of hotwords, or if the hotwords need frequent updates, the platform hotword table mode is a better fit. First, create a hotword table on the Doubao self-learning platform and obtain the generated boosting_table_id, then enter this ID on the HagiCode settings page.

The Doubao self-learning platform provides capabilities such as bulk import and categorized management for hotwords, which is very practical for teams that need to manage large sets of specialized terminology. By managing hotwords on the platform, you can maintain them centrally and roll out updates consistently. Once the hotword list becomes large, having a single place to manage it is much more practical than manual entry every time.

Using Combination Mode

In some complex scenarios, you may need to use both custom hotwords and a platform hotword table at the same time. In that case, simply configure both in HagiCode and enable the “Combination Mode” switch.

In combination mode, the Doubao API considers both hotword sources at the same time, so recognition accuracy is usually higher than using either source alone. However, it is worth noting that combination mode increases request complexity, so it is best to decide whether to enable it after practical testing. More complexity is only worth it if the real-world results justify it.

Code Integration Examples

Integrating the hotword feature into the HagiCode project is very straightforward. Here are some commonly used code snippets:

import {
  loadHotwordConfig,
  saveHotwordConfig,
  validateHotwordConfig,
  parseContextText,
  getEffectiveHotwordMode,
  type HotwordConfig
} from '@/types/hotword';

// Load and validate configuration
const config = loadHotwordConfig();
const validation = validateHotwordConfig(config);

if (!validation.isValid) {
  console.error('Hotword configuration validation failed:', validation.errors);
  return;
}

// Parse hotword text
const hotwords = parseContextText(config.contextText);
console.log('Parsed hotwords:', hotwords);

// Get effective hotword mode
const mode = getEffectiveHotwordMode(config);
console.log('Current hotword mode:', mode);

Backend usage is similarly concise:

var config = new DoubaoVoiceConfig
{
    AppId = "your_app_id",
    AccessToken = "your_access_token",
    ServiceUrl = "wss://openspeech.bytedance.com/api/v3/sauc/bigmodel_async",

    // Configure custom hotwords
    HotwordContexts = new List<string>
    {
        "高血压",
        "糖尿病",
        "冠心病"
    },

    // Configure platform hotword table
    BoostingTableId = "medical_table_v1"
};

var client = new DoubaoVoiceClient(config, logger);
await client.ConnectAsync();
await client.SendFullClientRequest();

Things to Keep in Mind

There are several points that deserve special attention when implementing and using the hotword feature.

First is the character limit. The Doubao API has strict restrictions on hotwords, including line count, characters per line, and total character count. If any limit is exceeded, the API returns an error. In HagiCode’s frontend implementation, we check these constraints during user input through validation functions, which prevents invalid configurations from being sent to the backend. Catching problems early is always better than waiting for the API to fail.

Second is the format of boosting_table_id. This field allows only letters, numbers, underscores, and hyphens, and it cannot contain spaces or other special characters. When creating a hotword table on the Doubao self-learning platform, be sure to follow the naming rules. That kind of strict format validation is common for APIs.

Third is backward compatibility. Hotword parameters are entirely optional. If no hotwords are configured, the system behaves exactly as it did before. This design ensures that existing users are not affected in any way, and it also makes gradual migration and upgrades easier. Adding a feature should not disrupt the previous logic.

Finally, there is error handling. When hotword configuration is invalid, the Doubao API returns corresponding error messages. HagiCode’s implementation records detailed logs to help developers troubleshoot issues. At the same time, the frontend displays validation errors in the UI to help users correct the configuration. Good error handling naturally leads to a better user experience.

Conclusion

Through this article, we have provided a detailed introduction to the complete solution for implementing Doubao speech recognition hotwords in the HagiCode project. This solution covers the entire process from requirement analysis and technical selection to code implementation, giving developers a practical example they can use for reference.

The key points can be summarized as follows. First, the Doubao API supports both custom hotwords and platform hotword tables, and they can be used independently or in combination. Second, the frontend uses localStorage to store configuration in a simple and efficient way. Third, the backend passes hotword parameters by dynamically constructing the corpus field, preserving strong backward compatibility. Fourth, comprehensive validation logic ensures configuration correctness and avoids invalid requests. Overall, the solution is not complicated; it simply follows the API requirements carefully.

Implementing the hotword feature further strengthens HagiCode’s capabilities in the speech recognition domain. By flexibly configuring business-related professional terms, developers can help the speech recognition system better understand content from specific domains and therefore provide more accurate services. Ultimately, technology should serve real business needs, and solving practical problems is what matters most.

If you found this article helpful, feel free to give HagiCode a Star on GitHub. Your support motivates us to keep sharing technical practice and experience. In the end, writing and sharing technical content that helps others is a pleasure in itself.

References

Thank you for reading. If you found this article useful, click the like button below 👍 so more people can discover it.

This content was created with AI-assisted collaboration, reviewed by me, and reflects my own views and positions.

Author: newbe36524
Article link: https://docs.hagicode.com/blog/2026-03-06-doubao-speech-recognition-hotword-support/
Copyright notice: Unless otherwise stated, all articles on this blog are licensed under BY-NC-SA. All rights reserved!

Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition

Mar 5, 2026

Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition

The browser WebSocket API does not support custom HTTP headers, which creates a challenge for speech recognition services that require authentication data in headers. This article shares how we solved that problem in the HagiCode project with a backend proxy pattern, and how the approach evolved from playground experiments to production use.

Background

When we started building speech recognition for the HagiCode project, we confidently chose ByteDance’s Doubao speech recognition service. The initial design was straightforward: let the frontend connect directly to Doubao’s WebSocket service. How hard could that be? Just open a connection and send some data, right?

Then came the surprise: Doubao’s API requires authentication information to be passed through HTTP headers, including things like accessToken and secretKey. That immediately became awkward, because the browser WebSocket API simply does not support setting custom headers.

So what do you do when the browser will not let you send them?

At the time, we weighed two options:

Put the credentials into URL query parameters - simple and blunt
Add a proxy layer on the backend - more work at first glance

The first option exposes credentials directly in frontend code and local storage. Is that really safe? I was not comfortable with it. On top of that, some APIs require header-based verification, so this approach is not even viable.

In the end, we chose the second option: implement a WebSocket proxy on the backend. Coincidentally, this pattern was first validated in our playground environment, and only after we confirmed its stability did we move it into production. After all, nobody wants production to double as a lab experiment.

About HagiCode

The approach shared in this article comes from our practical experience in the HagiCode project.

HagiCode is an AI coding assistant project with voice interaction support. Because we needed to call a speech recognition service from the frontend, we ran straight into this WebSocket authentication problem, which led us to the solution described here. Sometimes these technical roadblocks are frustrating, but they also force you to learn patterns that turn out to be useful later.

Technical Challenge Analysis

Browser WebSocket limitations

The standard WebSocket API looks wonderfully simple:

const ws = new WebSocket('wss://example.com/ws');

But that simplicity is exactly where the problem lies - it only passes parameters in the URL, and it cannot set headers the way an HTTP request can:

// This is not supported in the WebSocket API
const ws = new WebSocket('wss://example.com/ws', {
  headers: {
    'Authorization': 'Bearer token'
  }
});

And that is the core issue. For services like Doubao speech recognition that depend on header-based authentication, this limitation is a hard blocker.

Once you accept that constraint, the architecture has to change.

Architectural design decisions

When designing the solution, we compared the trade-offs carefully.

Decision 1: Choosing the proxy pattern

We compared two approaches:

Option	Pros	Cons	Decision
Native WebSocket	Lightweight, simple, direct forwarding	Connection management must be handled manually	Chosen
SignalR	Automatic reconnection, strong typing	Overly complex, extra dependencies	Rejected

We ultimately chose native WebSocket. To be honest, it was the lightest option and a better fit for simple bidirectional binary stream forwarding. Pulling in SignalR would have felt like overengineering for this use case, and it could add extra latency.

Decision 2: Connection management strategy

We adopted a “one connection, one session” model - each frontend WebSocket connection maps to its own independent Doubao backend connection.

The benefits are straightforward:

Simple to implement and aligned with the common usage pattern
Easier to debug and troubleshoot
Good resource isolation, preventing interference between sessions

Put simply, the direct solution is sometimes the best one. Complexity does not automatically make a design better.

Decision 3: Storing authentication data

Credentials are stored in backend configuration files (appsettings.yml or environment variables) and loaded through dependency injection:

Simple configuration model that matches existing backend conventions
Sensitive data never reaches the frontend
Supports multi-environment setup for development, testing, and production

That level of separation matters. No one wants credentials floating around in places they should not be.

Data flow design

The overall data flow looks like this:

Frontend (browser)
  │
  │ ws://backend/api/voice/ws
  │ WebSocket (binary)
  ▼
Backend (proxy)
  │
  │ wss://openspeech.bytedance.com/
  │ (with authentication headers)
  ▼
Doubao API

The flow itself is not complicated:

The frontend connects to the backend proxy through WebSocket
The backend proxy receives audio data and connects to the Doubao API with authenticated headers
The Doubao API returns recognition results, and the proxy forwards them to the frontend
The whole process remains fully asynchronous with bidirectional streaming

Once the responsibilities are split clearly, the design becomes quite natural.

Core component implementation

1. WebSocket endpoint configuration

app.Map("/ws", async context =>
{
    if (context.WebSockets.IsWebSocketRequest)
    {
        // Read configuration from query parameters
        var appId = context.Request.Query["appId"];
        var accessToken = context.Request.Query["accessToken"];

        // Validate required parameters
        if (string.IsNullOrEmpty(appId) || string.IsNullOrEmpty(accessToken))
        {
            context.Response.StatusCode = 400;
            return;
        }

        // Accept the WebSocket connection
        using var webSocket = await context.WebSockets.AcceptWebSocketAsync();

        // Message handling loop
        var buffer = new byte[4096];
        while (!webSocket.CloseStatus.HasValue)
        {
            var result = await webSocket.ReceiveAsync(buffer, CancellationToken.None);

            if (result.MessageType == WebSocketMessageType.Close)
            {
                await webSocket.CloseAsync(
                    result.CloseStatus.Value,
                    result.CloseStatusDescription,
                    CancellationToken.None);
                break;
            }

            // Process audio data
            await HandleAudioDataAsync(buffer, result.Count);
        }
    }
});

2. Session management

public class DoubaoSessionManager : IDoubaoSessionManager
{
    private readonly ConcurrentDictionary<string, DoubaoSession> _sessions = new();

    public DoubaoSession CreateSession(string connectionId)
    {
        var session = new DoubaoSession(connectionId);
        _sessions[connectionId] = session;
        return session;
    }

    public async Task SendAudioAsync(string connectionId, byte[] audioData)
    {
        if (_sessions.TryGetValue(connectionId, out var session))
        {
            await session.SendAudioAsync(audioData);
        }
    }

    public void RemoveSession(string connectionId)
    {
        if (_sessions.TryRemove(connectionId, out var session))
        {
            session.Dispose();
        }
    }
}

Using ConcurrentDictionary for session management means thread safety is largely handled for us. Each incoming connection gets its own session, and cleanup happens automatically on disconnect.

3. Configuration validation

public class ClientConfigDto
{
    public string AppId { get; set; } = null!;
    public string Access set; } =Token { get; null!;
    public string? ServiceUrl { get; set; }
    public string? ResourceId { get; set; }
    public int? SampleRate { get; set; }
    public int? BitsPerSample { get; set; }
    public int? Channels { get; set; }

    public void Validate()
    {
        if (string.IsNullOrWhiteSpace(AppId))
            throw new ArgumentException("AppId is required");
        if (string.IsNullOrWhiteSpace(AccessToken))
            throw new ArgumentException("AccessToken is required");
    }
}

Configuration validation helps surface problems during startup instead of letting them fail later at runtime. That safeguard is worth having.

Message protocol design

The frontend and backend use JSON text messages for control, while binary messages carry audio data.

Example control message:

{
    "type": "control",
    "messageId": "msg_123",
    "timestamp": "2026-03-03T10:00:00Z",
    "payload": {
        "command": "StartRecognition",
        "parameters": {
            "hotwordId": "hotword1",
            "boosting_table_id": "table123"
        }
    }
}

Example recognition result:

{
    "type": "result",
    "timestamp": "2026-03-03T10:00:03Z",
    "payload": {
        "text": "Hello world",
        "confidence": 0.95,
        "duration": 1500,
        "isFinal": true,
        "utterances": [
            {
                "text": "Hello",
                "startTime": 0,
                "endTime": 800,
                "definite": true
            }
        ]
    }
}

This design separates control signals from audio payloads, which makes the system easier to reason about and implement. Splitting responsibilities cleanly is often the right call.

Frontend integration in practice

WebSocket connection

class DoubaoVoiceClient {
    constructor(config) {
        this.config = config;
        this.ws = null;
    }

    async connect() {
        const url = new URL(this.config.wsUrl);
        // Add query parameters
        Object.entries(this.config.params).forEach(([key, value]) => {
            url.searchParams.set(key, value);
        });

        this.ws = new WebSocket(url);

        return new Promise((resolve, reject) => {
            this.ws.onopen = () => {
                console.log('[DoubaoVoice] Connected');
                resolve();
            };

            this.ws.onmessage = (event) => {
                this._handleMessage(JSON.parse(event.data));
            };

            this.ws.onerror = reject;
        });
    }

    _handleMessage(message) {
        switch (message.type) {
            case 'status':
                this._handleStatus(message.payload);
                break;
            case 'result':
                this.onResult?.(message.payload);
                break;
            case 'error':
                console.error('[DoubaoVoice] Error:', message.payload);
                break;
        }
    }
}

// Usage example
const client = new DoubaoVoiceClient({
    wsUrl: 'ws://localhost:5000/ws',
    params: {
        appId: 'your-app-id',
        accessToken: 'your-access-token',
        sampleRate: 16000,
        bitsPerSample: 16,
        channels: 1
    }
});

Audio capture and transmission

Using AudioWorkletNode for audio processing gives better performance:

class AudioProcessorWorklet extends AudioWorkletProcessor {
    process(inputs, outputs, parameters) {
        const input = inputs[0]?.[0];
        if (!input) return true;

        // Convert to 16-bit PCM
        const pcm = new Int16Array(input.length);
        for (let i = 0; i < input.length; i++) {
            pcm[i] = Math.max(-32768, Math.min(32767, input[i] * 32767));
        }

        this.port.postMessage({
            type: 'audioData',
            data: pcm.buffer
        }, [pcm.buffer]);

        return true;
    }
}

registerProcessor('audio-processor', AudioProcessorWorklet);

// Main thread code
async function startAudioRecording() {
    const stream = await navigator.mediaDevices.getUserMedia({
        audio: {
            echoCancellation: true,
            noiseSuppression: true,
            autoGainControl: true,
            sampleRate: 48000
        }
    });

    const audioContext = new AudioContext();
    const audioSource = audioContext.createMediaStreamSource(stream);

    await audioContext.audioWorklet.addModule('/audio-worklet.js');
    const audioWorkletNode = new AudioWorkletNode(audioContext, 'audio-processor');

    audioWorkletNode.port.onmessage = (event) => {
        if (event.data.type === 'audioData' && ws?.readyState === WebSocket.OPEN) {
            ws.send(event.data.data); // Send binary data directly
        }
    };

    audioSource.connect(audioWorkletNode);
}

AudioWorklet performs far better than ScriptProcessorNode and avoids the audio stutter problems that older processing paths often introduce.

Backend configuration

`appsettings.json` example

{
  "Serilog": {
    "MinimumLevel": {
      "Default": "Information",
      "Override": {
        "Microsoft": "Warning",
        "System": "Warning"
      }
    },
    "WriteTo": [
      { "Name": "Console" },
      {
        "Name": "File",
        "Args": { "path": "logs/log-.txt", "rollingInterval": "Day" }
      }
    ]
  },
  "Kestrel": {
    "Urls": "http://0.0.0.0:5000"
  }
}

Logging configuration matters because it makes problems much easier to trace. Serilog’s file sink can roll logs daily, which keeps individual log files at a manageable size.

Notes and best practices

1. Connection monitoring

Regularly log session state to trace the full connection lifecycle
Monitor the number and duration of audio segments to detect abnormal connections
Record connection status and reconnection behavior for the Doubao service

These are basic operational practices, but they make a real difference in production.

2. Error handling

Capture and log all WebSocket exceptions
Use IAsyncDisposable to ensure resources are cleaned up
Implement graceful connection shutdown and timeout handling

In short, favor robustness.

3. Audio format requirements

Sample rate: 16000 Hz (recommended) or 8000 Hz
Bit depth: 16-bit
Channels: mono
Encoding: PCM (raw)

If the format is wrong, recognition may fail or quality may degrade significantly. These requirements matter.

4. Security considerations

Keep sensitive credentials only in backend configuration
Enforce connection limits to prevent resource exhaustion
Use HTTPS/WSS in production

Security is never a minor concern.

5. Performance optimization

Use asynchronous operations to avoid blocking
Tune buffer sizes as needed (default: 4096 bytes)
Consider connection pooling and reuse strategies

Apply these optimizations where they make sense for your workload.

Deployment recommendations

Docker deployment: Package the proxy service as a container for easier scaling and management
Load balancing: Use Nginx or Envoy as a reverse proxy for WebSocket traffic
Health checks: Implement heartbeat-based checks to monitor service availability
Log aggregation: Send logs to a centralized logging system such as ELK or Loki

Deployment can be simple or complex depending on the team and environment, so adjust accordingly.

Summary

The WebSocket proxy pattern solves the core problem that the browser WebSocket API does not support custom headers. In the HagiCode project, this pattern proved both feasible and stable as it moved from playground validation into production deployment.

Key takeaways:

A backend proxy can pass authentication information securely
Native WebSocket is lightweight and efficient for simple scenarios
The “one connection, one session” model simplifies both implementation and debugging
The frontend-backend protocol should separate control signals from audio data

If you are building a feature that also depends on WebSocket authentication, I hope this pattern gives you a useful starting point.

If you have questions, feel free to discuss them with us. Technical progress happens faster when people compare notes.

References

Thank you for reading. If you found this article helpful, please click the like button below 👍 so more people can discover it.

This content was created with AI-assisted collaboration, reviewed by me, and reflects my own views and position.

Author: newbe36524
Article link: https://docs.hagicode.com/blog/2026-03-05-websocket-proxy-for-doubao-speech-recognition/
Copyright notice: Unless otherwise stated, all articles in this blog are licensed under BY-NC-SA. Please include the source when reprinting.

WebSocket

Typing Is Slower Than Talking, and Talking Is Slower Than a Screenshot - Multimodal Input Practices for AI Coding Assistants

Background

About HagiCode

Analysis

Technical Challenges in Speech Recognition

Functional Requirements for Image Uploads

Solution

Speech Recognition: WebSocket Proxy Architecture

Image Uploads: Multi-Method Upload Component

Practical Guide

How to Use Speech Recognition

How to Use Image Uploads

Notes

Conclusion

References

Copyright Notice

Practical Multi-AI Provider Architecture in the HagiCode Platform

Background

About HagiCode

Architecture Design

Unified Interface Abstraction

IFlowCliProvider: A WebSocket-Based Implementation

OpenCodeCliProvider: An HTTP API-Based Implementation

Comparing the Two Approaches

Practical Guide

Configure Providers

Use IFlowCliProvider

Use OpenCodeCliProvider

Health Checks

Embedded Command Support

Notes and Best Practices

Resource Management

Error Handling

Performance Considerations

Configuration Validation

Summary

References

Guide to Implementing Hotword Support for Doubao Speech Recognition

Background

About HagiCode

Core Implementation

Understanding the Two Hotword Modes

Frontend Type Definitions and Validation

Frontend Configuration Persistence

Backend SDK Configuration Extension

Payload Construction Logic

WebSocket Parameter Passing

Practical Guide

Configure Custom Hotwords

Configure a Platform Hotword Table

Using Combination Mode

Code Integration Examples

Things to Keep in Mind

Conclusion

References

Solving Browser WebSocket Authentication Challenges: A Practical Proxy Pattern for Doubao Speech Recognition

Background

About HagiCode

Technical Challenge Analysis

Browser WebSocket limitations

Architectural design decisions

Data flow design

Core component implementation

1. WebSocket endpoint configuration

2. Session management

3. Configuration validation

Message protocol design

Frontend integration in practice

WebSocket connection

Audio capture and transmission

Backend configuration

appsettings.json example

Notes and best practices

1. Connection monitoring

2. Error handling

3. Audio format requirements

4. Security considerations

5. Performance optimization

Deployment recommendations

`appsettings.json` example