Multimodal Interaction

Typing Is Slower Than Talking, and Talking Is Slower Than a Screenshot - Multimodal Input Practices for AI Coding Assistants

Mar 31, 2026

Typing Is Slower Than Talking, and Talking Is Slower Than a Screenshot - Multimodal Input Practices for AI Coding Assistants

Writing code has a speed limit no matter how fast you type. Sometimes something you could say in one sentence takes forever to type out; sometimes one screenshot explains everything, yet you still have to describe it with a pile of text. This article talks about what we ran into while building HagiCode, from speech recognition to image uploads. In the end, we just wanted to make an AI coding assistant a little easier to use.

Background

While building HagiCode, we noticed a problem - or rather, a problem that naturally surfaced once people started using it heavily: relying on typing alone can be tiring.

Think about it: interaction between users and the Agent is a core scenario. But if every exchange requires nonstop typing at the keyboard, the efficiency is not great:

Typing is too slow: For complicated issues, like error messages or UI problems, typing everything out can take half a minute, while saying it aloud might take ten seconds. That gap is real.
Images are more direct: Sometimes the UI throws an error, sometimes you want to compare a design draft, and sometimes you need to show a code structure. “A picture is worth a thousand words” may be an old saying, but it still holds up. Letting AI directly “see” the problem is much clearer than describing it for ages.
Interaction should feel natural: Modern AI assistants should support text, voice, and images. Users should be able to choose whichever input method feels most natural.

So we decided to add speech recognition and image upload support to HagiCode to make Agent interactions more convenient. If users can type a little less, that is already a win.

About HagiCode

The solutions shared in this article come from our hands-on work in the HagiCode project - or, more accurately, from lessons learned while stumbling through quite a few pitfalls.

HagiCode is an open-source AI coding assistant project with a simple goal: use AI to improve development efficiency. As we kept building, it became clear that users strongly wanted multimodal input. Sometimes speaking one sentence is faster than typing a long paragraph, and sometimes a screenshot is far clearer than a long explanation.

Those needs pushed us forward, and that is how features like speech recognition and image uploads eventually took shape. Users can now interact with AI in the most natural way available to them, and that feels good.

Analysis

Technical Challenges in Speech Recognition

When building speech recognition, we ran into a tricky issue: the browser WebSocket API does not support custom HTTP headers.

The speech recognition service we chose was ByteDance’s Doubao Speech Recognition API. Unfortunately, this API requires authentication information such as accessToken and secretKey to be passed through HTTP headers. That created an immediate technical conflict:

// The browser WebSocket API does not support this approach
const ws = new WebSocket('wss://api.com/ws', {
  headers: {
    'Authorization': 'Bearer token'  // Not supported
  }
});

We basically had two options:

URL query parameter approach: put the authentication info in the URL
- The advantage is that it is simple to implement
- The downside is that credentials are exposed to the frontend, which is insecure; some APIs also require header-based authentication
Backend proxy approach: implement a WebSocket proxy on the backend
- The advantage is that credentials remain securely stored on the backend and the solution is fully compatible with API requirements
- The downside is that implementation is a bit more complex

In the end, we chose the backend proxy approach. Security is not something you compromise on.

Functional Requirements for Image Uploads

Our requirements for image uploads were actually pretty straightforward:

Multiple upload methods: click to select a file, drag and drop, and paste from the clipboard
File validation: type restrictions (PNG, JPG, WebP, GIF) and size limits (5-10 MB) are basic requirements
User experience: upload progress, previews, and error messages so users always know what is happening
Security: server-side validation and protection against malicious file uploads are essential

Solution

Speech Recognition: WebSocket Proxy Architecture

We designed a three-layer architecture for speech recognition and found a path that worked:

Browser WebSocket
       |
       | ws://backend/api/voice/ws
       | (binary audio)
       v
Backend Proxy
       |
       | wss://openspeech.bytedance.com/ (with auth header)
       v
Doubao API

Core component implementations:

Frontend AudioWorklet processor:

class AudioProcessorWorklet extends AudioWorkletProcessor {
  process(inputs, outputs, parameters) {
    const input = inputs[0]?.[0];
    if (!input) return true;

    // Resample to 16 kHz (required by the Doubao API)
    const samples = this.resampleAudio(input, 48000, 16000);

    // Accumulate samples into 500 ms chunks
    this.accumulatedSamples.push(...samples);

    if (this.accumulatedSamples.length >= 8000) {
      // Convert to 16-bit PCM and send
      const pcm = this.floatToPcm16(this.accumulatedSamples);
      this.port.postMessage({ type: 'audioData', data: pcm.buffer }, [pcm.buffer]);
      this.accumulatedSamples = [];
    }
    return true;
  }
}

Backend WebSocket handler (C#):

[HttpGet("ws")]
public async Task GetWebSocket()
{
    if (HttpContext.WebSockets.IsWebSocketRequest)
    {
        await _webSocketHandler.HandleAsync(HttpContext);
    }
}

Frontend VoiceTextArea component:

export const VoiceTextArea = forwardRef<HTMLTextAreaElement, VoiceTextAreaProps>(
  ({ value, onChange, onTextRecognized, maxDuration }, ref) => {
    const { isRecording, interimText, volume, duration, startRecording, stopRecording } =
      useVoiceRecording({ onTextRecognized, maxDuration });

    return (
      <div className="flex gap-2">
        {/* Voice button */}
        <button onClick={handleButtonClick}>
          {isRecording ? <VolumeWaveform volume={volume} /> : <Mic />}
        </button>
        {/* Text input area */}
        <textarea value={displayValue} onChange={handleChange} />
      </div>
    );
  }
);

Image Uploads: Multi-Method Upload Component

We built a full-featured image upload component with support for all three upload methods, covering the most common scenarios users run into.

Core features:

Three upload methods:

// Click to upload
const handleClick = () => fileInputRef.current?.click();

// Drag-and-drop upload
const handleDrop = (e: React.DragEvent) => {
  const file = e.dataTransfer.files?.[0];
  if (file) uploadFile(file);
};

// Clipboard paste
const handlePaste = (e: ClipboardEvent) => {
  for (const item of Array.from(e.clipboardData?.items || [])) {
    if (item.type.startsWith('image/')) {
      const file = item.getAsFile();
      if (file) uploadFile(file);
    }
  }
};

Frontend validation:

const validateFile = (file: File): { valid: boolean; error?: string } => {
  if (!acceptedTypes.includes(file.type)) {
    return { valid: false, error: 'Only PNG, JPG, JPEG, WebP, and GIF images are allowed' };
  }
  if (file.size > maxSize) {
    return { valid: false, error: `Maximum file size is ${(maxSize / 1024 / 1024).toFixed(1)}MB` };
  }
  return { valid: true };
};

Backend upload handler (TypeScript):

export const Route = createFileRoute('/api/upload')({
  server: {
    handlers: {
      POST: async ({ request }) => {
        const formData = await request.formData();
        const file = formData.get('file') as File;

        // Validation
        const validation = validateFile(file);
        if (!validation.isValid) {
          return Response.json({ error: validation.error }, { status: 400 });
        }

        // Save file
        const uuid = uuidv4();
        const filePath = join(uploadDir, `${uuid}${extension}`);
        await writeFile(filePath, buffer);

        return Response.json({ url: `/uploaded/${today}/${uuid}${extension}` });
      }
    }
  }
});

Practical Guide

How to Use Speech Recognition

Configure the speech recognition service:
- Open the speech recognition settings page
- Configure the Doubao Speech AppId and AccessToken
- Optionally configure hotwords to improve recognition accuracy for domain-specific terms
Use it in the input box:
- Click the microphone icon on the left side of the input box
- Start speaking after the waveform animation appears
- Click the icon again to stop recording
- The recognized text is automatically inserted at the cursor position
Hotword configuration example:

TypeScript
React
useState
useEffect

How to Use Image Uploads

Upload methods:
- Click the upload button to choose a file
- Drag an image directly into the upload area
- Use Ctrl+V to paste a screenshot from the clipboard
Supported formats: PNG, JPG, JPEG, WebP, GIF
Size limit: 5 MB by default (configurable)

Notes

Speech recognition:
- Microphone permission is required
- Use in a quiet environment when possible
- The maximum supported recording duration is 300 seconds by default (configurable)
Image uploads:
- Only common image formats are supported
- Pay attention to file size limits
- Uploaded images automatically receive a preview URL
Security considerations:
- Speech recognition credentials are stored on the backend
- Image uploads go through strict server-side validation
- HTTPS/WSS is recommended in production environments

Conclusion

After adding speech recognition and image uploads, the HagiCode user experience improved noticeably. Users can now interact with AI in a more natural way - speaking instead of typing, and sharing screenshots instead of describing everything manually. It feels like finally finding a more comfortable way to communicate.

While building this feature, we ran into the problem that browser WebSocket APIs do not support custom headers. In the end, we solved it with a backend proxy approach. That solution not only preserved security, but also laid the groundwork for integrating other authenticated WebSocket services later on.

The image upload component also benefits from supporting multiple upload methods, letting users choose whatever is most convenient in the moment. Clicking, dragging, or pasting all work, and each path gets the job done quickly.

“Typing is slower than talking, and talking is slower than a screenshot” fits the theme here quite well. If you are building a similar AI assistant product, I hope these experiences help, even if only a little.

References

If this article helped you:

Give it a like so more people can find it
Star us on GitHub: github.com/HagiCode-org/site
Visit the official site to learn more: hagicode.com
Watch the 30-minute hands-on demo: www.bilibili.com/video/BV1pirZBuEzq/
Try one-click installation: docs.hagicode.com/installation/docker-compose
Quick install for Desktop: hagicode.com/desktop/
Public beta has started - feel free to install and try it

Copyright Notice

Thank you for reading. If you found this article useful, feel free to like, bookmark, and share it. This content was created with AI-assisted collaboration, and the final version was reviewed and confirmed by the author.

Author: newbe36524
Original link: https://docs.hagicode.com/blog/2026-03-31-voice-and-image-upload-multimodal-input/
Copyright: Unless otherwise stated, all blog posts on this site are licensed under BY-NC-SA. Please include the source when reposting.