Skip to content

Multimodal Interaction

1 post with the tag “Multimodal Interaction”

Typing Is Slower Than Talking, and Talking Is Slower Than a Screenshot - Multimodal Input Practices for AI Coding Assistants

Typing Is Slower Than Talking, and Talking Is Slower Than a Screenshot - Multimodal Input Practices for AI Coding Assistants

Section titled “Typing Is Slower Than Talking, and Talking Is Slower Than a Screenshot - Multimodal Input Practices for AI Coding Assistants”

Writing code has a speed limit no matter how fast you type. Sometimes something you could say in one sentence takes forever to type out; sometimes one screenshot explains everything, yet you still have to describe it with a pile of text. This article talks about what we ran into while building HagiCode, from speech recognition to image uploads. In the end, we just wanted to make an AI coding assistant a little easier to use.

While building HagiCode, we noticed a problem - or rather, a problem that naturally surfaced once people started using it heavily: relying on typing alone can be tiring.

Think about it: interaction between users and the Agent is a core scenario. But if every exchange requires nonstop typing at the keyboard, the efficiency is not great:

  1. Typing is too slow: For complicated issues, like error messages or UI problems, typing everything out can take half a minute, while saying it aloud might take ten seconds. That gap is real.

  2. Images are more direct: Sometimes the UI throws an error, sometimes you want to compare a design draft, and sometimes you need to show a code structure. “A picture is worth a thousand words” may be an old saying, but it still holds up. Letting AI directly “see” the problem is much clearer than describing it for ages.

  3. Interaction should feel natural: Modern AI assistants should support text, voice, and images. Users should be able to choose whichever input method feels most natural.

So we decided to add speech recognition and image upload support to HagiCode to make Agent interactions more convenient. If users can type a little less, that is already a win.

The solutions shared in this article come from our hands-on work in the HagiCode project - or, more accurately, from lessons learned while stumbling through quite a few pitfalls.

HagiCode is an open-source AI coding assistant project with a simple goal: use AI to improve development efficiency. As we kept building, it became clear that users strongly wanted multimodal input. Sometimes speaking one sentence is faster than typing a long paragraph, and sometimes a screenshot is far clearer than a long explanation.

Those needs pushed us forward, and that is how features like speech recognition and image uploads eventually took shape. Users can now interact with AI in the most natural way available to them, and that feels good.

Technical Challenges in Speech Recognition

Section titled “Technical Challenges in Speech Recognition”

When building speech recognition, we ran into a tricky issue: the browser WebSocket API does not support custom HTTP headers.

The speech recognition service we chose was ByteDance’s Doubao Speech Recognition API. Unfortunately, this API requires authentication information such as accessToken and secretKey to be passed through HTTP headers. That created an immediate technical conflict:

// The browser WebSocket API does not support this approach
const ws = new WebSocket('wss://api.com/ws', {
headers: {
'Authorization': 'Bearer token' // Not supported
}
});

We basically had two options:

  1. URL query parameter approach: put the authentication info in the URL

    • The advantage is that it is simple to implement
    • The downside is that credentials are exposed to the frontend, which is insecure; some APIs also require header-based authentication
  2. Backend proxy approach: implement a WebSocket proxy on the backend

    • The advantage is that credentials remain securely stored on the backend and the solution is fully compatible with API requirements
    • The downside is that implementation is a bit more complex

In the end, we chose the backend proxy approach. Security is not something you compromise on.

Our requirements for image uploads were actually pretty straightforward:

  1. Multiple upload methods: click to select a file, drag and drop, and paste from the clipboard
  2. File validation: type restrictions (PNG, JPG, WebP, GIF) and size limits (5-10 MB) are basic requirements
  3. User experience: upload progress, previews, and error messages so users always know what is happening
  4. Security: server-side validation and protection against malicious file uploads are essential

Speech Recognition: WebSocket Proxy Architecture

Section titled “Speech Recognition: WebSocket Proxy Architecture”

We designed a three-layer architecture for speech recognition and found a path that worked:

Browser WebSocket
|
| ws://backend/api/voice/ws
| (binary audio)
v
Backend Proxy
|
| wss://openspeech.bytedance.com/ (with auth header)
v
Doubao API

Core component implementations:

  1. Frontend AudioWorklet processor:
class AudioProcessorWorklet extends AudioWorkletProcessor {
process(inputs, outputs, parameters) {
const input = inputs[0]?.[0];
if (!input) return true;
// Resample to 16 kHz (required by the Doubao API)
const samples = this.resampleAudio(input, 48000, 16000);
// Accumulate samples into 500 ms chunks
this.accumulatedSamples.push(...samples);
if (this.accumulatedSamples.length >= 8000) {
// Convert to 16-bit PCM and send
const pcm = this.floatToPcm16(this.accumulatedSamples);
this.port.postMessage({ type: 'audioData', data: pcm.buffer }, [pcm.buffer]);
this.accumulatedSamples = [];
}
return true;
}
}
  1. Backend WebSocket handler (C#):
[HttpGet("ws")]
public async Task GetWebSocket()
{
if (HttpContext.WebSockets.IsWebSocketRequest)
{
await _webSocketHandler.HandleAsync(HttpContext);
}
}
  1. Frontend VoiceTextArea component:
export const VoiceTextArea = forwardRef<HTMLTextAreaElement, VoiceTextAreaProps>(
({ value, onChange, onTextRecognized, maxDuration }, ref) => {
const { isRecording, interimText, volume, duration, startRecording, stopRecording } =
useVoiceRecording({ onTextRecognized, maxDuration });
return (
<div className="flex gap-2">
{/* Voice button */}
<button onClick={handleButtonClick}>
{isRecording ? <VolumeWaveform volume={volume} /> : <Mic />}
</button>
{/* Text input area */}
<textarea value={displayValue} onChange={handleChange} />
</div>
);
}
);

Image Uploads: Multi-Method Upload Component

Section titled “Image Uploads: Multi-Method Upload Component”

We built a full-featured image upload component with support for all three upload methods, covering the most common scenarios users run into.

Core features:

  1. Three upload methods:
// Click to upload
const handleClick = () => fileInputRef.current?.click();
// Drag-and-drop upload
const handleDrop = (e: React.DragEvent) => {
const file = e.dataTransfer.files?.[0];
if (file) uploadFile(file);
};
// Clipboard paste
const handlePaste = (e: ClipboardEvent) => {
for (const item of Array.from(e.clipboardData?.items || [])) {
if (item.type.startsWith('image/')) {
const file = item.getAsFile();
if (file) uploadFile(file);
}
}
};
  1. Frontend validation:
const validateFile = (file: File): { valid: boolean; error?: string } => {
if (!acceptedTypes.includes(file.type)) {
return { valid: false, error: 'Only PNG, JPG, JPEG, WebP, and GIF images are allowed' };
}
if (file.size > maxSize) {
return { valid: false, error: `Maximum file size is ${(maxSize / 1024 / 1024).toFixed(1)}MB` };
}
return { valid: true };
};
  1. Backend upload handler (TypeScript):
export const Route = createFileRoute('/api/upload')({
server: {
handlers: {
POST: async ({ request }) => {
const formData = await request.formData();
const file = formData.get('file') as File;
// Validation
const validation = validateFile(file);
if (!validation.isValid) {
return Response.json({ error: validation.error }, { status: 400 });
}
// Save file
const uuid = uuidv4();
const filePath = join(uploadDir, `${uuid}${extension}`);
await writeFile(filePath, buffer);
return Response.json({ url: `/uploaded/${today}/${uuid}${extension}` });
}
}
}
});
  1. Configure the speech recognition service:

    • Open the speech recognition settings page
    • Configure the Doubao Speech AppId and AccessToken
    • Optionally configure hotwords to improve recognition accuracy for domain-specific terms
  2. Use it in the input box:

    • Click the microphone icon on the left side of the input box
    • Start speaking after the waveform animation appears
    • Click the icon again to stop recording
    • The recognized text is automatically inserted at the cursor position
  3. Hotword configuration example:

TypeScript
React
useState
useEffect
  1. Upload methods:

    • Click the upload button to choose a file
    • Drag an image directly into the upload area
    • Use Ctrl+V to paste a screenshot from the clipboard
  2. Supported formats: PNG, JPG, JPEG, WebP, GIF

  3. Size limit: 5 MB by default (configurable)

  1. Speech recognition:

    • Microphone permission is required
    • Use in a quiet environment when possible
    • The maximum supported recording duration is 300 seconds by default (configurable)
  2. Image uploads:

    • Only common image formats are supported
    • Pay attention to file size limits
    • Uploaded images automatically receive a preview URL
  3. Security considerations:

    • Speech recognition credentials are stored on the backend
    • Image uploads go through strict server-side validation
    • HTTPS/WSS is recommended in production environments

After adding speech recognition and image uploads, the HagiCode user experience improved noticeably. Users can now interact with AI in a more natural way - speaking instead of typing, and sharing screenshots instead of describing everything manually. It feels like finally finding a more comfortable way to communicate.

While building this feature, we ran into the problem that browser WebSocket APIs do not support custom headers. In the end, we solved it with a backend proxy approach. That solution not only preserved security, but also laid the groundwork for integrating other authenticated WebSocket services later on.

The image upload component also benefits from supporting multiple upload methods, letting users choose whatever is most convenient in the moment. Clicking, dragging, or pasting all work, and each path gets the job done quickly.

“Typing is slower than talking, and talking is slower than a screenshot” fits the theme here quite well. If you are building a similar AI assistant product, I hope these experiences help, even if only a little.


If this article helped you:

Thank you for reading. If you found this article useful, feel free to like, bookmark, and share it. This content was created with AI-assisted collaboration, and the final version was reviewed and confirmed by the author.