## How Speech-to-Text Works
### The Pipeline
- Audio capture: Microphone input or file upload
- Pre-processing: Noise reduction, normalization, format conversion
- Transcription: Convert speech to text using AI models
- Post-processing: Punctuation, capitalization, formatting
### API Options
| Service | Strengths | Latency | Cost | |---------|-----------|---------|------| | OpenAI Whisper API | Accuracy, multilingual | Batch | $ | | Whisper (local) | Privacy, no cost | Varies | Free | | Deepgram | Speed, real-time | <300ms | $$ | | AssemblyAI | Features, accuracy | ~1s | $$ | | Google Speech-to-Text | Integration, languages | ~500ms | $$ |
### OpenAI Whisper API
```typescript const transcription = await openai.audio.transcriptions.create({ file: fs.createReadStream("recording.mp3"), model: "whisper-1", language: "en", response_format: "verbose_json", timestamp_granularities: ["word", "segment"], });
console.log(transcription.text); // Access word-level timestamps for (const segment of transcription.segments) { console.log(`[${segment.start}s] ${segment.text}`); } ```
### Audio Formats
- Supported: mp3, mp4, mpeg, mpga, m4a, wav, webm
- Best quality: WAV (uncompressed) or FLAC
- Best size: MP3 or Opus for compressed
- Max file size: 25MB (Whisper API)