Ai Audio Speech
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
AI for audio and speech encompasses the use of machine learning to process, understand, generate, and transform audio signals — including spoken language, music, environmental sounds, and biological signals. Speech recognition converts spoken words to text; text-to-speech synthesis creates natural-sounding voices; speaker identification recognizes who is speaking; music generation creates novel musical compositions; and sound classification identifies events from environmental audio. These technologies are embedded in virtual assistants, accessibility tools, music platforms, hearing aids, and security systems.
Remembering
- Automatic Speech Recognition (ASR) — The conversion of spoken audio to text; also called speech-to-text (STT).
- Text-to-Speech (TTS) — Synthesizing natural-sounding speech from written text.
- Speaker diarization — Segmenting an audio recording by speaker identity: "who spoke when."
- Speaker verification — Determining whether a voice sample belongs to a claimed identity.
- Speaker identification — Identifying which person from a known set is speaking.
- Waveform — The raw audio signal as a 1D array of amplitude values over time.
- Spectrogram — A 2D time-frequency representation of audio; shows which frequencies are present at each moment.
- Mel spectrogram — A spectrogram with frequency axis mapped to the mel scale, which approximates human auditory perception.
- MFCC (Mel-Frequency Cepstral Coefficients) — Features computed from the mel spectrogram widely used in traditional speech processing.
- End-to-end ASR — Models that directly map audio to text without separate acoustic and language model stages (Whisper, Wav2Vec).
- Wav2Vec 2.0 — A self-supervised pre-training approach for speech that learns speech representations from raw audio, enabling powerful ASR with limited labeled data.
- Whisper — OpenAI's multilingual, multitask speech recognition model trained on 680k hours of web audio; state-of-the-art open ASR.
- WER (Word Error Rate) — The primary metric for ASR: (Substitutions + Deletions + Insertions) / Total Reference Words.
- Vocoder — A model that converts acoustic features (mel spectrograms) to raw audio waveforms (WaveNet, HiFi-GAN).
- Voice cloning — Synthesizing a new speaker's voice from a short sample; enabling personalized TTS.
Understanding
Audio is a continuous 1D waveform, typically sampled at 16,000–44,100 Hz. A 10-second speech clip at 16kHz = 160,000 samples — a long sequence. Direct learning from raw waveforms is possible (WaveNet, SoundStream) but computationally expensive. The dominant approach: convert to 2D time-frequency representations (spectrograms), then apply computer vision or sequence models.
ASR evolution: Early ASR combined separate acoustic models (HMMs + Gaussian mixtures) and language models. Deep learning replaced the acoustic model with neural networks but kept the pipeline. Modern end-to-end models (CTC-based, attention-based encoders) jointly learn acoustic-to-linguistic mapping. Whisper uses a transformer encoder-decoder, treating ASR as a sequence-to-sequence translation problem, and achieves remarkable robustness across accents, noise, and languages.
Self-supervised audio: Wav2Vec 2.0 pre-trains on unlabeled audio by learning to identify the correct quantized speech representation for a masked segment among distractors. This contrastive objective learns rich speech representations without transcriptions. Fine-tuning on even 10 minutes of labeled audio achieves competitive ASR — transformative for low-resource languages.
TTS pipeline: Text → phonemes → acoustic features (mel spectrogram) → waveform. Tacotron 2 generates mel spectrograms from phoneme sequences. HiFi-GAN (vocoder) converts mel spectrograms to high-fidelity audio. Modern neural TTS is nearly indistinguishable from human speech in quality evaluations.
The voice cloning challenge: Modern TTS systems like VALL-E can clone a voice from 3 seconds of audio, generating speech in that voice from any text. This enables transformative accessibility applications and severe deepfake risks simultaneously.
Applying
Speech recognition with Whisper: <syntaxhighlight lang="python"> import whisper import numpy as np
- Load model (tiny/base/small/medium/large)
model = whisper.load_model("medium")
- Transcribe audio file (handles any language automatically)
result = model.transcribe(
"interview.mp3", language="en", # Auto-detect if None task="transcribe", # Or "translate" to English word_timestamps=True, # Return per-word timing fp16=True # Use FP16 for speed
) print(result["text"]) print(result["segments"]) # [{start, end, text, words}]
- Streaming ASR with faster-whisper (CTranslate2 optimized)
from faster_whisper import WhisperModel model = WhisperModel("large-v3", device="cuda", compute_type="int8") segments, info = model.transcribe("audio.wav", beam_size=5) for segment in segments:
print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
</syntaxhighlight>
Real-time speaker diarization with pyannote: <syntaxhighlight lang="python"> from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_TOKEN")
diarization = pipeline("meeting.wav", num_speakers=3) for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"[{turn.start:.1f}s → {turn.end:.1f}s] {speaker}")
</syntaxhighlight>
- Audio AI application map
- Transcription → Whisper (offline), AssemblyAI, Rev.ai (API)
- Real-time ASR → faster-whisper, DeepSpeech, Vosk (edge)
- TTS → ElevenLabs (realistic), Coqui TTS (open), Kokoro (lightweight)
- Music generation → MusicGen (Meta), Suno, Udio
- Sound classification → YAMNet, PANNs, BirdNET (birds)
- Speaker diarization → pyannote.audio, NVIDIA NeMo
Analyzing
| System | WER (LibriSpeech clean) | Languages | Latency | License |
|---|---|---|---|---|
| Whisper large-v3 | 2.7% | 99 | Batch | Open (MIT) |
| faster-whisper | Same (optimized) | 99 | Near-real-time | Open |
| Google Cloud STT | ~3% | 130+ | Real-time | Proprietary/paid |
| AssemblyAI | ~3% | 20 | Real-time API | Proprietary/paid |
| Wav2Vec 2.0 | ~2.2% (fine-tuned) | Batch | Open (Apache) |
Failure modes: ASR performance degrades severely for accented speech, noisy environments, overlapping speakers, and domain-specific vocabulary (medical, legal, technical). WER on standard benchmarks (LibriSpeech) is much lower than real-world deployment WER. TTS systems can produce unnatural prosody for unusual names, numbers, and non-standard text. Voice cloning enables deepfake audio — voice authentication systems must be robust to synthetic voices.
Evaluating
ASR evaluation:
- WER on diverse benchmarks: LibriSpeech (read speech), Switchboard (telephone conversational), CHiME-4 (noisy), AMI (meeting), and target domain.
- Latency: Real-time factor (RTF): if RTF=0.1, model processes 10 seconds of audio per second of compute — production systems typically require RTF<1.
- Fairness: WER disaggregated by accent, dialect, gender, and age — ASR systems frequently exhibit higher WER for underrepresented groups. TTS evaluation: MOS (Mean Opinion Score) from human listeners; naturalness and intelligibility scores separately.
Creating
Designing a production speech AI pipeline:
- ASR: deploy faster-whisper with CTranslate2 on GPU for batch; Vosk or Sherpa-ONNX for on-device/edge.
- VAD (Voice Activity Detection): silero-VAD to filter silent segments before ASR — reduces cost and latency.
- Diarization: pyannote.audio for speaker labeling.
- Post-processing: domain-specific word correction (medical terminology, product names) using a custom language model or dictionary lookup.
- TTS: ElevenLabs API or Coqui TTS for synthesis; cache common phrases.
- Quality monitoring: compute WER on human-transcribed sample weekly; alert if WER increases >20% relative.