Ai Audio Speech

From BloomWiki
Jump to navigation Jump to search

How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?

AI for audio and speech encompasses the use of machine learning to process, understand, generate, and transform audio signals — including spoken language, music, environmental sounds, and biological signals. Speech recognition converts spoken words to text; text-to-speech synthesis creates natural-sounding voices; speaker identification recognizes who is speaking; music generation creates novel musical compositions; and sound classification identifies events from environmental audio. These technologies are embedded in virtual assistants, accessibility tools, music platforms, hearing aids, and security systems.

Remembering[edit]

  • Automatic Speech Recognition (ASR) — The conversion of spoken audio to text; also called speech-to-text (STT).
  • Text-to-Speech (TTS) — Synthesizing natural-sounding speech from written text.
  • Speaker diarization — Segmenting an audio recording by speaker identity: "who spoke when."
  • Speaker verification — Determining whether a voice sample belongs to a claimed identity.
  • Speaker identification — Identifying which person from a known set is speaking.
  • Waveform — The raw audio signal as a 1D array of amplitude values over time.
  • Spectrogram — A 2D time-frequency representation of audio; shows which frequencies are present at each moment.
  • Mel spectrogram — A spectrogram with frequency axis mapped to the mel scale, which approximates human auditory perception.
  • MFCC (Mel-Frequency Cepstral Coefficients) — Features computed from the mel spectrogram widely used in traditional speech processing.
  • End-to-end ASR — Models that directly map audio to text without separate acoustic and language model stages (Whisper, Wav2Vec).
  • Wav2Vec 2.0 — A self-supervised pre-training approach for speech that learns speech representations from raw audio, enabling powerful ASR with limited labeled data.
  • Whisper — OpenAI's multilingual, multitask speech recognition model trained on 680k hours of web audio; state-of-the-art open ASR.
  • WER (Word Error Rate) — The primary metric for ASR: (Substitutions + Deletions + Insertions) / Total Reference Words.
  • Vocoder — A model that converts acoustic features (mel spectrograms) to raw audio waveforms (WaveNet, HiFi-GAN).
  • Voice cloning — Synthesizing a new speaker's voice from a short sample; enabling personalized TTS.

Understanding[edit]

Audio is a continuous 1D waveform, typically sampled at 16,000–44,100 Hz. A 10-second speech clip at 16kHz = 160,000 samples — a long sequence. Direct learning from raw waveforms is possible (WaveNet, SoundStream) but computationally expensive. The dominant approach: convert to 2D time-frequency representations (spectrograms), then apply computer vision or sequence models.

ASR evolution: Early ASR combined separate acoustic models (HMMs + Gaussian mixtures) and language models. Deep learning replaced the acoustic model with neural networks but kept the pipeline. Modern end-to-end models (CTC-based, attention-based encoders) jointly learn acoustic-to-linguistic mapping. Whisper uses a transformer encoder-decoder, treating ASR as a sequence-to-sequence translation problem, and achieves remarkable robustness across accents, noise, and languages.

Self-supervised audio: Wav2Vec 2.0 pre-trains on unlabeled audio by learning to identify the correct quantized speech representation for a masked segment among distractors. This contrastive objective learns rich speech representations without transcriptions. Fine-tuning on even 10 minutes of labeled audio achieves competitive ASR — transformative for low-resource languages.

TTS pipeline: Text → phonemes → acoustic features (mel spectrogram) → waveform. Tacotron 2 generates mel spectrograms from phoneme sequences. HiFi-GAN (vocoder) converts mel spectrograms to high-fidelity audio. Modern neural TTS is nearly indistinguishable from human speech in quality evaluations.

The voice cloning challenge: Modern TTS systems like VALL-E can clone a voice from 3 seconds of audio, generating speech in that voice from any text. This enables transformative accessibility applications and severe deepfake risks simultaneously.

Applying[edit]

Speech recognition with Whisper: <syntaxhighlight lang="python"> import whisper import numpy as np

  1. Load model (tiny/base/small/medium/large)

model = whisper.load_model("medium")

  1. Transcribe audio file (handles any language automatically)

result = model.transcribe(

   "interview.mp3",
   language="en",     # Auto-detect if None
   task="transcribe", # Or "translate" to English
   word_timestamps=True,  # Return per-word timing
   fp16=True              # Use FP16 for speed

) print(result["text"]) print(result["segments"]) # [{start, end, text, words}]

  1. Streaming ASR with faster-whisper (CTranslate2 optimized)

from faster_whisper import WhisperModel model = WhisperModel("large-v3", device="cuda", compute_type="int8") segments, info = model.transcribe("audio.wav", beam_size=5) for segment in segments:

   print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")

</syntaxhighlight>

Real-time speaker diarization with pyannote: <syntaxhighlight lang="python"> from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1",

                                   use_auth_token="YOUR_TOKEN")

diarization = pipeline("meeting.wav", num_speakers=3) for turn, _, speaker in diarization.itertracks(yield_label=True):

   print(f"[{turn.start:.1f}s → {turn.end:.1f}s] {speaker}")

</syntaxhighlight>

Audio AI application map
Transcription → Whisper (offline), AssemblyAI, Rev.ai (API)
Real-time ASR → faster-whisper, DeepSpeech, Vosk (edge)
TTS → ElevenLabs (realistic), Coqui TTS (open), Kokoro (lightweight)
Music generation → MusicGen (Meta), Suno, Udio
Sound classification → YAMNet, PANNs, BirdNET (birds)
Speaker diarization → pyannote.audio, NVIDIA NeMo

Analyzing[edit]

ASR System Comparison
System WER (LibriSpeech clean) Languages Latency License
Whisper large-v3 2.7% 99 Batch Open (MIT)
faster-whisper Same (optimized) 99 Near-real-time Open
Google Cloud STT ~3% 130+ Real-time Proprietary/paid
AssemblyAI ~3% 20 Real-time API Proprietary/paid
Wav2Vec 2.0 ~2.2% (fine-tuned) Batch Open (Apache)

Failure modes: ASR performance degrades severely for accented speech, noisy environments, overlapping speakers, and domain-specific vocabulary (medical, legal, technical). WER on standard benchmarks (LibriSpeech) is much lower than real-world deployment WER. TTS systems can produce unnatural prosody for unusual names, numbers, and non-standard text. Voice cloning enables deepfake audio — voice authentication systems must be robust to synthetic voices.

Evaluating[edit]

ASR evaluation:

  1. WER on diverse benchmarks: LibriSpeech (read speech), Switchboard (telephone conversational), CHiME-4 (noisy), AMI (meeting), and target domain.
  2. Latency: Real-time factor (RTF): if RTF=0.1, model processes 10 seconds of audio per second of compute — production systems typically require RTF<1.
  3. Fairness: WER disaggregated by accent, dialect, gender, and age — ASR systems frequently exhibit higher WER for underrepresented groups. TTS evaluation: MOS (Mean Opinion Score) from human listeners; naturalness and intelligibility scores separately.

Creating[edit]

Designing a production speech AI pipeline:

  1. ASR: deploy faster-whisper with CTranslate2 on GPU for batch; Vosk or Sherpa-ONNX for on-device/edge.
  2. VAD (Voice Activity Detection): silero-VAD to filter silent segments before ASR — reduces cost and latency.
  3. Diarization: pyannote.audio for speaker labeling.
  4. Post-processing: domain-specific word correction (medical terminology, product names) using a custom language model or dictionary lookup.
  5. TTS: ElevenLabs API or Coqui TTS for synthesis; cache common phrases.
  6. Quality monitoring: compute WER on human-transcribed sample weekly; alert if WER increases >20% relative.