Editing AI for Audio and Speech (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Automatic Speech Recognition (ASR)''' — The conversion of spoken audio to text; also called speech-to-text (STT).
* '''Text-to-Speech (TTS)''' — Synthesizing natural-sounding speech from written text.
* '''Speaker diarization''' — Segmenting an audio recording by speaker identity: "who spoke when."
* '''Speaker verification''' — Determining whether a voice sample belongs to a claimed identity.
* '''Speaker identification''' — Identifying which person from a known set is speaking.
* '''Waveform''' — The raw audio signal as a 1D array of amplitude values over time.
* '''Spectrogram''' — A 2D time-frequency representation of audio; shows which frequencies are present at each moment.
* '''Mel spectrogram''' — A spectrogram with frequency axis mapped to the mel scale, which approximates human auditory perception.
* '''MFCC (Mel-Frequency Cepstral Coefficients)''' — Features computed from the mel spectrogram widely used in traditional speech processing.
* '''End-to-end ASR''' — Models that directly map audio to text without separate acoustic and language model stages (Whisper, Wav2Vec).
* '''Wav2Vec 2.0''' — A self-supervised pre-training approach for speech that learns speech representations from raw audio, enabling powerful ASR with limited labeled data.
* '''Whisper''' — OpenAI's multilingual, multitask speech recognition model trained on 680k hours of web audio; state-of-the-art open ASR.
* '''WER (Word Error Rate)''' — The primary metric for ASR: (Substitutions + Deletions + Insertions) / Total Reference Words.
* '''Vocoder''' — A model that converts acoustic features (mel spectrograms) to raw audio waveforms (WaveNet, HiFi-GAN).
* '''Voice cloning''' — Synthesizing a new speaker's voice from a short sample; enabling personalized TTS.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">