Editing
Ai Audio Speech
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Understanding</span> == Audio is a continuous 1D waveform, typically sampled at 16,000β44,100 Hz. A 10-second speech clip at 16kHz = 160,000 samples β a long sequence. Direct learning from raw waveforms is possible (WaveNet, SoundStream) but computationally expensive. The dominant approach: convert to 2D time-frequency representations (spectrograms), then apply computer vision or sequence models. '''ASR evolution''': Early ASR combined separate acoustic models (HMMs + Gaussian mixtures) and language models. Deep learning replaced the acoustic model with neural networks but kept the pipeline. Modern end-to-end models (CTC-based, attention-based encoders) jointly learn acoustic-to-linguistic mapping. Whisper uses a transformer encoder-decoder, treating ASR as a sequence-to-sequence translation problem, and achieves remarkable robustness across accents, noise, and languages. '''Self-supervised audio''': Wav2Vec 2.0 pre-trains on unlabeled audio by learning to identify the correct quantized speech representation for a masked segment among distractors. This contrastive objective learns rich speech representations without transcriptions. Fine-tuning on even 10 minutes of labeled audio achieves competitive ASR β transformative for low-resource languages. '''TTS pipeline''': Text β phonemes β acoustic features (mel spectrogram) β waveform. Tacotron 2 generates mel spectrograms from phoneme sequences. HiFi-GAN (vocoder) converts mel spectrograms to high-fidelity audio. Modern neural TTS is nearly indistinguishable from human speech in quality evaluations. '''The voice cloning challenge''': Modern TTS systems like VALL-E can clone a voice from 3 seconds of audio, generating speech in that voice from any text. This enables transformative accessibility applications and severe deepfake risks simultaneously. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information