Speech Recognition
Speech recognition (also known as automatic speech recognition or ASR) converts spoken language into written text.
Hands-on Example: Transcribing Speech
from transformers import pipeline
import librosa
import soundfile as sf
import matplotlib.pyplot as plt
import numpy as np
import requests
from io import BytesIO
# Initialize the automatic speech recognition pipeline
transcriber = pipeline("automatic-speech-recognition")
# Download an audio sample
audio_url = "https://github.com/librosa/librosa/raw/main/tests/data/choice.wav"
response = requests.get(audio_url)
with open("speech_sample.wav", "wb") as f:
f.write(response.content)
# Load the audio
audio, sr = librosa.load("speech_sample.wav", sr=16000)
# Visualize the waveform
plt.figure(figsize=(10, 4))
plt.plot(np.linspace(0, len(audio)/sr, len(audio)), audio)
plt.title("Audio Waveform")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.tight_layout()
plt.show()
# Transcribe audio
result = transcriber("speech_sample.wav")
print(f"Transcription: {result['text']}")
The speech recognition pipeline converts audio recordings into text, using pre-trained models that have been fine-tuned on large datasets of speech.
Try It Yourself:
- Record your own voice using a tool like Audacity and transcribe it.
- Try transcribing audio in different languages using models like
facebook/wav2vec2-large-960h-lv60-self
. - Experiment with audio that has background noise or multiple speakers to test model robustness.