Titan AI LogoTitan AI

whisperX

17,742
1,870
Python

Project Description

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

whisperX: WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

Project Title

whisperX โ€” Advanced Speech Recognition with Word-level Timestamps and Speaker Diarization

Overview

WhisperX is an open-source Python project that offers fast automatic speech recognition (ASR) with word-level timestamps and speaker diarization. It leverages the Whisper model by OpenAI and enhances it with forced phoneme alignment and voice-activity-based batching for improved accuracy and speed. This project stands out for its real-time transcription capabilities and its ability to handle multispeaker ASR.

Key Features

  • Batched inference for 70x real-time transcription using Whisper large-v2
  • Utilizes the faster-whisper backend, requiring less than 8GB GPU memory
  • Accurate word-level timestamps via wav2vec2 alignment
  • Multispeaker ASR with speaker diarization from pyannote-audio
  • Voice Activity Detection (VAD) preprocessing to reduce hallucination and enable efficient batching

Use Cases

  • Researchers and developers needing high-speed, accurate transcriptions for audio analysis
  • Applications in call centers for real-time transcription and speaker identification
  • Use in multimedia content creation for automated captioning and subtitling

Advantages

  • Real-time transcription at 70x speed, significantly faster than standard ASR tools
  • Improved timestamp accuracy at the word level, enhancing the usability of transcriptions
  • Open-source and community-driven, allowing for continuous improvement and customization

Limitations / Considerations

  • Requires a GPU with at least 8GB memory for optimal performance with large-v2 models
  • The project's complexity might pose a steep learning curve for new users
  • May have limitations in handling extremely noisy environments or non-native speaker accents

Similar / Related Projects

  • Mozilla DeepSpeech: An open-source speech-to-text engine with a focus on offline use, differing in its approach to ASR without the need for cloud services.
  • Kaldi: A widely-used open-source speech recognition toolkit that offers a range of features but may not match whisperX's speed.
  • wav2vec2: A model for unsupervised pre-training of wav2vec 2.0, which whisperX uses for phoneme-based ASR, but it is not an end-to-end solution like whisperX.

Basic Information


๐Ÿ“Š Project Information

  • Project Name: whisperX
  • GitHub URL: https://github.com/m-bain/whisperX
  • Programming Language: Python
  • โญ Stars: 17,103
  • ๐Ÿด Forks: 1,808
  • ๐Ÿ“… Created: 2022-12-09
  • ๐Ÿ”„ Last Updated: 2025-08-04

๐Ÿท๏ธ Project Topics

Topics: [, ", a, s, r, ", ,, , ", s, p, e, e, c, h, ", ,, , ", s, p, e, e, c, h, -, r, e, c, o, g, n, i, t, i, o, n, ", ,, , ", s, p, e, e, c, h, -, t, o, -, t, e, x, t, ", ,, , ", w, h, i, s, p, e, r, ", ]


๐Ÿ“š Documentation


This article is automatically generated by AI based on GitHub project information and README content analysis

Titan AI Explorehttps://www.titanaiexplore.com/projects/whisperx-576103395en-USTechnology

Project Information

Created on 12/9/2022
Updated on 9/15/2025