Whisper Web API Reference

This section provides a comprehensive reference for all modules and classes in the Whisper Web real-time transcription system.

Core Components

Server & API

class whisper_web.server.ClientSession(session_id: str, model_config: ModelConfig)[source]

Bases: object

Represents an isolated client transcription session with dedicated resources.

Each ClientSession encapsulates a complete transcription pipeline for a single client, including its own event bus, transcription manager, and Whisper model instance. This design enables concurrent multi-client support with isolated state and configurations.

Session Components:

Event Bus: Dedicated event system for inter-component communication
Transcription Manager: Handles audio queuing and transcription state
Whisper Model: Configured ASR model instance for speech recognition
Inference Task: Async task managing the transcription processing loop
Download Tracking: Monitors model download progress

Key Features:

Isolation: Each session operates independently with separate state
Lifecycle Management: Handles session startup, operation, and cleanup
Event-Driven Architecture: Uses publish-subscribe pattern for loose coupling
Async Processing: Non-blocking inference execution with proper task management
Model Download Handling: Tracks and responds to model download events

Parameters:

session_id (str) – Unique identifier for this client session
model_config (ModelConfig) – Configuration for the Whisper model used in this session

Variables:

session_id – Unique session identifier
model_config – Whisper model configuration
event_bus – Session-specific event bus for component communication
manager – Transcription manager handling audio processing
model – Whisper model instance for speech recognition
inference_task – Async task running the inference loop
is_downloading – Flag indicating if model is currently downloading

async handle_model_download(event: DownloadModel)[source]

Handle model download progress events for this session.

Updates the session’s download status based on model download events, allowing the session to track when models are being loaded and when they become available for inference.

Parameters:: event (DownloadModel) – Model download event containing URL and completion status

Behavior:

Sets is_downloading to True when download starts
Sets is_downloading to False when download completes
Logs download status changes for monitoring

async start_inference()[source]

Start the transcription inference task for this session.

Creates and starts an async task that runs the main inference loop, processing audio chunks through the Whisper model. If an inference task is already running, this method has no effect.

Behavior:

Creates new inference task if none exists or previous task completed
Task runs indefinitely until manually stopped or session cleanup
Uses the session’s transcription manager and model for processing
Handles audio chunks from the session’s event bus

Note

The inference task runs in the background and must be explicitly stopped using stop_inference() for proper cleanup.

async stop_inference()[source]

Stop the transcription inference task for this session.

Gracefully cancels the running inference task and waits for proper cleanup. Handles cancellation exceptions to ensure clean shutdown without propagating cancellation errors.

Behavior:

Cancels the inference task if currently running
Waits for task cancellation to complete
Suppresses CancelledError exceptions from task cleanup
Safe to call multiple times or when no task is running

Note

This method should be called during session cleanup to prevent resource leaks and ensure proper task termination.

pydantic model whisper_web.server.CreateSessionRequest[source]

Bases: BaseModel

Request schema for creating a new session.

field model_configuration: ModelConfig | None = None: Model configuration for the session

field session_id: str | None = None: Optional custom session ID

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.CurrentTranscriptionResponse[source]

Bases: BaseModel

Response schema for current transcription.

field current_transcription: str [Required]: Current transcription text

field session_id: str [Required]: Session identifier

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.FinalTranscriptionResponse[source]

Bases: BaseModel

Response schema for final transcription.

field final_transcription: str [Required]: Final transcription text

field session_id: str [Required]: Session identifier

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.InstalledModelsResponse[source]

Bases: BaseModel

Response schema for getting all installed models.

field installed_models: List[str] [Required]: List of all installed models

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.MessageResponse[source]

Bases: BaseModel

Generic response schema for operations with messages.

field message: str [Required]: Operation result message

field session_id: str | None = None: Session identifier

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.QueueProcessedResponse[source]

Bases: BaseModel

Response schema for processed queue items.

field audio_queue_processed: int [Required]: Number of audio chunks processed

field session_id: str [Required]: Session identifier

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.QueueSizeResponse[source]

Bases: BaseModel

Response schema for audio queue size.

field audio_queue_size: int [Required]: Number of audio chunks in queue

field session_id: str [Required]: Session identifier

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.SessionInfo[source]

Bases: BaseModel

Schema for session information in list responses.

field audio_queue_size: int [Required]: Number of audio chunks in queue

field current_transcription: str [Required]: Current transcription text

field inference_running: bool [Required]: Whether inference is currently running

field model_configuration: Dict [Required]: Model configuration

field session_id: str [Required]: Unique session identifier

field transcription_count: int [Required]: Number of completed transcriptions

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.SessionListResponse[source]

Bases: BaseModel

Response schema for listing sessions.

field sessions: List[SessionInfo] [Required]: List of active sessions

field total_sessions: int [Required]: Total number of active sessions

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.SessionOperationResponse[source]

Bases: BaseModel

Response schema for session operations like restart.

field inference_running: bool | None = None: Whether inference is running after operation

field message: str [Required]: Operation result message

field session_id: str [Required]: Session identifier

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.SessionResponse[source]

Bases: BaseModel

Response schema for session creation.

field message: str | None = None: Optional message (e.g., for existing sessions)

field model_configuration: Dict [Required]: Model configuration used for the session

field session_id: str [Required]: Unique session identifier

_abc_impl = <_abc._abc_data object>

pydantic model whisper_web.server.SessionStatusResponse[source]

Bases: BaseModel

Response schema for session status.

field audio_queue_processed: int [Required]: Number of audio chunks processed

field audio_queue_size: int [Required]: Number of audio chunks in queue

field inference_running: bool [Required]: Whether inference is currently running

field is_downloading: bool [Required]: Whether model is currently downloading

field model_configuration: Dict [Required]: Model configuration

field session_id: str [Required]: Session identifier

field transcription_count: int [Required]: Number of completed transcriptions

_abc_impl = <_abc._abc_data object>

class whisper_web.server.TranscriptionServer(default_model_config: ModelConfig | None = None, host: str = '0.0.0.0', port: int = 8000)[source]

Bases: object

A real-time speech transcription API server built with FastAPI.

This service provides a full audio transcription pipeline via HTTP and WebSocket endpoints, including:

Audio stream ingestion per client.
Real-time automatic speech recognition (ASR) using configurable Whisper models.
Voice Activity Detection (VAD) to detect speech regions.
Status monitoring of transcription, generation, and voice activity per client.
Retrieval and posting of transcription data per client.
Multi-client support with isolated sessions.

Parameters:

default_model_config (ModelConfig, optional) – Default configuration for the ASR model (e.g., Whisper), defaults to ModelConfig()
host (str, optional) – Hostname for the FastAPI server, defaults to “localhost”
port (int, optional) – Port for the FastAPI server, defaults to 8000

Note

Each client connection creates its own isolated session with a dedicated TranscriptionManager and WhisperModel. This allows multiple clients to use different model configurations and have separate transcription states.

The server exposes endpoints for creating sessions, managing transcriptions per session, and retrieving processing status per client.

_setup_api_routes()[source]

Configure all HTTP API routes for the FastAPI application.

This method organizes and sets up the complete REST API interface for the transcription server. Routes are grouped by functionality to provide a comprehensive API for session management and transcription access.

Route Categories:

Session Management: Create, delete, list, and manage sessions
Transcription Access: Retrieve current, final, and historical transcriptions
Queue Monitoring: Monitor audio processing queues and statistics

API Design:

RESTful design with resource-based URLs
Consistent response schemas across endpoints
Proper HTTP status codes and error handling
Session-scoped operations for multi-client support

Note

Routes are automatically registered with the FastAPI application and include OpenAPI documentation for interactive API exploration.

_setup_session_management_routes()[source]

Configure HTTP routes for session lifecycle management.

Sets up endpoints that handle the complete session lifecycle including creation, deletion, status monitoring, and operational controls. These routes provide the foundation for multi-client session management.

Endpoints Configured:

GET /installed_models - Get list of installed models on the server
POST /sessions - Create new session with optional configuration
POST /sessions/{session_id} - Create session with specific ID
DELETE /sessions/{session_id} - Remove session and cleanup resources
GET /sessions - List all active sessions with statistics
GET /sessions/{session_id}/status - Get detailed session status
POST /sessions/{session_id}/clear - Clear session transcription history
POST /sessions/{session_id}/restart - Restart session inference

Features:

Model configuration per session
Session existence validation
Graceful resource cleanup
Comprehensive status reporting
Operational controls for session management

_setup_transcription_routes()[source]

Configure HTTP routes for transcription data access and queue management.

Sets up endpoints that provide access to transcription results, real-time status monitoring, and audio queue management. These routes enable clients to retrieve transcription data and monitor processing progress.

Transcription Data Endpoints:

GET /sessions/{session_id}/transcriptions - Get all completed transcriptions
GET /sessions/{session_id}/transcription/current - Get current active transcription
GET /sessions/{session_id}/transcription/final - Get complete final transcription

Queue Management Endpoints:

GET /sessions/{session_id}/queue/size - Get current audio queue size
GET /sessions/{session_id}/queue/processed - Get processed chunk count
POST /sessions/{session_id}/queue/clear - Clear pending audio chunks

Features:

Real-time transcription access
Processing progress monitoring
Queue management and clearing
Session-scoped data isolation

_setup_ws_route()[source]

Configure WebSocket route for real-time audio streaming and transcription.

Sets up the primary WebSocket endpoint that handles real-time audio streaming from clients. This endpoint manages the complete audio-to-transcription pipeline including session management, audio processing, and connection lifecycle.

WebSocket Protocol:

Endpoint: /ws/transcribe/{session_id}
Binary Protocol: First byte indicates finality flag, remaining bytes contain WAV audio data
Session Management: Automatically creates or retrieves existing sessions
Real-time Processing: Streams audio directly to transcription pipeline

Connection Lifecycle:

Connection: Accept WebSocket connection for specified session
Session Setup: Get or create session with default configuration
Inference Start: Begin transcription processing for the session
Audio Streaming: Continuously receive and process audio chunks
Cleanup: Handle disconnections and errors gracefully

Audio Processing:

Receives binary audio data in WAV format
Extracts finality flag from protocol header
Converts audio to tensor format for processing
Publishes audio events to session event bus
Maintains real-time processing capabilities

Error Handling:

Graceful WebSocket disconnection handling
Exception logging without service interruption
Automatic session cleanup on connection loss

Note

This WebSocket endpoint is the primary interface for real-time transcription and supports the binary protocol expected by clients.

async cleanup_inactive_sessions()[source]

Remove sessions with failed or completed inference tasks.

Performs maintenance by identifying and removing sessions whose inference tasks have finished or failed. This prevents accumulation of dead sessions and ensures system resources are properly reclaimed.

Behavior:

Scans all active sessions for completed inference tasks
Identifies sessions with failed tasks (exceptions)
Marks failed/completed sessions for removal
Removes inactive sessions and performs cleanup
Logs cleanup activities for monitoring

Use Cases:

Periodic maintenance to clean up dead sessions
Error recovery after inference failures
Resource management in long-running deployments

Note

This method should be called periodically or after detecting inference failures to maintain system health.

get_or_create_session(session_id: str | None = None, model_config: ModelConfig | None = None) → ClientSession[source]

Retrieve an existing session or create a new one with specified configuration.

This method implements the session management logic, ensuring each client gets a dedicated session with isolated resources. If a session doesn’t exist, it creates a new one with the provided or default model configuration.

Parameters:

session_id (Optional[str]) – Unique identifier for the session. If None, generates UUID
model_config (Optional[ModelConfig]) – Model configuration for new sessions. Uses default if None

Returns:

The existing or newly created client session

Return type:

ClientSession

Behavior:

Generates UUID if no session_id provided
Returns existing session if session_id already exists
Creates new session with provided or default model configuration
New sessions are immediately ready for inference

async remove_session(session_id: str)[source]

Remove a client session and perform complete cleanup.

Safely removes a session by stopping its inference task, cleaning up resources, and removing it from the active sessions dictionary. This method ensures proper resource cleanup to prevent memory leaks.

Parameters:: session_id (str) – Unique identifier of the session to remove

Behavior:

Stops the session’s inference task gracefully
Removes session from active sessions dictionary
Performs cleanup of session resources
Logs removal for monitoring purposes
Safe to call for non-existent sessions (no-op)

run()[source]

Starts the FastAPI application in a separate thread.

This method runs the FastAPI server using Uvicorn, which handles the HTTP requests for the transcription service. The server is launched in a separate thread, allowing the application to run concurrently with other tasks. It uses the host and port parameters defined in the class to bind the server.

The server operates as a daemon thread, meaning it will not block the main program from exiting.

Note

The application listens for HTTP requests at the specified host and port.
Ensure that the necessary configurations for the host and port are provided when calling this method.

pydantic model whisper_web.server.TranscriptionsResponse[source]

Bases: BaseModel

Response schema for getting all transcriptions.

field session_id: str [Required]: Session identifier

field transcriptions: List[str] [Required]: List of all transcriptions

_abc_impl = <_abc._abc_data object>

The main FastAPI server that provides RESTful endpoints and WebSocket connections for real-time transcription services. Handles session management, audio streaming, and transcription delivery.

Transcription and Audio Management

class whisper_web.management.AudioManager(event_bus: EventBus)[source]

Bases: object

Manages audio chunk lifecycle and distribution in the transcription pipeline.

The AudioManager handles the flow of raw audio chunks from generation to consumption, providing a buffered interface between audio input sources and transcription models. It operates through event-driven architecture to ensure loose coupling and scalability.

Core Responsibilities:

Audio Chunk Buffering: Maintains a thread-safe queue of processed audio chunks
Event-Driven Processing: Subscribes to audio generation events and manages chunk flow
Chunk Metadata Tracking: Monitors chunk counts and processing statistics
Async Chunk Distribution: Provides non-blocking access to queued audio data
Error Handling: Gracefully handles audio processing errors and timeouts

Event Subscriptions:

AudioChunkGenerated: Receives and queues newly generated audio chunks
AudioChunkNum: Updates total expected chunk count for progress tracking

Key Features:

Thread-Safe Operations: All queue operations are async-safe for concurrent access
Timeout Handling: Non-blocking audio retrieval with configurable timeouts
Progress Tracking: Monitors processing progress against expected chunk counts
Error Resilience: Continues operation despite individual chunk processing errors

Parameters:

event_bus (EventBus) – Event bus instance for inter-component communication

Variables:

audio_chunk_queue – Thread-safe queue for audio chunk storage
processed_chunks – Counter for successfully processed audio chunks
num_chunks – Total expected number of chunks for current session

async _handle_generated_audio_chunk(event: AudioChunkGenerated) → None[source]

Process generated audio chunk events and queue valid audio data.

Validates and queues incoming audio chunks, ensuring only chunks with valid audio data are added to the processing queue. Includes error handling to maintain system stability during audio processing issues.

Parameters:: event (AudioChunkGenerated) – Audio chunk generated event containing audio data and finality flag

Behavior:

Validates audio chunk contains non-empty tensor data
Increments processed chunk counter for progress tracking
Queues valid audio chunks with finality flag for downstream processing
Logs errors without interrupting processing flow

async _handle_num_chunks_updated(event: AudioChunkNum) → None[source]

Update the total expected chunk count for progress tracking.

Receives chunk count updates to maintain accurate progress tracking during file processing or streaming sessions.

Parameters:: event (AudioChunkNum) – Audio chunk number event containing total expected chunks

clear_audio_queue() → None[source]

Remove all pending audio chunks from the processing queue.

Empties the audio chunk queue by discarding all queued audio data. Useful for session cleanup or resetting audio processing state.

Warning

This operation discards all pending audio chunks and cannot be undone. Use with caution during active audio processing sessions.

async get_next_audio_chunk() → Tuple[AudioChunk, bool] | None[source]

Retrieve the next available audio chunk from the processing queue.

Provides non-blocking access to queued audio chunks with timeout handling. Returns None when no audio is available within the timeout period, allowing callers to handle empty queue conditions gracefully.

Returns:: Tuple of (AudioChunk, is_final) if available, None if timeout/error
Return type:: Tuple[AudioChunk, bool] | None

Behavior:

Waits up to 1 second for audio chunk availability
Returns tuple containing audio chunk and finality flag
Returns None on timeout or processing errors
Logs errors without raising exceptions

property queue_size: int

Get the current number of audio chunks waiting in the processing queue.

Returns:: Number of queued audio chunks
Return type:: int

property stats: dict

Get comprehensive statistics about audio chunk processing.

Provides key metrics for monitoring audio processing performance, including queue utilization and chunk processing progress.

Returns:: Dictionary containing audio processing statistics
Return type:: dict

Returns:

queue_size: Number of audio chunks currently in processing queue
processed_chunks: Total number of audio chunks processed

class whisper_web.management.TranscriptionManager(event_bus: EventBus)[source]

Bases: object

Manages the complete transcription pipeline from audio input to text output.

The TranscriptionManager serves as the central coordinator for the real-time transcription process, handling audio chunk queuing, transcription state management, and result aggregation. It operates through an event-driven architecture, responding to audio events and publishing transcription updates.

Core Responsibilities:

Audio Queue Management: Maintains a thread-safe queue of incoming audio chunks
Transcription State Tracking: Manages both current and historical transcription results
Event Coordination: Subscribes to audio events and publishes transcription updates
Inference Loop: Provides async interface for model inference execution
Result Aggregation: Combines partial and final transcriptions into complete text

Event Subscriptions:

AudioChunkReceived: Queues audio data for processing
TranscriptionCompleted: Updates transcription state and publishes results

Event Publications:

TranscriptionUpdated: Notifies subscribers of new transcription results

Parameters:

event_bus (EventBus) – Event bus instance for inter-component communication

Variables:

transcriptions – List of completed transcription segments
current_transcription – Current active transcription text
audio_queue – Thread-safe queue for audio processing
processed_chunks – Counter for processed audio chunks
num_chunks – Total expected number of chunks

async _handle_audio_chunk(event: AudioChunkReceived) → None[source]

Process incoming audio chunk events and queue valid audio data.

Validates incoming audio chunks and adds them to the processing queue if they contain valid audio data (non-empty tensors).

Parameters:: event (AudioChunkReceived) – Audio chunk received event containing audio data and finality flag

async _handle_transcription_completed(event: TranscriptionCompleted) → None[source]

Process completed transcription events and update internal state.

Updates the current transcription text, increments processed chunk counter, and manages the transcription history. Publishes transcription update events to notify other components of new results.

Parameters:: event (TranscriptionCompleted) – Transcription completed event with result text and finality flag

Behavior:

Updates current transcription text from event
Appends to transcription history if final or first result
Publishes TranscriptionUpdated event with current and full text

clear_audio_queue() → None[source]

Remove all pending audio chunks from the processing queue.

Empties the audio queue by discarding all queued audio data. Useful for resetting the transcription state or handling session cleanup.

Warning

This operation discards all pending audio data and cannot be undone. Use with caution during active transcription sessions.

property full_transcription: str

Get the complete transcription text including all segments.

Combines all completed transcription segments with the current active transcription to provide the full transcribed text.

Returns:: Complete transcription text with all segments joined
Return type:: str

property queue_size: int

Get the current number of audio chunks waiting in the processing queue.

Returns:: Number of queued audio chunks
Return type:: int

async run_batched_inference(model, batch_size: int = 1, batch_timeout_s: float = 0.1) → None[source]

Execute the main batched inference loop for continuous audio processing.

Continuously retrieves audio data from the queue and processes it in batches for improved efficiency. Collects multiple audio samples up to the specified batch size before passing them to the model for transcription. Handles timeouts gracefully to prevent blocking and ensures timely processing even with partial batches.

Parameters:

model (Callable[[Tuple[list[torch.Tensor], list[bool]]], Awaitable[None]]) – Async callable model that processes batched audio tensors
batch_size (int) – Maximum number of audio samples to collect per batch
batch_timeout_s (float) – Timeout in seconds for collecting additional batch items

Batching Strategy:

Waits up to 1 second for the first audio sample to avoid busy waiting
Collects additional samples with shorter timeout (batch_timeout_s) for responsiveness
Processes partial batches when timeout is reached or batch_size is filled
Maintains separate lists for audio tensors and finality flags

Processing Flow:

Initial Wait: Blocks up to 1s for first audio sample
Batch Collection: Gathers additional samples with short timeout
Batch Processing: Sends complete batch to model as tuple of lists
Continuous Loop: Repeats indefinitely for real-time processing

Timeout Behavior:

Long timeout (1s) for first item prevents CPU spinning when queue is empty
Short timeout (batch_timeout_s) for additional items ensures low latency
Graceful handling of timeouts without error propagation
Processes partial batches immediately when collection timeout occurs

Batch Format:

Model receives: (list[torch.Tensor], list[bool])
First list contains audio tensors for batch processing
Second list contains corresponding finality flags
Maintains order correspondence between tensors and flags

Performance Benefits:

Reduced model invocation overhead through batching
Improved GPU utilization with parallel processing
Lower per-sample processing latency at scale
Configurable batch size for memory/latency trade-offs

Note

This method runs indefinitely and should be executed in a separate task or thread. The batch_size parameter affects both memory usage and processing efficiency - larger batches improve throughput but increase latency.

Warning

Large batch sizes may cause GPU memory issues. Monitor memory consumption and adjust batch_size accordingly. The method will block if the audio queue is consistently empty.

property stats: dict

Get comprehensive statistics about the transcription process.

Provides key metrics for monitoring transcription performance and state, including queue utilization and processing progress.

Returns:: Dictionary containing transcription statistics
Return type:: :class`dict`

Returns:

queue_size: Number of audio chunks in processing queue
processed_chunks: Total number of processed audio chunks
num_transcriptions: Number of completed transcription segments

Event-driven transcription manager that coordinates the entire transcription pipeline. Manages audio queues, processes chunks, and delivers completed transcriptions through the event system.

Speech Recognition

Whisper Model

pydantic model whisper_web.whisper_model.ModelConfig[source]

Bases: BaseModel

Configuration for creating and loading the Whisper ASR model.

This class contains the configuration options required for initializing the Whisper model, including the model size, device type, and other parameters related to the inference process.

field batch_size: int = 1: The batch size to be used for inference. This is the number of audio chunks processed in parallel.

field batch_timeout_s: float = 0.1: The timeout in seconds for batch processing. If the batch is not filled within this time, it will be processed anyway.

field continuous: bool = True: Whether to generate audio data continuously or not.

field device: str = 'cuda': The device to be used for inference. choices=[‘cpu’, ‘cuda’, ‘mps’]

field model_id: str | None = None: The model id to be used for loading the model.

field model_size: str = 'large-v3': The size of the model to be used for inference. choices=[‘small’, ‘medium’, ‘large-v3’]

field samplerate: int = 16000: The sample rate of the generated audio.

field use_vad: bool = False: Whether to use VAD (Voice Activity Detection) or not.

class whisper_web.whisper_model.WhisperModel(model_args: ModelConfig, event_bus: EventBus)[source]

Bases: object

Event-driven Whisper ASR model wrapper with optimized inference capabilities.

This class provides a high-level interface to OpenAI’s Whisper models (via Transformers) with event-driven architecture, device optimization, and asynchronous processing. It handles model loading, configuration, and transcription with automatic device selection and performance optimizations.

Core Features:

Event-Driven Architecture: Publishes transcription results via event bus
Device Optimization: Automatic device selection with CUDA, MPS, and CPU support
Async Processing: Non-blocking transcription using thread pools
Model Flexibility: Supports various Whisper model sizes and custom model IDs
Performance Optimizations: Includes dtype optimization and CUDA acceleration
Distil-Whisper Integration: Optimized distilled models for faster inference

Supported Models:

small: distil-whisper/distil-small.en (English only, fastest)
medium: distil-whisper/distil-medium.en (English only, balanced)
large-v3: distil-whisper/distil-large-v3 (Multilingual, most accurate)
Custom: Any HuggingFace Whisper-compatible model ID

Device Support:

CUDA: GPU acceleration with float16 precision
MPS: Apple Silicon GPU acceleration
CPU: Fallback with float32 precision

Parameters:

model_args (ModelConfig) – Configuration object specifying model and device settings
event_bus (EventBus) – Event bus for publishing transcription results

Variables:

device – Torch device used for model inference
samplerate – Audio sample rate for processing
torch_dtype – Data type used for model computations
speech_model – Loaded Whisper model for conditional generation
processor – Whisper processor for audio preprocessing

Note

The model automatically handles device placement, dtype conversion, and publishes results through the event system for loose coupling.

load_model(model_size: str, model_id: str | None) → None[source]

Load and initialize the Whisper ASR model with optimized configuration.

This method handles the complete model loading process including model ID resolution, cache management, device placement, and performance optimizations. It supports both predefined model sizes and custom model IDs from HuggingFace.

Parameters:

model_size (str) – Predefined model size (‘small’, ‘medium’, ‘large-v3’) or custom size
model_id (Optional[str]) – Optional custom HuggingFace model ID. Overrides model_size if provided

Model ID Resolution:

If model_id is provided, uses it directly
Otherwise, maps model_size to appropriate distil-whisper model:
- ‘small’ → ‘distil-whisper/distil-small.en’
- ‘medium’ → ‘distil-whisper/distil-medium.en’
- ‘large-v3’ → ‘distil-whisper/distil-large-v3’
- ‘large’ → ‘distil-whisper/distil-large-v3’ (legacy mapping)

Optimizations Applied:

Automatic dtype selection (float16 for CUDA, float32 for CPU/MPS)
Low CPU memory usage during loading
SafeTensors format for improved security and performance
Configurable cache directory via HF_HOME environment variable

Cache Management:

Uses HF_HOME environment variable if set
Falls back to ‘./.models’ directory for local caching
Enables offline usage after initial download

Note

Model loading may take time on first use due to download requirements. Subsequent loads use cached models for faster initialization.

Warning

Ensure sufficient disk space for model caching. Large models can require several GB of storage space.

Wrapper for OpenAI’s Whisper models with configuration management and device optimization. Provides the core speech-to-text functionality with support for various model sizes and configurations.

Audio Processing

Audio Input Stream Generator

pydantic model whisper_web.inputstream_generator.GeneratorConfig[source]

Bases: BaseModel

Configuration model for controlling audio input generation behavior.

This configuration class is used to define how audio should be captured, processed, and segmented before being sent to a speech recognition system.

field adjustment_time: int = 5: The adjustment_time for setting the silence threshold.

field blocksize: int = 6000: The size of each individual audio chunk.

field continuous: bool = True: Whether to generate audio data conituously or not.

field from_file: str = '': The path to the audio file to be used for inference.

field max_length_s: int = 25: The maximum length of the audio data.

field min_chunks: int = 3: The minimum number of chunks to be generated, before feeding it into the asr model.

field phrase_delta: float = 1.0: The expected pause between two phrases in seconds.

field samplerate: int = 16000: The specified samplerate of the audio data.

class whisper_web.inputstream_generator.InputStreamGenerator(generator_config: GeneratorConfig, event_bus: EventBus)[source]

Bases: object

Handles real-time or file-based audio input for speech processing and transcription.

This class manages the lifecycle of audio input—from capturing or loading audio data to detecting speech segments and dispatching them for transcription. It supports both live microphone streams and pre-recorded audio files, and includes configurable voice activity detection (VAD) heuristics and silence detection.

Core Features:

Real-Time Audio Input: Captures audio using a microphone input stream.
File-Based Input: Reads and processes audio from a file if specified.
Silence Threshold Calibration: Dynamically computes the silence threshold based on environmental noise.
Voice Activity Detection (VAD): Supports heuristic-based VAD.
Phrase Segmentation: Aggregates audio buffers into speech phrases based on silence duration and loudness.
Asynchronous Processing: Fully asynchronous design suitable for non-blocking audio pipelines.

Parameters:

generator_config (GeneratorConfig) – Configuration object with audio processing settings
event_bus (EventBus) – Instance of the EventBus to handle events

Variables:

samplerate – Sample rate for audio processing
blocksize – Size of each audio block
adjustment_time – Time in seconds for adjusting silence threshold
min_chunks – Minimum number of chunks to process
continuous – Flag for continuous processing
event_bus – Event bus for handling events
global_ndarray – Global buffer for audio data
phrase_delta_blocks – Max number of blocks for inbetween phrases
silence_threshold – Threshold for silence detection
max_blocksize – Maximum size of audio block
max_blocksize – Maximum size of audio block in samples
max_chunks – Maximum number of chunks
from_file – Path to the audio file if specified

Note

Instantiate this class with a GeneratorConfig and EventBus, then call process_audio() to start listening or processing input.

async generate() → AsyncGenerator[source]

Asynchronously generates audio chunks for processing from a live input stream.

This method acts as a unified audio generator, yielding blocks of audio data for downstream processing.

Behavior:

Opens an audio input stream using sounddevice.InputStream.
Captures audio in blocks of self.blocksize, configured for mono 16-bit input.
Uses a thread-safe callback to push incoming audio data into an asyncio.Queue.
Yields (in_data, status) tuples from the queue as they become available.

Returns:: A tuple containing the raw audio block and its status.
Return type:: Iterator[Tuple[np.ndarray, CallbackFlags]]

async generate_from_file(file_path: str) → None[source]

Processes audio data from a file and simulates streaming for transcription.

This method reads audio from the given file path, optionally resamples and converts it to mono, and then splits the audio into chunks that simulate live microphone input. Each chunk is passed to the transcription manager after waiting for the current transcription to complete.

Behavior:

Reads audio from the specified file using soundfile.
Supports multi-channel audio, which is converted to mono by selecting the first channel.
If the audio files sample rate differs from the expected rate (self.samplerate), the data is resampled to match.
Audio is divided into blocks of self.max_blocksize samples.
The final chunk is zero-padded if it is shorter than the expected size.
Each chunk is set as the current buffer and dispatched for transcription using _send_audio().
Waits for the transcription manager’s signal (transcription_status.wait()) before continuing.
Logs the total time taken to process the file.

Parameters:: file_path (str) – Path to the audio file to be processed

async process_audio() → None[source]

Entry point for audio processing based on the selected VAD configuration.

Determines if the input is from a file or a live stream, sets up the silence threshold, and processes audio input accordingly.

Note

If from_file is set, it processes the audio from the specified file. If from_file is not set, it sets the silence threshold and processes audio using heuristics.

async process_with_heuristic() → None[source]

Continuously processes audio input, detects significant speech segments, and dispatches them for transcription.

This method operates in an asynchronous loop, consuming real-time audio buffers from generate(), aggregating meaningful speech segments while filtering out silence or noise based on a calculated silence threshold.

Behavior:

Buffers with low average volume (below self.silence_threshold) are considered silent.
Incoming buffers are accumulated in self.global_ndarray.
If the accumulated audio exceeds self.max_chunks, it is dispatched for transcription.
If the size of self.global_ndarray is > 0 and the average volume is below the silence threshold, empty blocks is incremented by one and if it exceeds self.phrase_delta_blocks, the buffer is dispatched.
If a buffer does not start or end or is silent the audio is dispatched.
In continuous mode (self.continuous = True), the method loops indefinitely to process ongoing audio.
Otherwise, it exits after the first valid speech phrase is processed.

async send_audio(is_final: bool = False) → None[source]

Dispatches the collected audio buffer for transcription after normalization.

This method converts the internal audio buffer (self.global_ndarray) from 16-bit PCM format to a normalized float32 waveform in the range [-1.0, 1.0]. It then creates an AudioChunk instance with the normalized data and publishes it as an AudioChunkGenerated event to the event bus.

Parameters:: is_final (bool) – Indicates if the audio chunk is complete and ready for final processing.

async set_silence_threshold() → None[source]

Dynamically determines and sets the silence threshold based on initial audio input.

This method analyzes the average loudness of incoming audio blocks during a short calibration phase to determine an appropriate silence threshold. The threshold helps distinguish between background noise and meaningful speech during audio processing.

Behavior:

Processes audio blocks for a predefined duration (_adjustment_time in seconds).
For each block, computes the mean absolute loudness and stores it.
After enough blocks are collected, calculates the average loudness across all blocks.
Sets self.silence_threshold to this value, treating it as the baseline for silence.

Note

This method is skipped if audio is being read from a file (self.from_file is set). Intended to run once before audio processing begins, helping tailor silence detection to the environment.

Handles real-time audio capture and processing. Converts audio streams into chunks suitable for transcription, with configurable sample rates, chunk sizes, and audio preprocessing options.

Event System

Events & Event Bus

class whisper_web.events.AudioChunkGenerated(chunk: whisper_web.types.AudioChunk, is_final: bool)[source]

Bases: Event

chunk: AudioChunk

is_final: bool

class whisper_web.events.AudioChunkNum(num_chunks: int)[source]

Bases: Event

num_chunks: int

class whisper_web.events.AudioChunkReceived(chunk: whisper_web.types.AudioChunk, is_final: bool)[source]

Bases: Event

chunk: AudioChunk

is_final: bool

class whisper_web.events.DownloadModel(model_url: str, is_finished: bool = False)[source]

Bases: Event

is_finished: bool = False

model_url: str

class whisper_web.events.Event[source]

Bases: ABC

Base class for all events.

class whisper_web.events.EventBus[source]

Bases: object

Asynchronous event bus implementation for decoupled component communication.

The EventBus provides a publish-subscribe pattern that enables loose coupling between different components of the whisper-web transcription system. Components can subscribe to specific event types and publish events without direct knowledge of other components.

Key Features:

Type-Safe Subscriptions: Events are registered by their concrete type
Async/Sync Handler Support: Automatically detects and handles both coroutine and regular functions
Multiple Subscribers: Multiple handlers can subscribe to the same event type
Decoupled Architecture: Publishers don’t need to know about subscribers

Variables:: _subscribers – Internal mapping of event types to their handler lists

async publish(event: Event) → None[source]

Publish an event to all registered subscribers of its type.

This method delivers the event to all handlers that have subscribed to the event’s specific type. Both synchronous and asynchronous handlers are supported and will be called appropriately.

Parameters:: event (Event) – The event instance to publish to subscribers

subscribe(event_type: type, handler: Callable[[Any], None]) → None[source]

Register a handler function to receive events of a specific type.

When an event of the specified type is published, the handler will be called with the event instance as its argument. Handlers can be either synchronous functions or async coroutines.

Parameters:

event_type (type) – The class type of events this handler should receive
handler (Callable[[Any], None]) – Function or coroutine to call when events are published

class whisper_web.events.TranscriptionCompleted(transcription: whisper_web.types.Transcription, is_final: bool)[source]

Bases: Event

is_final: bool

transcription: Transcription

class whisper_web.events.TranscriptionUpdated(current_text: str, full_text: str)[source]

Bases: Event

current_text: str

full_text: str

Asynchronous event system that coordinates communication between components. Defines all event types used throughout the transcription pipeline for loose coupling and extensibility.

Data Types & Utilities

Core Data Types

class whisper_web.types.AudioChunk(data: torch.Tensor, timestamp: datetime.date)[source]

Bases: object

data: Tensor

timestamp: date

class whisper_web.types.Transcription(text: str, timestamp: datetime.date)[source]

Bases: object

text: str

timestamp: date

Fundamental data structures used throughout the system, including audio chunks and transcription objects with proper typing and validation.

Utility Functions

whisper_web.utils.get_installed_models()[source]: Scan the .models folder and return formatted model names.

whisper_web.utils.process_transcription_timestamps(transcriptions: list[str], last_timestamp: float) → tuple[list[str], float][source]

Process transcriptions to maintain timestamp continuity across batches.

Parameters:: transcriptions – List of transcription strings with timestamps
Returns:: List of transcriptions with adjusted timestamps

whisper_web.utils.set_device(device) → device[source]

Helper functions for device management, configuration, and other common operations used across the transcription system.