whisper_web package

Submodules

whisper_web.events module

class whisper_web.events.AudioChunkGenerated(chunk: whisper_web.types.AudioChunk, is_final: bool)[source]

Bases: Event

chunk: AudioChunk
is_final: bool
class whisper_web.events.AudioChunkNum(num_chunks: int)[source]

Bases: Event

num_chunks: int
class whisper_web.events.AudioChunkReceived(chunk: whisper_web.types.AudioChunk, is_final: bool)[source]

Bases: Event

chunk: AudioChunk
is_final: bool
class whisper_web.events.DownloadModel(model_url: str, is_finished: bool = False)[source]

Bases: Event

is_finished: bool = False
model_url: str
class whisper_web.events.Event[source]

Bases: ABC

Base class for all events.

class whisper_web.events.EventBus[source]

Bases: object

Asynchronous event bus implementation for decoupled component communication.

The EventBus provides a publish-subscribe pattern that enables loose coupling between different components of the whisper-web transcription system. Components can subscribe to specific event types and publish events without direct knowledge of other components.

Key Features:

  • Type-Safe Subscriptions: Events are registered by their concrete type

  • Async/Sync Handler Support: Automatically detects and handles both coroutine and regular functions

  • Multiple Subscribers: Multiple handlers can subscribe to the same event type

  • Decoupled Architecture: Publishers don’t need to know about subscribers

Variables:

_subscribers – Internal mapping of event types to their handler lists

async publish(event: Event) None[source]

Publish an event to all registered subscribers of its type.

This method delivers the event to all handlers that have subscribed to the event’s specific type. Both synchronous and asynchronous handlers are supported and will be called appropriately.

Parameters:

event (Event) – The event instance to publish to subscribers

subscribe(event_type: type, handler: Callable[[Any], None]) None[source]

Register a handler function to receive events of a specific type.

When an event of the specified type is published, the handler will be called with the event instance as its argument. Handlers can be either synchronous functions or async coroutines.

Parameters:
  • event_type (type) – The class type of events this handler should receive

  • handler (Callable[[Any], None]) – Function or coroutine to call when events are published

class whisper_web.events.TranscriptionCompleted(transcription: whisper_web.types.Transcription, is_final: bool)[source]

Bases: Event

is_final: bool
transcription: Transcription
class whisper_web.events.TranscriptionUpdated(current_text: str, full_text: str)[source]

Bases: Event

current_text: str
full_text: str

whisper_web.inputstream_generator module

pydantic model whisper_web.inputstream_generator.GeneratorConfig[source]

Bases: BaseModel

Configuration model for controlling audio input generation behavior.

This configuration class is used to define how audio should be captured, processed, and segmented before being sent to a speech recognition system.

field adjustment_time: int = 5

The adjustment_time for setting the silence threshold.

field blocksize: int = 6000

The size of each individual audio chunk.

field continuous: bool = True

Whether to generate audio data conituously or not.

field from_file: str = ''

The path to the audio file to be used for inference.

field max_length_s: int = 25

The maximum length of the audio data.

field min_chunks: int = 3

The minimum number of chunks to be generated, before feeding it into the asr model.

field phrase_delta: float = 1.0

The expected pause between two phrases in seconds.

field samplerate: int = 16000

The specified samplerate of the audio data.

class whisper_web.inputstream_generator.InputStreamGenerator(generator_config: GeneratorConfig, event_bus: EventBus)[source]

Bases: object

Handles real-time or file-based audio input for speech processing and transcription.

This class manages the lifecycle of audio input—from capturing or loading audio data to detecting speech segments and dispatching them for transcription. It supports both live microphone streams and pre-recorded audio files, and includes configurable voice activity detection (VAD) heuristics and silence detection.

Core Features:

  • Real-Time Audio Input: Captures audio using a microphone input stream.

  • File-Based Input: Reads and processes audio from a file if specified.

  • Silence Threshold Calibration: Dynamically computes the silence threshold based on environmental noise.

  • Voice Activity Detection (VAD): Supports heuristic-based VAD.

  • Phrase Segmentation: Aggregates audio buffers into speech phrases based on silence duration and loudness.

  • Asynchronous Processing: Fully asynchronous design suitable for non-blocking audio pipelines.

Parameters:
  • generator_config (GeneratorConfig) – Configuration object with audio processing settings

  • event_bus (EventBus) – Instance of the EventBus to handle events

Variables:
  • samplerate – Sample rate for audio processing

  • blocksize – Size of each audio block

  • adjustment_time – Time in seconds for adjusting silence threshold

  • min_chunks – Minimum number of chunks to process

  • continuous – Flag for continuous processing

  • event_bus – Event bus for handling events

  • global_ndarray – Global buffer for audio data

  • phrase_delta_blocks – Max number of blocks for inbetween phrases

  • silence_threshold – Threshold for silence detection

  • max_blocksize – Maximum size of audio block

  • max_blocksize – Maximum size of audio block in samples

  • max_chunks – Maximum number of chunks

  • from_file – Path to the audio file if specified

Note

Instantiate this class with a GeneratorConfig and EventBus, then call process_audio() to start listening or processing input.

async generate() AsyncGenerator[source]

Asynchronously generates audio chunks for processing from a live input stream.

This method acts as a unified audio generator, yielding blocks of audio data for downstream processing.

Behavior:

  • Opens an audio input stream using sounddevice.InputStream.

  • Captures audio in blocks of self.blocksize, configured for mono 16-bit input.

  • Uses a thread-safe callback to push incoming audio data into an asyncio.Queue.

  • Yields (in_data, status) tuples from the queue as they become available.

Returns:

A tuple containing the raw audio block and its status.

Return type:

Iterator[Tuple[np.ndarray, CallbackFlags]]

async generate_from_file(file_path: str) None[source]

Processes audio data from a file and simulates streaming for transcription.

This method reads audio from the given file path, optionally resamples and converts it to mono, and then splits the audio into chunks that simulate live microphone input. Each chunk is passed to the transcription manager after waiting for the current transcription to complete.

Behavior:

  • Reads audio from the specified file using soundfile.

  • Supports multi-channel audio, which is converted to mono by selecting the first channel.

  • If the audio files sample rate differs from the expected rate (self.samplerate), the data is resampled to match.

  • Audio is divided into blocks of self.max_blocksize samples.

  • The final chunk is zero-padded if it is shorter than the expected size.

  • Each chunk is set as the current buffer and dispatched for transcription using _send_audio().

  • Waits for the transcription manager’s signal (transcription_status.wait()) before continuing.

  • Logs the total time taken to process the file.

Parameters:

file_path (str) – Path to the audio file to be processed

async process_audio() None[source]

Entry point for audio processing based on the selected VAD configuration.

Determines if the input is from a file or a live stream, sets up the silence threshold, and processes audio input accordingly.

Note

If from_file is set, it processes the audio from the specified file. If from_file is not set, it sets the silence threshold and processes audio using heuristics.

async process_with_heuristic() None[source]

Continuously processes audio input, detects significant speech segments, and dispatches them for transcription.

This method operates in an asynchronous loop, consuming real-time audio buffers from generate(), aggregating meaningful speech segments while filtering out silence or noise based on a calculated silence threshold.

Behavior:

  • Buffers with low average volume (below self.silence_threshold) are considered silent.

  • Incoming buffers are accumulated in self.global_ndarray.

  • If the accumulated audio exceeds self.max_chunks, it is dispatched for transcription.

  • If the size of self.global_ndarray is > 0 and the average volume is below the silence threshold, empty blocks is incremented by one and if it exceeds self.phrase_delta_blocks, the buffer is dispatched.

  • If a buffer does not start or end or is silent the audio is dispatched.

  • In continuous mode (self.continuous = True), the method loops indefinitely to process ongoing audio.

  • Otherwise, it exits after the first valid speech phrase is processed.

async send_audio(is_final: bool = False) None[source]

Dispatches the collected audio buffer for transcription after normalization.

This method converts the internal audio buffer (self.global_ndarray) from 16-bit PCM format to a normalized float32 waveform in the range [-1.0, 1.0]. It then creates an AudioChunk instance with the normalized data and publishes it as an AudioChunkGenerated event to the event bus.

Parameters:

is_final (bool) – Indicates if the audio chunk is complete and ready for final processing.

async set_silence_threshold() None[source]

Dynamically determines and sets the silence threshold based on initial audio input.

This method analyzes the average loudness of incoming audio blocks during a short calibration phase to determine an appropriate silence threshold. The threshold helps distinguish between background noise and meaningful speech during audio processing.

Behavior:

  • Processes audio blocks for a predefined duration (_adjustment_time in seconds).

  • For each block, computes the mean absolute loudness and stores it.

  • After enough blocks are collected, calculates the average loudness across all blocks.

  • Sets self.silence_threshold to this value, treating it as the baseline for silence.

Note

This method is skipped if audio is being read from a file (self.from_file is set). Intended to run once before audio processing begins, helping tailor silence detection to the environment.

whisper_web.management module

class whisper_web.management.AudioManager(event_bus: EventBus)[source]

Bases: object

Manages audio chunk lifecycle and distribution in the transcription pipeline.

The AudioManager handles the flow of raw audio chunks from generation to consumption, providing a buffered interface between audio input sources and transcription models. It operates through event-driven architecture to ensure loose coupling and scalability.

Core Responsibilities:

  • Audio Chunk Buffering: Maintains a thread-safe queue of processed audio chunks

  • Event-Driven Processing: Subscribes to audio generation events and manages chunk flow

  • Chunk Metadata Tracking: Monitors chunk counts and processing statistics

  • Async Chunk Distribution: Provides non-blocking access to queued audio data

  • Error Handling: Gracefully handles audio processing errors and timeouts

Event Subscriptions:

  • AudioChunkGenerated: Receives and queues newly generated audio chunks

  • AudioChunkNum: Updates total expected chunk count for progress tracking

Key Features:

  • Thread-Safe Operations: All queue operations are async-safe for concurrent access

  • Timeout Handling: Non-blocking audio retrieval with configurable timeouts

  • Progress Tracking: Monitors processing progress against expected chunk counts

  • Error Resilience: Continues operation despite individual chunk processing errors

Parameters:

event_bus (EventBus) – Event bus instance for inter-component communication

Variables:
  • audio_chunk_queue – Thread-safe queue for audio chunk storage

  • processed_chunks – Counter for successfully processed audio chunks

  • num_chunks – Total expected number of chunks for current session

clear_audio_queue() None[source]

Remove all pending audio chunks from the processing queue.

Empties the audio chunk queue by discarding all queued audio data. Useful for session cleanup or resetting audio processing state.

Warning

This operation discards all pending audio chunks and cannot be undone. Use with caution during active audio processing sessions.

async get_next_audio_chunk() Tuple[AudioChunk, bool] | None[source]

Retrieve the next available audio chunk from the processing queue.

Provides non-blocking access to queued audio chunks with timeout handling. Returns None when no audio is available within the timeout period, allowing callers to handle empty queue conditions gracefully.

Returns:

Tuple of (AudioChunk, is_final) if available, None if timeout/error

Return type:

Tuple[AudioChunk, bool] | None

Behavior:

  • Waits up to 1 second for audio chunk availability

  • Returns tuple containing audio chunk and finality flag

  • Returns None on timeout or processing errors

  • Logs errors without raising exceptions

property queue_size: int

Get the current number of audio chunks waiting in the processing queue.

Returns:

Number of queued audio chunks

Return type:

int

property stats: dict

Get comprehensive statistics about audio chunk processing.

Provides key metrics for monitoring audio processing performance, including queue utilization and chunk processing progress.

Returns:

Dictionary containing audio processing statistics

Return type:

dict

Returns:

  • queue_size: Number of audio chunks currently in processing queue

  • processed_chunks: Total number of audio chunks processed

class whisper_web.management.TranscriptionManager(event_bus: EventBus)[source]

Bases: object

Manages the complete transcription pipeline from audio input to text output.

The TranscriptionManager serves as the central coordinator for the real-time transcription process, handling audio chunk queuing, transcription state management, and result aggregation. It operates through an event-driven architecture, responding to audio events and publishing transcription updates.

Core Responsibilities:

  • Audio Queue Management: Maintains a thread-safe queue of incoming audio chunks

  • Transcription State Tracking: Manages both current and historical transcription results

  • Event Coordination: Subscribes to audio events and publishes transcription updates

  • Inference Loop: Provides async interface for model inference execution

  • Result Aggregation: Combines partial and final transcriptions into complete text

Event Subscriptions:

  • AudioChunkReceived: Queues audio data for processing

  • TranscriptionCompleted: Updates transcription state and publishes results

Event Publications:

  • TranscriptionUpdated: Notifies subscribers of new transcription results

Parameters:

event_bus (EventBus) – Event bus instance for inter-component communication

Variables:
  • transcriptions – List of completed transcription segments

  • current_transcription – Current active transcription text

  • audio_queue – Thread-safe queue for audio processing

  • processed_chunks – Counter for processed audio chunks

  • num_chunks – Total expected number of chunks

clear_audio_queue() None[source]

Remove all pending audio chunks from the processing queue.

Empties the audio queue by discarding all queued audio data. Useful for resetting the transcription state or handling session cleanup.

Warning

This operation discards all pending audio data and cannot be undone. Use with caution during active transcription sessions.

property full_transcription: str

Get the complete transcription text including all segments.

Combines all completed transcription segments with the current active transcription to provide the full transcribed text.

Returns:

Complete transcription text with all segments joined

Return type:

str

property queue_size: int

Get the current number of audio chunks waiting in the processing queue.

Returns:

Number of queued audio chunks

Return type:

int

async run_batched_inference(model, batch_size: int = 1, batch_timeout_s: float = 0.1) None[source]

Execute the main batched inference loop for continuous audio processing.

Continuously retrieves audio data from the queue and processes it in batches for improved efficiency. Collects multiple audio samples up to the specified batch size before passing them to the model for transcription. Handles timeouts gracefully to prevent blocking and ensures timely processing even with partial batches.

Parameters:
  • model (Callable[[Tuple[list[torch.Tensor], list[bool]]], Awaitable[None]]) – Async callable model that processes batched audio tensors

  • batch_size (int) – Maximum number of audio samples to collect per batch

  • batch_timeout_s (float) – Timeout in seconds for collecting additional batch items

Batching Strategy:

  • Waits up to 1 second for the first audio sample to avoid busy waiting

  • Collects additional samples with shorter timeout (batch_timeout_s) for responsiveness

  • Processes partial batches when timeout is reached or batch_size is filled

  • Maintains separate lists for audio tensors and finality flags

Processing Flow:

  1. Initial Wait: Blocks up to 1s for first audio sample

  2. Batch Collection: Gathers additional samples with short timeout

  3. Batch Processing: Sends complete batch to model as tuple of lists

  4. Continuous Loop: Repeats indefinitely for real-time processing

Timeout Behavior:

  • Long timeout (1s) for first item prevents CPU spinning when queue is empty

  • Short timeout (batch_timeout_s) for additional items ensures low latency

  • Graceful handling of timeouts without error propagation

  • Processes partial batches immediately when collection timeout occurs

Batch Format:

  • Model receives: (list[torch.Tensor], list[bool])

  • First list contains audio tensors for batch processing

  • Second list contains corresponding finality flags

  • Maintains order correspondence between tensors and flags

Performance Benefits:

  • Reduced model invocation overhead through batching

  • Improved GPU utilization with parallel processing

  • Lower per-sample processing latency at scale

  • Configurable batch size for memory/latency trade-offs

Note

This method runs indefinitely and should be executed in a separate task or thread. The batch_size parameter affects both memory usage and processing efficiency - larger batches improve throughput but increase latency.

Warning

Large batch sizes may cause GPU memory issues. Monitor memory consumption and adjust batch_size accordingly. The method will block if the audio queue is consistently empty.

property stats: dict

Get comprehensive statistics about the transcription process.

Provides key metrics for monitoring transcription performance and state, including queue utilization and processing progress.

Returns:

Dictionary containing transcription statistics

Return type:

:class`dict`

Returns:

  • queue_size: Number of audio chunks in processing queue

  • processed_chunks: Total number of processed audio chunks

  • num_transcriptions: Number of completed transcription segments

whisper_web.server module

class whisper_web.server.ClientSession(session_id: str, model_config: ModelConfig)[source]

Bases: object

Represents an isolated client transcription session with dedicated resources.

Each ClientSession encapsulates a complete transcription pipeline for a single client, including its own event bus, transcription manager, and Whisper model instance. This design enables concurrent multi-client support with isolated state and configurations.

Session Components:

  • Event Bus: Dedicated event system for inter-component communication

  • Transcription Manager: Handles audio queuing and transcription state

  • Whisper Model: Configured ASR model instance for speech recognition

  • Inference Task: Async task managing the transcription processing loop

  • Download Tracking: Monitors model download progress

Key Features:

  • Isolation: Each session operates independently with separate state

  • Lifecycle Management: Handles session startup, operation, and cleanup

  • Event-Driven Architecture: Uses publish-subscribe pattern for loose coupling

  • Async Processing: Non-blocking inference execution with proper task management

  • Model Download Handling: Tracks and responds to model download events

Parameters:
  • session_id (str) – Unique identifier for this client session

  • model_config (ModelConfig) – Configuration for the Whisper model used in this session

Variables:
  • session_id – Unique session identifier

  • model_config – Whisper model configuration

  • event_bus – Session-specific event bus for component communication

  • manager – Transcription manager handling audio processing

  • model – Whisper model instance for speech recognition

  • inference_task – Async task running the inference loop

  • is_downloading – Flag indicating if model is currently downloading

async handle_model_download(event: DownloadModel)[source]

Handle model download progress events for this session.

Updates the session’s download status based on model download events, allowing the session to track when models are being loaded and when they become available for inference.

Parameters:

event (DownloadModel) – Model download event containing URL and completion status

Behavior:

  • Sets is_downloading to True when download starts

  • Sets is_downloading to False when download completes

  • Logs download status changes for monitoring

async start_inference()[source]

Start the transcription inference task for this session.

Creates and starts an async task that runs the main inference loop, processing audio chunks through the Whisper model. If an inference task is already running, this method has no effect.

Behavior:

  • Creates new inference task if none exists or previous task completed

  • Task runs indefinitely until manually stopped or session cleanup

  • Uses the session’s transcription manager and model for processing

  • Handles audio chunks from the session’s event bus

Note

The inference task runs in the background and must be explicitly stopped using stop_inference() for proper cleanup.

async stop_inference()[source]

Stop the transcription inference task for this session.

Gracefully cancels the running inference task and waits for proper cleanup. Handles cancellation exceptions to ensure clean shutdown without propagating cancellation errors.

Behavior:

  • Cancels the inference task if currently running

  • Waits for task cancellation to complete

  • Suppresses CancelledError exceptions from task cleanup

  • Safe to call multiple times or when no task is running

Note

This method should be called during session cleanup to prevent resource leaks and ensure proper task termination.

pydantic model whisper_web.server.CreateSessionRequest[source]

Bases: BaseModel

Request schema for creating a new session.

field model_configuration: ModelConfig | None = None

Model configuration for the session

field session_id: str | None = None

Optional custom session ID

pydantic model whisper_web.server.CurrentTranscriptionResponse[source]

Bases: BaseModel

Response schema for current transcription.

field current_transcription: str [Required]

Current transcription text

field session_id: str [Required]

Session identifier

pydantic model whisper_web.server.FinalTranscriptionResponse[source]

Bases: BaseModel

Response schema for final transcription.

field final_transcription: str [Required]

Final transcription text

field session_id: str [Required]

Session identifier

pydantic model whisper_web.server.InstalledModelsResponse[source]

Bases: BaseModel

Response schema for getting all installed models.

field installed_models: List[str] [Required]

List of all installed models

pydantic model whisper_web.server.MessageResponse[source]

Bases: BaseModel

Generic response schema for operations with messages.

field message: str [Required]

Operation result message

field session_id: str | None = None

Session identifier

pydantic model whisper_web.server.QueueProcessedResponse[source]

Bases: BaseModel

Response schema for processed queue items.

field audio_queue_processed: int [Required]

Number of audio chunks processed

field session_id: str [Required]

Session identifier

pydantic model whisper_web.server.QueueSizeResponse[source]

Bases: BaseModel

Response schema for audio queue size.

field audio_queue_size: int [Required]

Number of audio chunks in queue

field session_id: str [Required]

Session identifier

pydantic model whisper_web.server.SessionInfo[source]

Bases: BaseModel

Schema for session information in list responses.

field audio_queue_size: int [Required]

Number of audio chunks in queue

field current_transcription: str [Required]

Current transcription text

field inference_running: bool [Required]

Whether inference is currently running

field model_configuration: Dict [Required]

Model configuration

field session_id: str [Required]

Unique session identifier

field transcription_count: int [Required]

Number of completed transcriptions

pydantic model whisper_web.server.SessionListResponse[source]

Bases: BaseModel

Response schema for listing sessions.

field sessions: List[SessionInfo] [Required]

List of active sessions

field total_sessions: int [Required]

Total number of active sessions

pydantic model whisper_web.server.SessionOperationResponse[source]

Bases: BaseModel

Response schema for session operations like restart.

field inference_running: bool | None = None

Whether inference is running after operation

field message: str [Required]

Operation result message

field session_id: str [Required]

Session identifier

pydantic model whisper_web.server.SessionResponse[source]

Bases: BaseModel

Response schema for session creation.

field message: str | None = None

Optional message (e.g., for existing sessions)

field model_configuration: Dict [Required]

Model configuration used for the session

field session_id: str [Required]

Unique session identifier

pydantic model whisper_web.server.SessionStatusResponse[source]

Bases: BaseModel

Response schema for session status.

field audio_queue_processed: int [Required]

Number of audio chunks processed

field audio_queue_size: int [Required]

Number of audio chunks in queue

field inference_running: bool [Required]

Whether inference is currently running

field is_downloading: bool [Required]

Whether model is currently downloading

field model_configuration: Dict [Required]

Model configuration

field session_id: str [Required]

Session identifier

field transcription_count: int [Required]

Number of completed transcriptions

class whisper_web.server.TranscriptionServer(default_model_config: ModelConfig | None = None, host: str = '0.0.0.0', port: int = 8000)[source]

Bases: object

A real-time speech transcription API server built with FastAPI.

This service provides a full audio transcription pipeline via HTTP and WebSocket endpoints, including:

  • Audio stream ingestion per client.

  • Real-time automatic speech recognition (ASR) using configurable Whisper models.

  • Voice Activity Detection (VAD) to detect speech regions.

  • Status monitoring of transcription, generation, and voice activity per client.

  • Retrieval and posting of transcription data per client.

  • Multi-client support with isolated sessions.

Parameters:
  • default_model_config (ModelConfig, optional) – Default configuration for the ASR model (e.g., Whisper), defaults to ModelConfig()

  • host (str, optional) – Hostname for the FastAPI server, defaults to “localhost”

  • port (int, optional) – Port for the FastAPI server, defaults to 8000

Note

Each client connection creates its own isolated session with a dedicated TranscriptionManager and WhisperModel. This allows multiple clients to use different model configurations and have separate transcription states.

The server exposes endpoints for creating sessions, managing transcriptions per session, and retrieving processing status per client.

async cleanup_inactive_sessions()[source]

Remove sessions with failed or completed inference tasks.

Performs maintenance by identifying and removing sessions whose inference tasks have finished or failed. This prevents accumulation of dead sessions and ensures system resources are properly reclaimed.

Behavior:

  • Scans all active sessions for completed inference tasks

  • Identifies sessions with failed tasks (exceptions)

  • Marks failed/completed sessions for removal

  • Removes inactive sessions and performs cleanup

  • Logs cleanup activities for monitoring

Use Cases:

  • Periodic maintenance to clean up dead sessions

  • Error recovery after inference failures

  • Resource management in long-running deployments

Note

This method should be called periodically or after detecting inference failures to maintain system health.

get_or_create_session(session_id: str | None = None, model_config: ModelConfig | None = None) ClientSession[source]

Retrieve an existing session or create a new one with specified configuration.

This method implements the session management logic, ensuring each client gets a dedicated session with isolated resources. If a session doesn’t exist, it creates a new one with the provided or default model configuration.

Parameters:
  • session_id (Optional[str]) – Unique identifier for the session. If None, generates UUID

  • model_config (Optional[ModelConfig]) – Model configuration for new sessions. Uses default if None

Returns:

The existing or newly created client session

Return type:

ClientSession

Behavior:

  • Generates UUID if no session_id provided

  • Returns existing session if session_id already exists

  • Creates new session with provided or default model configuration

  • New sessions are immediately ready for inference

async remove_session(session_id: str)[source]

Remove a client session and perform complete cleanup.

Safely removes a session by stopping its inference task, cleaning up resources, and removing it from the active sessions dictionary. This method ensures proper resource cleanup to prevent memory leaks.

Parameters:

session_id (str) – Unique identifier of the session to remove

Behavior:

  • Stops the session’s inference task gracefully

  • Removes session from active sessions dictionary

  • Performs cleanup of session resources

  • Logs removal for monitoring purposes

  • Safe to call for non-existent sessions (no-op)

run()[source]

Starts the FastAPI application in a separate thread.

This method runs the FastAPI server using Uvicorn, which handles the HTTP requests for the transcription service. The server is launched in a separate thread, allowing the application to run concurrently with other tasks. It uses the host and port parameters defined in the class to bind the server.

The server operates as a daemon thread, meaning it will not block the main program from exiting.

Note

  • The application listens for HTTP requests at the specified host and port.

  • Ensure that the necessary configurations for the host and port are provided when calling this method.

pydantic model whisper_web.server.TranscriptionsResponse[source]

Bases: BaseModel

Response schema for getting all transcriptions.

field session_id: str [Required]

Session identifier

field transcriptions: List[str] [Required]

List of all transcriptions

whisper_web.types module

class whisper_web.types.AudioChunk(data: torch.Tensor, timestamp: datetime.date)[source]

Bases: object

data: Tensor
timestamp: date
class whisper_web.types.Transcription(text: str, timestamp: datetime.date)[source]

Bases: object

text: str
timestamp: date

whisper_web.utils module

whisper_web.utils.get_installed_models()[source]

Scan the .models folder and return formatted model names.

whisper_web.utils.process_transcription_timestamps(transcriptions: list[str], last_timestamp: float) tuple[list[str], float][source]

Process transcriptions to maintain timestamp continuity across batches.

Parameters:

transcriptions – List of transcription strings with timestamps

Returns:

List of transcriptions with adjusted timestamps

whisper_web.utils.set_device(device) device[source]

whisper_web.whisper_model module

pydantic model whisper_web.whisper_model.ModelConfig[source]

Bases: BaseModel

Configuration for creating and loading the Whisper ASR model.

This class contains the configuration options required for initializing the Whisper model, including the model size, device type, and other parameters related to the inference process.

field batch_size: int = 1

The batch size to be used for inference. This is the number of audio chunks processed in parallel.

field batch_timeout_s: float = 0.1

The timeout in seconds for batch processing. If the batch is not filled within this time, it will be processed anyway.

field continuous: bool = True

Whether to generate audio data continuously or not.

field device: str = 'cuda'

The device to be used for inference. choices=[‘cpu’, ‘cuda’, ‘mps’]

field model_id: str | None = None

The model id to be used for loading the model.

field model_size: str = 'large-v3'

The size of the model to be used for inference. choices=[‘small’, ‘medium’, ‘large-v3’]

field samplerate: int = 16000

The sample rate of the generated audio.

field use_vad: bool = False

Whether to use VAD (Voice Activity Detection) or not.

class whisper_web.whisper_model.WhisperModel(model_args: ModelConfig, event_bus: EventBus)[source]

Bases: object

Event-driven Whisper ASR model wrapper with optimized inference capabilities.

This class provides a high-level interface to OpenAI’s Whisper models (via Transformers) with event-driven architecture, device optimization, and asynchronous processing. It handles model loading, configuration, and transcription with automatic device selection and performance optimizations.

Core Features:

  • Event-Driven Architecture: Publishes transcription results via event bus

  • Device Optimization: Automatic device selection with CUDA, MPS, and CPU support

  • Async Processing: Non-blocking transcription using thread pools

  • Model Flexibility: Supports various Whisper model sizes and custom model IDs

  • Performance Optimizations: Includes dtype optimization and CUDA acceleration

  • Distil-Whisper Integration: Optimized distilled models for faster inference

Supported Models:

  • small: distil-whisper/distil-small.en (English only, fastest)

  • medium: distil-whisper/distil-medium.en (English only, balanced)

  • large-v3: distil-whisper/distil-large-v3 (Multilingual, most accurate)

  • Custom: Any HuggingFace Whisper-compatible model ID

Device Support:

  • CUDA: GPU acceleration with float16 precision

  • MPS: Apple Silicon GPU acceleration

  • CPU: Fallback with float32 precision

Parameters:
  • model_args (ModelConfig) – Configuration object specifying model and device settings

  • event_bus (EventBus) – Event bus for publishing transcription results

Variables:
  • device – Torch device used for model inference

  • samplerate – Audio sample rate for processing

  • torch_dtype – Data type used for model computations

  • speech_model – Loaded Whisper model for conditional generation

  • processor – Whisper processor for audio preprocessing

Note

The model automatically handles device placement, dtype conversion, and publishes results through the event system for loose coupling.

load_model(model_size: str, model_id: str | None) None[source]

Load and initialize the Whisper ASR model with optimized configuration.

This method handles the complete model loading process including model ID resolution, cache management, device placement, and performance optimizations. It supports both predefined model sizes and custom model IDs from HuggingFace.

Parameters:
  • model_size (str) – Predefined model size (‘small’, ‘medium’, ‘large-v3’) or custom size

  • model_id (Optional[str]) – Optional custom HuggingFace model ID. Overrides model_size if provided

Model ID Resolution:

  • If model_id is provided, uses it directly

  • Otherwise, maps model_size to appropriate distil-whisper model:

    • ‘small’ → ‘distil-whisper/distil-small.en’

    • ‘medium’ → ‘distil-whisper/distil-medium.en’

    • ‘large-v3’ → ‘distil-whisper/distil-large-v3’

    • ‘large’ → ‘distil-whisper/distil-large-v3’ (legacy mapping)

Optimizations Applied:

  • Automatic dtype selection (float16 for CUDA, float32 for CPU/MPS)

  • Low CPU memory usage during loading

  • SafeTensors format for improved security and performance

  • Configurable cache directory via HF_HOME environment variable

Cache Management:

  • Uses HF_HOME environment variable if set

  • Falls back to ‘./.models’ directory for local caching

  • Enables offline usage after initial download

Note

Model loading may take time on first use due to download requirements. Subsequent loads use cached models for faster initialization.

Warning

Ensure sufficient disk space for model caching. Large models can require several GB of storage space.

Module contents