Architecture advice: Real-time pipeline for YouTube Audio -> Whisper -> LLM -> SSE (Sub-10s latency) [D]

“`html

A British AI enthusiast is working on a backend system that analyzes long YouTube videos in real-time. Currently, the process involves downloading full audio from a video and passing it through Whisper for transcription, followed by an LLM for analysis. This results in a user waiting indefinitely until all processing is complete.

The individual wants to implement a real-time streaming system where audio chunks are processed as they become available, allowing users immediate feedback without long delays.
They seek advice on how to efficiently chunk the YouTube audio streams and manage overlapping tasks in their pipeline. Specifically, they ask about avoiding cutting sentences and managing queueing strategies for this new architecture.
The user is also looking for recommendations on libraries or architectural patterns that could help streamline their real-time processing system.

“`

Source Read original →