Floor for Local Meeting Summarization on a 6GB GPU
Disclosure: I made this. Open-source, MIT, Windows + Linux. Not affiliated with voiceflow.com.
Why This Exists
- I wanted local-only dictation and meeting transcription because audio shouldn’t have to leave the machine just to become text.
- I had a 6GB GPU sitting there doing nothing most of the day. So I built it:
- A hotkey allows you to transcribe locally, with the text pasting at the cursor.
- The v1.6.0 release now includes a new feature: a ‘meetings recorder’ that combines mic and system audio into one stereo file, which is then transcribed locally before being sent to any endpoint (e.g., Ollama, llama.cpp).
- The only network call in the whole product is for the optional summary. You pick where it goes.
Mini Models on Real Workloads
- I tried the latest small Qwen first: qwen3.5:0.8b (873M, Q8_0). Test rig: RTX 3060 Laptop with ~4.3GB free after Whisper loads, Ollama 0.23, Arch.
- Input: a real 4-minute meeting (~2900 chars).
- The model works, but there’s one caveat: Ollama’s VRAM-aware default num_ctx is set to 4096 tokens, which gets eaten before the user-visible tokens land. A simple fix was made:
- This fixed the issue and allowed it to stream a 1562-char structured summary in 57 seconds at 2.2GB of VRAM. TL;DR, decisions, action items, open questions, all there.
- Better than I’d expect from sub-1B honestly. For the “but you didn’t go small enough” counter: I sanity-checked Granite 4.0 350M on the same workload. Speed-wise it crushed (0.6 to 2.8 seconds per summary vs 57s for the Qwen model) and structure came back clean, sections all in the right places.
- Granite returned “Anthropic‘s acquisition by Anthropic” as a discussion topic and invented Binance as another one. A different 4-minute meeting came back as a Star Trek bridge log (“Starship Cassiopeia”, “Tao City F”, colony vessel Andromeda).
FROM qwen3.5:0.8b PARAMETER num_ctx 16384
For People Who Don’t Want to Run Local
- Groq’s free tier on llama-3.3-70b has been solid. ~2 seconds per summary, output is tighter than the local 0.8B, and the only thing that broke it for me was a 4-hour meeting transcript that blew past their context window.
- For anything under that, it’s a real free option.
The Actual Question I’d Like Answers On
- Long-context structured summarization on low VRAM. The 0.8B Qwen handles a 4-minute meeting comfortably at 16K context. For 1-2 hour transcripts (~30K-60K tokens) on a 6-8GB GPU, what’s working? Pushing context wider and eating the VRAM, chunked map-reduce, or a different small model that doesn’t fall apart on long inputs.
- Looking for something that holds structure (TL;DR + sections + bullets) when the input gets long, without needing 24GB of VRAM to do it.
App Details
- One .exe on Windows, one .AppImage on Linux.
- Pyrold + React + faster-whisper + SQLite, CUDA auto-detect with CPU fallback.
- The model and mic plus hotkey are done in onboarding in about a minute. Claude was the pair-programming assistant for a lot of boilerplate and the Qt threading gnarliness; architecture and the hard bugs are mine, git history is honest about it.
Repo + 1.6.0
Key Takeaways
- The qwen3.5:0.8b model works well for real meeting transcripts, handling 4-minute meetings comfortably at 16K context.
- Groq’s free tier on llama-3.3-70b provides a good alternative for longer transcripts without needing to run local models.
- For long-context structured summarization, different approaches like chunked map-reduce or a smaller model might be more suitable than relying solely on larger models like qwen3.5:0.8b.
Source Read original →
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




