Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 19, 2026 3 min read
Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates





Floor for Local Meeting Summarization on a 6GB GPU


Floor for Local Meeting Summarization on a 6GB GPU

Image of the post

Disclosure: I made this. Open-source, MIT, Windows + Linux. Not affiliated with voiceflow.com.

Why This Exists

  • I wanted local-only dictation and meeting transcription because audio shouldn’t have to leave the machine just to become text.
  • I had a 6GB GPU sitting there doing nothing most of the day. So I built it:
    • A hotkey allows you to transcribe locally, with the text pasting at the cursor.
    • The v1.6.0 release now includes a new feature: a ‘meetings recorder’ that combines mic and system audio into one stereo file, which is then transcribed locally before being sent to any endpoint (e.g., Ollama, llama.cpp).
  • The only network call in the whole product is for the optional summary. You pick where it goes.

Mini Models on Real Workloads

  • I tried the latest small Qwen first: qwen3.5:0.8b (873M, Q8_0). Test rig: RTX 3060 Laptop with ~4.3GB free after Whisper loads, Ollama 0.23, Arch.
  • Input: a real 4-minute meeting (~2900 chars).
  • The model works, but there’s one caveat: Ollama’s VRAM-aware default num_ctx is set to 4096 tokens, which gets eaten before the user-visible tokens land. A simple fix was made:
  • FROM qwen3.5:0.8b
    PARAMETER num_ctx 16384
  • This fixed the issue and allowed it to stream a 1562-char structured summary in 57 seconds at 2.2GB of VRAM. TL;DR, decisions, action items, open questions, all there.
  • Better than I’d expect from sub-1B honestly. For the “but you didn’t go small enough” counter: I sanity-checked Granite 4.0 350M on the same workload. Speed-wise it crushed (0.6 to 2.8 seconds per summary vs 57s for the Qwen model) and structure came back clean, sections all in the right places.
  • Granite returned “Anthropic‘s acquisition by Anthropic” as a discussion topic and invented Binance as another one. A different 4-minute meeting came back as a Star Trek bridge log (“Starship Cassiopeia”, “Tao City F”, colony vessel Andromeda).

For People Who Don’t Want to Run Local

  • Groq’s free tier on llama-3.3-70b has been solid. ~2 seconds per summary, output is tighter than the local 0.8B, and the only thing that broke it for me was a 4-hour meeting transcript that blew past their context window.
  • For anything under that, it’s a real free option.

The Actual Question I’d Like Answers On

  • Long-context structured summarization on low VRAM. The 0.8B Qwen handles a 4-minute meeting comfortably at 16K context. For 1-2 hour transcripts (~30K-60K tokens) on a 6-8GB GPU, what’s working? Pushing context wider and eating the VRAM, chunked map-reduce, or a different small model that doesn’t fall apart on long inputs.
  • Looking for something that holds structure (TL;DR + sections + bullets) when the input gets long, without needing 24GB of VRAM to do it.

App Details

  • One .exe on Windows, one .AppImage on Linux.
  • Pyrold + React + faster-whisper + SQLite, CUDA auto-detect with CPU fallback.
  • The model and mic plus hotkey are done in onboarding in about a minute. Claude was the pair-programming assistant for a lot of boilerplate and the Qt threading gnarliness; architecture and the hard bugs are mine, git history is honest about it.

Repo + 1.6.0

Key Takeaways

  • The qwen3.5:0.8b model works well for real meeting transcripts, handling 4-minute meetings comfortably at 16K context.
  • Groq’s free tier on llama-3.3-70b provides a good alternative for longer transcripts without needing to run local models.
  • For long-context structured summarization, different approaches like chunked map-reduce or a smaller model might be more suitable than relying solely on larger models like qwen3.5:0.8b.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top