Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Floor for Local Meeting Summarization on a 6GB GPU

Image of the post

Disclosure: I made this. Open-source, MIT, Windows + Linux. Not affiliated with voiceflow.com.

Why This Exists

I wanted local-only dictation and meeting transcription because audio shouldn’t have to leave the machine just to become text.
I had a 6GB GPU sitting there doing nothing most of the day. So I built it:

A hotkey allows you to transcribe locally, with the text pasting at the cursor.
The v1.6.0 release now includes a new feature: a ‘meetings recorder’ that combines mic and system audio into one stereo file, which is then transcribed locally before being sent to any endpoint (e.g., Ollama, llama.cpp).

The only network call in the whole product is for the optional summary. You pick where it goes.

Mini Models on Real Workloads

I tried the latest small Qwen first: qwen3.5:0.8b (873M, Q8_0). Test rig: RTX 3060 Laptop with ~4.3GB free after Whisper loads, Ollama 0.23, Arch.
Input: a real 4-minute meeting (~2900 chars).
The model works, but there’s one caveat: Ollama’s VRAM-aware default num_ctx is set to 4096 tokens, which gets eaten before the user-visible tokens land. A simple fix was made:

FROM qwen3.5:0.8b
PARAMETER num_ctx 16384

This fixed the issue and allowed it to stream a 1562-char structured summary in 57 seconds at 2.2GB of VRAM. TL;DR, decisions, action items, open questions, all there.
Better than I’d expect from sub-1B honestly. For the “but you didn’t go small enough” counter: I sanity-checked Granite 4.0 350M on the same workload. Speed-wise it crushed (0.6 to 2.8 seconds per summary vs 57s for the Qwen model) and structure came back clean, sections all in the right places.
Granite returned “Anthropic‘s acquisition by Anthropic” as a discussion topic and invented Binance as another one. A different 4-minute meeting came back as a Star Trek bridge log (“Starship Cassiopeia”, “Tao City F”, colony vessel Andromeda).

For People Who Don’t Want to Run Local

Groq’s free tier on llama-3.3-70b has been solid. ~2 seconds per summary, output is tighter than the local 0.8B, and the only thing that broke it for me was a 4-hour meeting transcript that blew past their context window.
For anything under that, it’s a real free option.

The Actual Question I’d Like Answers On

Long-context structured summarization on low VRAM. The 0.8B Qwen handles a 4-minute meeting comfortably at 16K context. For 1-2 hour transcripts (~30K-60K tokens) on a 6-8GB GPU, what’s working? Pushing context wider and eating the VRAM, chunked map-reduce, or a different small model that doesn’t fall apart on long inputs.
Looking for something that holds structure (TL;DR + sections + bullets) when the input gets long, without needing 24GB of VRAM to do it.

App Details

One .exe on Windows, one .AppImage on Linux.
Pyrold + React + faster-whisper + SQLite, CUDA auto-detect with CPU fallback.
The model and mic plus hotkey are done in onboarding in about a minute. Claude was the pair-programming assistant for a lot of boilerplate and the Qt threading gnarliness; architecture and the hard bugs are mine, git history is honest about it.

Repo + 1.6.0

Key Takeaways

The qwen3.5:0.8b model works well for real meeting transcripts, handling 4-minute meetings comfortably at 16K context.
Groq’s free tier on llama-3.3-70b provides a good alternative for longer transcripts without needing to run local models.
For long-context structured summarization, different approaches like chunked map-reduce or a smaller model might be more suitable than relying solely on larger models like qwen3.5:0.8b.

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates

Floor for Local Meeting Summarization on a 6GB GPU

Why This Exists

Mini Models on Real Workloads

For People Who Don’t Want to Run Local

The Actual Question I’d Like Answers On

App Details

Repo + 1.6.0

Key Takeaways

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Carbon Electra 2 is…

Paul McCartney says smartphones…

AI Has Come for…

Floor for Local Meeting Summarization on a 6GB GPU

Why This Exists

Mini Models on Real Workloads

For People Who Don’t Want to Run Local

The Actual Question I’d Like Answers On

App Details

Repo + 1.6.0

Key Takeaways

More in AI News

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Carbon Electra 2 is…

Paul McCartney says smartphones…

AI Has Come for…