I’ve done it!!! FINALLY I have become a (quasi-local) summoner!!! AMA [imtiredboss.jpg]

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 22, 2026 3 min read
I’ve done it!!! FINALLY I have become a (quasi-local) summoner!!! AMA [imtiredboss.jpg]

I’ve done it!!! FINALLY I have become a (quasi-local) summoner!!! AMA [imtiredboss.jpg]

Hi friends! After 2.5 years of a LOT of hard work… starting from the GPT-3.5 bottom and now we’re here… I’ve finally got my personal 1.0 local-ish AI playground whipped into shape. This is for all those out there with mid-tier equipment relying on Big Tech/BigAI as far as their AI needs when they know they have something useful and they’re not sure how to piece it together. Hopefully this gives some inspiration!!

**DISCLAIMER:** I say local-ish because while I do have nine local endpoints… there are only a handful (if that) that are useful to me because I do not have the compute to support long context, extended/semi-agentic inferencing. I am of firm belief that as of May 2026 and beyond, the “free ride” for AI is over, and unless you have equipment worth thousands and thousands of dollars, you WILL be paying some piper somewhere if you want to be remotely competitive.

Granted, I realize that’s an area for healthy debate… but that’s just me and it’s what drove the philosophy behind my stack. I do feature local endpoints in my screenshots and will say more about them below.

To be clear: I’m not claiming my local box beats frontier models on raw intelligence because it doesn’t at ALL (seriously, for the HuggingFace people out there… I’m at 25.3 TFLOPs soooo there’s that). What I mean is that this workflow is better for *me* than any single hosted SOTA chat product because I control the routing, context, tooling, model mix, observability, and failure handling.

What I’ve got stitched together:

  • Msty Studio as the front-end cockpit
  • Hybrid local + cloud inference
  • LiteLLM proxy layer
  • Dockerized observability stack
  • Actual operational guardrails

The screenshots are not meant to be polished SaaS screenshots. They are more like proof that I finally have the bones of a real personal inference platform running: model control, budget visibility, telemetry, local models, remote models, tool workflows, and enough dashboards to tell when something is lying, slow, down, expensive, or looping.

Some underrated Msty pieces that clicked for me:

  • Model Hub — makes a messy provider/model universe manageable
  • OpenAI-compatible providers — plug in my own LiteLLM gateway instead of being locked into one vendor
  • Workspaces/projects — keeping contexts from turning into a junk drawer
  • Toolsets/MCP-style workflows — an actual workbench, not just a textbox
  • Turnstiles — reusable workflow pipelines for repeated tasks
  • Personas — lets me keep specialized operating modes without rewriting giant prompts every time

The best part is that it feels like the system is now compounding. Every new model, provider, tool, prompt, workflow, and dashboard slot can plug into the same cockpit instead of becoming another disconnected toy. I know a lot of people here already run much more serious local stacks because holy GOD it’s impressive what this community puts out… so I’m not pretending this is some final boss. But as a solo-builder “quasi-local summoner” setup, this is the first time my local AI environment feels like an actual platform instead of a pile of experiments.

AMA. Happy to explain the architecture, Msty setup, LiteLLM routing, Docker stack, local model choices, what failed, what I’d rebuild, and what’s still duct-taped together.

ALL LOCAL MODELS EMPLOYED:

  • Unsloth’s Gemma3-1B-IT (Q4_K_M, GGUF, llama.cpp-provided)
  • Google’s Gemma4-E2B-IT (4-bit, MLX)
  • IBM’s Granite3.3-2B-IT (4-bit, MLX)
  • NVIDIA’s Nemotron 3 Nano 4B (Q8_0, GGUF, LM Studio/LM Link)
  • Mistral’s Ministral-8B-IT-2512 (Q4_K_M, GGUF, llama.cpp-provided)
  • mlx-community’s Jan-v2-VL-High 8B (4-bit, MLX, LM Studio/LM Link)
  • HauHauCS’s Qwen3.5-9B-Uncensored-Aggressive (Q4_K_M, GGUF, LM Studio/Link)
  • OpenAI’s gpt-oss-20B (MXFP4, GGUF, LM Studio/LM Link)
  • HauHauCS’s Qwen3.6-35B-A3B-Uncensored-Aggressive (Q5_K_P, GGUF, LM Studio/Link)

For those curious about my beefiest model (that I call “titan”), it’s… let’s say not fast lmao. I’m probably rocking anywhere from 5-9 tokens per sec; it can get up to 15 sometimes but never really faster. Otherwise, I’m not really a tps demon per se… so long as it’s usable for what I’m using the model for, it works just fine for me (5-9 is my slowest, 150+ is my fastest as far as local endpoints).

Key Takeaways

  • The author has built a personal AI playground with nine local endpoints.
  • This setup allows for better control over the model mix, context, and tooling compared to hosted SOTA products.
  • The system is designed to compound new models and tools into an integrated platform, providing visibility and observability.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top