How I built a Go proxy that keeps your LLM conversation alive when cloud quota runs out
Tech

How I built a Go proxy that keeps your LLM conversation alive when cloud quota runs out

Introduction If you've ever been mid-conversation with Claude or GPT, hit a quota limit, and switched to a local Ollama model,you know the pain. The local model has zero context. It's like walking into a meeting 45 minutes late and nobody catches you up. I got frustrated enough to build something about it. That something is Trooper. What is Trooper Trooper is a lightweight Go proxy (~850 lines, two files) that sits between your application and your LLM providers. When a cloud provider returns a quota error (429, 402, 529), Trooper automatically falls back to a local Ollama instance without dropping the conversation context. Single binary. Zero dependencies. Easy to audit since it sits in front of your API keys. The real problem: context loss on fallback Most fallback proxies solve the routing problem but ignore the context problem. They either pass the raw message history as-is (which blows up the local model's context window) or they truncate the oldest turns (which kills continuity). Neither works well in practice. The solution: three-layer context compaction Trooper uses a structured compaction strategy before handing off to Ollama: Anchor : The first two turns of the conversation are always preserved. These establish the original intent and set the tone. SITREP : The middle turns get compressed into a structured summary called a SITREP. It extracts intent, entities, open loops, recent actions, and resolved items. The local model gets situational awareness, not raw history. Tail : The most recent turns are preserved within a configurable token budget. A real SITREP looks like this in the logs: 📦 Context compaction triggered — 538 tokens exceeds 500 budget 📦 Context compaction complete Total turns : 7 Anchor turns : 2 (~43 tokens) Middle turns : 2 → SITREP (~71 tokens) Recent turns : 3 (~323 tokens) Tokens used : 437 / 500 SITREP : intent="trooper" stage=unclear confidence=0.60 open=1 actions=0 resolved=0 The local model knows what you were working on, what's broken, what's been resolved, and what the last few exchanges were. That's enough to keep the conversation coherent. Why Go Single binary distribution was the main reason. No runtime, no dependencies, drop it anywhere and it runs. The codebase being ~850 lines also means anyone can read the whole thing in an afternoon — important for something that proxies API keys. Provider support Trooper currently supports Claude, Gemini, and OpenAI as cloud providers with automatic fallback to Ollama. The provider chain is configurable via environment variables. What's next V3.0 is focused on foundation hardening — concurrency fixes and improved error handling. V3.1 will improve the SITREP extraction quality on longer conversations, which is where intent detection starts to degrade today. Try it github.com/shouvik12/trooper Would love feedback on the context compaction approach — especially from anyone running larger local models. What's your cold-start latency on fallback?

Read full story →

Comments

Loading comments…

Related