Built an open-source memory layer for local LLMs — single-shot calls, auto-extracted constraints, no context degradation

DEV Community · Sat, 02 May 2026 18:06:07 +0000 · 1 min read · 👁 1 views

Been running Llama 3.3 70B via Groq for coding tasks and kept losing architectural decisions across sessions. "We use PostgreSQL" — forgotten. "Auth is JWT" — re-debated. Every new chat starts from zero. So I built steerhead — it sits between you and any OpenAI-compatible API and manages context via SQLite instead of chat history. The trick: every message is a single-shot API call. Steerhead assembles the system prompt from stored constraints + file history, fires one clean call, then auto-extracts any decisions the model made (via a second LLM pass) and stores them for next time. Result: 146 tokens of surgical context instead of 80K tokens of degrading conversation history. New session? The model still knows your entire project's decisions. Works with: Groq (free tier, tested with Llama 3.3 70B) Ollama (local) OpenRouter (free models) Any OpenAI-compatible endpoint What's there: project-scoped DBs, session persistence, auto constraint extraction, React UI What's next: git diff capture, drift detection, memory classification (inspired by Cloudflare's Agent Memory announcement) Stack: FastAPI + SQLite + React. Fully local, MIT licensed. Looking for contributors — especially around constraint extraction accuracy and drift detection. GitHub: https://github.com/josephmjustin/steerhead

Read full story →

Built an open-source memory layer for local LLMs — single-shot calls, auto-extracted constraints, no context degradation

Comments

Related

I Thought I'd Just Call a Blockchain API. It Didn't Work Out That Way.

LeafWiki Devlog #10: v0.9.0 – no more broken links, lost edits, or overwritten changes

Linear Regression: Code (a) Line