Armchair Arena - Test Ollama Models for YOUR use case
A self-hosted tiny LLM arena to pick my favorite Ollama model tailored for my unique use case.
Up until recently, I was a HUGE Claude user, bouncing between Max and Max 20 depending on the project and the month. I mostly saw AI as a way to write deterministic code. I watched OpenClaw show up, get hacked, then evolve, and I said, "I don't need that." Between Perplexity, Gemini, and Claude Code, all my non-coding "life assistant" needs were covered.
Then I got curious about Hermes โ it seemed like a "smarter" OpenClaw. I set it up against my ChatGPT subscription and it worked fine, but it got me curious about running open models. So I grabbed an Ollama subscription and was immediately spoiled for choice: 20+ models, each with a pile of cryptic numbers after its name โ oh my. I picked them basically at random, wired up Cron jobs and webhooks, and had them give me advice and run little errands. Sometimes they were great, sometimes they weren't, and I genuinely couldn't tell which. Classic analysis paralysis: I had no idea which model deserved to be my "go-to," my "set and forget" pick.
So, in classic Steve fashion, I built my own LM arena : Armchair Arena where I judge a model's performance by completely subjective voting on the answer and on the web sources it chose to cite.
What Armchair Arena actually does
It's a tiny, self-hosted arena for evaluating LLMs on real-world, non-coding tasks โ web research, summarization, advice, recipes, general knowledge โ and it judges them on the three things that actually matter day to day: was it fast, was it token-efficient, and was the answer any good?
You pick 3 models, ask one question, and see the answers side-by-side โ complete with live metrics (tokens, tokens/sec, wall-clock) and the actual source URLs each model pulled in. Then you crown the best one with a single click. Every model gets the same web-research tools (a self-hosted Firecrawl), so it's a fair fight.

One decisive judgment per round beats fuzzy 1โ5 star ratings, because over many rounds it turns into something you can actually quantify.
The data is where it gets interesting
Every run is saved to SQLite, and an analytics page turns all of it into an opponent-aware strength rating , a Bradley-Terry / Elo-style score where beating a strong model counts for more than beating a weak one. You get win-rate with a 95% confidence interval (so a lucky 3-run sample can't masquerade as the champ), a Pareto efficiency frontier that flags the models nothing else beats on strength, cost, and speed at once, plus strength-vs-cost and speed leaderboards. And CSV export, because of course.

Here's my honest snapshot. By raw strength, deepseek-v4-pro (~1743) and qwen3.5 (~1738) are out front, and my pick nemotron-3-nano:30b-cloud sits around third (~1469 strength, ~31% win rate over 26 runs). It's not on the efficiency frontier either. So why is it my "set and forget"?
Because for my questions, it's fast, it's light, its answers are consistently good-enough, and it tends to cite sources I'd actually click. The heavyweights win more head-to-heads, but nemotron-nano is the one I'm happy to leave running unattended on a Cron job.
Set it up to curate your own roster local, remote, or cloud
On first run, an onboarding screen lets you curate the roster of models you want in the arena. Mix Ollama Cloud with any number of local or remote Ollama servers you add by URL (a workstation, a box on your tailnet, whatever) โ each model is automatically routed to its own backend. Your picks are saved server-side, so they persist across sessions and browsers, and you can edit the roster anytime from โ Models.

How you can use it
If you're curious what the best model is for YOU, try it, the whole project is on GitHub:
๐ https://github.com/drkpxl/armchair-arena
You can set it up yourself (Python + uv, a self-hosted Firecrawl for the web tools, and an Ollama backend), or just point an AI agent like Hermes/OpenClaw at the included AGENTS.md โ "clone this repo, follow AGENTS.md, here's my Ollama API key" and let it install, configure, and confirm everything's healthy for you.
Future plans
If there's interest, I'd add frontier models via their own API keys I'm just not ready for that yet. If you want to help shape that, or you just want to argue about which model should be everyone's set-and-forget, let me know. And if you run it, I'd genuinely love to hear which model ends up being your armchair champion.