All posts

Cohere Open-Sources a Coding Agent That Runs on One H100. The Catch? It Talks Too Much.

Cohere's North Mini Code is a 30B open-source model for agentic coding. It runs on a single H100, but tests show it generates three times the output tokens of rivals. That verbosity is the hidden cost of self-hosting

June 10, 20263 min read
Heavy black chisel-marker illustration of a simplified mechanical coding agent connected to one large server block, flooding the scene with an oversized stream of printed code and话

Cohere wants you to own your coding agent

Cohere released North Mini Code on Tuesday, an open-source coding agent that fits on a single H100 GPU. It is a 30 billion parameter mixture-of-experts model with only 3 billion parameters active per token, and Cohere built it specifically for agentic software engineering tasks like sub-agent orchestration and architecture generation. That sounds like a dream for teams who want to self-host their own coding assistant without renting a fleet of servers.

The timing is deliberate. Anthropic just opened up its Mythos-class Fable 5 model to the public, and OpenAI keeps pushing its managed coding pipelines. Cohere is offering a different deal. You download the weights, you run the model on hardware you control, and your proprietary code never leaves your network. For enterprises with strict data rules or developers who simply hate vendor lock-in, that is a genuine alternative.

There is a catch, and it is measured in tokens. Independent testing showed that North Mini Code generated three times the output tokens of comparable models. That is not a rounding error. In production, every extra token costs latency, compute, and context window space. A coding agent that produces three paragraphs when a single line would do is not efficient. It is expensive in ways that do not show up on the price tag of the GPU.

The self-hosting fantasy meets production reality

Self-hosting feels like freedom. You own the stack, you skip the API rate limits, and you avoid the surprise pricing changes that managed providers drop without warning. But freedom is a full-time job. Once you decide to run your own model, you inherit the work of keeping drivers updated, managing CUDA versions, monitoring GPU memory, and debugging why your inference server crashed at two in the morning.

Managed models shift that burden to someone else. You pay per token, but you do not have to act as your own infrastructure team. The total cost of ownership moves from engineering hours to direct cash. For a small team trying to ship a product, cash is usually easier to find than spare time. North Mini Code changes the math only if you have the scale to amortize the hardware and the headcount to babysit it.

The real cost of North Mini Code is verbosity. The cost goes beyond the token bill. Bloated outputs mean slower iteration loops, more noise for downstream tools, and harder debugging sessions. When a sub-agent sends three times more text than necessary to its parent orchestrator, the entire pipeline gets sluggish. A self-hosted model that runs fast but thinks out loud is still a bottleneck.

What builders should actually do

If you are an indie hacker or a small startup, keep using managed APIs. The economics of self-hosting only start to make sense at serious scale, or when you face compliance requirements that forbid sending data to external clouds. North Mini Code is a win for tinkerers and large enterprises. It is not a shortcut for teams that need to ship this week.

Still, this release is good for everyone. It puts pressure on managed providers to keep prices low and to justify their premiums with real value rather than lock-in. When open-source models get this capable, the entire market has to compete harder. That is how builders win, whether they rent their compute or buy it.

The broader trend is clear. Builders are rejecting black boxes, whether in foundation models or in the platforms they use to ship apps. Botflow follows the same logic: the entire stack is open source, so you can self-host, inspect what is running, and ship full-stack web or mobile apps without waiting on cloud builds. You do not need to manage a GPU to own your backend. But the principle is identical. Control and speed matter more than ever.

Cohere just proved that a serious coding agent can live on a single H100. That is a technical milestone. The next milestone is making it concise. Until then, the token meter is the fine print you cannot afford to ignore.