precomputed-ai

Precomputed AI: Reason Ahead of Time, Serve Instantly

A design pattern for moving LLM reasoning into artifacts produced ahead of time, with live inference as opt-in escalation.


A few weeks ago I wrote about Token Consumption Anxiety — that creeping unease when you watch your AI-powered app burn through tokens, and every user request makes the bill a little bigger.

In that post, I suggested three ways to deal with it. I’ve been sitting with that idea for a while now. It’s bigger than a footnote in an anxiety post. It’s the pattern behind every AI tool I’ve shipped this year.

So I’m giving it a proper name and a proper frame.

Call it Precomputed AI, or PAI for short.


What it is

Precomputed AI relocates LLM reasoning into artifacts produced ahead of time, with live inference reserved as opt-in escalation.

Three words: precompute, serve, escalate.

If you’ve used Next.js static export, Jekyll, or Hugo, the shape will feel familiar. Those tools generate pages at build time and serve them instantly at request time, falling back to dynamic rendering only when they have to. PAI applies the same idea to LLM reasoning.

The artifact doesn’t have to be a web page. It can be a decision ruleset. A lookup table. Generated code. A precomputed explanation. A cached policy surface. What matters is where the reasoning lives — not on the user’s request, but ahead of it.


What it is not

Three things PAI gets confused with, and shouldn’t:

It’s not caching. A cache stores identical outputs keyed on identical inputs. PAI precomputes reasoning across an input space — a decision surface that covers many cases, not a lookup of past answers.

It’s not batch inference. Netflix precomputing your movie recommendations overnight is a decades-old pattern on prediction models. PAI is about LLM reasoning specifically. The word “precomputed” is shared, but the thing being precomputed is different.

It’s not Compiled AI. A recent paper from XY.AI Labs, Stanford, Cornell, and Harvard defines Compiled AI as LLM-generated code artifacts that run deterministically without further model invocation. That’s one valid technique inside PAI, but PAI is broader. It admits non-code artifacts, and it keeps runtime escalation as a first-class option.

Also, I recognize that this is not a blanket frame that will apply to all situations. There are key applications of this approach where it will be an excellent fit, like where rules-based sytems or more deterministic paths to get to the best outcomme.


Mechanics

PAI has three key parts.

1. Escalation. The precomputed artifact handles the common case. When it can’t decide — novel input, ambiguous request, edge case — the system escalates to a live LLM. Escalation is explicit, opt-in, and its cost is shown to the user. Moreover, the use can pre-approve escalation once and let it run automatically after that.

2. Regeneration. The artifact has a refresh cadence — hourly, daily, weekly, or triggered by an external event. Ths addresses the issue “Won’t your ruleset go stale?” Yes, unless you regenerate it. Name the cadence. Show the last-refreshed date and define the window.

3. Artifacts. Rulesets, tables, generated code, precomputed text, cached policies — all OK. PAI asks where reasoning lives, not what form. That flexibility is what makes PAI broader than a single technique.


Patterns


Two tools I’ve shipped under this frame

rightmodel.dev is a model picker for coding tasks. It uses a precomputed ruleset (open-sourced as “ruleset v2”) to recommend the cheapest model tier that fits the job. Prices are refreshed on a schedule. Explanations are precomputed per recommendation.

A request to rightmodel costs zero tokens. The ruleset decides; the page renders.

When the ruleset can’t decide, the user can opt into a “deep analysis” that escalates to a live LLM. The cost is disclosed before the user commits. That’s the escalation contract in production.

cloudestimate.dev sizes self-managed workloads across AWS, GCP, and Azure. Published vendor reference architectures are mapped to cloud instance shapes, then priced against cached regional snapshots. The “Pricing data last refreshed” footer shows the staleness window directly to the user.

This is Scheduled Generation in its cleanest form. LLM involvement in the regeneration pipeline is on the roadmap; the served artifact is static between refreshes.


Why this matters now

Token costs compound at scale. Cheap per call, expensive per million calls. Teams shipping at production volume are running into this wall right now.

Latency budgets are tight. Users don’t tolerate synchronous LLM calls in interfaces that used to feel instant. Every feature shipped with a live LLM call is a latency regression unless the reasoning lives somewhere else.

Auditability is becoming a buyer requirement. Enterprise and regulated industries do not accept “the model decided” as an audit trail. Precomputed artifacts are inspectable, versionable, and testable. Live inference is not.

LLM providers themselves are converging on this pattern — prompt caching, batch APIs, cached responses — without naming it as a design frame. PAI is the name for what a lot of us are already doing under different labels.


An invitation

If you’re shipping an LLM-powered tool and you feel that Token Consumption Anxiety, try asking one question about your design:

Which parts of this reasoning could live in an artifact instead of in the request?

If you build under this approach, I’d like to hear about it. Pattern contributions, worked examples, and honest critiques of the definitions above are all welcome.

Hopefully some of this helps.


Canonical home: precomputedai.com

Patterns and spec: github.com/PrecomputedAI/precomputed-ai

Worked examples: rightmodel.dev · cloudestimate.dev

Licensed CC BY 4.0. Cite as: Raquedan, R. (2026). Precomputed AI: Reason Ahead of Time, Serve Instantly. https://precomputedai.com