A Year of 100 Billion Tokens

the public doesn’t know how to waste money, we are the experts

Most people don’t burn billions of tokens, a long debugging session full of motivated reasoning uses millions, but that won’t scale to billions.

It takes the organizational synergies of an $ENTERPRISE: AI FOMO disguised as hollow “vision statements”(usually after layoffs) and recurring data pipelines.

The organizational synergies manifest predictably: cron and interactive text generation(none solicited, all pure evil) versus knowledge distillation for specialized models, which is defensible though it may incentivize more layoffs. Don’t worry, if you are reading this, it’s not your fault.

$ENTERPRISE will burn tokens regardless. If you are forced to participate, burn them meaningfully. Here are a few lessons from 100B SOTA tokens – first o1-preview, then o1 and o3.

Evaluation Dataset and Guidance

This is the limiting factor. It requires domain experts and tooling support.

Before rushing into import openai, consider these invariants:

task definition: a shared understanding of the problem articulated in natural language, guiding human actions and reducing disputes.
prompt effectiveness: can it steer the model to produce desired outputs?
model performance difference: is a cheaper model good enough? Should you upgrade on model releases?

A dataset built on guidance with minimal subjective judgements helps all three and enables automated prompt tuning, large-scale distillation, i.e. produce training labels.

Lower Expectations: Gaps

Real-life tasks contain vagueness, experts disagree with good reasons.

Models respond differently to prompts, you may never find the phrase that causes undesired results. The most expensive model may not increase precision and recall. This depends on task complexity. In my case, gpt-5-pro equals o3, all mini series score 0.05~0.15 worse and should be avoided.

Models have inherent biases: GPT’s chat bait nature, Claude’s “You are absolutely right!”, language style, punctuation choice, assumed “tech” context, and more.

Prompt engineering cannot fix model biases, focus on what we can control.

Structured Output

Claude now officially supports structured output, you won’t need to write tool definitions anymore. (Corollary: stay away from Bedrock, it doesn’t even support web search tool.)

Unstructured text is largely useless for training, yet many $ENTERPRISE teams still ask gpt-3.5-turbo for a JSON output, strip markdown quotes, and silently ignore JSON parse failures.

Read the docs, write pydantic classes, add field descriptions as if writing a prompt.

RAG or finetune?

RAG outperforms. My tasks don’t involve injecting static proprietary knowledge, there’s no pipelined finetuning recipe, and unseen data requires further finetune. Most importantly, I’m lazy, so I chose RAG.

Prompt Engineering

Claude recommends simple, direct, clear prompts. They are language agnostic and may resist context rot.

Tools like GEPA and proofofthought bridge the gap from natural language to prompts.

However, I prefer to use my own brain first.

Domain Experts

Vagueness and complexity require domain experts to iterate in the feedback loop: test ideas, gather feedback, find correlations, fix the dataset, improve the prompt, etc. Build tools to support these.

Some decision makers hallucinate that TRENDING-AI-IDEs magically reduce every problem to a man-month timeline. Don’t trust them. AI IDEs are useful when:

you know but are too lazy to implement
you don’t know but you can verify

Knowledge distillation falls outside these, you can’t verify what you don’t know, only domain experts can – most already laid off.

If I ran this from the beginning, I would hire all domain experts, so competitors find none on the market after the FOMO.