Encoder models are often overlooked despite their superior performance and efficiency. Training one at an $ENTERPRISE frees you from “why don’t you just” professional managers, and it’s a rewarding process.

The training doesn’t require much background or technical expertise in hindsight, and each stage gives some positive feedback to continue. It’s also rewarding to see better evaluation results on downstream tasks than using LLMs that presumably have 10x more params.

The training can be largely divided into 3 stages: dataset preparation, train tokenizer(encouraged but optional), train model itself.

Dataset

Most papers recommend using normal datasets for pretraining, and higher quality datasets for later training stages. There are plenty of open corpora on HuggingFace, or you may need a proprietary one from your $ENTERPRISE storage, either way, no peaceful internet neighborhood shall be disturbed.

I did 2 preprocessing steps for a proprietary dataset.

Deduplication with N-gram MinHashLSH

Near-deduplication improves the model’s downstream performance with a much smaller dataset

… as well as training time. There are plenty of articles on the topic, but most are at best MinHash 101 and Spark 101, some cost too much, some are content farm gibberish. The only useful one from my browser history is Large-scale Near-deduplication Behind BigCode, it offers the motivation above, and extra:

  • trade-off analysis: “If you are familiar with Datasketch, you might ask, why do we bother to strip all the nice high-level functions the library provides”
  • cost as a reference: 1.4TB data / 4 hours / $15 per hour. My naive parquet-streaming version using Datasketch costs about the same dollar per compressed TB.
  • some 101 and codes

If a fresh, domain-tailored tokenizer is needed, train one using the deduped dataset, then pretokenize the deduped dataset for training. See Tokenizer section below, then come back.

Pretokenize the dataset

Pretokenize helps increase training speed (tokens/second, see issue with code examples) by 25%~400%, depending on model size and sequence packing on/off. For my dataset, it takes ~9 hours to pretokenize but saves 50+ hours in each base-size pretraining, a worthwhile upfront cost.

Sequence packing affects convergence. My best guess is that batches become noiser and back propagation reduced max-seq / avg-seq times – good luck finding a proper combination of hyperparams.

Tokenizer

Training a tokenizer is encouraged but completely optional, any English/multilingual BPE toenizer from modern LLM should just work, at the cost of wasting more token embeddings but may only minimally hurt downstream task – I verified that accidentally because of a bug.

But it’s fun and easy: choose normalizers, add template, make an iterator by yield from the deduped dataset, train from the iterator, save.

I decided to train one because: a) conceptually it’s a good idea, more token embeddings used, fewer tokens to represent domain text can also reduce the inference cost, b) papers like SciVocab proved effectiveness, c) it’s also cheap. Skip this perhaps only when your domain is close to general web text?

Examining longest tokens reveals a few confirming and fun facts: they are domain terminologies that I’d expect to see in the dataset, naturally, less token embedding will be wasted during finetuning; German compound words can be very long; some data sources repeatedly contributed the same phrases.

Post Processor Template

BOS/CLS and EOS/SEP tokens are important for benchmarks and real use cases, especially CLS which always appears at the text beginning, and has bi-directional information encoded. Add a post processor template to include these special tokens in pretraining.

Train

The paper presented a training recipe of 3 stages: pretrain, context extension, lr decay. And a few yaml files in this branch, largely replicated by other papers with promising results.

When you don’t have access to H100, decreasing microbatch_size proportionally to VRAM should work, the learning rate can be higher or lower, I didn’t see any meaningful differences on convergence in a few ablations.

I found that combining stage 2 and 3 saves some time and still gives a very decent model, which performs reasonably well on downstream tasks. However, combining all 3 stages into one pass didn’t work, its performance was much worse.

Epilogue

I chose Modern BERT simply because when I started training its Torch support was ready, while other recent variants weren’t. Their codebase isn’t the most actively maintained, but at least the branch provides usable code, and authors are quite helpful despite being super busy.

Before committing to training, I validated classification quality, benchmarked latency and estimated cost, all numbers showed clear potential.

It’s now serving online requests in $ENTERPRISE as the backbone model, delivering better results at <1% of GPT cost and latency. Not a fair comparison, as that’s domain-specific encoder vs general-purpose decoder, yet it’s still fulfilling to see something you trained from scratch actually work.