Talkie-1930: An LLM That Has Never Heard of Computers Just Learned to Code • didof.dev

When Claude or GPT-4 writes a working SQL query, you cannot tell whether the model reasoned its way to the answer or just remembered a similar one from 100 million GitHub repos. The training corpus is too big and too messy to rule out memorization for any specific output. That ambiguity is the quiet engine of the “stochastic parrot” debate, and it never resolves, because both sides are arguing about the same evidence.

Nick Levine, David Duvenaud, and Alec Radford built a way out. Their model, Talkie-1930 , is a 13B-parameter LLM trained on 260 billion tokens of English text published before 31 December 1930. No code. No internet. No World War II. Then they handed it HumanEval (OpenAI’s standard 164-problem Python coding benchmark) with a few Python examples in the prompt and asked it to write code.

It can.

Key Takeaways
Talkie-1930 is a 13B LLM trained on 260B tokens of pre-1931 books, newspapers, periodicals, scientific journals, patents, and case law. Funded by Coefficient Giving and Anthropic.
With Python examples in-context, it solves HumanEval problems at non-zero pass@100 (the benchmark sample succeeds if any of 100 attempts is correct), despite never having seen code or knowing what a computer is.
Surprisingness on 4,990 NYT “On This Day” events climbs after 1930, peaking in the 1950s and 1960s, then plateauing. Bits-per-byte (the number of bits the model needs to encode each byte of text, where higher means “more surprised”) tracks the model’s actual ignorance horizon.
Time leakage is brutal. The 7B leaked the New Deal and named specific 1933 to 1935 statutes. The 13B leaked WWII and the United Nations.
OCR errors cut effective learning to 30% of human-transcribed quality. Regex cleanup recovered it to 70%.
The team is targeting a trillion-token corpus and a GPT-3-level Talkie by summer 2026.

The Cleanest Experiment We Have for “Reasoning vs Memorization”

The argument cuts like this. Pro-reasoning camp: LLMs solve novel problems, therefore they are doing something more than retrieval. Anti-reasoning camp: the training set is so vast that any “novel” problem has near-neighbors the model can paraphrase, and you cannot prove it didn’t.

Both sides are right within the frame they choose. The frame is the problem.

Talkie-1930 is the cleanest natural experiment for breaking the frame. The training cutoff is hard, principled, and verifiable: 31 December 1930 is the date when works enter the public domain in the United States. The corpus is composed of materials whose original publication dates predate any modern technical concept the test will probe. If Talkie can write Python, it cannot have memorized Python. The capability has to come from somewhere else.

The “somewhere else” is the part that matters.

What’s in 260 Billion Tokens of Pre-1931 English

The corpus pulls from books, newspapers, periodicals, scientific journals, patents, and case law. These sources are not interchangeable for the question at hand. Case law and patents are the two heaviest carriers of formal logical structure in the historical written record. A patent abstract describes a mechanism in terms of components, relations, conditions, and outputs. A judicial opinion threads premises through rules of inference toward a conclusion that has to survive review. Both genres trained generations of human readers in the kind of step-by-step reasoning that programming later inherited.

The team doesn’t publish a per-source breakdown of tokens. The behaviour suggests these structurally rich genres punch above their weight.

The HumanEval Result You Should Care About

The model “dramatically underperforms” a modern-twin model trained on FineWeb (a large, modern web-scraped corpus from HuggingFace, used as a stand-in for “everything a contemporary LLM gets to read”) when both are evaluated on HumanEval pass@100 with Python examples in-context. Pass@100 means: sample the model 100 times for each problem, and count it as solved if any of those 100 completions runs correctly. That comparison is the headline most readers fixate on, and it’s the wrong one.

The right comparison is Talkie-1930 vs zero. The model has never been told what a computer is. It has not seen a def or a for loop in 260 billion tokens. Researchers paste a handful of working Python examples into the prompt, and the model produces additional working Python. Pass@100 is well above floor. The capability exists.

That single fact eliminates one explanation for in-context learning as a phenomenon. Whatever in-context learning is doing in modern LLMs, it is not solely “remembering similar code from training and pattern-matching the prompt to it”. A model that has never seen code can still extract syntax rules from examples and apply them. The inductive substrate has to predate the specific domain.

If you’ve spent any time prompting current models, this should reorganize how you think about few-shot prompting. The few-shot examples are not retrieval triggers. They are something closer to an instruction set the model assembles on the fly out of more general capacity.

Forecasting the Future With Bits Per Byte

The reasoning result is the eye-catcher. The forecasting result is the methodology contribution that quietly does the most work.

The team scored Talkie-1930’s surprise at 4,990 historical event descriptions sourced from the New York Times “On This Day” column. The metric is bits per byte: how many bits the model needs to encode a passage, normalized by passage length. Lower numbers mean the model found the content predictable; higher numbers mean the content sat outside what the model could anticipate. It’s a close cousin of perplexity, just measured per byte of raw text instead of per token, which makes it cleaner to compare across models with different tokenizers.

Run that across the 20th century and you get a curve. Quoting the project directly:

“We can see an increase after the knowledge cutoff, particularly pronounced in the 1950s and 1960s, followed by a plateau.”

The 1950s-60s peak is where the Cold War, the space race, and the early computing era arrive in the news cycle. These are the concepts most semantically distant from anything in the pre-1931 corpus. The plateau after that point is the model bottoming out: at some level of unfamiliarity, additional novelty stops registering as additional surprise.

This is a clean operationalization of “what does the model genuinely not know”. It also gives you an objective signal for AI forecasting research that doesn’t require you to ask the model to predict anything. You just measure its surprise at history.

Time Leakage: The Part Where the Experiment Almost Breaks

The hard part of training a model with a clean knowledge cutoff is keeping the cutoff clean. Two specific failures show up in the project’s own materials.

The 7B model leaked the New Deal. Asked who was the US President in 1936, an earlier 7B Talkie correctly named Franklin D. Roosevelt and cited the National Recovery Act of 1933, the Agricultural Adjustment Act of 1935, and the Emergency Banking Act of 1935. Each of those is more than two years past the cutoff. The leak almost certainly traveled through editorial introductions, prefaces, footnotes, and library cataloguing metadata embedded in scanned books. A 1928 economics text reissued in 1962 carries a 1962 introduction.

The 13B model leaked WWII. Despite tighter filtering, the larger model retained knowledge of the United Nations and the postwar division of Germany. Same root cause, different surfaces.

What this proves is mundane and important. A “knowledge cutoff” is not a physical wall. It is a continuous filtering problem with adversarial inputs (modern editorial content piggybacking on old documents). Anyone who has ever audited a “model trained only on data before X” claim now has a concrete case study showing how leakage actually happens.

OCR Is the Other Reason This Project Was Hard

Pre-1931 text is not natively digital. It exists as scanned images of physical pages, processed through optical character recognition. OCR on historical print is famously noisy.

The team measured the cost. Training on raw machine-transcribed text yielded 30% of the learning efficiency that human-transcribed versions achieved. Regex cleanup brought that up to 70%. The remaining 30% gap is what the team has to close to get to GPT-3-level capability on a fully historical corpus.

Their illustrative example is the opening of L. Frank Baum’s The Wonderful Wizard of Oz (1899) as it landed in the corpus before cleanup:

J)ecause of the great Lion... C 'This must be the Land of Oz," said^Dor(|tliy...

Multiply by 260 billion tokens and you understand why “just train on the public domain” turned out to be a multi-year engineering project rather than a weekend hack.

Vintage Post-Training: How to Make a 1930s Chatbot

A pretrained Talkie can complete pre-1931 prose. Making it follow instructions without injecting modern preferences required a custom post-training pipeline. The team generated instruction-response pairs from period-appropriate reference material:

An etiquette manual (Beadle, 1859)
A practical knowledge book (Henley, 1914)
A parlor guide (Sandison, c. 1895)
A letter-writing manual (Chambers, 1900)

They also pulled from cookbooks, dictionaries, encyclopedias, and poetry and fable collections. These genres carry implicit instruction-following in their structure: a recipe is a procedure with a goal, a dictionary is a Q&A about words, a letter-writing manual is a templated request-response corpus.

DPO (Direct Preference Optimization) ran on the model’s own rollouts. DPO is a fine-tuning method that nudges a model toward “preferred” answers and away from “rejected” ones using a dataset of paired comparisons, without the separate reward model that older RLHF (Reinforcement Learning from Human Feedback) pipelines required. The Talkie team used Claude Sonnet 4.6 as the AI judge that decided which of two candidate responses was better. Instruction-following ratings climbed from 2.0 to 3.4 on a five-point scale.

There is a subtle point here. Using a modern model as a judge introduces a small contamination channel. Claude was trained on post-1930 material, including modern conventions about what a “good answer” looks like. A 1930 reader might prefer different rhetorical structures than a 2026 Anthropic preference dataset rewards. The team flags this honestly. It’s the kind of trade-off that makes the project feel like real research instead of a gimmick.

What Talkie Tells Us That Modern LLMs Cannot

Modern frontier models confound three sources of capability: pretraining scale, RLHF alignment (the post-training step that teaches models to follow instructions and refuse harmful requests), and direct exposure to the test domain. You cannot tease them apart by ablating one factor on GPT-4 (ablating, in ML, just means removing one ingredient and re-running the experiment to see what changes). You’d have to retrain from scratch.

Talkie-1930 removes the third factor by construction. Whatever it can do, it does on top of pretraining structure plus vintage post-training. No exposure to code, modern instruction-following styles, or contemporary discourse.

Three findings survive that removal:

In-context learning generalizes across temporal-semantic gaps. Few-shot Python works with zero pretraining exposure to programming.
Surprise is measurable and tracks ignorance. Bits-per-byte on dated content gives you an objective handle on what the model genuinely does not know.
Instruction following is portable. The model learned to follow instructions from period reference works that were not designed as instruction data.

The complementary finding from the BitNet line of research is similar in spirit: capabilities we attribute to scale or precision are sometimes more robust than the framing suggests. (See BitNet b1.58 for the 1-bit version of this argument and running BitNet on M1 for the practical side.) Stripping a confound usually preserves more than you expect.

The Hassabis Test

The team frames the larger ambition with a question Demis Hassabis has been posing for a few years:

“As Demis Hassabis has asked, could a model trained up to 1911 independently discover General Relativity, as Einstein did in 1915?”

That question stops being a thought experiment the moment Talkie-1930 ships. We now have the instrument. A 1911 cutoff is mostly a different filtering pass over the same source materials, and the experimental design (in-context examples, surprisingness curves, post-training from period reference works) carries over.

We are not at General Relativity. We are nowhere near General Relativity. The relevant point is that the question has stopped being rhetorical. You can run the experiment.

What’s Next

The team is targeting a trillion-token historical corpus and a GPT-3-level Talkie for release by summer 2026. At that scale, the comparison stops being “tiny vintage model vs frontier model trained on the modern web” and starts being “comparable-capacity model with vs without modern data”. That’s the experiment whose results actually let us isolate what modern data buys us, beyond raw scale.

If you want to play with the current model, the chat demo is at talkie-lm.com/chat and the code lives at github.com/talkie-lm/talkie . The project’s own writeup, Introducing talkie: a 13B vintage language model from 1930, is the canonical source for the numbers and methodology.

The thing I keep coming back to: a model that has never heard of a computer can be taught to write code from a handful of examples. Whatever debate you were having about reasoning vs memorization, you are now having a different one.