Theory is interesting. Validation is what matters. This month I ran Tessera through her first formal test: the holdout experiment.
The Method
I held out one full year of decisions from the training corpus. Tessera was given only what I had at the time each decision was made: the email chain, the context, the constraints that were visible at the moment of decision. She was not given the outcome, my journal reflections, or any subsequent information.
For each held-out decision, I asked Tessera: given this situation, what would you recommend? Then I compared her recommendation against what I actually chose.
The Scoring
I scored two things separately. Decision-class match: did Tessera recommend the same type of action? Not the exact same words, but the same strategic move. Escalate versus contain. Invest versus defer. Confront versus accommodate. Hold the line versus compromise.
Reasoning alignment: did Tessera surface the same concerns, identify the same risks, and frame the tradeoffs the same way, even when the final recommendation differed?
The Results
In my core operational domains, decision-class match exceeded 85%. Tessera chose the same type of move I chose more than four times out of five. Reasoning alignment was even higher, above 90%. Even when her specific recommendation differed, she was looking at the same factors I was looking at.
In less-documented domains, the numbers dropped to 60-70% for decision-class match, but Tessera’s behavior was appropriate: she became conservative, flagged uncertainty, and recommended further investigation rather than committing. That is exactly what I would want.
This is not a marketing number. This is a defensible, reproducible result from a blinded evaluation against my own historical decisions. Tessera is real.