Testing Against Real Crises: What Happened When I Used Tessera in Production

Theory is comfortable. Production is not. I have been running Tessera in parallel with my normal workflow for six weeks, using it during actual client situations and comparing its output to what I would have done without it. The results are encouraging, humbling, and instructive.

The Exchange Migration Incident

A client’s Exchange migration to M365 stalled mid-cutover. Mail flow was partially disrupted. Users were split between on-premise and cloud. The client’s CEO was calling every thirty minutes.

I queried Tessera: “Incident Replay for Exchange hybrid coexistence failures.” It found three prior incidents with similar characteristics. Two had successful resolutions. One had required a full rollback. The replay showed that in all three cases, the root cause was DNS propagation, but the resolution approaches differed based on the organization’s MX record configuration.

Tessera’s output: a three-paragraph briefing with the three precedents, the relevant DNS configurations from each, and a note that the current client’s configuration most closely matched Precedent Two, which was resolved by forcing a specific mail routing connector. Time from query to usable briefing: forty-seven seconds.

Without Tessera, I would have reached the same conclusion, but it would have taken me fifteen to twenty minutes of email searching, ticket history review, and memory reconstruction. The forty-seven seconds mattered because the CEO was on the phone.

The Backup Failure

A managed services client reported that their backup chain had been broken for three weeks without alerting. The monitoring should have caught it. It did not. The question was not just how to fix the backups but how the monitoring failed and what other clients might be affected.

Tessera’s response to “what monitoring gaps have we identified in the last twelve months?” was less useful. It found four prior monitoring failures but could not establish a pattern because the enrichment pipeline had classified them as separate incidents rather than a systemic issue. The graph connections existed but the salience scores were too low for the pattern to surface.

This is a genuine failure mode. Tessera is only as good as the enrichment quality, and systemic patterns that span multiple low-salience incidents are the hardest to detect. I corrected the classification, boosted the salience, and the pattern now surfaces correctly. But I had to find it myself first. The system learned, but it did not discover.

What I Learned

Tessera excels at single-incident remediation support, where there is a clear precedent to match against. It struggles with systemic pattern detection across multiple low-profile incidents. It is faster than manual research by a factor of five to twenty, depending on the complexity. And the verification layer caught two instances where the language model hallucinated specific technical details that were plausible but wrong.

The verification layer earning its keep twice in six weeks is reason enough to keep it. In both cases, I would have caught the error myself eventually, but “eventually” during a crisis is too late.

Production use has shifted my development priorities. Less time on generation quality, more time on enrichment accuracy and systemic pattern detection. The architecture is right. The data quality needs work. That is a solvable problem.