The Prompt Architecture: Designing Conversations With Myself

The prompt engineering for Tessera is unlike any other prompt work I have done. I am not designing prompts for a general-purpose assistant. I am designing prompts for a system that needs to think like me, respond in patterns I find useful, and maintain a level of rigor that matches my expectations.

The system prompt is a living document. It has been rewritten fourteen times and will be rewritten again. Each rewrite reflects something I learned about how the local model interprets instructions, how retrieval results should be integrated into generation, and what kind of responses are actually useful to me under pressure.

The Persona Problem

Tessera does not have a persona. It is not pretending to be a friendly assistant, a sarcastic colleague, or an authoritative expert. It is a system that presents information, surfaces patterns, and identifies gaps. The tone is clinical and direct. I do not need warmth from a tool. I need accuracy and speed.

This was a deliberate rejection of the industry trend toward anthropomorphized AI. Tessera’s responses begin with the answer, not with an acknowledgment of the question. They end with a confidence indicator, not with a pleasantry. Every word in the output is there because it carries information. Nothing is decorative.

Retrieval-Augmented Prompting

The prompt structure changes based on query type. For Factual Lookup, the prompt instructs the model to answer strictly from retrieved sources with no elaboration. For Pattern Matching, the prompt instructs the model to identify commonalities across retrieved artifacts and present them as observations. For Action Planning, the prompt provides a structured template that the model fills with retrieved information.

The most complex prompt is for Incident Replay. It instructs the model to reconstruct a chronological narrative from the retrieved decision chain, identify the key inflection points, note what worked and what did not, and draw parallels to the current situation. The prompt is nearly two thousand tokens long, which is expensive in a local model’s limited context window, but the quality of the output justifies the cost.

Context Budget Management

Local models have smaller context windows than cloud models. The current model I am running supports eight thousand tokens. After the system prompt, the retrieval results, and the query, there are roughly four thousand tokens available for generation. That is about five hundred words, maybe six hundred.

This constraint forces discipline. The retrieval system must be precise because there is no room for marginally relevant context. The prompt must be efficient because every wasted token in the prompt is a token stolen from the response. The response itself must be dense because there is no room for filler.

I have come to see the context limitation as a design feature rather than a constraint. It forces the system to be concise, which is what I want from an assistant that I consult during time-critical situations. A thousand-word essay on the history of Exchange Server failures is less useful than a three-paragraph briefing that tells me what I need to know right now.