The Enrichment Pipeline: Where Raw Data Becomes Useful Knowledge

Raw artifacts are useless to Tessera. An email is a sequence of characters. A PDF is a layout specification. A meeting note is unstructured text. The enrichment pipeline transforms these raw inputs into structured knowledge that the retrieval system can actually work with.

Enrichment is the most computationally expensive part of the system, and it is where the quality of everything downstream is determined. Bad enrichment produces bad retrieval produces bad results. There is no shortcut.

The Five Enrichment Stages

Entity Recognition. Identify people, organizations, technologies, dates, locations, and concepts mentioned in the artifact. This uses a combination of named entity recognition and domain-specific dictionaries built from my own corpus. The NER model knows my clients, my team, the products I work with, and the terminology I use.

Relationship Extraction. Determine how the identified entities relate to each other within the artifact. Did Person A assign a task to Person B? Did Organization X deploy Technology Y? Did I decide to change an approach? Relationship extraction creates the edges in the knowledge graph.

Decision Identification. The most critical stage. Scan the artifact for evidence of decisions: actions taken, directions given, options chosen, approaches rejected. Each identified decision becomes a node in the graph with edges to the people, organizations, technologies, and outcomes it involves.

Domain Classification. Assign the artifact to one or more life domains: professional, personal, health, financial, technical, strategic. Domain classification determines which context windows the artifact updates and which retrieval queries it is eligible to answer.

Salience Scoring. Assign initial salience scores based on the consequence, recurrence, and emotional weight signals present in the artifact. A crisis escalation email gets high initial salience. A routine status update gets low salience. The scores are refined over time as the artifact is referenced or ignored.

The Domain Dictionary

Generic NER models do not know that “CW” means ConnectWise in my context, or that “the Meridian situation” refers to a specific client incident. The domain dictionary is a continuously growing reference that maps my shorthand, abbreviations, and contextual references to their full meaning.

Building the initial dictionary was manual. I reviewed the first thousand enriched artifacts, corrected the entity recognition errors, and fed the corrections back as dictionary entries. After the first thousand, the accuracy improved enough that corrections became less frequent. The dictionary now has about three thousand entries and grows by a few dozen per week.

This is one of the advantages of a personal system. The dictionary is mine. It encodes my language, my references, my world. No generic model has this, and no amount of fine-tuning on public data would produce it. The enrichment pipeline’s accuracy on my corpus is higher than any general-purpose system could achieve because it has been trained on my specific domain.

Quality Control

Enrichment is not fire-and-forget. I run periodic quality audits on randomly sampled enriched artifacts. Are the entities correct? Are the relationships accurate? Are the decisions properly identified? The audit results feed back into the enrichment models and the domain dictionary.

Current accuracy by stage: Entity Recognition 94%, Relationship Extraction 82%, Decision Identification 78%, Domain Classification 96%, Salience Scoring 71%. The weakest stages, Decision Identification and Salience Scoring, are the most subjective. They will improve with more training data, but they may never reach the accuracy of the more objective stages. That is acceptable as long as the verification layer catches the errors at query time.