The retrieval architecture is only as good as the data behind it. For Tessera to work, it needs to ingest, parse, classify, and index twenty-three years of professional and personal artifacts. The volume is significant. The variety is worse.
Emails in multiple formats. Documents in Word, PDF, plain text, and markdown. Meeting notes in three different apps across two decades. Code repositories. Ticket system exports. CRM records. Personal journals. Strategic plans. Technical diagrams described in text because the originals are lost. The corpus is a geological record of a career, and it is messy in ways that no clean dataset has ever been.
The Ingestion Pipeline
The pipeline has five stages: acquisition, normalization, extraction, enrichment, and indexing. Acquisition pulls artifacts from source systems. Normalization converts everything to a common format. Extraction pulls structured information from unstructured content. Enrichment adds metadata, relationships, and classifications. Indexing pushes the enriched artifacts into the three retrieval stores: vector, graph, and lexical.
Each stage has its own failure modes. Acquisition fails when source systems have changed formats, lost data, or require authentication that expired years ago. Normalization fails on edge cases: emails with nested attachments, PDFs with scanned images, documents with mixed encodings. Extraction fails when the content is ambiguous, referential, or context-dependent.
The enrichment stage is where the real work happens. This is where Tessera identifies decisions, maps relationships, classifies domains, and assigns confidence scores. A well-enriched artifact is useful across all three retrieval modes. A poorly enriched artifact is noise.
The Decision Extraction Problem
The hardest part of enrichment is decision extraction. Most artifacts do not explicitly state “I decided X because of Y.” The decision is implicit in the action taken, the email sent, the configuration changed, the proposal accepted. Teaching Tessera to infer decisions from actions is the core natural language understanding challenge.
I am using a multi-pass approach. First pass: identify action verbs and their objects. Second pass: identify the context and constraints that surrounded the action. Third pass: classify the action as a decision, a directive, an observation, or noise. Fourth pass: link the decision to related decisions in the graph.
The accuracy is about seventy percent on the first automated pass, which means thirty percent of decisions are either missed or misclassified. That is not acceptable for production, but it is acceptable for building a training set. I am reviewing the failures manually, correcting the classifications, and feeding the corrections back into the extraction model. Each iteration improves.
Scale and Performance
The full corpus is approximately two hundred thousand artifacts. At current processing speed, initial ingestion takes about seventy-two hours. That is a one-time cost. Incremental ingestion of new artifacts takes seconds.
The graph currently has about four hundred thousand nodes and over a million edges. Query performance is acceptable: graph traversal completes in under two hundred milliseconds for most queries. Vector search is faster. The fusion layer adds about three hundred milliseconds. Total retrieval time is under a second for most queries, which is well within the target for interactive use.
Memory is the constraint. The full graph plus vector index plus lexical index requires about sixteen gigabytes of RAM. That is fine for a dedicated machine. It is a problem for the air-gapped deployment I want, where the target hardware is a standard laptop. Optimization work is coming.