A deep dive into the matching pipeline, architecture, and technical implementation
The matching engine orchestrates a sophisticated pipeline that combines multiple signals for robust account matching. Each stage builds on the previous one to progressively refine and improve match quality.
Generate dense vector representations of each account using state-of-the-art embedding models. The system composes a rich text representation from the account's hierarchy path, name, code, and metadata hints before generating embeddings.
Assets > Current Assets > Cash > Cash EUR | Code: 1-100-100-30-10 | Hints: isBank, hasIBAN, currency: EURFor each account in the left dataset (e.g., Talix), find the top-K most similar accounts from the right dataset (e.g., CustomCOA1Csv) using cosine similarity on the embedding vectors. This efficiently narrows down the candidate space.
Apply rule-based scoring to each candidate pair. Heuristics capture domain-specific knowledge that pure ML approaches might miss, such as code structure patterns and account type indicators.
Leverage large language models to provide semantic understanding and contextual reranking of candidates. The LLM sees the full account context and provides both a score and an explanation for each match.
Candidates are batched (up to rerankMaxPerLeft per account) to minimize API calls. Structured JSON output ensures consistent parsing.
Combine the three signals (embeddings, heuristics, LLM) using configurable weights. Additionally, apply a mutuality boost to bidirectional strong matches (when A→B and B→A are both high-confidence).
Apply a global confidence threshold to filter out low-quality matches, then cap each left account to its top-M matches by combined score. This ensures manageable output while preserving many-to-many relationships.
The system supports multiple AI providers for both embeddings and LLM inference. You can configure credentials for one or both, and the system will auto-select or you can specify your preference.
Configure AWS credentials with access to Bedrock and Translate services.
export AWS_REGION=eu-central-1
export BEDROCK_EMBEDDINGS_MODEL_ID=amazon.titan-embed-text-v2:0
export BEDROCK_CLAUDE_MODEL_ID=anthropic.claude-3-5-sonnet-20240620-v1:0Note: Bedrock is the default provider. If credentials are available, it will be used unless explicitly overridden.
Configure OpenAI API credentials for embeddings and chat completion.
export OPENAI_API_KEY=sk-...
export OPENAI_ORGANIZATION=org-... # optionalAll data sources are normalized to a unified account schema before matching. This abstraction allows the core matching engine to work with any chart of accounts format through plugin adapters.
interface UnifiedAccount {
Code: string // Hierarchical account code (e.g., "1-100-100-30-10")
Name: string // English display name (auto-translated if needed)
PathNames: string[] // Full hierarchy from root to leaf ["Assets", "Current Assets", ...]
CodeParts: number[] // Parsed numeric segments [1, 100, 100, 30, 10]
Hints: {
isBank?: boolean // Bank account indicator
hasIBAN?: boolean // Contains IBAN pattern
currencies?: string[] // Currency codes (EUR, USD, etc.)
}
// ... extensible metadata fields
}Single representation ensures consistency across all inputs
Add custom fields for domain-specific needs
TypeScript interfaces ensure correctness
The final confidence score for each match is a weighted combination of multiple signals. This multi-signal approach provides robustness and handles edge cases that single-method approaches might miss.
combined = (0.65 × embed) + (0.25 × llm) + (0.10 × heur) + (mutualBoost)Cosine similarity between embeddings. Captures semantic similarity based on hierarchy, naming, and context.
Semantic reranking from Claude/GPT. Provides contextual understanding and reasoning about account relationships.
Rule-based scoring for code patterns, depth, and hints. Captures domain expertise and structural patterns.
Bonus applied when both A→B and B→A are strong matches. Increases confidence in bidirectional relationships.
All matching behavior can be tuned without code changes through configuration files and environment variables.
{
"threshold": 0.75, // Minimum confidence to include
"topK": 12, // Candidates per account (KNN)
"topM": 5, // Max output matches per account
"rerankMaxPerLeft": 8, // Max candidates for LLM rerank
"scoreWeights": {
"embed": 0.65,
"llm": 0.25,
"heur": 0.10,
"mutualBoost": 0.05
},
"enableTranslate": true, // Auto-translate names
"maskIbanInPrompts": true, // Mask IBAN patterns
"bedrock": {
"region": "eu-central-1",
"embeddingsModelId": "amazon.titan-embed-text-v2:0",
"claudeModelId": "anthropic.claude-3-5-sonnet-20240620-v1:0"
}
}Tip: Create a local talixmap.config.json file to override defaults without modifying version-controlled settings.
The system produces a single unified JSON structure with complete match information. When run with --explain, it includes detailed score breakdowns and AI rationale embedded within each match.
All matching results in a single hierarchical structure
{
"LeftSideAccounts": {
"1-100-100-30-10": {
"LeftAccount": {
"Code": "1-100-100-30-10",
"OriginalCode": "1-100-100-30-10",
"Name": "Cash EUR",
"PathNames": ["Assets", "Current Assets", "Cash", "Cash EUR"],
"Hints": { "isBank": true, "hasIBAN": true, "currencies": ["EUR"] }
},
"Matches": [
{
"Confidence": 0.8734,
"1000": {
"Code": "1000",
"OriginalCode": "1000",
"Name": "Kasse",
"PathNames": ["Aktiva", "Umlaufvermögen", "Kasse"],
"Hints": { "isBank": true, "currencies": ["EUR"] }
},
"Explanation": null // or { scores: {...}, llmWhy: "..." } with --explain
}
]
}
},
"UnmatchedRightAccounts": [
{
"Code": "8888",
"Name": "Sonstiges",
// ... full UnifiedAccount object
}
]
}Detailed score breakdowns and LLM reasoning embedded within each match
{
"Confidence": 0.8734,
"1000": {
"Code": "1000",
"OriginalCode": "1000",
"Name": "Kasse",
"PathNames": ["Aktiva", "Umlaufvermögen", "Kasse"],
"Hints": { "isBank": true, "currencies": ["EUR"] }
},
"Explanation": {
"scores": {
"embed": 0.89,
"heur": 0.75,
"llm": 0.92,
"combined": 0.8734
},
"llmWhy": "Both represent cash/liquid assets with EUR currency hint"
}
}All matches organized by left account code
Full UnifiedAccount objects for both sides
Score breakdowns embedded in matches
Each left account can have multiple matches
The system has been optimized for both speed and cost-effectiveness through parallel batch processing and two-level caching, resulting in a 6.5x speedup and 60-80% cost reduction on repeat operations.
All expensive API operations (embeddings, translations, LLM calls) now process in parallel batches instead of sequentially.
10x
faster (100s → 10s)
30x
faster (30s → <1s)
5x
faster (200s → 40s)
Overall Impact
5.5 min → 51 sec
6.5x faster for typical matching runs
L1 (in-memory) + L2 (PostgreSQL database) caching eliminates redundant API calls across sessions for significant cost savings.
Eliminates re-embedding of same datasets across sessions
Persists German→English translations indefinitely
Stores LLM rerank scores for account pairs
Caches fully preprocessed datasets
Cost Savings
60-80%
on repeat operations with cached data
Experience the matching pipeline with your own data