Back to Home

How It Works

A deep dive into the matching pipeline, architecture, and technical implementation

The 6-Stage Matching Pipeline

The matching engine orchestrates a sophisticated pipeline that combines multiple signals for robust account matching. Each stage builds on the previous one to progressively refine and improve match quality.

1

Semantic Embeddings Generation

Generate dense vector representations of each account using state-of-the-art embedding models. The system composes a rich text representation from the account's hierarchy path, name, code, and metadata hints before generating embeddings.

Supported Providers

  • • AWS Bedrock Titan Embeddings v2 (default)
  • • OpenAI text-embedding-3-large (3072 dimensions)

Example Embedding Input

Assets > Current Assets > Cash > Cash EUR | Code: 1-100-100-30-10 | Hints: isBank, hasIBAN, currency: EUR
2

KNN (K-Nearest Neighbors) Search

For each account in the left dataset (e.g., Talix), find the top-K most similar accounts from the right dataset (e.g., CustomCOA1Csv) using cosine similarity on the embedding vectors. This efficiently narrows down the candidate space.

Key Parameters

  • topK: Number of candidates per left account (default: 12)
  • • Uses in-memory cosine similarity matrix for speed
  • • Batched processing for efficiency
3

Heuristic Scoring

Apply rule-based scoring to each candidate pair. Heuristics capture domain-specific knowledge that pure ML approaches might miss, such as code structure patterns and account type indicators.

Heuristic Rules

  • • Code structure similarity (e.g., "1-100-*" matching)
  • • Hierarchy depth alignment
  • • Bank account detection (isBank, hasIBAN hints)
  • • Currency code matching
  • • Extensible: add domain-specific rules via plugins
4

LLM Reranking

Leverage large language models to provide semantic understanding and contextual reranking of candidates. The LLM sees the full account context and provides both a score and an explanation for each match.

Supported Models

  • • Claude 3.5 Sonnet via AWS Bedrock (default)
  • • GPT-5 via OpenAI

Batched Inference

Candidates are batched (up to rerankMaxPerLeft per account) to minimize API calls. Structured JSON output ensures consistent parsing.

5

Score Combination & Mutuality Boost

Combine the three signals (embeddings, heuristics, LLM) using configurable weights. Additionally, apply a mutuality boost to bidirectional strong matches (when A→B and B→A are both high-confidence).

Default Weights

Embedding Score:65%
LLM Score:25%
Heuristic Score:10%
Mutuality Boost:+5%
6

Threshold Filtering & Soft Capping

Apply a global confidence threshold to filter out low-quality matches, then cap each left account to its top-M matches by combined score. This ensures manageable output while preserving many-to-many relationships.

Default Settings

  • threshold: 0.75 (matches below this are filtered)
  • topM: 5 (max matches per left account)
  • • Both values are configurable via config or API

AI Provider Setup

The system supports multiple AI providers for both embeddings and LLM inference. You can configure credentials for one or both, and the system will auto-select or you can specify your preference.

AWS Bedrock

Configure AWS credentials with access to Bedrock and Translate services.

Environment Variables

export AWS_REGION=eu-central-1 export BEDROCK_EMBEDDINGS_MODEL_ID=amazon.titan-embed-text-v2:0 export BEDROCK_CLAUDE_MODEL_ID=anthropic.claude-3-5-sonnet-20240620-v1:0

Note: Bedrock is the default provider. If credentials are available, it will be used unless explicitly overridden.

OpenAI

Configure OpenAI API credentials for embeddings and chat completion.

Environment Variables

export OPENAI_API_KEY=sk-... export OPENAI_ORGANIZATION=org-... # optional

Fixed Models

  • • Embeddings: text-embedding-3-large (3072 dimensions)
  • • Chat: gpt-5

Data Model & Unified Schema

All data sources are normalized to a unified account schema before matching. This abstraction allows the core matching engine to work with any chart of accounts format through plugin adapters.

UnifiedAccount Schema

interface UnifiedAccount {
  Code: string              // Hierarchical account code (e.g., "1-100-100-30-10")
  Name: string              // English display name (auto-translated if needed)
  PathNames: string[]       // Full hierarchy from root to leaf ["Assets", "Current Assets", ...]
  CodeParts: number[]       // Parsed numeric segments [1, 100, 100, 30, 10]
  Hints: {
    isBank?: boolean        // Bank account indicator
    hasIBAN?: boolean       // Contains IBAN pattern
    currencies?: string[]   // Currency codes (EUR, USD, etc.)
  }
  // ... extensible metadata fields
}

Key Benefits

Normalization

Single representation ensures consistency across all inputs

Extensibility

Add custom fields for domain-specific needs

Type Safety

TypeScript interfaces ensure correctness

Scoring Methodology

The final confidence score for each match is a weighted combination of multiple signals. This multi-signal approach provides robustness and handles edge cases that single-method approaches might miss.

Score Formula

combined = (0.65 × embed) + (0.25 × llm) + (0.10 × heur) + (mutualBoost)

Embedding Score (65%)

Cosine similarity between embeddings. Captures semantic similarity based on hierarchy, naming, and context.

LLM Score (25%)

Semantic reranking from Claude/GPT. Provides contextual understanding and reasoning about account relationships.

Heuristic Score (10%)

Rule-based scoring for code patterns, depth, and hints. Captures domain expertise and structural patterns.

Mutuality Boost (+5%)

Bonus applied when both A→B and B→A are strong matches. Increases confidence in bidirectional relationships.

Configuration Options

All matching behavior can be tuned without code changes through configuration files and environment variables.

config/default.config.json

{
  "threshold": 0.75,          // Minimum confidence to include
  "topK": 12,                 // Candidates per account (KNN)
  "topM": 5,                  // Max output matches per account
  "rerankMaxPerLeft": 8,      // Max candidates for LLM rerank
  "scoreWeights": {
    "embed": 0.65,
    "llm": 0.25,
    "heur": 0.10,
    "mutualBoost": 0.05
  },
  "enableTranslate": true,    // Auto-translate names
  "maskIbanInPrompts": true,  // Mask IBAN patterns
  "bedrock": {
    "region": "eu-central-1",
    "embeddingsModelId": "amazon.titan-embed-text-v2:0",
    "claudeModelId": "anthropic.claude-3-5-sonnet-20240620-v1:0"
  }
}

Tip: Create a local talixmap.config.json file to override defaults without modifying version-controlled settings.

Output Format & Observability

The system produces a single unified JSON structure with complete match information. When run with --explain, it includes detailed score breakdowns and AI rationale embedded within each match.

Unified Output Structure (out/mappings.json)

All matching results in a single hierarchical structure

{
  "LeftSideAccounts": {
    "1-100-100-30-10": {
      "LeftAccount": {
        "Code": "1-100-100-30-10",
        "OriginalCode": "1-100-100-30-10",
        "Name": "Cash EUR",
        "PathNames": ["Assets", "Current Assets", "Cash", "Cash EUR"],
        "Hints": { "isBank": true, "hasIBAN": true, "currencies": ["EUR"] }
      },
      "Matches": [
        {
          "Confidence": 0.8734,
          "1000": {
            "Code": "1000",
            "OriginalCode": "1000",
            "Name": "Kasse",
            "PathNames": ["Aktiva", "Umlaufvermögen", "Kasse"],
            "Hints": { "isBank": true, "currencies": ["EUR"] }
          },
          "Explanation": null  // or { scores: {...}, llmWhy: "..." } with --explain
        }
      ]
    }
  },
  "UnmatchedRightAccounts": [
    {
      "Code": "8888",
      "Name": "Sonstiges",
      // ... full UnifiedAccount object
    }
  ]
}

Match Entry with Explanations (--explain enabled)

Detailed score breakdowns and LLM reasoning embedded within each match

{
  "Confidence": 0.8734,
  "1000": {
    "Code": "1000",
    "OriginalCode": "1000",
    "Name": "Kasse",
    "PathNames": ["Aktiva", "Umlaufvermögen", "Kasse"],
    "Hints": { "isBank": true, "currencies": ["EUR"] }
  },
  "Explanation": {
    "scores": {
      "embed": 0.89,
      "heur": 0.75,
      "llm": 0.92,
      "combined": 0.8734
    },
    "llmWhy": "Both represent cash/liquid assets with EUR currency hint"
  }
}

Hierarchical Grouping

All matches organized by left account code

Complete Context

Full UnifiedAccount objects for both sides

Inline Explanations

Score breakdowns embedded in matches

Many-to-Many

Each left account can have multiple matches

Performance Optimizations

The system has been optimized for both speed and cost-effectiveness through parallel batch processing and two-level caching, resulting in a 6.5x speedup and 60-80% cost reduction on repeat operations.

Parallel Batch Processing

All expensive API operations (embeddings, translations, LLM calls) now process in parallel batches instead of sequentially.

Bedrock Titan Embeddings

10x

faster (100s → 10s)

AWS Translate

30x

faster (30s → <1s)

LLM Reranking

5x

faster (200s → 40s)

Overall Impact

5.5 min → 51 sec

6.5x faster for typical matching runs

Two-Level Caching System

L1 (in-memory) + L2 (PostgreSQL database) caching eliminates redundant API calls across sessions for significant cost savings.

Embedding Cache

30 days TTL

Eliminates re-embedding of same datasets across sessions

Translation Cache

90 days TTL

Persists German→English translations indefinitely

Rerank Cache

7 days TTL

Stores LLM rerank scores for account pairs

Prep Cache

30 days TTL

Caches fully preprocessed datasets

Cost Savings

60-80%

on repeat operations with cached data

Ready to Try It?

Experience the matching pipeline with your own data