Back to Knowledge Base
WhitepaperSemantic SearchEmbeddingpgvectorRAGKnowledge GraphGraphRAGCosine SimilarityHybrid SearchAI ArchitectureMulti-tenant

Semantic Search and Embedding Strategies — Whitepaper for CTOs and IT Decision-Makers

ÁZ&A
Ádám Zsolt & AIMY
||37 min read

Executive Summary

Semantic search is the critical infrastructure layer of AI-based business applications. While traditional keyword search looks for exact matches, semantic search understands the meaning — enabling an AI assistant to respond to business questions with truly relevant context.

This whitepaper presents the architecture of a real, production multi-tenant SaaS system that:

  • Handles 9 different data types (email, calendar, CRM, invoicing)
  • Uses pgvector instead of a dedicated vector database
  • Enriches vector search with a knowledge graph
  • Optimizes LLM context with a greedy token budget system
  • Manages high-volume embedding generation with a BullMQ async pipeline

The whitepaper walks the reader through 15 chapters — from embedding model selection to production monitoring.


1. The Problem — When Search Doesn't Understand What We're Looking For

The Semantic Gap

Imagine: a beauty salon's AI assistant is asked: "When was Kiss Anna's last visit?"

Keyword search result: Nothing. The word "last" doesn't appear in any calendar entry.

Semantic search result: Finds Kiss Anna's most recent calendar entry (March 15, 2026 — haircut + coloring), because it understands that "last visit" = most recent appointment.

But this isn't enough. What if the question is: "When was Kiss Anna's last visit, and what did we do?"

This requires the knowledge graph too: alongside the calendar event, it loads the client profile (VIP, allergic to certain dyes) and the note written during the previous appointment.

Search Approach               What Does It Find?            Can It Answer the Question?
──────────────────────────────────────────────────────────────────────────────────────────
Keyword (LIKE, tsvector)      Exact word matches            ❌ No
Vector (embedding)            Semantically similar          ⚠️ Partially
Vector + graph                Similar + related             ✅ Yes, with context

This whitepaper presents how we got from keyword search to a vector + knowledge graph architecture — and what decisions we had to make along the way.


2.1 What Is an Embedding?

An embedding transforms text (in this case business data: email, calendar event, client profile) into a numerical vector — typically in a 256-3072 dimensional space. The semantic similarity between such vectors can be measured with cosine similarity.

2.2 The Main Selection Criteria

Model Dimensions Price (1M tokens) MTEB Average Multilingual
OpenAI text-embedding-3-small 1536 $0.02 62.3 Good (multilingual)
OpenAI text-embedding-3-large 3072 $0.13 64.6 Good
Cohere embed-v4 1024 $0.10 66.1 Moderate
Google text-embedding-005 768 $0.025 63.8 Good (multilingual)
BAAI/bge-m3 (local) 1024 $0 (GPU) 62.0 Good
E5-mistral-7b-instruct (local) 4096 $0 (GPU) 66.6 Moderate

2.3 Our Choice: OpenAI text-embedding-3-small (1536d)

Why?

  1. Multilingual: Handles Hungarian business text (email, CRM notes) well, without needing a separate Hungarian model
  2. API simplicity: We already use the OpenAI API for the LLM — single vendor, single API key
  3. Cost-effective: $0.02/1M tokens — a complete knowledge graph of 10,000 nodes costs ~$0.50 to embed
  4. Dimensionality: 1536 is a good balance between accuracy and storage/search cost

2.4 Language-Specific Considerations

Hungarian is an agglutinative language — "futottam" (I ran), "futottál" (you ran), "futottunk" (we ran) are all different tokens but semantically close. OpenAI's BPE tokenization handles this partially, but very rare compound words (e.g., "munkavállalói-érdekképviseleti-tanácsadó" — employee-representation-advisor) may fragment.

Practical experience: text-embedding-3-small works surprisingly well on Hungarian business text (email, CRM, calendar) — cosine similarity values are consistent with semantic similarity. For rare domain-specific terms (e.g., specific cosmetic procedures), contextualized embedding (see Chapter 14.2) can help.

2.5 Decision Tree for Model Selection

Does data need to stay within the EU?
├── Yes → Local model (bge-m3, E5-mistral) or EU-region API
│          ├── Have GPU infrastructure? → Local
│          └── No → EU-region OpenAI / Cohere
└── No → API-based
           ├── Cost-sensitive? → OpenAI text-embedding-3-small ($0.02)
           ├── Maximum accuracy? → Cohere embed-v4 or E5-mistral
           └── Already have OpenAI integration? → text-embedding-3-small (simplicity)

3. Vector Storage — pgvector vs. Dedicated Vector Databases

3.1 The Decision: PostgreSQL + pgvector

Criterion pgvector Pinecone Qdrant Weaviate
Operations Existing PostgreSQL Managed SaaS Self-hosted / Cloud Self-hosted / Cloud
Cost $0 (extension) $70+/mo $0-65/mo $25+/mo
Scalability Excellent up to ~5M vectors Unlimited 100M+ 100M+
SQL integration Native (JOIN, WHERE) None None GraphQL
Multi-tenant WHERE provider_id= Namespace Collection / payload Tenant
ACID transactions Yes No No No
Knowledge graph Same DB Separate system needed Separate system needed Built-in (partial)

The deciding argument: Knowledge graph nodes and edges are in the same database as the vectors. A single SQL query can perform vector search + graph traversal + tenant filtering — no network latency between two systems.

3.2 The knowledge_nodes Table

CREATE TABLE knowledge_nodes (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    provider_id UUID NOT NULL REFERENCES providers(id),
    type VARCHAR(50) NOT NULL,              -- 'email', 'calendar_event', 'client', etc.
    source VARCHAR(50),                      -- 'gmail', 'google_calendar', 'crm', etc.
    external_id VARCHAR(255),                -- original system ID
    label VARCHAR(500),                      -- human-readable title
    content TEXT,                             -- full text content
    properties JSONB DEFAULT '{}',           -- type-specific metadata
    embedding vector(1536),                  -- OpenAI text-embedding-3-small
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),

    CONSTRAINT unique_external UNIQUE (provider_id, source, external_id)
);

-- Vector search index
CREATE INDEX idx_knowledge_nodes_embedding
    ON knowledge_nodes USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

-- Tenant filtering index
CREATE INDEX idx_knowledge_nodes_provider
    ON knowledge_nodes (provider_id);

3.3 Indexing: IVFFlat vs. HNSW

Criterion IVFFlat HNSW Notes
Build time Fast Slow (10-100x) HNSW can take hours with large data
Search speed Good (~5ms) Excellent (~1ms) HNSW faster, but IVFFlat is sufficient
Recall@10 95-98% 99%+ IVFFlat tunable with probes parameter
Memory Low High (2-3x) HNSW keeps graph structure in memory
Incremental index No (rebuild needed) Yes HNSW indexes new vectors immediately

Our choice: IVFFlat (lists=100). In a multi-tenant SaaS, tenants typically hold 100-10,000 nodes — IVFFlat is more than sufficient at this scale, and index rebuilds can be integrated into the BullMQ pipeline.


4. Async Embedding Pipeline — BullMQ Architecture

4.1 Why Async?

Embedding generation is not synchronous: an email sync brings 500 emails, each requiring an API call. If synchronous, a Gmail sync would take 500 × 200ms = 100 seconds — unacceptable.

4.2 The Pipeline Architecture

Gmail/Calendar Connector
        │
        ▼
   EventWorker (concurrency: 5)
        │  Node creation/update
        │  Edge management (BELONGS_TO, EMAILED, etc.)
        ▼
   EmbeddingQueue.add({ nodeId, content })
        │
        ▼
   EmbeddingWorker (concurrency: 3, rate: 50/min)
        │  OpenAI API call
        │  Vector save → knowledge_nodes.embedding
        ▼
   Ready for search ✓

4.3 Text Preparation

The quality of the embedding depends on the quality of the input text. The pipeline:

  1. Content assembly by type:

    • Email: Subject: ${subject}\nFrom: ${from}\n${body}
    • Calendar: ${summary} - ${start} - ${location}\n${description}
    • Client: ${name} - ${email} - ${notes}
  2. Cleanup: HTML tag removal, whitespace normalization

  3. Truncate: Max 8000 characters (text-embedding-3-small has an 8191 token limit, and the character/token ratio is ~4:1 for Hungarian text)

function prepareTextForEmbedding(node) {
    let text = '';
    switch (node.type) {
        case 'email':
            text = `Email subject: ${node.label}\n${node.content}`;
            break;
        case 'calendar_event':
            text = `Event: ${node.label}\n${node.properties?.location || ''}\n${node.content}`;
            break;
        case 'client':
            text = `Client: ${node.label}\n${node.content}`;
            break;
        default:
            text = `${node.label}\n${node.content}`;
    }
    return text.replace(/\s+/g, ' ').trim().substring(0, 8000);
}

4.4 Rate Limiting and Error Handling

The OpenAI embedding API rate limit (Tier 3): ~5000 RPM. But for safe operation, we limit to 50 jobs/minute:

const embeddingWorker = new Worker('embedding-queue', processEmbedding, {
    connection: redis,
    concurrency: 3,
    limiter: {
        max: 50,
        duration: 60000  // 50 jobs/minute
    }
});

429 (rate limit) handling: If OpenAI returns 429, the worker takes a 1-hour pause (based on OpenAI's rate limit reset window), then resumes. This is more aggressive than exponential backoff, but more reliable — after a 429, exponential backoff often "oscillates."

Error isolation: A failed embedding doesn't block the processing of other nodes. BullMQ's attempts: 3 + backoff: exponential configuration automatically retries, and after 3 failed attempts, the node stays with embedding = NULL, meaning search won't find it — but the system works.


5. Chunking Strategies — Or Rather: Why We DON'T Chunk

5.1 The Chunking Myth

Most RAG tutorials teach "chunking" first: split documents into 500-1000 token pieces, embed each separately. This is excellent for document-based systems (e.g., processing a 200-page manual).

But business data is different:

  • An email averages 200-500 tokens — a natural unit, no need to split
  • A calendar event is 50-150 tokens
  • A client profile is 100-300 tokens
  • A CRM note is 50-200 tokens

These data are "natural chunks" — if we split them, we lose context.

5.2 The Entity-Based Approach

Our system is entity-based, not document-based:

Document-based:                  Entity-based:
─────────────────                ─────────────────
  PDF → 50 chunks → 50 vectors    Email → 1 node + edges
  No connection between chunks     Calendar → 1 node + edges
  Much redundancy                  Client → 1 node + edges
                                   Natural graph structure

Every business entity (email, event, client, invoice) is one node in the knowledge graph, with one embedding. Context isn't provided by chunk overlaps but by graph edges — an email node is connected to the sender (client), the thread, and the events mentioned in it.

5.3 Exception: Long Texts

If there are long texts (e.g., a 10,000-word product catalog), we can choose from three strategies:

  1. Fixed-size chunking: 500-token pieces, with 50-token overlap. Simple but loses context.
  2. Recursive character text splitting: Splits by paragraphs, then sentences. Better context preservation.
  3. Semantic chunking: Uses embedding-based similarity to decide where to split. Best quality but more expensive.

In our system, long texts are rare (95% of business data is < 1000 tokens), so simple truncation (8000 characters) is sufficient.


6. Search Tuning — Cosine Similarity, Threshold, Top-K

6.1 Cosine Similarity in Brief

The cosine similarity of two vectors (A, B): cos(θ) = (A · B) / (||A|| × ||B||). Its value ranges from -1 to 1; 1 = perfectly similar, 0 = orthogonal (no relationship).

In practice, embedding models produce values in the 0.3-0.9 range — values above 0.60 typically indicate "relevant."

6.2 The Threshold Paradox

It seems intuitive: set the threshold high (e.g., 0.80), and only return very relevant results. But:

Threshold Precision Recall Experience
0.80 Excellent Very low Barely finds anything — "no data" experience
0.70 Good Medium Relevant items sometimes missed
0.60 Good Good Optimal balance with OpenAI model
0.50 Low High Many irrelevant results — "noisy" context

0.60 is the optimal threshold with the text-embedding-3-small model, tested on business data. This is model-specific — a different model may have a different optimal value.

6.3 Top-K: Why 8?

Top-K (how many results to request) relates to the token budget:

  • Token budget: 3000 tokens (~12,000 characters)
  • Average node content: 300-400 characters (75-100 tokens)
  • Formatting overhead: ~20 tokens/node (Markdown headers, separators)
  • Graph neighbors: The top-3 results' neighbors are also included

→ 8 direct results + ~5 graph neighbors = ~13 nodes × ~100 tokens = ~1300 tokens, which fits well within the 3000 budget, leaving room for formatting and a "safety margin."

6.4 Multi-Tenant Search Isolation

Search is always tenant-filtered:

SELECT id, label, content, type, source, properties,
       1 - (embedding <=> $1::vector) AS similarity
FROM knowledge_nodes
WHERE provider_id = $2
  AND embedding IS NOT NULL
ORDER BY embedding <=> $1::vector
LIMIT $3;

The provider_id = $2 condition ensures that a tenant never sees another tenant's data — not even accidentally. This isn't just GDPR compliance — it's business-critical: one beauty salon shouldn't see another salon's client data.


7. Context Enrichment — The Power of the Graph

7.1 The Problem: Isolated Vector Results

Vector search returns isolated nodes. But business questions require context:

  • "When is Kiss Anna coming next?" → Need the calendar event + the client profile (e.g., allergy information)
  • "What was Saturday's email about?" → Need the email + the full thread + the affected client
  • "How much revenue was there in March?" → Need invoices + the affected clients

7.2 Loading 1-Hop Neighbors

The solution: load the graph neighbors of the top vector results. 1-hop = direct neighbors (nodes 1 edge away).

Vector result: event_calendar_77 (sim: 0.82)
                    │
                    ├── BOOKED ──▶ client_15 "Kiss Anna" (VIP, allergy: X dye)
                    ├── BELONGS_TO ──▶ calendar_google_main
                    └── MENTIONS ──▶ note_23 "Need to mention balayage next time"

client_15 and note_23 were not in the vector search's top-K results (their embeddings differ), but they provide critical context for the answer.

7.3 Relevance Inheritance (Decay Factor)

Neighbors' relevance scores are inherited from the parent's similarity, reduced by a decay factor:

neighbor_score = parent_similarity × decay_factor

In our system, decay_factor = 0.8:

  • If the parent similarity = 0.82 → the neighbor score = 0.82 × 0.8 = 0.656
  • This is still above the threshold (0.60) → included in results
  • A 2-hop neighbor: 0.82 × 0.8 × 0.8 = 0.525 → below threshold → excluded

The decay factor automatically regulates graph traversal depth: the further a neighbor, the lower its score, and it naturally falls out.

7.4 Why Does This Approach Work?

The relevance inheritance has three important properties:

  1. Direct results are always ranked higher in the results
  2. Neighbors don't "push out" direct results from the token budget
  3. Multiply-referenced entities (e.g., a client node reachable from two emails) receive higher relevance scores

7.5 Recursive CTE Graph Traversal

Neighbor queries use efficient SQL with a recursive CTE:

WITH RECURSIVE related AS (
    -- Starting point: the start node
    SELECT kn.id, kn.type, kn.label, kn.content, kn.properties,
           0 AS depth, ARRAY[kn.id] AS path
    FROM knowledge_nodes kn
    WHERE kn.id = $1

    UNION ALL

    -- Recursion: traverse edges in both directions
    SELECT kn2.id, kn2.type, kn2.label, kn2.content, kn2.properties,
           r.depth + 1, r.path || kn2.id
    FROM related r
    JOIN knowledge_edges ke ON ke.from_node_id = r.id OR ke.to_node_id = r.id
    JOIN knowledge_nodes kn2 ON kn2.id = CASE
        WHEN ke.from_node_id = r.id THEN ke.to_node_id
        ELSE ke.from_node_id
    END
    WHERE r.depth < $2           -- max depth: 1 (or 2 in special cases)
      AND NOT kn2.id = ANY(r.path)  -- cycle prevention
)
SELECT DISTINCT ON (id) * FROM related WHERE depth > 0;

Cycle prevention: The path array contains the nodes visited so far. If a node already appears in the path, recursion doesn't continue on that branch. This prevents infinite loops in mutual references (e.g., email → thread → email).

7.6 Configurable Parameters

Parameter Value Effect
graphMaxDepth 1 How many hops into the graph (1 = direct neighbors)
graphMaxNeighbors 5 Maximum neighbors per starting node
graphMaxEnriched 3 How many of the top-K vector results get enriched with graph neighbors
decayFactor 0.8 Relevance inheritance factor for neighbors

Why 1 hop and not 2? 2-hop neighbors exponentially increase the result set (5 neighbors × 5 = 25 second-hop nodes), and relevance drops to decay² = 0.64 — which is already near the threshold (0.60). Therefore, 1-hop is the optimal balance between context richness and cost-effectiveness.


8. The Complete RAG Pipeline — In 5 Steps

8.1 Architectural Overview

User Question
      │
      ▼
  ┌─────────────────────────────────────────────────────────┐
  │  retrieveRAGContext(providerId, message)                  │
  │                                                          │
  │  1. Guard: message.length >= 3?                          │
  │     └─ No → return emptyResult()                         │
  │                                                          │
  │  2. Vector Search                                        │
  │     generateEmbedding(message)                           │
  │     → searchByEmbedding(embedding, providerId, topK=8)   │
  │     → filter: similarity > 0.60                          │
  │                                                          │
  │  3. Graph Enrichment                                     │
  │     Top-3 results → getNeighbors() per node              │
  │     → neighbors decay = 0.8                              │
  │     → max 5 neighbors per node                           │
  │                                                          │
  │  4. Dedup + Rank + Token Budget                          │
  │     → deduplication by ID (higher score wins)            │
  │     → sort by similarity desc                            │
  │     → greedy packing: 3000 token budget                  │
  │                                                          │
  │  5. Format + Inject                                      │
  │     → Markdown context grouped by type                   │
  │     → Source objects for the frontend                     │
  │     → As system message for the LLM                      │
  └─────────────────────────────────────────────────────────┘
      │
      ▼
  LLM response generation based on context

8.2 Step 1: Guard — Input Validation

if (!message || message.trim().length < RAG_CONFIG.minQueryLength) {
    return emptyResult();
}

Why is this needed? A 1-2 character message (e.g., "hi", "ok") carries no semantic content — its embedding is "average," which would return irrelevant results. The 3-character minimum filters these out.

The user message is embedded with the same model as the data (text-embedding-3-small). This is critical: if the query embedding and stored embeddings come from different models, cosine similarity is meaningless.

The search uses pgvector's cosine distance operator, with provider_id tenant filtering.

8.4 Step 3: Graph Enrichment

For the top-3 vector results (not all 8 — performance-conscious), we load their 1-hop neighbors. Neighbors receive their parent's similarity × 0.8 decay.

8.5 Step 4: Deduplication, Ranking, Token Budget

All nodes (vector + graph) go into a single list:

  1. Deduplication: If a node appears multiple times (e.g., a client is a neighbor of two emails), the higher score remains
  2. Ranking: Descending by similarity
  3. Token budget packing: Greedy algorithm — following the ranking, we add nodes until the 3000-token budget (estimated with character/4 heuristic) is exhausted. Each node gets max 400 characters of content.
let usedTokens = 0;
const selected = [];

for (const node of rankedNodes) {
    const content = (node.content || '').substring(0, RAG_CONFIG.maxContentLength);
    const estimatedTokens = Math.ceil(content.length / 4) + 20; // +20 for formatting
    if (usedTokens + estimatedTokens > RAG_CONFIG.maxContextTokens) break;
    usedTokens += estimatedTokens;
    selected.push({ ...node, content });
}

8.6 Step 5: Markdown Formatting and Injection

The selected nodes are transformed into type-grouped Markdown:

📧 **Emails:**
- **Appointment change** (2026-03-10)
  Dear Salon! I'd like to modify my Friday appointment...
  _Source: Gmail_

📅 **Calendar Events:**
- **Haircut + coloring** (2026-03-15 14:00)
  Kiss Anna - 90 minutes
  _Source: Google Calendar_

👤 **Clients:**
- **Kiss Anna**
  Phone: +36-30-123-4567, VIP client, allergy: certain dyes
  _Source: CRM_

This Markdown is inserted as a system message into the LLM context — separated from the main system prompt:

LLM message array:
  [0] system  → Main system prompt (personality, rules, available tools)
  [1] system  → RAG context (the Markdown above)
  [2-N] user/assistant → Previous conversation (max 50 messages)
  [N+1] user  → Current question

8.7 Source Attribution for the Frontend

The RAG pipeline doesn't just provide context to the LLM — it also sends source references back to the frontend:

const sources = selectedNodes.map(node => ({
    type: node.type,
    label: node.label,
    snippet: node.content?.substring(0, 150),
    source: SOURCE_LABELS[node.source],
    icon: SOURCE_ICONS[node.source],
    similarity: node.similarity,
    nodeId: node.id
}));

This allows the frontend to display a "Sources" section below the response — showing what data the LLM based its answer on, alongside the LLM's response.


9. Hybrid Search — The Next Step

Purely vector-based search isn't perfect:

Situation Problem
Exact name search: "Kiss Anna" The vector searches for "meaning" — "Kiss Anna" is semantically similar to "Nagy Éva" (both are person names)
Number identifiers: "#INV-2024-0042" The vector has no concept of exact number patterns
Rare domain term: "balayage" If the model doesn't know it, the vector doesn't represent it well
Short, exact query: "email address" Too generic, many false positives

9.2 Hybrid: Vector + BM25

The solution: combine semantic search with keyword-based search (BM25 or PostgreSQL full-text search):

User Question
      │
      ├─── Semantic Search (pgvector cosine) → top-K list + score
      │
      └─── Keyword Search (tsvector/BM25) → top-K list + score
                │
                ▼
        Reciprocal Rank Fusion (RRF)
                │
                ▼
        Combined, ranked result

Reciprocal Rank Fusion (RRF) formula:

RRF(d) = Σ 1 / (k + rank_r(d))   where k is typically 60

The advantage of RRF is that no score normalization is needed — ranking position matters, not absolute values.

9.3 PostgreSQL-Native Implementation

The pgvector + pg_trgm + tsvector combination enables hybrid search in a single database:

-- Full-text search index (if not already present)
CREATE INDEX idx_knowledge_nodes_fts
    ON knowledge_nodes USING gin (to_tsvector('hungarian', label || ' ' || content));

-- Hybrid query: vector + full-text, RRF
WITH vector_results AS (
    SELECT id, label, content,
           ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS v_rank
    FROM knowledge_nodes
    WHERE provider_id = $2 AND embedding IS NOT NULL
    LIMIT 20
),
text_results AS (
    SELECT id, label, content,
           ROW_NUMBER() OVER (ORDER BY ts_rank_cd(
               to_tsvector('hungarian', label || ' ' || content),
               plainto_tsquery('hungarian', $3)
           ) DESC) AS t_rank
    FROM knowledge_nodes
    WHERE provider_id = $2
      AND to_tsvector('hungarian', label || ' ' || content)
          @@ plainto_tsquery('hungarian', $3)
    LIMIT 20
)
SELECT COALESCE(v.id, t.id) AS id,
       COALESCE(v.label, t.label) AS label,
       1.0 / (60 + COALESCE(v.v_rank, 1000))
         + 1.0 / (60 + COALESCE(t.t_rank, 1000)) AS rrf_score
FROM vector_results v
FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 8;

9.4 When to Switch to Hybrid?

Signal Indication
Users frequently search by name or identifier Strong indication
Precision is low (many irrelevant results) Moderate indication
Recall is low (relevant data is being missed) Moderate indication
The system works well with vector search No indication — don't optimize without reason

10. Re-ranking — The Final Mile of Quality

10.1 The Problem

The bi-encoder (embedding model) is fast and efficient, but measures similarity by comparing separately computed vectors. It doesn't "read" the query and document together.

The cross-encoder reads them together — therefore more accurate, but slower:

Bi-encoder (embedding):
  Query → [vector_q]    Document → [vector_d]    cosine(q, d) → score
  Speed: ~1000 doc/sec    Accuracy: ★★★☆☆

Cross-encoder (re-ranker):
  [Query + Document] → score
  Speed: ~50 doc/sec     Accuracy: ★★★★★

The solution: the bi-encoder (pgvector) filters (top-K), the cross-encoder ranks:

Question → pgvector cosine (1ms, top-20 from 50K documents)
             │
             ▼
           Cross-encoder re-rank (200ms, on 20 documents)
             │
             ▼
           Top-8 truly relevant results

10.3 Available Re-ranker Solutions

Solution Type Latency Accuracy Price
Cohere Rerank v3 API ~200ms/20 doc Excellent $2/1000 search
Jina Reranker v2 API/local ~150ms/20 doc Good $1/1000 search
cross-encoder/ms-marco-MiniLM-L-12-v2 Local ~300ms/20 doc Good $0 (GPU)
Voyage Reranker API ~180ms/20 doc Good $0.05/1M tokens
OpenAI No native re-ranker

10.4 When to Use a Re-ranker?

Situation Recommendation
< 5K nodes per tenant Not necessary — top-8 cosine is good enough
5K – 50K nodes Worth considering — hybrid + re-rank can improve
50K+ nodes Strongly recommended — precision improves significantly
Threshold tuning doesn't help Re-ranking may solve it

11. Evaluation Framework — How Do You Know It's Working?

11.1 The RAG Evaluation Problem

Evaluating semantic search and RAG pipelines is harder than evaluating a traditional search engine because:

  1. There's no clear "correct answer" — relevance is subjective
  2. Response quality is the combined performance of search + LLM
  3. Evaluation is expensive (human annotation or LLM-based scoring)

11.2 RAGAS Metrics

The RAGAS framework is the industry standard for RAG evaluation:

Metric What Does It Measure? How?
Context Precision Is the context relevant to the question? How many of the top-K results are relevant (e.g., top-1 is most important)
Context Recall Does the context contain all information needed for the answer? LLM compares the ground truth with the context
Faithfulness Is the answer faithful to the context? No hallucination? LLM checks whether the answer's claims come from the context
Answer Relevancy Does the answer actually respond to the question? LLM generates questions from the answer and compares with the original

11.3 Practical Evaluation Method

Build a golden dataset with 50-100 real questions:

{
    "question": "When was Kiss Anna's last visit?",
    "expected_context": ["event_77", "client_15"],
    "expected_answer_contains": ["2026-03-15", "haircut"],
    "category": "appointment_lookup"
}

Automated evaluation cycle:

  1. Run the 50 questions through the RAG pipeline
  2. Measure: Context Precision, Context Recall, Faithfulness
  3. Vary the parameters (threshold: 0.55/0.60/0.65, topK: 5/8/12)
  4. Visualize: precision-recall curves for different configurations
  5. Choose the best balance

11.4 A/B Testing in Production

If there's sufficient traffic, A/B testing is the ground truth:

Group Configuration Measured KPI
A (control) threshold=0.60, topK=8, no re-rank User satisfaction, back-question rate
B (test) threshold=0.55, topK=12, Cohere re-rank User satisfaction, back-question rate

The "back-question rate" (how often users ask follow-up questions because they didn't get a good answer) is one of the best proxy metrics for RAG quality.


12. Knowledge Graph + RAG: The GraphRAG Approach

12.1 What Does the Graph Add to RAG?

Traditional RAG is "flat" — it searches documents. GraphRAG adds structure:

Aspect Traditional RAG GraphRAG
Search Vector similarity Vector + graph traversal
Context Isolated documents Connected entity network
"Who did it?" type questions Weak (the name is just text) Strong (has client entity + edges)
Multi-hop questions Cannot answer 2-hop: client → assigned → calendar
Deduplication None (duplicate chunks) Natural (one entity = one node)

12.2 Our GraphRAG Implementation

Our system handles 9 node types and 8 edge types:

Node types:

Type Source Typical Content
email Gmail connector Email body, subject, date
email_thread Gmail connector Full thread summaries
calendar_event Google Calendar Time, participants, location
client CRM module Name, contact info, preferences
deal CRM module Sales opportunity, amount, status
task CRM module Task description, deadline, assignee
appointment Booking system Time, service, client
note CRM module Free-text note
invoice Számlázz.hu / Billingo Item, amount, date, status

Edge types (bidirectional traversal):

EMAILED     — email send/receive relationship
BOOKED      — booking relationship (client → appointment)
PAID        — payment relationship (client → invoice)
MENTIONS    — reference (any node → any node)
TAGGED      — tagging
ASSIGNED    — assignment (task → user)
BELONGS_TO  — grouping (email → thread, deal → client)
SENT_TO     — targeted sending

12.3 Graph Advantages with Real Questions

Question: "How much did Kiss Anna spend in the last 3 months?"

Traditional RAG result:
  → Finds an email mentioning payment
  → LLM estimates

GraphRAG result:
  1. Vector search → deal_15 "Kiss Anna package" (sim: 0.75)
  2. Graph enrichment:
     deal_15 ──BELONGS_TO──▶ client_15 "Kiss Anna"
     client_15 ──PAID──▶ invoice_23 "45,000 HUF 2026-02"
     client_15 ──PAID──▶ invoice_31 "38,000 HUF 2026-01"
     client_15 ──BOOKED──▶ appointment_44 "2026-03-15"
  3. LLM gives a precise answer: "Kiss Anna spent 83,000 HUF
     in the last 3 months"

13. Production Operations — Monitoring, Drift, Re-indexing

13.1 What to Monitor?

Metric How? Alert if...
Embedding queue depth BullMQ getJobCounts() Waiting > 1000 (backlog)
Embedding queue error rate Failed jobs / total > 5% (API issue)
Average search latency pgvector query time > 500ms (index issue)
Average similarity score RAG pipeline log Average < 0.55 (drift or bad data)
Empty RAG result rate RAG emptyResult() ratio > 40% (threshold too high, or insufficient data)
429 rate limit events Embedding worker log > 3/day (rate limit too aggressive)
Node count per tenant COUNT(*) group by provider_id < 50 (tenant not using the system)

13.2 Embedding Drift

Models change. If OpenAI updates text-embedding-3-small (they haven't yet, but they marked the older text-embedding-ada-002 as deprecated), old and new vectors become incompatible. This is "drift" — search quality gradually degrades.

Prevention:

  1. Model version logging: Store which model version generated each embedding
  2. Full re-embedding capability: Have a script that re-embeds all nodes
  3. Canary tests: Regularly run the golden dataset — if precision drops, suspect drift

13.3 Re-indexing Strategy

When to re-index?

Reason Frequency
pgvector index type switch (IVFFlat → HNSW) One-time
IVFFlat lists parameter increase (data growth) Check quarterly
Embedding model change One-time, full
Significant data volume change (+100%) Recalculate lists parameter

Zero-downtime re-indexing:

-- 1. Build new index CONCURRENTLY (doesn't lock the table)
CREATE INDEX CONCURRENTLY idx_knowledge_nodes_embedding_new
    ON knowledge_nodes USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- 2. Drop old index
DROP INDEX idx_knowledge_nodes_embedding;

-- 3. Rename
ALTER INDEX idx_knowledge_nodes_embedding_new
    RENAME TO idx_knowledge_nodes_embedding;

13.4 The Resilience Matrix

Every component in the full pipeline is fail-safe:

Component On Failure Impact on User
RAG pipeline (rag.js) emptyResult() return LLM responds, but without context
Graph enrichment Per-node try/catch, skip Less context, but works
Embedding worker Queue pause + retry New data temporarily unsearchable
Event worker Per-entity try/catch Partial processing, rest continues
Connector sync Writes to SyncLog: PARTIAL/FAILED Visible on admin dashboard
Context loading try/catch → null System prompt works without knowledge context
Tool execution Per-tool try/catch, error → LLM LLM tries a new strategy
Token budget Hard cap 3000 tokens Never overflows

The principle: The AI assistant always responds. If there's no context, it responds without the knowledge graph. If there are no tools, it responds without them. Degradation is gradual, never total.


14. Fine-tuning Embeddings — When Is It Worth It?

14.1 The Promise and Reality of Fine-tuning

Fine-tuning an embedding model can tune it to domain-specific vocabulary. But:

Aspect Advantage Disadvantage
Domain vocabulary Better similarity for domain-specific terms Data collection required (1000+ pairs)
Language specifics Better agglutination handling Expensive (API: ~$50-200/training)
Accuracy +3-8% MTEB improvement within domain General knowledge may degrade
Maintenance Must redo with every model change

14.2 Alternative: Prompt-Level Embedding Improvement

Before fine-tuning, try improving the input text:

// Instead of:
generateEmbedding(email.body)

// Contextualize:
generateEmbedding(`Email subject: ${email.subject}\nFrom: ${email.from}\n${email.body}`)

This "contextualized embedding" can improve results surprisingly well — the model knows more about the text's context, without any fine-tuning.

14.3 When to Fine-tune?

Condition Yes No
Have 1000+ relevant (query, document, label) triplets?
Current precision < 70% on golden dataset?
Prompt-level improvement didn't help?
All of the above are true? → Fine-tune → Don't fine-tune

15. Summary and Decision Matrix

15.1 Architecture Layers

┌──────────────────────────────────────────────────────────────┐
│                       User Question                           │
├──────────────────────────────────────────────────────────────┤
│  RAG Pipeline                                                 │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────────┐ │
│  │ Vector Search│→ │ Graph Enrich │→ │ Dedup + Token Pack  │ │
│  │ (pgvector)   │  │ (1-hop CTE)  │  │ (3000 token budget) │ │
│  └─────────────┘  └──────────────┘  └─────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│  Knowledge Graph (PostgreSQL)                                 │
│  ┌──────────────────┐  ┌──────────────────┐                  │
│  │  KnowledgeNode    │──│  KnowledgeEdge   │                  │
│  │  (9 types, 1536d) │  │  (8 edge types)  │                  │
│  └──────────────────┘  └──────────────────┘                  │
├──────────────────────────────────────────────────────────────┤
│  Embedding Pipeline (BullMQ)                                  │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────────┐ │
│  │ Event Worker │→ │ Embedding Q  │→ │ OpenAI API          │ │
│  │ (conc: 5)   │  │ (50/min)     │  │ (text-emb-3-small)  │ │
│  └─────────────┘  └──────────────┘  └─────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│  Connectors                                                   │
│  Gmail │ Google Calendar │ CRM │ Invoicing │ Billing          │
└──────────────────────────────────────────────────────────────┘

15.2 CTO Decision Checklist

Question If yes... If no...
Have PostgreSQL? → pgvector (free, simple) → Managed vector DB (Pinecone)
< 5M vectors? → pgvector is more than enough → Consider Qdrant/Milvus
Multi-tenant? → Provider-level filtering is mandatory! → Simpler architecture
Structured business data? → Knowledge graph + entity-based → Chunking + document-based
Graph-like relationships in data? → GraphRAG enrichment → Pure vector RAG
Precision is critical? → Hybrid search + re-ranker → Pure vector search
Non-English primary language? → OpenAI text-embedding-3-small → Benchmark models on your own text

15.3 The Most Important Lessons

  1. Start simple: pgvector + text-embedding-3-small + cosine search. This works in 30 minutes and is sufficient for most SME use cases.

  2. Don't chunk what's a natural unit: Emails, events, client data are better kept whole. The entity-based knowledge graph handles the granularity question.

  3. Graph enrichment is the real differentiator: For "Who?" "When?" "How much?" type questions, vector search alone is weak — loading neighbors brings dramatic quality improvement.

  4. Resilience is not optional: In a production system, the RAG pipeline must not block the AI response. If anything breaks, graceful degradation: we respond with less context, but we respond.

  5. Measure with 50 questions before changing anything: Threshold, top-K, token budget are all tunable — but with data-driven decisions, not intuition.


Want to implement semantic search in your own system? The Atlosz Interactive team has production experience with pgvector, knowledge graph, and RAG pipeline architecture. Get in touch for a free technical consultation.