Semantic Search and Embedding Strategies — Whitepaper for CTOs and IT Decision-Makers

Executive Summary

Semantic search is the critical infrastructure layer of AI-based business applications. While traditional keyword search looks for exact matches, semantic search understands the meaning — enabling an AI assistant to respond to business questions with truly relevant context.

This whitepaper presents the architecture of a real, production multi-tenant SaaS system that:

Handles 9 different data types (email, calendar, CRM, invoicing)
Uses pgvector instead of a dedicated vector database
Enriches vector search with a knowledge graph
Optimizes LLM context with a greedy token budget system
Manages high-volume embedding generation with a BullMQ async pipeline

The whitepaper walks the reader through 15 chapters — from embedding model selection to production monitoring.

1. The Problem — When Search Doesn't Understand What We're Looking For

The Semantic Gap

Imagine: a beauty salon's AI assistant is asked: "When was Kiss Anna's last visit?"

Keyword search result: Nothing. The word "last" doesn't appear in any calendar entry.

Semantic search result: Finds Kiss Anna's most recent calendar entry (March 15, 2026 — haircut + coloring), because it understands that "last visit" = most recent appointment.

But this isn't enough. What if the question is: "When was Kiss Anna's last visit, and what did we do?"

This requires the knowledge graph too: alongside the calendar event, it loads the client profile (VIP, allergic to certain dyes) and the note written during the previous appointment.

Search Approach               What Does It Find?            Can It Answer the Question?
──────────────────────────────────────────────────────────────────────────────────────────
Keyword (LIKE, tsvector)      Exact word matches            ❌ No
Vector (embedding)            Semantically similar          ⚠️ Partially
Vector + graph                Similar + related             ✅ Yes, with context

This whitepaper presents how we got from keyword search to a vector + knowledge graph architecture — and what decisions we had to make along the way.

2. Embedding Models — The Engine of Semantic Search

2.1 What Is an Embedding?

An embedding transforms text (in this case business data: email, calendar event, client profile) into a numerical vector — typically in a 256-3072 dimensional space. The semantic similarity between such vectors can be measured with cosine similarity.

2.2 The Main Selection Criteria

Model	Dimensions	Price (1M tokens)	MTEB Average	Multilingual
OpenAI text-embedding-3-small	1536	$0.02	62.3	Good (multilingual)
OpenAI text-embedding-3-large	3072	$0.13	64.6	Good
Cohere embed-v4	1024	$0.10	66.1	Moderate
Google text-embedding-005	768	$0.025	63.8	Good (multilingual)
BAAI/bge-m3 (local)	1024	$0 (GPU)	62.0	Good
E5-mistral-7b-instruct (local)	4096	$0 (GPU)	66.6	Moderate

2.3 Our Choice: OpenAI text-embedding-3-small (1536d)

Why?

Multilingual: Handles Hungarian business text (email, CRM notes) well, without needing a separate Hungarian model
API simplicity: We already use the OpenAI API for the LLM — single vendor, single API key
Cost-effective: $0.02/1M tokens — a complete knowledge graph of 10,000 nodes costs ~$0.50 to embed
Dimensionality: 1536 is a good balance between accuracy and storage/search cost

2.4 Language-Specific Considerations

Hungarian is an agglutinative language — "futottam" (I ran), "futottál" (you ran), "futottunk" (we ran) are all different tokens but semantically close. OpenAI's BPE tokenization handles this partially, but very rare compound words (e.g., "munkavállalói-érdekképviseleti-tanácsadó" — employee-representation-advisor) may fragment.

Practical experience: text-embedding-3-small works surprisingly well on Hungarian business text (email, CRM, calendar) — cosine similarity values are consistent with semantic similarity. For rare domain-specific terms (e.g., specific cosmetic procedures), contextualized embedding (see Chapter 14.2) can help.

2.5 Decision Tree for Model Selection

Does data need to stay within the EU?
├── Yes → Local model (bge-m3, E5-mistral) or EU-region API
│          ├── Have GPU infrastructure? → Local
│          └── No → EU-region OpenAI / Cohere
└── No → API-based
           ├── Cost-sensitive? → OpenAI text-embedding-3-small ($0.02)
           ├── Maximum accuracy? → Cohere embed-v4 or E5-mistral
           └── Already have OpenAI integration? → text-embedding-3-small (simplicity)

3. Vector Storage — pgvector vs. Dedicated Vector Databases

3.1 The Decision: PostgreSQL + pgvector

Criterion	pgvector	Pinecone	Qdrant	Weaviate
Operations	Existing PostgreSQL	Managed SaaS	Self-hosted / Cloud	Self-hosted / Cloud
Cost	$0 (extension)	$70+/mo	$0-65/mo	$25+/mo
Scalability	Excellent up to ~5M vectors	Unlimited	100M+	100M+
SQL integration	Native (JOIN, WHERE)	None	None	GraphQL
Multi-tenant	WHERE provider_id=	Namespace	Collection / payload	Tenant
ACID transactions	Yes	No	No	No
Knowledge graph	Same DB	Separate system needed	Separate system needed	Built-in (partial)

The deciding argument: Knowledge graph nodes and edges are in the same database as the vectors. A single SQL query can perform vector search + graph traversal + tenant filtering — no network latency between two systems.

3.2 The knowledge_nodes Table

CREATE TABLE knowledge_nodes (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    provider_id UUID NOT NULL REFERENCES providers(id),
    type VARCHAR(50) NOT NULL,              -- 'email', 'calendar_event', 'client', etc.
    source VARCHAR(50),                      -- 'gmail', 'google_calendar', 'crm', etc.
    external_id VARCHAR(255),                -- original system ID
    label VARCHAR(500),                      -- human-readable title
    content TEXT,                             -- full text content
    properties JSONB DEFAULT '{}',           -- type-specific metadata
    embedding vector(1536),                  -- OpenAI text-embedding-3-small
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),

    CONSTRAINT unique_external UNIQUE (provider_id, source, external_id)
);

-- Vector search index
CREATE INDEX idx_knowledge_nodes_embedding
    ON knowledge_nodes USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

-- Tenant filtering index
CREATE INDEX idx_knowledge_nodes_provider
    ON knowledge_nodes (provider_id);

3.3 Indexing: IVFFlat vs. HNSW

Criterion	IVFFlat	HNSW	Notes
Build time	Fast	Slow (10-100x)	HNSW can take hours with large data
Search speed	Good (~5ms)	Excellent (~1ms)	HNSW faster, but IVFFlat is sufficient
Recall@10	95-98%	99%+	IVFFlat tunable with probes parameter
Memory	Low	High (2-3x)	HNSW keeps graph structure in memory
Incremental index	No (rebuild needed)	Yes	HNSW indexes new vectors immediately

Our choice: IVFFlat (lists=100). In a multi-tenant SaaS, tenants typically hold 100-10,000 nodes — IVFFlat is more than sufficient at this scale, and index rebuilds can be integrated into the BullMQ pipeline.

4. Async Embedding Pipeline — BullMQ Architecture

4.1 Why Async?

Embedding generation is not synchronous: an email sync brings 500 emails, each requiring an API call. If synchronous, a Gmail sync would take 500 × 200ms = 100 seconds — unacceptable.

4.2 The Pipeline Architecture

Gmail/Calendar Connector
        │
        ▼
   EventWorker (concurrency: 5)
        │  Node creation/update
        │  Edge management (BELONGS_TO, EMAILED, etc.)
        ▼
   EmbeddingQueue.add({ nodeId, content })
        │
        ▼
   EmbeddingWorker (concurrency: 3, rate: 50/min)
        │  OpenAI API call
        │  Vector save → knowledge_nodes.embedding
        ▼
   Ready for search ✓

4.3 Text Preparation

The quality of the embedding depends on the quality of the input text. The pipeline:

Content assembly by type:
- Email: Subject: ${subject}\nFrom: ${from}\n${body}
- Calendar: ${summary} - ${start} - ${location}\n${description}
- Client: ${name} - ${email} - ${notes}
Cleanup: HTML tag removal, whitespace normalization
Truncate: Max 8000 characters (text-embedding-3-small has an 8191 token limit, and the character/token ratio is ~4:1 for Hungarian text)

function prepareTextForEmbedding(node) {
    let text = '';
    switch (node.type) {
        case 'email':
            text = `Email subject: ${node.label}\n${node.content}`;
            break;
        case 'calendar_event':
            text = `Event: ${node.label}\n${node.properties?.location || ''}\n${node.content}`;
            break;
        case 'client':
            text = `Client: ${node.label}\n${node.content}`;
            break;
        default:
            text = `${node.label}\n${node.content}`;
    }
    return text.replace(/\s+/g, ' ').trim().substring(0, 8000);
}

4.4 Rate Limiting and Error Handling

The OpenAI embedding API rate limit (Tier 3): ~5000 RPM. But for safe operation, we limit to 50 jobs/minute:

const embeddingWorker = new Worker('embedding-queue', processEmbedding, {
    connection: redis,
    concurrency: 3,
    limiter: {
        max: 50,
        duration: 60000  // 50 jobs/minute
    }
});

429 (rate limit) handling: If OpenAI returns 429, the worker takes a 1-hour pause (based on OpenAI's rate limit reset window), then resumes. This is more aggressive than exponential backoff, but more reliable — after a 429, exponential backoff often "oscillates."

Error isolation: A failed embedding doesn't block the processing of other nodes. BullMQ's attempts: 3 + backoff: exponential configuration automatically retries, and after 3 failed attempts, the node stays with embedding = NULL, meaning search won't find it — but the system works.

5. Chunking Strategies — Or Rather: Why We DON'T Chunk

5.1 The Chunking Myth

Most RAG tutorials teach "chunking" first: split documents into 500-1000 token pieces, embed each separately. This is excellent for document-based systems (e.g., processing a 200-page manual).

But business data is different:

An email averages 200-500 tokens — a natural unit, no need to split
A calendar event is 50-150 tokens
A client profile is 100-300 tokens
A CRM note is 50-200 tokens

These data are "natural chunks" — if we split them, we lose context.

5.2 The Entity-Based Approach

Our system is entity-based, not document-based:

Document-based:                  Entity-based:
─────────────────                ─────────────────
  PDF → 50 chunks → 50 vectors    Email → 1 node + edges
  No connection between chunks     Calendar → 1 node + edges
  Much redundancy                  Client → 1 node + edges
                                   Natural graph structure

Every business entity (email, event, client, invoice) is one node in the knowledge graph, with one embedding. Context isn't provided by chunk overlaps but by graph edges — an email node is connected to the sender (client), the thread, and the events mentioned in it.

5.3 Exception: Long Texts

If there are long texts (e.g., a 10,000-word product catalog), we can choose from three strategies:

Fixed-size chunking: 500-token pieces, with 50-token overlap. Simple but loses context.
Recursive character text splitting: Splits by paragraphs, then sentences. Better context preservation.
Semantic chunking: Uses embedding-based similarity to decide where to split. Best quality but more expensive.

In our system, long texts are rare (95% of business data is < 1000 tokens), so simple truncation (8000 characters) is sufficient.

6. Search Tuning — Cosine Similarity, Threshold, Top-K

6.1 Cosine Similarity in Brief

The cosine similarity of two vectors (A, B): cos(θ) = (A · B) / (||A|| × ||B||). Its value ranges from -1 to 1; 1 = perfectly similar, 0 = orthogonal (no relationship).

In practice, embedding models produce values in the 0.3-0.9 range — values above 0.60 typically indicate "relevant."

6.2 The Threshold Paradox

It seems intuitive: set the threshold high (e.g., 0.80), and only return very relevant results. But:

Threshold	Precision	Recall	Experience
0.80	Excellent	Very low	Barely finds anything — "no data" experience
0.70	Good	Medium	Relevant items sometimes missed
0.60	Good	Good	Optimal balance with OpenAI model
0.50	Low	High	Many irrelevant results — "noisy" context

0.60 is the optimal threshold with the text-embedding-3-small model, tested on business data. This is model-specific — a different model may have a different optimal value.

6.3 Top-K: Why 8?

Top-K (how many results to request) relates to the token budget:

Token budget: 3000 tokens (~12,000 characters)
Average node content: 300-400 characters (75-100 tokens)
Formatting overhead: ~20 tokens/node (Markdown headers, separators)
Graph neighbors: The top-3 results' neighbors are also included

→ 8 direct results + ~5 graph neighbors = ~13 nodes × ~100 tokens = ~1300 tokens, which fits well within the 3000 budget, leaving room for formatting and a "safety margin."

6.4 Multi-Tenant Search Isolation

Search is always tenant-filtered:

SELECT id, label, content, type, source, properties,
       1 - (embedding <=> $1::vector) AS similarity
FROM knowledge_nodes
WHERE provider_id = $2
  AND embedding IS NOT NULL
ORDER BY embedding <=> $1::vector
LIMIT $3;

The provider_id = $2 condition ensures that a tenant never sees another tenant's data — not even accidentally. This isn't just GDPR compliance — it's business-critical: one beauty salon shouldn't see another salon's client data.

7. Context Enrichment — The Power of the Graph

7.1 The Problem: Isolated Vector Results

Vector search returns isolated nodes. But business questions require context:

"When is Kiss Anna coming next?" → Need the calendar event + the client profile (e.g., allergy information)
"What was Saturday's email about?" → Need the email + the full thread + the affected client
"How much revenue was there in March?" → Need invoices + the affected clients

7.2 Loading 1-Hop Neighbors

The solution: load the graph neighbors of the top vector results. 1-hop = direct neighbors (nodes 1 edge away).

Vector result: event_calendar_77 (sim: 0.82)
                    │
                    ├── BOOKED ──▶ client_15 "Kiss Anna" (VIP, allergy: X dye)
                    ├── BELONGS_TO ──▶ calendar_google_main
                    └── MENTIONS ──▶ note_23 "Need to mention balayage next time"

client_15 and note_23 were not in the vector search's top-K results (their embeddings differ), but they provide critical context for the answer.

7.3 Relevance Inheritance (Decay Factor)

Neighbors' relevance scores are inherited from the parent's similarity, reduced by a decay factor:

neighbor_score = parent_similarity × decay_factor

In our system, decay_factor = 0.8:

If the parent similarity = 0.82 → the neighbor score = 0.82 × 0.8 = 0.656
This is still above the threshold (0.60) → included in results
A 2-hop neighbor: 0.82 × 0.8 × 0.8 = 0.525 → below threshold → excluded

The decay factor automatically regulates graph traversal depth: the further a neighbor, the lower its score, and it naturally falls out.

7.4 Why Does This Approach Work?

The relevance inheritance has three important properties:

Direct results are always ranked higher in the results
Neighbors don't "push out" direct results from the token budget
Multiply-referenced entities (e.g., a client node reachable from two emails) receive higher relevance scores

7.5 Recursive CTE Graph Traversal

Neighbor queries use efficient SQL with a recursive CTE:

WITH RECURSIVE related AS (
    -- Starting point: the start node
    SELECT kn.id, kn.type, kn.label, kn.content, kn.properties,
           0 AS depth, ARRAY[kn.id] AS path
    FROM knowledge_nodes kn
    WHERE kn.id = $1

    UNION ALL

    -- Recursion: traverse edges in both directions
    SELECT kn2.id, kn2.type, kn2.label, kn2.content, kn2.properties,
           r.depth + 1, r.path || kn2.id
    FROM related r
    JOIN knowledge_edges ke ON ke.from_node_id = r.id OR ke.to_node_id = r.id
    JOIN knowledge_nodes kn2 ON kn2.id = CASE
        WHEN ke.from_node_id = r.id THEN ke.to_node_id
        ELSE ke.from_node_id
    END
    WHERE r.depth < $2           -- max depth: 1 (or 2 in special cases)
      AND NOT kn2.id = ANY(r.path)  -- cycle prevention
)
SELECT DISTINCT ON (id) * FROM related WHERE depth > 0;

Cycle prevention: The path array contains the nodes visited so far. If a node already appears in the path, recursion doesn't continue on that branch. This prevents infinite loops in mutual references (e.g., email → thread → email).

7.6 Configurable Parameters

Parameter	Value	Effect
`graphMaxDepth`	1	How many hops into the graph (1 = direct neighbors)
`graphMaxNeighbors`	5	Maximum neighbors per starting node
`graphMaxEnriched`	3	How many of the top-K vector results get enriched with graph neighbors
`decayFactor`	0.8	Relevance inheritance factor for neighbors

Why 1 hop and not 2? 2-hop neighbors exponentially increase the result set (5 neighbors × 5 = 25 second-hop nodes), and relevance drops to decay² = 0.64 — which is already near the threshold (0.60). Therefore, 1-hop is the optimal balance between context richness and cost-effectiveness.

8. The Complete RAG Pipeline — In 5 Steps

8.1 Architectural Overview

User Question
      │
      ▼
  ┌─────────────────────────────────────────────────────────┐
  │  retrieveRAGContext(providerId, message)                  │
  │                                                          │
  │  1. Guard: message.length >= 3?                          │
  │     └─ No → return emptyResult()                         │
  │                                                          │
  │  2. Vector Search                                        │
  │     generateEmbedding(message)                           │
  │     → searchByEmbedding(embedding, providerId, topK=8)   │
  │     → filter: similarity > 0.60                          │
  │                                                          │
  │  3. Graph Enrichment                                     │
  │     Top-3 results → getNeighbors() per node              │
  │     → neighbors decay = 0.8                              │
  │     → max 5 neighbors per node                           │
  │                                                          │
  │  4. Dedup + Rank + Token Budget                          │
  │     → deduplication by ID (higher score wins)            │
  │     → sort by similarity desc                            │
  │     → greedy packing: 3000 token budget                  │
  │                                                          │
  │  5. Format + Inject                                      │
  │     → Markdown context grouped by type                   │
  │     → Source objects for the frontend                     │
  │     → As system message for the LLM                      │
  └─────────────────────────────────────────────────────────┘
      │
      ▼
  LLM response generation based on context

8.2 Step 1: Guard — Input Validation

if (!message || message.trim().length < RAG_CONFIG.minQueryLength) {
    return emptyResult();
}

Why is this needed? A 1-2 character message (e.g., "hi", "ok") carries no semantic content — its embedding is "average," which would return irrelevant results. The 3-character minimum filters these out.

8.3 Step 2: Vector Search

The user message is embedded with the same model as the data (text-embedding-3-small). This is critical: if the query embedding and stored embeddings come from different models, cosine similarity is meaningless.

The search uses pgvector's cosine distance operator, with provider_id tenant filtering.

8.4 Step 3: Graph Enrichment

For the top-3 vector results (not all 8 — performance-conscious), we load their 1-hop neighbors. Neighbors receive their parent's similarity × 0.8 decay.

8.5 Step 4: Deduplication, Ranking, Token Budget

All nodes (vector + graph) go into a single list:

Deduplication: If a node appears multiple times (e.g., a client is a neighbor of two emails), the higher score remains
Ranking: Descending by similarity
Token budget packing: Greedy algorithm — following the ranking, we add nodes until the 3000-token budget (estimated with character/4 heuristic) is exhausted. Each node gets max 400 characters of content.

let usedTokens = 0;
const selected = [];

for (const node of rankedNodes) {
    const content = (node.content || '').substring(0, RAG_CONFIG.maxContentLength);
    const estimatedTokens = Math.ceil(content.length / 4) + 20; // +20 for formatting
    if (usedTokens + estimatedTokens > RAG_CONFIG.maxContextTokens) break;
    usedTokens += estimatedTokens;
    selected.push({ ...node, content });
}

8.6 Step 5: Markdown Formatting and Injection

The selected nodes are transformed into type-grouped Markdown:

📧 **Emails:**
- **Appointment change** (2026-03-10)
  Dear Salon! I'd like to modify my Friday appointment...
  _Source: Gmail_

📅 **Calendar Events:**
- **Haircut + coloring** (2026-03-15 14:00)
  Kiss Anna - 90 minutes
  _Source: Google Calendar_

👤 **Clients:**
- **Kiss Anna**
  Phone: +36-30-123-4567, VIP client, allergy: certain dyes
  _Source: CRM_

This Markdown is inserted as a system message into the LLM context — separated from the main system prompt:

LLM message array:
  [0] system  → Main system prompt (personality, rules, available tools)
  [1] system  → RAG context (the Markdown above)
  [2-N] user/assistant → Previous conversation (max 50 messages)
  [N+1] user  → Current question

8.7 Source Attribution for the Frontend

The RAG pipeline doesn't just provide context to the LLM — it also sends source references back to the frontend:

const sources = selectedNodes.map(node => ({
    type: node.type,
    label: node.label,
    snippet: node.content?.substring(0, 150),
    source: SOURCE_LABELS[node.source],
    icon: SOURCE_ICONS[node.source],
    similarity: node.similarity,
    nodeId: node.id
}));

This allows the frontend to display a "Sources" section below the response — showing what data the LLM based its answer on, alongside the LLM's response.

9. Hybrid Search — The Next Step

9.1 The Limitations of Semantic Search

Purely vector-based search isn't perfect:

Situation	Problem
Exact name search: "Kiss Anna"	The vector searches for "meaning" — "Kiss Anna" is semantically similar to "Nagy Éva" (both are person names)
Number identifiers: "#INV-2024-0042"	The vector has no concept of exact number patterns
Rare domain term: "balayage"	If the model doesn't know it, the vector doesn't represent it well
Short, exact query: "email address"	Too generic, many false positives

9.2 Hybrid: Vector + BM25

The solution: combine semantic search with keyword-based search (BM25 or PostgreSQL full-text search):

User Question
      │
      ├─── Semantic Search (pgvector cosine) → top-K list + score
      │
      └─── Keyword Search (tsvector/BM25) → top-K list + score
                │
                ▼
        Reciprocal Rank Fusion (RRF)
                │
                ▼
        Combined, ranked result

Reciprocal Rank Fusion (RRF) formula:

RRF(d) = Σ 1 / (k + rank_r(d)) where k is typically 60

The advantage of RRF is that no score normalization is needed — ranking position matters, not absolute values.

9.3 PostgreSQL-Native Implementation

The pgvector + pg_trgm + tsvector combination enables hybrid search in a single database:

-- Full-text search index (if not already present)
CREATE INDEX idx_knowledge_nodes_fts
    ON knowledge_nodes USING gin (to_tsvector('hungarian', label || ' ' || content));

-- Hybrid query: vector + full-text, RRF
WITH vector_results AS (
    SELECT id, label, content,
           ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS v_rank
    FROM knowledge_nodes
    WHERE provider_id = $2 AND embedding IS NOT NULL
    LIMIT 20
),
text_results AS (
    SELECT id, label, content,
           ROW_NUMBER() OVER (ORDER BY ts_rank_cd(
               to_tsvector('hungarian', label || ' ' || content),
               plainto_tsquery('hungarian', $3)
           ) DESC) AS t_rank
    FROM knowledge_nodes
    WHERE provider_id = $2
      AND to_tsvector('hungarian', label || ' ' || content)
          @@ plainto_tsquery('hungarian', $3)
    LIMIT 20
)
SELECT COALESCE(v.id, t.id) AS id,
       COALESCE(v.label, t.label) AS label,
       1.0 / (60 + COALESCE(v.v_rank, 1000))
         + 1.0 / (60 + COALESCE(t.t_rank, 1000)) AS rrf_score
FROM vector_results v
FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 8;

9.4 When to Switch to Hybrid?

Signal	Indication
Users frequently search by name or identifier	Strong indication
Precision is low (many irrelevant results)	Moderate indication
Recall is low (relevant data is being missed)	Moderate indication
The system works well with vector search	No indication — don't optimize without reason

10. Re-ranking — The Final Mile of Quality

10.1 The Problem

The bi-encoder (embedding model) is fast and efficient, but measures similarity by comparing separately computed vectors. It doesn't "read" the query and document together.

The cross-encoder reads them together — therefore more accurate, but slower:

Bi-encoder (embedding):
  Query → [vector_q]    Document → [vector_d]    cosine(q, d) → score
  Speed: ~1000 doc/sec    Accuracy: ★★★☆☆

Cross-encoder (re-ranker):
  [Query + Document] → score
  Speed: ~50 doc/sec     Accuracy: ★★★★★

10.2 Two-Phase Search

The solution: the bi-encoder (pgvector) filters (top-K), the cross-encoder ranks:

Question → pgvector cosine (1ms, top-20 from 50K documents)
             │
             ▼
           Cross-encoder re-rank (200ms, on 20 documents)
             │
             ▼
           Top-8 truly relevant results

10.3 Available Re-ranker Solutions

Solution	Type	Latency	Accuracy	Price
Cohere Rerank v3	API	~200ms/20 doc	Excellent	$2/1000 search
Jina Reranker v2	API/local	~150ms/20 doc	Good	$1/1000 search
cross-encoder/ms-marco-MiniLM-L-12-v2	Local	~300ms/20 doc	Good	$0 (GPU)
Voyage Reranker	API	~180ms/20 doc	Good	$0.05/1M tokens
OpenAI	—	—	—	No native re-ranker

10.4 When to Use a Re-ranker?

Situation	Recommendation
< 5K nodes per tenant	Not necessary — top-8 cosine is good enough
5K – 50K nodes	Worth considering — hybrid + re-rank can improve
50K+ nodes	Strongly recommended — precision improves significantly
Threshold tuning doesn't help	Re-ranking may solve it

11. Evaluation Framework — How Do You Know It's Working?

11.1 The RAG Evaluation Problem

Evaluating semantic search and RAG pipelines is harder than evaluating a traditional search engine because:

There's no clear "correct answer" — relevance is subjective
Response quality is the combined performance of search + LLM
Evaluation is expensive (human annotation or LLM-based scoring)

11.2 RAGAS Metrics

The RAGAS framework is the industry standard for RAG evaluation:

Metric	What Does It Measure?	How?
Context Precision	Is the context relevant to the question?	How many of the top-K results are relevant (e.g., top-1 is most important)
Context Recall	Does the context contain all information needed for the answer?	LLM compares the ground truth with the context
Faithfulness	Is the answer faithful to the context? No hallucination?	LLM checks whether the answer's claims come from the context
Answer Relevancy	Does the answer actually respond to the question?	LLM generates questions from the answer and compares with the original

11.3 Practical Evaluation Method

Build a golden dataset with 50-100 real questions:

{
    "question": "When was Kiss Anna's last visit?",
    "expected_context": ["event_77", "client_15"],
    "expected_answer_contains": ["2026-03-15", "haircut"],
    "category": "appointment_lookup"
}

Automated evaluation cycle:

Run the 50 questions through the RAG pipeline
Measure: Context Precision, Context Recall, Faithfulness
Vary the parameters (threshold: 0.55/0.60/0.65, topK: 5/8/12)
Visualize: precision-recall curves for different configurations
Choose the best balance

11.4 A/B Testing in Production

If there's sufficient traffic, A/B testing is the ground truth:

Group	Configuration	Measured KPI
A (control)	threshold=0.60, topK=8, no re-rank	User satisfaction, back-question rate
B (test)	threshold=0.55, topK=12, Cohere re-rank	User satisfaction, back-question rate

The "back-question rate" (how often users ask follow-up questions because they didn't get a good answer) is one of the best proxy metrics for RAG quality.

12. Knowledge Graph + RAG: The GraphRAG Approach

12.1 What Does the Graph Add to RAG?

Traditional RAG is "flat" — it searches documents. GraphRAG adds structure:

Aspect	Traditional RAG	GraphRAG
Search	Vector similarity	Vector + graph traversal
Context	Isolated documents	Connected entity network
"Who did it?" type questions	Weak (the name is just text)	Strong (has client entity + edges)
Multi-hop questions	Cannot answer	2-hop: client → assigned → calendar
Deduplication	None (duplicate chunks)	Natural (one entity = one node)

12.2 Our GraphRAG Implementation

Our system handles 9 node types and 8 edge types:

Node types:

Type	Source	Typical Content
`email`	Gmail connector	Email body, subject, date
`email_thread`	Gmail connector	Full thread summaries
`calendar_event`	Google Calendar	Time, participants, location
`client`	CRM module	Name, contact info, preferences
`deal`	CRM module	Sales opportunity, amount, status
`task`	CRM module	Task description, deadline, assignee
`appointment`	Booking system	Time, service, client
`note`	CRM module	Free-text note
`invoice`	Számlázz.hu / Billingo	Item, amount, date, status

Edge types (bidirectional traversal):

EMAILED     — email send/receive relationship
BOOKED      — booking relationship (client → appointment)
PAID        — payment relationship (client → invoice)
MENTIONS    — reference (any node → any node)
TAGGED      — tagging
ASSIGNED    — assignment (task → user)
BELONGS_TO  — grouping (email → thread, deal → client)
SENT_TO     — targeted sending

12.3 Graph Advantages with Real Questions

Question: "How much did Kiss Anna spend in the last 3 months?"

Traditional RAG result:
  → Finds an email mentioning payment
  → LLM estimates

GraphRAG result:
  1. Vector search → deal_15 "Kiss Anna package" (sim: 0.75)
  2. Graph enrichment:
     deal_15 ──BELONGS_TO──▶ client_15 "Kiss Anna"
     client_15 ──PAID──▶ invoice_23 "45,000 HUF 2026-02"
     client_15 ──PAID──▶ invoice_31 "38,000 HUF 2026-01"
     client_15 ──BOOKED──▶ appointment_44 "2026-03-15"
  3. LLM gives a precise answer: "Kiss Anna spent 83,000 HUF
     in the last 3 months"

13. Production Operations — Monitoring, Drift, Re-indexing

13.1 What to Monitor?

Metric	How?	Alert if...
Embedding queue depth	BullMQ getJobCounts()	Waiting > 1000 (backlog)
Embedding queue error rate	Failed jobs / total	> 5% (API issue)
Average search latency	pgvector query time	> 500ms (index issue)
Average similarity score	RAG pipeline log	Average < 0.55 (drift or bad data)
Empty RAG result rate	RAG emptyResult() ratio	> 40% (threshold too high, or insufficient data)
429 rate limit events	Embedding worker log	> 3/day (rate limit too aggressive)
Node count per tenant	COUNT(*) group by provider_id	< 50 (tenant not using the system)

13.2 Embedding Drift

Models change. If OpenAI updates text-embedding-3-small (they haven't yet, but they marked the older text-embedding-ada-002 as deprecated), old and new vectors become incompatible. This is "drift" — search quality gradually degrades.

Prevention:

Model version logging: Store which model version generated each embedding
Full re-embedding capability: Have a script that re-embeds all nodes
Canary tests: Regularly run the golden dataset — if precision drops, suspect drift

13.3 Re-indexing Strategy

When to re-index?

Reason	Frequency
pgvector index type switch (IVFFlat → HNSW)	One-time
IVFFlat lists parameter increase (data growth)	Check quarterly
Embedding model change	One-time, full
Significant data volume change (+100%)	Recalculate lists parameter

Zero-downtime re-indexing:

-- 1. Build new index CONCURRENTLY (doesn't lock the table)
CREATE INDEX CONCURRENTLY idx_knowledge_nodes_embedding_new
    ON knowledge_nodes USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- 2. Drop old index
DROP INDEX idx_knowledge_nodes_embedding;

-- 3. Rename
ALTER INDEX idx_knowledge_nodes_embedding_new
    RENAME TO idx_knowledge_nodes_embedding;

13.4 The Resilience Matrix

Every component in the full pipeline is fail-safe:

Component	On Failure	Impact on User
RAG pipeline (rag.js)	emptyResult() return	LLM responds, but without context
Graph enrichment	Per-node try/catch, skip	Less context, but works
Embedding worker	Queue pause + retry	New data temporarily unsearchable
Event worker	Per-entity try/catch	Partial processing, rest continues
Connector sync	Writes to SyncLog: PARTIAL/FAILED	Visible on admin dashboard
Context loading	try/catch → null	System prompt works without knowledge context
Tool execution	Per-tool try/catch, error → LLM	LLM tries a new strategy
Token budget	Hard cap 3000 tokens	Never overflows

The principle: The AI assistant always responds. If there's no context, it responds without the knowledge graph. If there are no tools, it responds without them. Degradation is gradual, never total.

14. Fine-tuning Embeddings — When Is It Worth It?

14.1 The Promise and Reality of Fine-tuning

Fine-tuning an embedding model can tune it to domain-specific vocabulary. But:

Aspect	Advantage	Disadvantage
Domain vocabulary	Better similarity for domain-specific terms	Data collection required (1000+ pairs)
Language specifics	Better agglutination handling	Expensive (API: ~$50-200/training)
Accuracy	+3-8% MTEB improvement within domain	General knowledge may degrade
Maintenance	—	Must redo with every model change

14.2 Alternative: Prompt-Level Embedding Improvement

Before fine-tuning, try improving the input text:

// Instead of:
generateEmbedding(email.body)

// Contextualize:
generateEmbedding(`Email subject: ${email.subject}\nFrom: ${email.from}\n${email.body}`)

This "contextualized embedding" can improve results surprisingly well — the model knows more about the text's context, without any fine-tuning.

14.3 When to Fine-tune?

Condition	Yes	No
Have 1000+ relevant (query, document, label) triplets?		—
Current precision < 70% on golden dataset?		—
Prompt-level improvement didn't help?		—
All of the above are true?	→ Fine-tune	→ Don't fine-tune

15. Summary and Decision Matrix

15.1 Architecture Layers

┌──────────────────────────────────────────────────────────────┐
│                       User Question                           │
├──────────────────────────────────────────────────────────────┤
│  RAG Pipeline                                                 │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────────┐ │
│  │ Vector Search│→ │ Graph Enrich │→ │ Dedup + Token Pack  │ │
│  │ (pgvector)   │  │ (1-hop CTE)  │  │ (3000 token budget) │ │
│  └─────────────┘  └──────────────┘  └─────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│  Knowledge Graph (PostgreSQL)                                 │
│  ┌──────────────────┐  ┌──────────────────┐                  │
│  │  KnowledgeNode    │──│  KnowledgeEdge   │                  │
│  │  (9 types, 1536d) │  │  (8 edge types)  │                  │
│  └──────────────────┘  └──────────────────┘                  │
├──────────────────────────────────────────────────────────────┤
│  Embedding Pipeline (BullMQ)                                  │
│  ┌─────────────┐  ┌──────────────┐  ┌─────────────────────┐ │
│  │ Event Worker │→ │ Embedding Q  │→ │ OpenAI API          │ │
│  │ (conc: 5)   │  │ (50/min)     │  │ (text-emb-3-small)  │ │
│  └─────────────┘  └──────────────┘  └─────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│  Connectors                                                   │
│  Gmail │ Google Calendar │ CRM │ Invoicing │ Billing          │
└──────────────────────────────────────────────────────────────┘

15.2 CTO Decision Checklist

Question	If yes...	If no...
Have PostgreSQL?	→ pgvector (free, simple)	→ Managed vector DB (Pinecone)
< 5M vectors?	→ pgvector is more than enough	→ Consider Qdrant/Milvus
Multi-tenant?	→ Provider-level filtering is mandatory!	→ Simpler architecture
Structured business data?	→ Knowledge graph + entity-based	→ Chunking + document-based
Graph-like relationships in data?	→ GraphRAG enrichment	→ Pure vector RAG
Precision is critical?	→ Hybrid search + re-ranker	→ Pure vector search
Non-English primary language?	→ OpenAI text-embedding-3-small	→ Benchmark models on your own text

15.3 The Most Important Lessons

Start simple: pgvector + text-embedding-3-small + cosine search. This works in 30 minutes and is sufficient for most SME use cases.
Don't chunk what's a natural unit: Emails, events, client data are better kept whole. The entity-based knowledge graph handles the granularity question.
Graph enrichment is the real differentiator: For "Who?" "When?" "How much?" type questions, vector search alone is weak — loading neighbors brings dramatic quality improvement.
Resilience is not optional: In a production system, the RAG pipeline must not block the AI response. If anything breaks, graceful degradation: we respond with less context, but we respond.
Measure with 50 questions before changing anything: Threshold, top-K, token budget are all tunable — but with data-driven decisions, not intuition.

Want to implement semantic search in your own system? The Atlosz Interactive team has production experience with pgvector, knowledge graph, and RAG pipeline architecture. Get in touch for a free technical consultation.