Executive Summary
Semantic search is the critical infrastructure layer of AI-based business applications. While traditional keyword search looks for exact matches, semantic search understands the meaning — enabling an AI assistant to respond to business questions with truly relevant context.
This whitepaper presents the architecture of a real, production multi-tenant SaaS system that:
- Handles 9 different data types (email, calendar, CRM, invoicing)
- Uses pgvector instead of a dedicated vector database
- Enriches vector search with a knowledge graph
- Optimizes LLM context with a greedy token budget system
- Manages high-volume embedding generation with a BullMQ async pipeline
The whitepaper walks the reader through 15 chapters — from embedding model selection to production monitoring.
1. The Problem — When Search Doesn't Understand What We're Looking For
The Semantic Gap
Imagine: a beauty salon's AI assistant is asked: "When was Kiss Anna's last visit?"
Keyword search result: Nothing. The word "last" doesn't appear in any calendar entry.
Semantic search result: Finds Kiss Anna's most recent calendar entry (March 15, 2026 — haircut + coloring), because it understands that "last visit" = most recent appointment.
But this isn't enough. What if the question is: "When was Kiss Anna's last visit, and what did we do?"
This requires the knowledge graph too: alongside the calendar event, it loads the client profile (VIP, allergic to certain dyes) and the note written during the previous appointment.
Search Approach What Does It Find? Can It Answer the Question?
──────────────────────────────────────────────────────────────────────────────────────────
Keyword (LIKE, tsvector) Exact word matches ❌ No
Vector (embedding) Semantically similar ⚠️ Partially
Vector + graph Similar + related ✅ Yes, with context
This whitepaper presents how we got from keyword search to a vector + knowledge graph architecture — and what decisions we had to make along the way.
2. Embedding Models — The Engine of Semantic Search
2.1 What Is an Embedding?
An embedding transforms text (in this case business data: email, calendar event, client profile) into a numerical vector — typically in a 256-3072 dimensional space. The semantic similarity between such vectors can be measured with cosine similarity.
2.2 The Main Selection Criteria
2.3 Our Choice: OpenAI text-embedding-3-small (1536d)
Why?
- Multilingual: Handles Hungarian business text (email, CRM notes) well, without needing a separate Hungarian model
- API simplicity: We already use the OpenAI API for the LLM — single vendor, single API key
- Cost-effective: $0.02/1M tokens — a complete knowledge graph of 10,000 nodes costs ~$0.50 to embed
- Dimensionality: 1536 is a good balance between accuracy and storage/search cost
2.4 Language-Specific Considerations
Hungarian is an agglutinative language — "futottam" (I ran), "futottál" (you ran), "futottunk" (we ran) are all different tokens but semantically close. OpenAI's BPE tokenization handles this partially, but very rare compound words (e.g., "munkavállalói-érdekképviseleti-tanácsadó" — employee-representation-advisor) may fragment.
Practical experience: text-embedding-3-small works surprisingly well on Hungarian business text (email, CRM, calendar) — cosine similarity values are consistent with semantic similarity. For rare domain-specific terms (e.g., specific cosmetic procedures), contextualized embedding (see Chapter 14.2) can help.
2.5 Decision Tree for Model Selection
Does data need to stay within the EU?
├── Yes → Local model (bge-m3, E5-mistral) or EU-region API
│ ├── Have GPU infrastructure? → Local
│ └── No → EU-region OpenAI / Cohere
└── No → API-based
├── Cost-sensitive? → OpenAI text-embedding-3-small ($0.02)
├── Maximum accuracy? → Cohere embed-v4 or E5-mistral
└── Already have OpenAI integration? → text-embedding-3-small (simplicity)
3. Vector Storage — pgvector vs. Dedicated Vector Databases
3.1 The Decision: PostgreSQL + pgvector
The deciding argument: Knowledge graph nodes and edges are in the same database as the vectors. A single SQL query can perform vector search + graph traversal + tenant filtering — no network latency between two systems.
3.2 The knowledge_nodes Table
CREATE TABLE knowledge_nodes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
provider_id UUID NOT NULL REFERENCES providers(id),
type VARCHAR(50) NOT NULL, -- 'email', 'calendar_event', 'client', etc.
source VARCHAR(50), -- 'gmail', 'google_calendar', 'crm', etc.
external_id VARCHAR(255), -- original system ID
label VARCHAR(500), -- human-readable title
content TEXT, -- full text content
properties JSONB DEFAULT '{}', -- type-specific metadata
embedding vector(1536), -- OpenAI text-embedding-3-small
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
CONSTRAINT unique_external UNIQUE (provider_id, source, external_id)
);
-- Vector search index
CREATE INDEX idx_knowledge_nodes_embedding
ON knowledge_nodes USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Tenant filtering index
CREATE INDEX idx_knowledge_nodes_provider
ON knowledge_nodes (provider_id);
3.3 Indexing: IVFFlat vs. HNSW
Our choice: IVFFlat (lists=100). In a multi-tenant SaaS, tenants typically hold 100-10,000 nodes — IVFFlat is more than sufficient at this scale, and index rebuilds can be integrated into the BullMQ pipeline.
4. Async Embedding Pipeline — BullMQ Architecture
4.1 Why Async?
Embedding generation is not synchronous: an email sync brings 500 emails, each requiring an API call. If synchronous, a Gmail sync would take 500 × 200ms = 100 seconds — unacceptable.
4.2 The Pipeline Architecture
Gmail/Calendar Connector
│
▼
EventWorker (concurrency: 5)
│ Node creation/update
│ Edge management (BELONGS_TO, EMAILED, etc.)
▼
EmbeddingQueue.add({ nodeId, content })
│
▼
EmbeddingWorker (concurrency: 3, rate: 50/min)
│ OpenAI API call
│ Vector save → knowledge_nodes.embedding
▼
Ready for search ✓
4.3 Text Preparation
The quality of the embedding depends on the quality of the input text. The pipeline:
-
Content assembly by type:
- Email:
Subject: ${subject}\nFrom: ${from}\n${body} - Calendar:
${summary} - ${start} - ${location}\n${description} - Client:
${name} - ${email} - ${notes}
- Email:
-
Cleanup: HTML tag removal, whitespace normalization
-
Truncate: Max 8000 characters (text-embedding-3-small has an 8191 token limit, and the character/token ratio is ~4:1 for Hungarian text)
function prepareTextForEmbedding(node) {
let text = '';
switch (node.type) {
case 'email':
text = `Email subject: ${node.label}\n${node.content}`;
break;
case 'calendar_event':
text = `Event: ${node.label}\n${node.properties?.location || ''}\n${node.content}`;
break;
case 'client':
text = `Client: ${node.label}\n${node.content}`;
break;
default:
text = `${node.label}\n${node.content}`;
}
return text.replace(/\s+/g, ' ').trim().substring(0, 8000);
}
4.4 Rate Limiting and Error Handling
The OpenAI embedding API rate limit (Tier 3): ~5000 RPM. But for safe operation, we limit to 50 jobs/minute:
const embeddingWorker = new Worker('embedding-queue', processEmbedding, {
connection: redis,
concurrency: 3,
limiter: {
max: 50,
duration: 60000 // 50 jobs/minute
}
});
429 (rate limit) handling: If OpenAI returns 429, the worker takes a 1-hour pause (based on OpenAI's rate limit reset window), then resumes. This is more aggressive than exponential backoff, but more reliable — after a 429, exponential backoff often "oscillates."
Error isolation: A failed embedding doesn't block the processing of other nodes. BullMQ's attempts: 3 + backoff: exponential configuration automatically retries, and after 3 failed attempts, the node stays with embedding = NULL, meaning search won't find it — but the system works.
5. Chunking Strategies — Or Rather: Why We DON'T Chunk
5.1 The Chunking Myth
Most RAG tutorials teach "chunking" first: split documents into 500-1000 token pieces, embed each separately. This is excellent for document-based systems (e.g., processing a 200-page manual).
But business data is different:
- An email averages 200-500 tokens — a natural unit, no need to split
- A calendar event is 50-150 tokens
- A client profile is 100-300 tokens
- A CRM note is 50-200 tokens
These data are "natural chunks" — if we split them, we lose context.
5.2 The Entity-Based Approach
Our system is entity-based, not document-based:
Document-based: Entity-based:
───────────────── ─────────────────
PDF → 50 chunks → 50 vectors Email → 1 node + edges
No connection between chunks Calendar → 1 node + edges
Much redundancy Client → 1 node + edges
Natural graph structure
Every business entity (email, event, client, invoice) is one node in the knowledge graph, with one embedding. Context isn't provided by chunk overlaps but by graph edges — an email node is connected to the sender (client), the thread, and the events mentioned in it.
5.3 Exception: Long Texts
If there are long texts (e.g., a 10,000-word product catalog), we can choose from three strategies:
- Fixed-size chunking: 500-token pieces, with 50-token overlap. Simple but loses context.
- Recursive character text splitting: Splits by paragraphs, then sentences. Better context preservation.
- Semantic chunking: Uses embedding-based similarity to decide where to split. Best quality but more expensive.
In our system, long texts are rare (95% of business data is < 1000 tokens), so simple truncation (8000 characters) is sufficient.
6. Search Tuning — Cosine Similarity, Threshold, Top-K
6.1 Cosine Similarity in Brief
The cosine similarity of two vectors (A, B): cos(θ) = (A · B) / (||A|| × ||B||). Its value ranges from -1 to 1; 1 = perfectly similar, 0 = orthogonal (no relationship).
In practice, embedding models produce values in the 0.3-0.9 range — values above 0.60 typically indicate "relevant."
6.2 The Threshold Paradox
It seems intuitive: set the threshold high (e.g., 0.80), and only return very relevant results. But:
0.60 is the optimal threshold with the text-embedding-3-small model, tested on business data. This is model-specific — a different model may have a different optimal value.
6.3 Top-K: Why 8?
Top-K (how many results to request) relates to the token budget:
- Token budget: 3000 tokens (~12,000 characters)
- Average node content: 300-400 characters (75-100 tokens)
- Formatting overhead: ~20 tokens/node (Markdown headers, separators)
- Graph neighbors: The top-3 results' neighbors are also included
→ 8 direct results + ~5 graph neighbors = ~13 nodes × ~100 tokens = ~1300 tokens, which fits well within the 3000 budget, leaving room for formatting and a "safety margin."
6.4 Multi-Tenant Search Isolation
Search is always tenant-filtered:
SELECT id, label, content, type, source, properties,
1 - (embedding <=> $1::vector) AS similarity
FROM knowledge_nodes
WHERE provider_id = $2
AND embedding IS NOT NULL
ORDER BY embedding <=> $1::vector
LIMIT $3;
The provider_id = $2 condition ensures that a tenant never sees another tenant's data — not even accidentally. This isn't just GDPR compliance — it's business-critical: one beauty salon shouldn't see another salon's client data.
7. Context Enrichment — The Power of the Graph
7.1 The Problem: Isolated Vector Results
Vector search returns isolated nodes. But business questions require context:
- "When is Kiss Anna coming next?" → Need the calendar event + the client profile (e.g., allergy information)
- "What was Saturday's email about?" → Need the email + the full thread + the affected client
- "How much revenue was there in March?" → Need invoices + the affected clients
7.2 Loading 1-Hop Neighbors
The solution: load the graph neighbors of the top vector results. 1-hop = direct neighbors (nodes 1 edge away).
Vector result: event_calendar_77 (sim: 0.82)
│
├── BOOKED ──▶ client_15 "Kiss Anna" (VIP, allergy: X dye)
├── BELONGS_TO ──▶ calendar_google_main
└── MENTIONS ──▶ note_23 "Need to mention balayage next time"
client_15 and note_23 were not in the vector search's top-K results (their embeddings differ), but they provide critical context for the answer.
7.3 Relevance Inheritance (Decay Factor)
Neighbors' relevance scores are inherited from the parent's similarity, reduced by a decay factor:
neighbor_score = parent_similarity × decay_factor
In our system, decay_factor = 0.8:
- If the parent similarity = 0.82 → the neighbor score = 0.82 × 0.8 = 0.656
- This is still above the threshold (0.60) → included in results
- A 2-hop neighbor: 0.82 × 0.8 × 0.8 = 0.525 → below threshold → excluded
The decay factor automatically regulates graph traversal depth: the further a neighbor, the lower its score, and it naturally falls out.
7.4 Why Does This Approach Work?
The relevance inheritance has three important properties:
- Direct results are always ranked higher in the results
- Neighbors don't "push out" direct results from the token budget
- Multiply-referenced entities (e.g., a client node reachable from two emails) receive higher relevance scores
7.5 Recursive CTE Graph Traversal
Neighbor queries use efficient SQL with a recursive CTE:
WITH RECURSIVE related AS (
-- Starting point: the start node
SELECT kn.id, kn.type, kn.label, kn.content, kn.properties,
0 AS depth, ARRAY[kn.id] AS path
FROM knowledge_nodes kn
WHERE kn.id = $1
UNION ALL
-- Recursion: traverse edges in both directions
SELECT kn2.id, kn2.type, kn2.label, kn2.content, kn2.properties,
r.depth + 1, r.path || kn2.id
FROM related r
JOIN knowledge_edges ke ON ke.from_node_id = r.id OR ke.to_node_id = r.id
JOIN knowledge_nodes kn2 ON kn2.id = CASE
WHEN ke.from_node_id = r.id THEN ke.to_node_id
ELSE ke.from_node_id
END
WHERE r.depth < $2 -- max depth: 1 (or 2 in special cases)
AND NOT kn2.id = ANY(r.path) -- cycle prevention
)
SELECT DISTINCT ON (id) * FROM related WHERE depth > 0;
Cycle prevention: The path array contains the nodes visited so far. If a node already appears in the path, recursion doesn't continue on that branch. This prevents infinite loops in mutual references (e.g., email → thread → email).
7.6 Configurable Parameters
Why 1 hop and not 2? 2-hop neighbors exponentially increase the result set (5 neighbors × 5 = 25 second-hop nodes), and relevance drops to decay² = 0.64 — which is already near the threshold (0.60). Therefore, 1-hop is the optimal balance between context richness and cost-effectiveness.
8. The Complete RAG Pipeline — In 5 Steps
8.1 Architectural Overview
User Question
│
▼
┌─────────────────────────────────────────────────────────┐
│ retrieveRAGContext(providerId, message) │
│ │
│ 1. Guard: message.length >= 3? │
│ └─ No → return emptyResult() │
│ │
│ 2. Vector Search │
│ generateEmbedding(message) │
│ → searchByEmbedding(embedding, providerId, topK=8) │
│ → filter: similarity > 0.60 │
│ │
│ 3. Graph Enrichment │
│ Top-3 results → getNeighbors() per node │
│ → neighbors decay = 0.8 │
│ → max 5 neighbors per node │
│ │
│ 4. Dedup + Rank + Token Budget │
│ → deduplication by ID (higher score wins) │
│ → sort by similarity desc │
│ → greedy packing: 3000 token budget │
│ │
│ 5. Format + Inject │
│ → Markdown context grouped by type │
│ → Source objects for the frontend │
│ → As system message for the LLM │
└─────────────────────────────────────────────────────────┘
│
▼
LLM response generation based on context
8.2 Step 1: Guard — Input Validation
if (!message || message.trim().length < RAG_CONFIG.minQueryLength) {
return emptyResult();
}
Why is this needed? A 1-2 character message (e.g., "hi", "ok") carries no semantic content — its embedding is "average," which would return irrelevant results. The 3-character minimum filters these out.
8.3 Step 2: Vector Search
The user message is embedded with the same model as the data (text-embedding-3-small). This is critical: if the query embedding and stored embeddings come from different models, cosine similarity is meaningless.
The search uses pgvector's cosine distance operator, with provider_id tenant filtering.
8.4 Step 3: Graph Enrichment
For the top-3 vector results (not all 8 — performance-conscious), we load their 1-hop neighbors. Neighbors receive their parent's similarity × 0.8 decay.
8.5 Step 4: Deduplication, Ranking, Token Budget
All nodes (vector + graph) go into a single list:
- Deduplication: If a node appears multiple times (e.g., a client is a neighbor of two emails), the higher score remains
- Ranking: Descending by similarity
- Token budget packing: Greedy algorithm — following the ranking, we add nodes until the 3000-token budget (estimated with character/4 heuristic) is exhausted. Each node gets max 400 characters of content.
let usedTokens = 0;
const selected = [];
for (const node of rankedNodes) {
const content = (node.content || '').substring(0, RAG_CONFIG.maxContentLength);
const estimatedTokens = Math.ceil(content.length / 4) + 20; // +20 for formatting
if (usedTokens + estimatedTokens > RAG_CONFIG.maxContextTokens) break;
usedTokens += estimatedTokens;
selected.push({ ...node, content });
}
8.6 Step 5: Markdown Formatting and Injection
The selected nodes are transformed into type-grouped Markdown:
📧 **Emails:**
- **Appointment change** (2026-03-10)
Dear Salon! I'd like to modify my Friday appointment...
_Source: Gmail_
📅 **Calendar Events:**
- **Haircut + coloring** (2026-03-15 14:00)
Kiss Anna - 90 minutes
_Source: Google Calendar_
👤 **Clients:**
- **Kiss Anna**
Phone: +36-30-123-4567, VIP client, allergy: certain dyes
_Source: CRM_
This Markdown is inserted as a system message into the LLM context — separated from the main system prompt:
LLM message array:
[0] system → Main system prompt (personality, rules, available tools)
[1] system → RAG context (the Markdown above)
[2-N] user/assistant → Previous conversation (max 50 messages)
[N+1] user → Current question
8.7 Source Attribution for the Frontend
The RAG pipeline doesn't just provide context to the LLM — it also sends source references back to the frontend:
const sources = selectedNodes.map(node => ({
type: node.type,
label: node.label,
snippet: node.content?.substring(0, 150),
source: SOURCE_LABELS[node.source],
icon: SOURCE_ICONS[node.source],
similarity: node.similarity,
nodeId: node.id
}));
This allows the frontend to display a "Sources" section below the response — showing what data the LLM based its answer on, alongside the LLM's response.
9. Hybrid Search — The Next Step
9.1 The Limitations of Semantic Search
Purely vector-based search isn't perfect:
9.2 Hybrid: Vector + BM25
The solution: combine semantic search with keyword-based search (BM25 or PostgreSQL full-text search):
User Question
│
├─── Semantic Search (pgvector cosine) → top-K list + score
│
└─── Keyword Search (tsvector/BM25) → top-K list + score
│
▼
Reciprocal Rank Fusion (RRF)
│
▼
Combined, ranked result
Reciprocal Rank Fusion (RRF) formula:
RRF(d) = Σ 1 / (k + rank_r(d)) where k is typically 60
The advantage of RRF is that no score normalization is needed — ranking position matters, not absolute values.
9.3 PostgreSQL-Native Implementation
The pgvector + pg_trgm + tsvector combination enables hybrid search in a single database:
-- Full-text search index (if not already present)
CREATE INDEX idx_knowledge_nodes_fts
ON knowledge_nodes USING gin (to_tsvector('hungarian', label || ' ' || content));
-- Hybrid query: vector + full-text, RRF
WITH vector_results AS (
SELECT id, label, content,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS v_rank
FROM knowledge_nodes
WHERE provider_id = $2 AND embedding IS NOT NULL
LIMIT 20
),
text_results AS (
SELECT id, label, content,
ROW_NUMBER() OVER (ORDER BY ts_rank_cd(
to_tsvector('hungarian', label || ' ' || content),
plainto_tsquery('hungarian', $3)
) DESC) AS t_rank
FROM knowledge_nodes
WHERE provider_id = $2
AND to_tsvector('hungarian', label || ' ' || content)
@@ plainto_tsquery('hungarian', $3)
LIMIT 20
)
SELECT COALESCE(v.id, t.id) AS id,
COALESCE(v.label, t.label) AS label,
1.0 / (60 + COALESCE(v.v_rank, 1000))
+ 1.0 / (60 + COALESCE(t.t_rank, 1000)) AS rrf_score
FROM vector_results v
FULL OUTER JOIN text_results t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 8;
9.4 When to Switch to Hybrid?
10. Re-ranking — The Final Mile of Quality
10.1 The Problem
The bi-encoder (embedding model) is fast and efficient, but measures similarity by comparing separately computed vectors. It doesn't "read" the query and document together.
The cross-encoder reads them together — therefore more accurate, but slower:
Bi-encoder (embedding):
Query → [vector_q] Document → [vector_d] cosine(q, d) → score
Speed: ~1000 doc/sec Accuracy: ★★★☆☆
Cross-encoder (re-ranker):
[Query + Document] → score
Speed: ~50 doc/sec Accuracy: ★★★★★
10.2 Two-Phase Search
The solution: the bi-encoder (pgvector) filters (top-K), the cross-encoder ranks:
Question → pgvector cosine (1ms, top-20 from 50K documents)
│
▼
Cross-encoder re-rank (200ms, on 20 documents)
│
▼
Top-8 truly relevant results
10.3 Available Re-ranker Solutions
10.4 When to Use a Re-ranker?
11. Evaluation Framework — How Do You Know It's Working?
11.1 The RAG Evaluation Problem
Evaluating semantic search and RAG pipelines is harder than evaluating a traditional search engine because:
- There's no clear "correct answer" — relevance is subjective
- Response quality is the combined performance of search + LLM
- Evaluation is expensive (human annotation or LLM-based scoring)
11.2 RAGAS Metrics
The RAGAS framework is the industry standard for RAG evaluation:
11.3 Practical Evaluation Method
Build a golden dataset with 50-100 real questions:
{
"question": "When was Kiss Anna's last visit?",
"expected_context": ["event_77", "client_15"],
"expected_answer_contains": ["2026-03-15", "haircut"],
"category": "appointment_lookup"
}
Automated evaluation cycle:
- Run the 50 questions through the RAG pipeline
- Measure: Context Precision, Context Recall, Faithfulness
- Vary the parameters (threshold: 0.55/0.60/0.65, topK: 5/8/12)
- Visualize: precision-recall curves for different configurations
- Choose the best balance
11.4 A/B Testing in Production
If there's sufficient traffic, A/B testing is the ground truth:
The "back-question rate" (how often users ask follow-up questions because they didn't get a good answer) is one of the best proxy metrics for RAG quality.
12. Knowledge Graph + RAG: The GraphRAG Approach
12.1 What Does the Graph Add to RAG?
Traditional RAG is "flat" — it searches documents. GraphRAG adds structure:
12.2 Our GraphRAG Implementation
Our system handles 9 node types and 8 edge types:
Node types:
Edge types (bidirectional traversal):
EMAILED — email send/receive relationship
BOOKED — booking relationship (client → appointment)
PAID — payment relationship (client → invoice)
MENTIONS — reference (any node → any node)
TAGGED — tagging
ASSIGNED — assignment (task → user)
BELONGS_TO — grouping (email → thread, deal → client)
SENT_TO — targeted sending
12.3 Graph Advantages with Real Questions
Question: "How much did Kiss Anna spend in the last 3 months?"
Traditional RAG result:
→ Finds an email mentioning payment
→ LLM estimates
GraphRAG result:
1. Vector search → deal_15 "Kiss Anna package" (sim: 0.75)
2. Graph enrichment:
deal_15 ──BELONGS_TO──▶ client_15 "Kiss Anna"
client_15 ──PAID──▶ invoice_23 "45,000 HUF 2026-02"
client_15 ──PAID──▶ invoice_31 "38,000 HUF 2026-01"
client_15 ──BOOKED──▶ appointment_44 "2026-03-15"
3. LLM gives a precise answer: "Kiss Anna spent 83,000 HUF
in the last 3 months"
13. Production Operations — Monitoring, Drift, Re-indexing
13.1 What to Monitor?
13.2 Embedding Drift
Models change. If OpenAI updates text-embedding-3-small (they haven't yet, but they marked the older text-embedding-ada-002 as deprecated), old and new vectors become incompatible. This is "drift" — search quality gradually degrades.
Prevention:
- Model version logging: Store which model version generated each embedding
- Full re-embedding capability: Have a script that re-embeds all nodes
- Canary tests: Regularly run the golden dataset — if precision drops, suspect drift
13.3 Re-indexing Strategy
When to re-index?
Zero-downtime re-indexing:
-- 1. Build new index CONCURRENTLY (doesn't lock the table)
CREATE INDEX CONCURRENTLY idx_knowledge_nodes_embedding_new
ON knowledge_nodes USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- 2. Drop old index
DROP INDEX idx_knowledge_nodes_embedding;
-- 3. Rename
ALTER INDEX idx_knowledge_nodes_embedding_new
RENAME TO idx_knowledge_nodes_embedding;
13.4 The Resilience Matrix
Every component in the full pipeline is fail-safe:
The principle: The AI assistant always responds. If there's no context, it responds without the knowledge graph. If there are no tools, it responds without them. Degradation is gradual, never total.
14. Fine-tuning Embeddings — When Is It Worth It?
14.1 The Promise and Reality of Fine-tuning
Fine-tuning an embedding model can tune it to domain-specific vocabulary. But:
14.2 Alternative: Prompt-Level Embedding Improvement
Before fine-tuning, try improving the input text:
// Instead of:
generateEmbedding(email.body)
// Contextualize:
generateEmbedding(`Email subject: ${email.subject}\nFrom: ${email.from}\n${email.body}`)
This "contextualized embedding" can improve results surprisingly well — the model knows more about the text's context, without any fine-tuning.
14.3 When to Fine-tune?
15. Summary and Decision Matrix
15.1 Architecture Layers
┌──────────────────────────────────────────────────────────────┐
│ User Question │
├──────────────────────────────────────────────────────────────┤
│ RAG Pipeline │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │
│ │ Vector Search│→ │ Graph Enrich │→ │ Dedup + Token Pack │ │
│ │ (pgvector) │ │ (1-hop CTE) │ │ (3000 token budget) │ │
│ └─────────────┘ └──────────────┘ └─────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│ Knowledge Graph (PostgreSQL) │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ KnowledgeNode │──│ KnowledgeEdge │ │
│ │ (9 types, 1536d) │ │ (8 edge types) │ │
│ └──────────────────┘ └──────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│ Embedding Pipeline (BullMQ) │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │
│ │ Event Worker │→ │ Embedding Q │→ │ OpenAI API │ │
│ │ (conc: 5) │ │ (50/min) │ │ (text-emb-3-small) │ │
│ └─────────────┘ └──────────────┘ └─────────────────────┘ │
├──────────────────────────────────────────────────────────────┤
│ Connectors │
│ Gmail │ Google Calendar │ CRM │ Invoicing │ Billing │
└──────────────────────────────────────────────────────────────┘
15.2 CTO Decision Checklist
15.3 The Most Important Lessons
-
Start simple: pgvector + text-embedding-3-small + cosine search. This works in 30 minutes and is sufficient for most SME use cases.
-
Don't chunk what's a natural unit: Emails, events, client data are better kept whole. The entity-based knowledge graph handles the granularity question.
-
Graph enrichment is the real differentiator: For "Who?" "When?" "How much?" type questions, vector search alone is weak — loading neighbors brings dramatic quality improvement.
-
Resilience is not optional: In a production system, the RAG pipeline must not block the AI response. If anything breaks, graceful degradation: we respond with less context, but we respond.
-
Measure with 50 questions before changing anything: Threshold, top-K, token budget are all tunable — but with data-driven decisions, not intuition.
Want to implement semantic search in your own system? The Atlosz Interactive team has production experience with pgvector, knowledge graph, and RAG pipeline architecture. Get in touch for a free technical consultation.