Knowledge Graph and RAG — How Does AI Become 'Smart'?

The Problem — Why Are Most AI Integrations "Dumb"?

Most enterprise AI integrations today work like this: the LLM (large language model) receives the user's question, runs an SQL query, and returns the result. That's a better search engine — but it's not intelligence.

The real problem: data alone is not knowledge. Knowledge lies in the relationships between data.

Example: Anna Kovács is a contact in the CRM. But what do we actually know about her?

She received a proposal last week (deal)
She replied positively via email (Gmail thread)
There's a consultation with her tomorrow (calendar event)
Her boss is also attending the consultation (participant)
We worked together on 3 projects last year (history)

A traditional CRM query might return 1-2 of these data points. A RAG system built on a Knowledge Graph returns all of them — because it understands the relationships.

Knowledge Graph — The Map of Business Relationships

What Is It Exactly?

A Knowledge Graph consists of two building blocks:

Nodes: entities — contacts, emails, calendar events, deals, invoices, tasks
Edges: relationships — WHO sent an email TO WHOM, WHO attended the EVENT, which DEAL belongs to which CONTACT

    ┌────────────┐      SENT      ┌────────────┐
    │ Anna Kovács│───────────────▶│   Email     │
    │ (contact)  │                │ "Proposal   │
    │            │◀───────────────│  OK!"       │
    └─────┬──────┘  WAS_RECIPIENT └────────────┘
          │
          │ ATTENDEE
          ▼
    ┌────────────┐   BELONGS_TO   ┌────────────┐
    │Consultation│                │ WebShop Pro │
    │  (event)   │───────────────▶│   (deal)    │
    │ 2025.11.04 │                │  €1,300     │
    └────────────┘                └────────────┘

The Data Model

A production-ready Knowledge Graph node has the following fields:

Field	Purpose
type	Node type: `email`, `calendar_event`, `contact`, `deal`, `invoice`, `task`
source	Origin: `gmail`, `google_calendar`, `crm`, `billingo`
label	Display title (what the user sees)
content	Full text — used to generate the embedding
properties	Structured attributes in JSON (date, amount, status, etc.)
embedding	1536-dimensional vector (for semantic search)

Edge types indicate the nature of the relationship: EMAILED, BOOKED, PAID, MENTIONS, ASSIGNED, ATTENDED_BY. Every edge is weighted (weight), enabling prioritization of relevant connections.

Why Not Neo4j?

Fair question. Dedicated graph databases (Neo4j, Amazon Neptune) seem like a natural choice. In practice, however, the PostgreSQL + pgvector combination works surprisingly well when:

PostgreSQL is already in the stack (no new infrastructure needed)
The graph size is a few tens of thousands of nodes per entity (not millions)
Vector search is also needed (pgvector supports it natively)
You want to keep everything in one database (ops simplicity)

The trick: recursive CTEs (Common Table Expressions) in PostgreSQL can handle multi-step graph traversal, shortest path queries, and cycle prevention — essentially providing the same capabilities as Neo4j's Cypher language at BFS level.

The key is the abstraction layer: if the graph-service API is well designed, switching from PostgreSQL to Neo4j requires zero code changes on the calling side. It's worth starting with this design principle.

RAG — How Does the AI Get Relevant Context?

The Essence of RAG in 30 Seconds

RAG (Retrieval-Augmented Generation) solves the biggest limitation of LLMs: they don't know everything, and what wasn't in the training data, they might hallucinate (make up).

RAG solves this by retrieving relevant information from the database before answering the question, and providing it as context to the LLM. The LLM doesn't answer from "memory" but based on the facts it's given.

Why Isn't Simple Search Enough?

Two reasons:

Semantic search > keyword search: If the user asks "what did we discuss about the proposal?", keyword search looks for the word "proposal." Semantic search understands that "quote", "pricing", "rate card", and even "I'll send the details" could be relevant.
Context linking: Vector search finds the relevant email. But the Knowledge Graph can add who sent it, which deal it belongs to, and what calendar event is connected. This combination makes the answer truly useful.

The 5-Step RAG Pipeline in Practice

Step 1 — Vectorization and Semantic Search

The user's question is converted into an embedding (OpenAI text-embedding-3-small, 1536 dimensions), then pgvector cosine similarity search finds the closest matching content.

Parameters (tested in production):

Top-K: 8 results
Threshold: 0.60 cosine similarity (below this, too much noise)

Step 2 — Graph Enrichment

The neighbors of the top 3 vector results are also pulled in — 1 step deep, max 5 neighbor nodes per result.

Neighbors receive an inherited relevance score: parent_similarity × 0.8. If an email came in with 92% relevance, the connected contact gets 73.6%. This ensures that context from the graph doesn't overwhelm direct matches.

Step 3 — Deduplication and Ranking

Merging vector and graph results (vector results get priority), sorted by descending relevance.

Critical element: token budget management. The LLM's context window is finite (and expensive). The pipeline uses a simple but effective estimation: (content_length + title_length + 50) / 4 tokens per node (Hungarian text ≈ 4 characters/token). When the cumulative token count reaches the 3,000 limit, it stops — no overflow, no unnecessary cost.

Step 4 — Context Assembly

A structured Markdown context is prepared for the LLM from the selected nodes:

## Relevant Context

### Email
**"RE: Proposal details"** [source: Gmail, relevance: 92%]
> Hi, I reviewed the proposal, it's acceptable for us...
*From: anna.kovacs@company.com | Date: 2025-10-28*
Relationships: → SENT_EMAIL → Anna Kovács (client)

### Calendar
**"Consultation — Anna Kovács"** [source: Google Calendar, relevance: 74%]
*2025-11-04 10:00–11:00 | Attendees: Anna Kovács, Dr. Szabó*

Step 5 — Source Attribution

Detailed source objects are created for the frontend: icon, color, excerpt, relevance percentage, and the entry path (vector or graph). The user sees exactly where the information came from — this is not just UX, it's critical from a compliance perspective.

Concrete Use Cases

"What Do We Know About This Client?"

Traditional approach: Open CRM → search contact → review deals → switch to email client → search → open calendar → search. ~5-10 minutes.

Knowledge Graph + RAG: The AI summarizes the contact, last emails, open deals, and upcoming events in response to a single question — with source references. ~5 seconds.

"What Happened Last Week with the WebShop Pro Project?"

Semantic search finds the relevant emails and calendar events. Graph enrichment pulls in connected contacts and deal status. The AI provides a chronological summary with references.

Proactive Alerts

Based on graph statistics, the AI recognizes: "Anna Kovács has 5 incomplete deals and there has been no communication for 2 weeks — it might be worth reaching out." No question needed — the system analyzes the graph on a schedule.

Decision Points for CTOs and IT Leaders

Embedding Model Selection

Model	Dimensions	Cost (1M tokens)	Strength
OpenAI `text-embedding-3-small`	1536	~$0.02	Best price/value, good multilingual support
OpenAI `text-embedding-3-large`	3072	~$0.13	More accurate, but double the storage
Cohere `embed-multilingual-v3`	1024	~$0.10	Excellent multilingual, lower dimensions
Local (Ollama + nomic-embed)	768	$0	No data leaving the network, lower quality

Recommendation: Start with the text-embedding-3-small model — it's the best compromise between cost, quality, and multilingual accuracy. It can be migrated later at any time.

Vector Database Selection

Solution	Advantage	Disadvantage
pgvector (PostgreSQL)	No new infrastructure, simple ops	May slow down at 1M+ vectors
Pinecone	Managed service, fast scaling	Vendor lock-in, data leaves your network
Weaviate	Open source, hybrid search	Separate infrastructure required
Qdrant	Open source, fast, Rust-based	Newer, smaller ecosystem

Recommendation: Below 100,000 nodes, pgvector is perfectly sufficient and dramatically simplifies infrastructure. Above that, Qdrant or Pinecone are worth considering.

Async vs. Sync Embedding Generation

Embedding generation is expensive (API call) and slow (100-500 ms/piece). Two approaches:

Synchronous: Generate the embedding immediately when the entity is created. Simple, but slows down writes.
Asynchronous (recommended): The entity is created, the embedding goes into a queue (BullMQ, RabbitMQ, SQS), and is generated in the background. With rate limiting, parallel processing, and retry logic.

Our production configuration: 3 parallel workers, max 50 jobs/minute rate limit, Redis-based BullMQ queue. This ensures that OpenAI API rate limits never cause errors, while most embeddings are ready within 1 minute.

Security and Data Protection

Tenant Isolation

Every Knowledge Graph query is filtered by providerId. In a multi-tenant system, this is not optional — it's the baseline. Graph traversal, vector search, and RAG context all access only the given provider's data.

Data Minimization in RAG

The RAG pipeline doesn't send the entire Knowledge Graph to the LLM — only the relevant, ranked context, with a token limit. From a GDPR perspective, this is data minimization; from a cost perspective, it's efficiency.

Embeddings and Personal Data

Important to know: an embedding cannot be reverse-engineered back to text, but the original content is stored alongside it. To comply with GDPR's right to erasure, when an entity is deleted, the node, its content, and the embedding must all be removed. Cascade delete (including edges) handles this automatically.

On-Premise Option

For the most sensitive data:

Embedding: Ollama + nomic-embed-text locally, zero data leaving the network
Vector search: pgvector on your own PostgreSQL
Trade-off: lower embedding quality, but full data control

Summary — When Is It Worth Starting?

Knowledge Graph + RAG is worth it if:

There are multiple data sources (CRM + email + calendar + invoicing)
User questions are context-dependent ("what do we know about them?" type)
Data silos are a clear pain point
Source transparency and auditability matter

It's too early if:

There's a single, well-structured data source — a simple SQL search suffices
There aren't 500+ entities — the graph doesn't add value at small data volumes
Semantic search isn't important — keyword search is sufficient

The Most Important Design Principles

Abstraction layer above the graph database — PostgreSQL today, Neo4j tomorrow, zero code changes
Async embedding pipeline — queue + rate limiting + retry
Token budget management in RAG — we decide what the LLM receives, not the LLM
Relevance inheritance in graph traversal — decay factor for 2nd-level results
Source attribution in the UI — the user always knows where the data came from

This article is based on the AIMY project's Knowledge Graph implementation — PostgreSQL + pgvector, OpenAI embeddings, BullMQ pipeline.

If you're considering a similar solution, get in touch with us!