Back to Blog
AI SecurityData ProtectionEnterprise AIRAGData Flow

Why Security Is the Most Important AI Question — Data Flow in an AI System

ÁZ&A
Ádám Zsolt & AIMY
||4 min read

This article is part 1 of the AI Security and Data Protection in Enterprise Environments whitepaper series. Other parts: Six security pillars, GDPR, EU AI Act and attack surfaces, Cloud vs. on-premise and checklist.


Why Is Security the Most Important Question?

The biggest obstacle to enterprise AI adoption is not technology — it's trust.

According to IBM's 2025 survey, 68% of corporate decision-makers cite data protection concerns as the primary barrier to AI adoption. Not cost, not technical complexity, not employee resistance — but the question: are our customers' data safe?

This is a valid concern. An AI agent — as we've demonstrated in our previous articles — has access to CRM, can read emails, manage calendars, and even send emails. This is the capability set that makes AI truly useful — but it's also what creates the security risk.

The good news: risks are manageable. The question is not whether there is risk (there is — as with any IT system), but rather what framework we use to manage it.


The Three Questions Every Leader Asks

"Does our customer data leave the company?"

Short answer: it depends on how we build the system — but the good news is it can be kept under full control.

When the AI agent answers a question, the following happens:

  1. The user's message reaches the AI engine
  2. The AI engine retrieves relevant data from the database
  3. The data + the question are sent to the LLM (e.g., OpenAI GPT-4o or Anthropic Claude)
  4. The LLM responds
  5. The response reaches the user

The critical step is point 3: the data sent to the LLM leaves our infrastructure and ends up on an external provider's server.

But:

  • Business APIs (OpenAI API, Anthropic API) do not use data for model training — this is contractually guaranteed
  • Only the relevant context is sent out, not the entire database (the RAG pipeline ensures this)
  • On-premise alternatives exist: with a local model (Llama, Mistral), data never leaves the network

"Who sees the customer data?"

In a well-designed multi-tenant system:

  • Every customer / company only sees their own data
  • The AI agent only accesses tools that the user has authorized
  • The administrator cannot access customer conversations (unless explicitly for audit purposes)
  • The LLM provider (OpenAI, Anthropic) does not read the data — it's automated processing with no human access

"What happens if something goes wrong?"

The AI system is not infallible — but errors are manageable:

  • Audit log: Every AI action is logged — who requested it, what it did, what data it used
  • Reversibility: High-risk operations (email sending, invoice generation) require approval
  • Isolated scope: One agent's error doesn't spread to other areas
  • Fallback: If the AI is uncertain, it escalates to a human colleague

How Does Data Flow Through an AI System?

To understand security, we first need to see the data's journey:

┌─────────────────────────────────────────────────────────────────┐
│                      OUR INFRASTRUCTURE                          │
│                                                                  │
│  User ──▶ API Gateway ──▶ AI Service                            │
│            (authentication)  │                                    │
│                              ├──▶ CRM Database (PostgreSQL)      │
│                              │    └─ Contacts, deals             │
│                              │                                    │
│                              ├──▶ Knowledge Graph                │
│                              │    └─ Emails, events              │
│                              │                                    │
│                              ├──▶ RAG Pipeline                   │
│                              │    └─ Relevant context            │
│                              │       selection (max 3000         │
│                              │       tokens)                     │
│                              │                                    │
│                              └──▶ Context Assembly               │
│                                   (system prompt +               │
│                                    relevant data +               │
│                                    user question)                │
│                                                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  DATA LEAVES OUR SYSTEM HERE:                            │  │
│  │                                                           │  │
│  │  Context (max ~3000 tokens) ────▶ LLM API (OpenAI /     │  │
│  │                                    Anthropic / Google)    │  │
│  │                                                           │  │
│  │  ◀── Response text ◀── LLM                              │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
│  AI Service ──▶ Save Response ──▶ Response to User              │
└─────────────────────────────────────────────────────────────────┘

The key insight: The entire database is not sent to the LLM — only the relevant context fragment selected by the RAG pipeline. If the customer stores 10,000 contacts in the CRM, perhaps 2-3 contacts' data reaches the LLM, and even then only the parts relevant to the specific question.


Next part: The Six Security Pillars — from authentication to human-in-the-loop.