*** ## title: Private RAG ## What Is RAG? Large language models are powerful, but they only know what was in their training data. When you need answers grounded in *your* documents — internal wikis, uploaded files, proprietary data — general knowledge isn't enough. You need **Retrieval-Augmented Generation (RAG)**. RAG can take different forms. You can retrieve from the live web using our [search API](/api-reference/search/v1-search), or you can retrieve from your own private documents using embeddings and a vector store. This demo covers the latter — **private RAG** — where the LLM response is grounded in your organization's documents. RAG solves this in three steps: *index* your documents, *retrieve* the relevant content, then *generate* a grounded answer from it. The model never needs to memorize your data — it just reads the right pieces at the right time. *** ## How RAG Works RAG follows three steps: **Index**, **Retrieve**, and **Generate**. ```mermaid flowchart LR subgraph Index A["Your Document(s)"] --> B["Chunk & Embed"] B --> C["Vector Store"] end subgraph Retrieve D["Your Question"] --> E["Embed Question"] E --> F["Similarity Search"] F --> G["Top Chunks (matching your question)"] end subgraph Generate H["Prompt (question + chunks)"] --> I["You.com Express Agent"] I --> J["Grounded Answer"] end C --> F G --> H D --> H style A fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a5f style D fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a5f style C fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#1e3a5f style H fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#1e3a5f style I fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#1e3a5f style J fill:#d1fae5,stroke:#10b981,stroke-width:2px,color:#1e3a5f ``` ### Index Your document is split into small, overlapping **chunks** of text. Each chunk is converted into a **vector embedding** — an array of numbers (e.g. `[0.21, -0.87, 0.54, ...]`) that encodes the *meaning* of that text. Chunks with similar meaning produce similar vectors, which is what makes semantic search possible. Those vectors are stored in a **vector store** — either in-memory for a simple demo like this one, or in a dedicated database like Pinecone, Weaviate, or pgvector for production. ### Retrieve When a user asks a question, it's run through the same embedding model we used in the index step to produce its own vector. The vector store finds the closest matches — the passages most semantically related to the question. There are many ways to measure that similarity (cosine similarity, dot product, Euclidean distance); this demo uses **cosine similarity**, which is the default for the vector store we're using. This is why a question like "What movies involve a kid and a father?" can surface content about *Interstellar* even if the word "father" never appears in the text. ### Generate The retrieved chunks are injected into a prompt alongside the original question and passed to a language model. The model reads only that context and generates a precise, grounded answer — without hallucinating facts it doesn't have. *** ## RAG Strategies There's no single way to build a RAG system. The right approach depends on your data size, latency requirements, and how often your documents change. | Strategy | How It Works | When to Use | | ------------------------- | ------------------------------------------------------- | -------------------------------------------------- | | **In-memory RAG** | Index is built per-request, lives only in RAM | Demos, small files, stateless APIs | | **Persistent vector DB** | Embeddings stored in Pinecone, pgvector, etc. | Large document sets, production apps | | **Hybrid search** | Combines vector similarity with keyword (BM25) search | When exact terms matter alongside semantic meaning | | **Reranking** | A second model re-scores retrieved chunks for relevance | When retrieval precision is critical | | **Multi-index / routing** | Multiple vector stores for different document types | Large, heterogeneous knowledge bases | This demo uses the simplest strategy: **in-memory RAG**. The index is built from the uploaded file at request time and discarded afterwards — no database, no disk writes. It's a clean way to understand the fundamentals before adding persistence. *** ## Why You.com for Generation? The retrieve step finds the right content. The generate step turns it into something useful. That's where the **You.com Express Agent** comes in. It's a fast, capable LLM endpoint designed for exactly this pattern: receive a context block, answer only from it, stream the response back. Key advantages: * **Streaming by default** — responses stream token-by-token, keeping latency low even for long answers * **No hallucination pressure** — explicitly prompted to answer only from the provided context * **No mandatory web search** — tools are opt-in, so the agent stays focused on your documents * **Simple API** — one SDK call with the `express` agent type ```ts // This is where the magic happens — the Express Agent reads the retrieved // context and synthesizes a grounded answer, streamed back token-by-token. const stream = await you.agentsRuns({ agent: "express", input: `Context:\n${retrievedChunks}\n\nQuestion: ${query}`, stream: true, }); ``` The result is an agent that reads *your* documents and writes coherent, grounded answers — not general-knowledge guesses. *** ## Full Working Example We've built a complete sample app you can clone and run locally: * **GitHub:** [youdotcom-oss/ydc-private-rag-sample](https://github.com/youdotcom-oss/ydc-private-rag-sample) * **Live demo:** [ydc-private-rag-sample.vercel.app](https://ydc-private-rag-sample.vercel.app/) The app is a Next.js project. Upload a `.txt` file (or use the built-in example), ask a question, and get a streamed answer grounded in that document. Embeddings run entirely on-device via `BAAI/bge-small-en-v1.5` — no data leaves your machine during the indexing step. ```bash git clone https://github.com/youdotcom-oss/ydc-private-rag-sample.git cd ydc-private-rag-sample npm install npm run dev # open localhost:3000 ``` Enter your You.com API key in the UI, upload 1 or more files, and ask away. ### How the Code Works The app has two files worth understanding: the API route that runs the RAG pipeline, and the UI that drives it. #### `app/api/query/route.ts` — Index, Retrieve & Generate This is where all three RAG steps happen server-side. At the top of the file, the embedding model is configured globally via LlamaIndex's `Settings` object. The right model for your use case will depend on your latency requirements, accuracy needs, and whether you want to run embeddings locally or via an API — this demo uses a small, fast, on-device model, but production systems often swap in a hosted model like OpenAI's `text-embedding-3-small`. ```ts Settings.embedModel = new HuggingFaceEmbedding({ modelType: "BAAI/bge-small-en-v1.5", modelOptions: { cache_dir: process.cwd() + "/models" }, }); ``` ### Build the index `buildIndex()` takes the uploaded file text, wraps each file in a LlamaIndex `Document`, and calls `VectorStoreIndex.fromDocuments()`. Because no `storageContext` is provided, the entire index lives in memory for the lifetime of the request and is gone afterwards. In a production system you'd replace this with a persistent vector store — Pinecone, pgvector, Weaviate, and Chroma are all common choices — and build the index once rather than on every request. ```ts async function buildIndex(files) { const documents = files.map(({ fileText, filename }) => new Document({ text: fileText, metadata: { source: filename } }) ); // No storageContext → stays entirely in memory, gone after the request. return VectorStoreIndex.fromDocuments(documents); } ``` LlamaIndex handles splitting each document into overlapping chunks and embedding them automatically. ### Retrieve the top chunks `retrieve()` embeds the user's query using the same model, runs a similarity search against the in-memory vector store, and returns the top 3 most relevant chunks for the users question. ```ts async function retrieve(query, index, topK = 3) { const retriever = index.asRetriever({ similarityTopK: topK }); const results = await retriever.retrieve(query); return results.map((r) => ({ text: r.node.text, source: r.node.metadata.source ?? "unknown", })); } ``` ### Build the prompt and stream the answer The chunks are formatted into a numbered context block and injected into the prompt alongside the original question. The You.com Express Agent is explicitly instructed to answer only from the provided context, then streams its response token-by-token back to the browser. ```ts const input = `Answer the following question using only the provided context. Context: ${buildContext(chunks)} Question: ${query}`; const stream = await you.agentsRuns({ agent: "express", input, stream: true }); const readable = new ReadableStream({ async start(controller) { for await (const chunk of stream) { if (chunk.data.type === "response.output_text.delta") { controller.enqueue(encoder.encode(chunk.data.response.delta)); } if (chunk.data.type === "response.done") controller.close(); } }, }); return new Response(readable, { headers: { "Content-Type": "text/plain" } }); ``` #### `app/page.tsx` — The UI The browser handles file selection, reading file contents as text, posting everything to `/api/query`, and streaming the response back into the UI. **Reading files and sending the request** Files are read as plain text in the browser using the File API and sent in the request body alongside the query and API key. No server-side file storage is needed. ```ts const files = await Promise.all( attachedFiles.map(async (file) => ({ fileText: await file.text(), filename: file.name, })) ); const res = await fetch("/api/query", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ files, query, apiKey }), }); ``` **Reading the streamed response** The response body is a `ReadableStream`. The UI reads it chunk-by-chunk and appends each decoded token to the displayed answer as it arrives. ```ts const reader = res.body.getReader(); const decoder = new TextDecoder(); while (true) { const { done, value } = await reader.read(); if (done) break; setResponse((prev) => prev + decoder.decode(value, { stream: true })); } ``` *** ## Further Reading If you want to go deeper on RAG concepts and production patterns: * [**Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks**](https://arxiv.org/abs/2005.11401) — the original RAG paper from Meta AI * [**LlamaIndex Docs: Building a RAG Pipeline**](https://docs.llamaindex.ai/en/stable/getting_started/concepts/) — the library used in this demo for chunking and retrieval * [**Pinecone: What is RAG?**](https://www.pinecone.io/learn/retrieval-augmented-generation/) — a practical guide to RAG with production considerations * [**You.com Express Agent Reference**](/api-reference/agents/express-agent/express-agent-runs) — full API docs for the agent used in the generate step * [**Research API**](/api-reference/research/v1-research) — if your use case involves searching the web rather than private documents, the Research API handles retrieval *and* synthesis for you *** ## Building a RAG System at Scale? If you're building a RAG system for your enterprise — whether that's over internal documents, proprietary data, or large knowledge bases — You.com offers solutions designed for production scale. [Talk to our team](https://you.com/enterprise-ai-solutions) about your use case. *** ## Resources * [You.com Express Agent Reference](/api-reference/agents/express-agent/express-agent-runs) * [TypeScript SDK](https://www.npmjs.com/package/@youdotcom-oss/sdk) (`npm install @youdotcom-oss/sdk`) * [GitHub: ydc-private-rag-sample](https://github.com/youdotcom-oss/ydc-private-rag-sample) * [Live Demo](https://ydc-private-rag-sample.vercel.app/) * [Discord](https://discord.com/invite/youdotcom/)