Private RAG

View as Markdown
Demo preview
Click here to try it out.

What Is RAG?

Large language models are powerful, but they only know what was in their training data. When you need answers grounded in your documents — internal wikis, uploaded files, proprietary data — general knowledge isn’t enough. You need Retrieval-Augmented Generation (RAG).

RAG can take different forms. You can retrieve from the live web using our search API, or you can retrieve from your own private documents using embeddings and a vector store. This demo covers the latter — private RAG — where the LLM response is grounded in your organization’s documents.

RAG solves this in three steps: index your documents, retrieve the relevant content, then generate a grounded answer from it. The model never needs to memorize your data — it just reads the right pieces at the right time.


How RAG Works

RAG follows three steps: Index, Retrieve, and Generate.

1

Index

Your document is split into small, overlapping chunks of text. Each chunk is converted into a vector embedding — an array of numbers (e.g. [0.21, -0.87, 0.54, ...]) that encodes the meaning of that text. Chunks with similar meaning produce similar vectors, which is what makes semantic search possible.

Those vectors are stored in a vector store — either in-memory for a simple demo like this one, or in a dedicated database like Pinecone, Weaviate, or pgvector for production.

2

Retrieve

When a user asks a question, it’s run through the same embedding model we used in the index step to produce its own vector. The vector store finds the closest matches — the passages most semantically related to the question. There are many ways to measure that similarity (cosine similarity, dot product, Euclidean distance); this demo uses cosine similarity, which is the default for the vector store we’re using.

This is why a question like “What movies involve a kid and a father?” can surface content about Interstellar even if the word “father” never appears in the text.

3

Generate

The retrieved chunks are injected into a prompt alongside the original question and passed to a language model. The model reads only that context and generates a precise, grounded answer — without hallucinating facts it doesn’t have.


RAG Strategies

There’s no single way to build a RAG system. The right approach depends on your data size, latency requirements, and how often your documents change.

StrategyHow It WorksWhen to Use
In-memory RAGIndex is built per-request, lives only in RAMDemos, small files, stateless APIs
Persistent vector DBEmbeddings stored in Pinecone, pgvector, etc.Large document sets, production apps
Hybrid searchCombines vector similarity with keyword (BM25) searchWhen exact terms matter alongside semantic meaning
RerankingA second model re-scores retrieved chunks for relevanceWhen retrieval precision is critical
Multi-index / routingMultiple vector stores for different document typesLarge, heterogeneous knowledge bases

This demo uses the simplest strategy: in-memory RAG. The index is built from the uploaded file at request time and discarded afterwards — no database, no disk writes. It’s a clean way to understand the fundamentals before adding persistence.


Why You.com for Generation?

The retrieve step finds the right content. The generate step turns it into something useful. That’s where the You.com Express Agent comes in.

It’s a fast, capable LLM endpoint designed for exactly this pattern: receive a context block, answer only from it, stream the response back. Key advantages:

  • Streaming by default — responses stream token-by-token, keeping latency low even for long answers
  • No hallucination pressure — explicitly prompted to answer only from the provided context
  • No mandatory web search — tools are opt-in, so the agent stays focused on your documents
  • Simple API — one SDK call with the express agent type
1// This is where the magic happens — the Express Agent reads the retrieved
2// context and synthesizes a grounded answer, streamed back token-by-token.
3const stream = await you.agentsRuns({
4 agent: "express",
5 input: `Context:\n${retrievedChunks}\n\nQuestion: ${query}`,
6 stream: true,
7});

The result is an agent that reads your documents and writes coherent, grounded answers — not general-knowledge guesses.


Full Working Example

We’ve built a complete sample app you can clone and run locally:

The app is a Next.js project. Upload a .txt file (or use the built-in example), ask a question, and get a streamed answer grounded in that document. Embeddings run entirely on-device via BAAI/bge-small-en-v1.5 — no data leaves your machine during the indexing step.

$git clone https://github.com/youdotcom-oss/ydc-private-rag-sample.git
$cd ydc-private-rag-sample
$npm install
$npm run dev # open localhost:3000

Enter your You.com API key in the UI, upload 1 or more files, and ask away.

How the Code Works

The app has two files worth understanding: the API route that runs the RAG pipeline, and the UI that drives it.

app/api/query/route.ts — Index, Retrieve & Generate

This is where all three RAG steps happen server-side.

At the top of the file, the embedding model is configured globally via LlamaIndex’s Settings object. The right model for your use case will depend on your latency requirements, accuracy needs, and whether you want to run embeddings locally or via an API — this demo uses a small, fast, on-device model, but production systems often swap in a hosted model like OpenAI’s text-embedding-3-small.

1Settings.embedModel = new HuggingFaceEmbedding({
2 modelType: "BAAI/bge-small-en-v1.5",
3 modelOptions: { cache_dir: process.cwd() + "/models" },
4});
1

Build the index

buildIndex() takes the uploaded file text, wraps each file in a LlamaIndex Document, and calls VectorStoreIndex.fromDocuments(). Because no storageContext is provided, the entire index lives in memory for the lifetime of the request and is gone afterwards. In a production system you’d replace this with a persistent vector store — Pinecone, pgvector, Weaviate, and Chroma are all common choices — and build the index once rather than on every request.

1async function buildIndex(files) {
2 const documents = files.map(({ fileText, filename }) =>
3 new Document({ text: fileText, metadata: { source: filename } })
4 );
5 // No storageContext → stays entirely in memory, gone after the request.
6 return VectorStoreIndex.fromDocuments(documents);
7}

LlamaIndex handles splitting each document into overlapping chunks and embedding them automatically.

2

Retrieve the top chunks

retrieve() embeds the user’s query using the same model, runs a similarity search against the in-memory vector store, and returns the top 3 most relevant chunks for the users question.

1async function retrieve(query, index, topK = 3) {
2 const retriever = index.asRetriever({ similarityTopK: topK });
3 const results = await retriever.retrieve(query);
4 return results.map((r) => ({
5 text: r.node.text,
6 source: r.node.metadata.source ?? "unknown",
7 }));
8}
3

Build the prompt and stream the answer

The chunks are formatted into a numbered context block and injected into the prompt alongside the original question. The You.com Express Agent is explicitly instructed to answer only from the provided context, then streams its response token-by-token back to the browser.

1const input = `Answer the following question using only the provided context.
2
3Context:
4${buildContext(chunks)}
5
6Question: ${query}`;
7
8const stream = await you.agentsRuns({ agent: "express", input, stream: true });
9
10const readable = new ReadableStream({
11 async start(controller) {
12 for await (const chunk of stream) {
13 if (chunk.data.type === "response.output_text.delta") {
14 controller.enqueue(encoder.encode(chunk.data.response.delta));
15 }
16 if (chunk.data.type === "response.done") controller.close();
17 }
18 },
19});
20
21return new Response(readable, { headers: { "Content-Type": "text/plain" } });

app/page.tsx — The UI

The browser handles file selection, reading file contents as text, posting everything to /api/query, and streaming the response back into the UI.

Reading files and sending the request

Files are read as plain text in the browser using the File API and sent in the request body alongside the query and API key. No server-side file storage is needed.

1const files = await Promise.all(
2 attachedFiles.map(async (file) => ({
3 fileText: await file.text(),
4 filename: file.name,
5 }))
6);
7
8const res = await fetch("/api/query", {
9 method: "POST",
10 headers: { "Content-Type": "application/json" },
11 body: JSON.stringify({ files, query, apiKey }),
12});

Reading the streamed response

The response body is a ReadableStream. The UI reads it chunk-by-chunk and appends each decoded token to the displayed answer as it arrives.

1const reader = res.body.getReader();
2const decoder = new TextDecoder();
3
4while (true) {
5 const { done, value } = await reader.read();
6 if (done) break;
7 setResponse((prev) => prev + decoder.decode(value, { stream: true }));
8}

Further Reading

If you want to go deeper on RAG concepts and production patterns:


Building a RAG System at Scale?

If you’re building a RAG system for your enterprise — whether that’s over internal documents, proprietary data, or large knowledge bases — You.com offers solutions designed for production scale. Talk to our team about your use case.


Resources