Skip to main content

๐Ÿ“„ Preparing Data & Generating Embeddings

In our previous post we explored how to access and interact with Large Language Models (LLMs) using providers like Amazon Bedrock. Now, we move on to the next step in building a RAG chatbot: preparing your data and generating vector embeddings.


๐Ÿง  Why Data Preparation Mattersโ€‹

LLMs are powerful โ€” trained on vast internet-scale datasets, they excel at understanding and generating human language. However, they lack context about your specific domain or internal data. Thatโ€™s where RAG comes in.

To enable an LLM to reason over your data, you must first:

  1. Format it properly
  2. Convert it into machine-understandable vectors
  3. Store and retrieve it efficiently

This step โ€” preparing your data pipeline โ€” is crucial for any RAG-based application.


๐Ÿž Structured vs Unstructured Dataโ€‹

Todayโ€™s data landscape is evolving. Traditional data warehouses are giving way to data lakes and lakehouses, handling both structured (CSV, SQL) and unstructured (PDFs, docs, text) content equally.

In this series, weโ€™ll focus on unstructured data โ€” PDFs, text files, etc. โ€” and use LLMs to extract insights from them conversationally.

๐Ÿ” Use Case: Think of a student prepping for exams using an LLM that answers questions from a textbook. Thatโ€™s RAG in action.


๐Ÿ–ผ RAG Architecture with Data Pipelineโ€‹

RAG Data Pipeline Fig 1: RAG Architecture with Data Pipeline

๐Ÿ“Š Data Processing Flow:โ€‹

  1. Document Ingestion โ†’ PDF, Word, Text files
  2. Text Extraction โ†’ Parse and clean content
  3. Chunking Strategy โ†’ Split into manageable segments
  4. Embedding Generation โ†’ Convert text to vectors
  5. Vector Storage โ†’ Store in vector database
  6. Retrieval & Generation โ†’ Query matching and response generation

๐Ÿ”ง Understanding Embeddingsโ€‹

LLMs donโ€™t understand raw text. We must convert our unstructured data into embeddings โ€” numerical representations that encode semantic meaning.

Embeddings allow LLMs to โ€œreasonโ€ over text by comparing the vector similarity between user queries and stored knowledge.


๐Ÿ— Providers of Embedding Modelsโ€‹

Some popular embedding providers include:

  • Amazon Titan Embeddings
  • OpenAI Embeddings
  • Cohere
  • Google Vertex AI
  • HuggingFace models

For this tutorial, we'll use Amazon Titan via Bedrock.

๐Ÿ”— Amazon Titan Embeddings on Bedrock


๐Ÿงฑ Building the Data Pipelineโ€‹

Hereโ€™s a high-level overview:

  1. Access raw document data
  2. Split documents into manageable chunks
  3. Convert chunks into embeddings
  4. Store embeddings in a vector database
  5. Search embeddings by query
  6. Feed relevant chunks into an LLM for a final answer

๐Ÿ›  Tools Usedโ€‹

  • LangChain: For document splitting and LLM abstraction
  • FAISS: In-memory vector database for fast retrieval
  • Bedrock: AWS Bedrock, serverless service for accessing different FMs

!!! note "Using LangChain for Rapid Development"

For this demo, weโ€™ll be using LangChain, an open-source framework for building LLM-based applications. LangChain simplifies the complexity of accessing LLMs by providing convenient abstractions and wrappers, allowing developers to build powerful applications faster and reduce time to market.

โœ‚๏ธ Step 1: Splitting Text into Chunksโ€‹

def create_chunks(self, documents: List, 
chunk_size: int = 1000,
chunk_overlap: int = 200) -> List:
"""
Split documents into chunks with error handling

Args:
documents: List of documents to chunk
chunk_size: Size of each chunk
chunk_overlap: Overlap between chunks

Returns:
List of document chunks
"""
try:
if not documents:
logger.warning("No documents provided for chunking")
return []

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_documents(documents)
logger.info(f"Split {len(documents)} documents into {len(chunks)} chunks")

return chunks

except Exception as e:
logger.error(f"Error creating chunks: {e}")
return []

โœ… Recommended:

  • chunk_size: 500โ€“1000 characters
  • chunk_overlap: 10โ€“20% of chunk size

This ensures context is preserved across chunks.


๐Ÿ”ข Step 2: Generating Embeddingsโ€‹

def _get_bedrock_embeddings(self) -> Optional[BedrockEmbeddings]:
"""Initialize and return Bedrock embeddings with error handling"""
try:
# Get AWS credentials
aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")

embeddings_kwargs = {
"model_id": self.embedding_model_id,
"region_name": self.region
}

# Add credentials if available
if aws_access_key_id and aws_secret_access_key:
embeddings_kwargs.update({
"credentials_profile_name": None # Use explicit credentials
})

embeddings = BedrockEmbeddings(**embeddings_kwargs)
logger.info(f"Initialized Bedrock embeddings with model: {self.embedding_model_id}")
return embeddings

except Exception as e:
logger.error(f"Failed to initialize Bedrock embeddings: {e}")
return None


๐Ÿ—ƒ Step 3: Storing in Vector DBโ€‹

# Create vector store
logger.info("Creating FAISS vector store...")
vector_store = FAISS.from_documents(
documents=chunks,
embedding=self.embeddings
)

# Save vector store
os.makedirs(os.path.dirname(vector_db_path) if os.path.dirname(vector_db_path) else '.', exist_ok=True)
vector_store.save_local(vector_db_path)
logger.info(f"FAISS vector database created and saved to {vector_db_path}")

๐Ÿง  Other vector DBs to explore:

  • ChromaDB
  • Weaviate
  • Pinecone
  • Amazon OpenSearch

๐Ÿ” Step 4: Querying the Vector DBโ€‹

# Load vector database if not provided
if vector_db is None:
vector_db = self.load_vector_db(vector_db_path)
if vector_db is None:
return "Failed to load vector database. Please ensure the database exists and is accessible."

# Search for relevant documents
logger.info(f"Searching for relevant documents for question: {question[:100]}...")
docs = vector_db.similarity_search(question, k=k)

โœ… Our Data is Now Searchable!โ€‹

Now that our embeddings are stored and indexed, we can query them โ€” and pass the results to an LLM to generate a contextual answer.


๐Ÿงช Example: Querying Documentsโ€‹

To demonstrate the pipeline, Iโ€™ve selected Googleโ€™s prompt engineering whitepaper as the input document โ€” a detailed resource outlining advanced prompting strategies.

Document Source: Googleโ€™s Prompt Engineering Whitepaper (PDF)

# Prepare context
context = "\n\n".join([doc.page_content for doc in docs])

# Create prompt
prompt = f"""You are a helpful assistant that answers questions based on the provided context from PDF documents.

Context:
{context}

Question: {question}

Instructions:
- Answer the question based only on the provided context
- If the context doesn't contain enough information, say "I don't have enough information to answer this question based on the provided documents"
- Be concise but comprehensive in your answer
- Cite relevant parts of the context when possible

Answer:"""

# Get answer from Claude
logger.info("Getting answer from Bedrock Claude...")
answer = self.call_bedrock_claude(prompt)

return answer

๐Ÿ“ธ Outputโ€‹

Prompt:

"What is a prompt?"

Output:

Based on the provided context, a prompt is an input given to a Large Language Model (LLM) to generate a response or prediction. When you write a prompt, you are trying to guide the LLM to predict the right sequence of tokens. The context specifically states that "in the context of natural language processing and LLMs, a prompt is an input provided to the model to generate a response or prediction.

Prompts can take different forms, including:

  • Zero-shot prompts (which provide just a description of a task without examples)
  • Role prompts (where the AI is assigned a specific role)
  • Multimodal prompts (which can combine text, images, audio, code, or other formats)

The effectiveness of a prompt often requires tinkering and optimization of factors like length, writing style, and structure in relation to the task at hand."

๐ŸŽฏ Example RAG Query Responseโ€‹

Query: "What are the key benefits of using vector embeddings in RAG applications?"

Response: The RAG application successfully processes the query through the vector database, retrieves relevant context, and generates a comprehensive response highlighting the advantages of vector embeddings in semantic search and retrieval accuracy.

You can access the full code on GitHub here.


๐Ÿ“Œ Summaryโ€‹

In this post, we covered:

  • Why unstructured data matters in RAG
  • What embeddings are and how they power LLM search
  • How to split, embed, and store your documents
  • How to query the knowledge base using FAISS
  • Explored retrieval and generation using your vector database
  • LLM to answer user queries interactively

๐Ÿ”ฎ Next Upโ€‹

Next, weโ€™ll explore prompt engineering techniques.

Subscribe or follow to continue your RAG journey ๐Ÿš€