๐ Preparing Data & Generating Embeddings
In our previous post we explored how to access and interact with Large Language Models (LLMs) using providers like Amazon Bedrock. Now, we move on to the next step in building a RAG chatbot: preparing your data and generating vector embeddings.
๐ง Why Data Preparation Mattersโ
LLMs are powerful โ trained on vast internet-scale datasets, they excel at understanding and generating human language. However, they lack context about your specific domain or internal data. Thatโs where RAG comes in.
To enable an LLM to reason over your data, you must first:
- Format it properly
- Convert it into machine-understandable vectors
- Store and retrieve it efficiently
This step โ preparing your data pipeline โ is crucial for any RAG-based application.
๐ Structured vs Unstructured Dataโ
Todayโs data landscape is evolving. Traditional data warehouses are giving way to data lakes and lakehouses, handling both structured (CSV, SQL) and unstructured (PDFs, docs, text) content equally.
In this series, weโll focus on unstructured data โ PDFs, text files, etc. โ and use LLMs to extract insights from them conversationally.
๐ Use Case: Think of a student prepping for exams using an LLM that answers questions from a textbook. Thatโs RAG in action.
๐ผ RAG Architecture with Data Pipelineโ
Fig 1: RAG Architecture with Data Pipeline
๐ Data Processing Flow:โ
- Document Ingestion โ PDF, Word, Text files
- Text Extraction โ Parse and clean content
- Chunking Strategy โ Split into manageable segments
- Embedding Generation โ Convert text to vectors
- Vector Storage โ Store in vector database
- Retrieval & Generation โ Query matching and response generation
๐ง Understanding Embeddingsโ
LLMs donโt understand raw text. We must convert our unstructured data into embeddings โ numerical representations that encode semantic meaning.
Embeddings allow LLMs to โreasonโ over text by comparing the vector similarity between user queries and stored knowledge.
๐ Providers of Embedding Modelsโ
Some popular embedding providers include:
- Amazon Titan Embeddings
- OpenAI Embeddings
- Cohere
- Google Vertex AI
- HuggingFace models
For this tutorial, we'll use Amazon Titan via Bedrock.
๐ Amazon Titan Embeddings on Bedrock
๐งฑ Building the Data Pipelineโ
Hereโs a high-level overview:
- Access raw document data
- Split documents into manageable chunks
- Convert chunks into embeddings
- Store embeddings in a vector database
- Search embeddings by query
- Feed relevant chunks into an LLM for a final answer
๐ Tools Usedโ
- LangChain: For document splitting and LLM abstraction
- FAISS: In-memory vector database for fast retrieval
- Bedrock: AWS Bedrock, serverless service for accessing different FMs
!!! note "Using LangChain for Rapid Development"
For this demo, weโll be using LangChain, an open-source framework for building LLM-based applications. LangChain simplifies the complexity of accessing LLMs by providing convenient abstractions and wrappers, allowing developers to build powerful applications faster and reduce time to market.
โ๏ธ Step 1: Splitting Text into Chunksโ
def create_chunks(self, documents: List,
chunk_size: int = 1000,
chunk_overlap: int = 200) -> List:
"""
Split documents into chunks with error handling
Args:
documents: List of documents to chunk
chunk_size: Size of each chunk
chunk_overlap: Overlap between chunks
Returns:
List of document chunks
"""
try:
if not documents:
logger.warning("No documents provided for chunking")
return []
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
logger.info(f"Split {len(documents)} documents into {len(chunks)} chunks")
return chunks
except Exception as e:
logger.error(f"Error creating chunks: {e}")
return []
โ Recommended:
chunk_size
: 500โ1000 characterschunk_overlap
: 10โ20% of chunk size
This ensures context is preserved across chunks.
๐ข Step 2: Generating Embeddingsโ
def _get_bedrock_embeddings(self) -> Optional[BedrockEmbeddings]:
"""Initialize and return Bedrock embeddings with error handling"""
try:
# Get AWS credentials
aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
embeddings_kwargs = {
"model_id": self.embedding_model_id,
"region_name": self.region
}
# Add credentials if available
if aws_access_key_id and aws_secret_access_key:
embeddings_kwargs.update({
"credentials_profile_name": None # Use explicit credentials
})
embeddings = BedrockEmbeddings(**embeddings_kwargs)
logger.info(f"Initialized Bedrock embeddings with model: {self.embedding_model_id}")
return embeddings
except Exception as e:
logger.error(f"Failed to initialize Bedrock embeddings: {e}")
return None
๐ Step 3: Storing in Vector DBโ
# Create vector store
logger.info("Creating FAISS vector store...")
vector_store = FAISS.from_documents(
documents=chunks,
embedding=self.embeddings
)
# Save vector store
os.makedirs(os.path.dirname(vector_db_path) if os.path.dirname(vector_db_path) else '.', exist_ok=True)
vector_store.save_local(vector_db_path)
logger.info(f"FAISS vector database created and saved to {vector_db_path}")
๐ง Other vector DBs to explore:
- ChromaDB
- Weaviate
- Pinecone
- Amazon OpenSearch
๐ Step 4: Querying the Vector DBโ
# Load vector database if not provided
if vector_db is None:
vector_db = self.load_vector_db(vector_db_path)
if vector_db is None:
return "Failed to load vector database. Please ensure the database exists and is accessible."
# Search for relevant documents
logger.info(f"Searching for relevant documents for question: {question[:100]}...")
docs = vector_db.similarity_search(question, k=k)
โ Our Data is Now Searchable!โ
Now that our embeddings are stored and indexed, we can query them โ and pass the results to an LLM to generate a contextual answer.
๐งช Example: Querying Documentsโ
To demonstrate the pipeline, Iโve selected Googleโs prompt engineering whitepaper as the input document โ a detailed resource outlining advanced prompting strategies.
Document Source: Googleโs Prompt Engineering Whitepaper (PDF)
# Prepare context
context = "\n\n".join([doc.page_content for doc in docs])
# Create prompt
prompt = f"""You are a helpful assistant that answers questions based on the provided context from PDF documents.
Context:
{context}
Question: {question}
Instructions:
- Answer the question based only on the provided context
- If the context doesn't contain enough information, say "I don't have enough information to answer this question based on the provided documents"
- Be concise but comprehensive in your answer
- Cite relevant parts of the context when possible
Answer:"""
# Get answer from Claude
logger.info("Getting answer from Bedrock Claude...")
answer = self.call_bedrock_claude(prompt)
return answer
๐ธ Outputโ
Prompt:
"What is a prompt?"
Output:
Based on the provided context, a prompt is an input given to a Large Language Model (LLM) to generate a response or prediction. When you write a prompt, you are trying to guide the LLM to predict the right sequence of tokens. The context specifically states that "in the context of natural language processing and LLMs, a prompt is an input provided to the model to generate a response or prediction.
Prompts can take different forms, including:
- Zero-shot prompts (which provide just a description of a task without examples)
- Role prompts (where the AI is assigned a specific role)
- Multimodal prompts (which can combine text, images, audio, code, or other formats)
The effectiveness of a prompt often requires tinkering and optimization of factors like length, writing style, and structure in relation to the task at hand."
๐ฏ Example RAG Query Responseโ
Query: "What are the key benefits of using vector embeddings in RAG applications?"
Response: The RAG application successfully processes the query through the vector database, retrieves relevant context, and generates a comprehensive response highlighting the advantages of vector embeddings in semantic search and retrieval accuracy.
You can access the full code on GitHub here.
๐ Summaryโ
In this post, we covered:
- Why unstructured data matters in RAG
- What embeddings are and how they power LLM search
- How to split, embed, and store your documents
- How to query the knowledge base using FAISS
- Explored retrieval and generation using your vector database
- LLM to answer user queries interactively
๐ฎ Next Upโ
Next, weโll explore prompt engineering techniques.
Subscribe or follow to continue your RAG journey ๐