Prototype Notes
This page documents notes and observations from prototype work exploring retrieval-augmented generation (RAG) implementation.
LLM Fundamentals
Large language models primarily generate text-based outputs.
For tasks such as image generation, speech synthesis, or code execution, additional specialized models or tools are typically used alongside the LLM.
During training:
- Text is broken down into tokens
- Each token corresponds to a unique textual unit
- Tokens can represent:
- full words
- partial words
- punctuation or symbol combinations
Each token is assigned a token ID, and text becomes a sequence of these IDs.
The model’s objective is: predict the next token given the preceding tokens in the sequence.
Prediction is computed using:
- token embeddings
- weights learned during training
- contextual relationships between tokens (e.g., how frequently tokens appear together).
Tokenization
Tokenization determines how raw text is converted into tokens.
Common Methods
Word Tokenization
- Text split into individual words using delimiters such as spaces.
Character Tokenization
- Text split into individual characters.
Subword Tokenization
- Frequently used words remain intact.
- Rare words are broken into smaller components.
For example, models such as GPT use subword tokenization.
Context Window
LLMs have limits on how many tokens they can process.
This limit is called the context window.
The context window includes:
- input tokens
- output tokens generated by the model
If the limit is exceeded, earlier parts of the input may be truncated.
This constraint is one of the motivations for using retrieval-based approaches such as RAG.
Embeddings
Embeddings represent text as numerical vectors capturing semantic meaning.
Different models produce different embedding representations.
Reference documentation:
https://ai.google.dev/gemini-api/docs/embeddings
Embeddings allow models to:
- measure similarity between texts
- cluster related documents
- retrieve relevant context for queries.
Retrieval-Augmented Generation (RAG)
Main reference article:
https://en.wikipedia.org/wiki/Retrieval-augmented_generation
RAG combines information retrieval systems with language models.
Instead of relying solely on the knowledge encoded in the model weights, RAG retrieves relevant external documents.
General workflow:
- Source data is converted into embeddings
- Embeddings are stored in a vector database
- A user query is converted into an embedding
- A retriever identifies the most relevant documents
- Retrieved context is added to the prompt
- The LLM generates a response using both the query and retrieved information
Vector Representations
Text can be encoded as different types of vectors.
Sparse Vectors
Characteristics:
- encode explicit token identity
- length equal to dictionary size
- contain many zeros
Dense Vectors
Characteristics:
- encode semantic meaning
- compact representation
- fewer zero values
Dense vectors are commonly used in modern embedding models.
Similarity Search
Vector databases retrieve relevant documents using similarity metrics.
Common methods include:
Dot Product
- measures similarity between vectors.
K-Nearest Neighbors (KNN)
- retrieves the closest vectors based on distance metrics.
Approximate Nearest Neighbors (ANN)
- improves efficiency for large vector datasets
- commonly used in production retrieval systems.
Retriever-Centric Improvements
Some research focuses on improving the retrieval stage of RAG rather than modifying the language model itself.
Goals include:
- improving document relevance
- reducing hallucinations
- increasing retrieval accuracy.
Chunking
Chunking refers to splitting documents into smaller units before generating embeddings.
This is necessary because:
- LLM context windows are limited
- embeddings typically work best on smaller text segments.
Chunking strategy significantly affects retrieval quality.
Common Chunking Strategies
Document-level chunking
- each document treated as one unit.
Page-based chunking
- chunks created based on page boundaries.
Text-block chunking
- chunks generated based on natural paragraph boundaries.
Chunking a Large PDF
The LAMMPS manual is a 3000+ page PDF.
To use it within a RAG system:
- Extract text from the PDF
- Separate metadata such as:
- title
- author
- publication date
- Apply chunking
- Convert chunks into embeddings
- Store embeddings in a vector database.
Chunking must preserve semantic coherence so that the retriever can identify meaningful context.
Prototype Implementation
The current prototype follows a typical RAG pipeline.
Core Components
Three main elements are required:
- Chat model
- used for generating responses
- Embedding model
- converts text into vector representations
- Vector database
- stores embeddings and supports similarity search.
Prototype Workflow
The workflow currently implemented is:
- Chat model initialization
- Embedding model initialization
- Vector store setup
- Document loading
- Text splitting
- Storing document embeddings
- RAG agent creation
- Prompting and query handling
Each stage includes configurable parameters that affect the quality of retrieval and generation.
Implementation Framework
The prototype currently uses LangChain.
Advantages:
- extensive third-party integrations
- modular components for LLMs, embeddings, and vector stores
- easier experimentation with different RAG architectures.
References
- https://arxiv.org/pdf/2312.10997
- https://arxiv.org/pdf/2005.11401
- https://docs.langchain.com/oss/python/integrations/chat/huggingface
- Build a semantic search engine with LangChain
TODO
- Questions
- Copilot-like Assistant
- it must work inside the file editing and either “generate” or “complete” the script from the current script context
code completion: the llm use predicts the next lines of the script from the current file.contextual generation: “write a script to calculate diffusion coefficient of X in Y alloy via LAMMPS”
- possible system components:
- backend:
- lammps RAG using
langchain - using lammps documentation for retrieval
- api endpoint
- lammps RAG using
- frontend as a VS code extention?
- reads editor context
- sends request to backend
- displays suggestion
- backend:
- it must work inside the file editing and either “generate” or “complete” the script from the current script context