Prototype Notes

This page documents notes and observations from prototype work exploring retrieval-augmented generation (RAG) implementation.

LLM Fundamentals

Large language models primarily generate text-based outputs.
For tasks such as image generation, speech synthesis, or code execution, additional specialized models or tools are typically used alongside the LLM.

During training:

Text is broken down into tokens
Each token corresponds to a unique textual unit
Tokens can represent:
- full words
- partial words
- punctuation or symbol combinations

Each token is assigned a token ID, and text becomes a sequence of these IDs.

The model’s objective is: predict the next token given the preceding tokens in the sequence.

Prediction is computed using:

token embeddings
weights learned during training
contextual relationships between tokens (e.g., how frequently tokens appear together).

Tokenization

Tokenization determines how raw text is converted into tokens.

Common Methods

Word Tokenization

Text split into individual words using delimiters such as spaces.

Character Tokenization

Text split into individual characters.

Subword Tokenization

Frequently used words remain intact.
Rare words are broken into smaller components.

For example, models such as GPT use subword tokenization.

Context Window

LLMs have limits on how many tokens they can process.

This limit is called the context window.

The context window includes:

input tokens
output tokens generated by the model

If the limit is exceeded, earlier parts of the input may be truncated.

This constraint is one of the motivations for using retrieval-based approaches such as RAG.

Embeddings

Embeddings represent text as numerical vectors capturing semantic meaning.

Different models produce different embedding representations.

Reference documentation:
https://ai.google.dev/gemini-api/docs/embeddings

Embeddings allow models to:

measure similarity between texts
cluster related documents
retrieve relevant context for queries.

Retrieval-Augmented Generation (RAG)

Main reference article:
https://en.wikipedia.org/wiki/Retrieval-augmented_generation

RAG combines information retrieval systems with language models.

Instead of relying solely on the knowledge encoded in the model weights, RAG retrieves relevant external documents.

General workflow:

Source data is converted into embeddings
Embeddings are stored in a vector database
A user query is converted into an embedding
A retriever identifies the most relevant documents
Retrieved context is added to the prompt
The LLM generates a response using both the query and retrieved information

Vector Representations

Text can be encoded as different types of vectors.

Sparse Vectors

Characteristics:

encode explicit token identity
length equal to dictionary size
contain many zeros

Dense Vectors

Characteristics:

encode semantic meaning
compact representation
fewer zero values

Dense vectors are commonly used in modern embedding models.

Similarity Search

Vector databases retrieve relevant documents using similarity metrics.

Common methods include:

Dot Product

measures similarity between vectors.

K-Nearest Neighbors (KNN)

retrieves the closest vectors based on distance metrics.

Approximate Nearest Neighbors (ANN)

improves efficiency for large vector datasets
commonly used in production retrieval systems.

Retriever-Centric Improvements

Some research focuses on improving the retrieval stage of RAG rather than modifying the language model itself.

Goals include:

improving document relevance
reducing hallucinations
increasing retrieval accuracy.

Chunking

Chunking refers to splitting documents into smaller units before generating embeddings.

This is necessary because:

LLM context windows are limited
embeddings typically work best on smaller text segments.

Chunking strategy significantly affects retrieval quality.

Common Chunking Strategies

Document-level chunking

each document treated as one unit.

Page-based chunking

chunks created based on page boundaries.

Text-block chunking

chunks generated based on natural paragraph boundaries.

Chunking a Large PDF

The LAMMPS manual is a 3000+ page PDF.

To use it within a RAG system:

Extract text from the PDF
Separate metadata such as:
- title
- author
- publication date
Apply chunking
Convert chunks into embeddings
Store embeddings in a vector database.

Chunking must preserve semantic coherence so that the retriever can identify meaningful context.

Prototype Implementation

The current prototype follows a typical RAG pipeline.

Core Components

Three main elements are required:

Chat model
- used for generating responses
Embedding model
- converts text into vector representations
Vector database
- stores embeddings and supports similarity search.

Prototype Workflow

The workflow currently implemented is:

Chat model initialization
Embedding model initialization
Vector store setup
Document loading
Text splitting
Storing document embeddings
RAG agent creation
Prompting and query handling

Each stage includes configurable parameters that affect the quality of retrieval and generation.

Implementation Framework

The prototype currently uses LangChain.

Advantages:

extensive third-party integrations
modular components for LLMs, embeddings, and vector stores
easier experimentation with different RAG architectures.

References

TODO

Questions
- should the documentation be accessed as a static pdf downloaded (may get outdated after a major release or miss new additions) or we should use web-scraping to dynamically access the documentation using Beautiful Soup
- optimum chunking parameters
- is it best to use faiss or something else like chromaDB for indexing
- better prompt design when creating the agent (read prompt engineering)
Copilot-like Assistant
- it must work inside the file editing and either “generate” or “complete” the script from the current script context
  - code completion: the llm use predicts the next lines of the script from the current file.
  - contextual generation: “write a script to calculate diffusion coefficient of X in Y alloy via LAMMPS”
- possible system components:
  - backend:
    - lammps RAG using langchain
    - using lammps documentation for retrieval
    - api endpoint
  - frontend as a VS code extention?
    - reads editor context
    - sends request to backend
    - displays suggestion