Prototype Notes

This page documents notes and observations from prototype work exploring retrieval-augmented generation (RAG) implementation.


LLM Fundamentals

Large language models primarily generate text-based outputs.
For tasks such as image generation, speech synthesis, or code execution, additional specialized models or tools are typically used alongside the LLM.

During training:

  • Text is broken down into tokens
  • Each token corresponds to a unique textual unit
  • Tokens can represent:
    • full words
    • partial words
    • punctuation or symbol combinations

Each token is assigned a token ID, and text becomes a sequence of these IDs.

The model’s objective is: predict the next token given the preceding tokens in the sequence.

Prediction is computed using:

  • token embeddings
  • weights learned during training
  • contextual relationships between tokens (e.g., how frequently tokens appear together).

Tokenization

Tokenization determines how raw text is converted into tokens.

Common Methods

Word Tokenization

  • Text split into individual words using delimiters such as spaces.

Character Tokenization

  • Text split into individual characters.

Subword Tokenization

  • Frequently used words remain intact.
  • Rare words are broken into smaller components.

For example, models such as GPT use subword tokenization.


Context Window

LLMs have limits on how many tokens they can process.

This limit is called the context window.

The context window includes:

  • input tokens
  • output tokens generated by the model

If the limit is exceeded, earlier parts of the input may be truncated.

This constraint is one of the motivations for using retrieval-based approaches such as RAG.


Embeddings

Embeddings represent text as numerical vectors capturing semantic meaning.

Different models produce different embedding representations.

Reference documentation:
https://ai.google.dev/gemini-api/docs/embeddings

Embeddings allow models to:

  • measure similarity between texts
  • cluster related documents
  • retrieve relevant context for queries.

Retrieval-Augmented Generation (RAG)

Main reference article:
https://en.wikipedia.org/wiki/Retrieval-augmented_generation

RAG combines information retrieval systems with language models.

Instead of relying solely on the knowledge encoded in the model weights, RAG retrieves relevant external documents.

General workflow:

  1. Source data is converted into embeddings
  2. Embeddings are stored in a vector database
  3. A user query is converted into an embedding
  4. A retriever identifies the most relevant documents
  5. Retrieved context is added to the prompt
  6. The LLM generates a response using both the query and retrieved information

Vector Representations

Text can be encoded as different types of vectors.

Sparse Vectors

Characteristics:

  • encode explicit token identity
  • length equal to dictionary size
  • contain many zeros

Dense Vectors

Characteristics:

  • encode semantic meaning
  • compact representation
  • fewer zero values

Dense vectors are commonly used in modern embedding models.


Retriever-Centric Improvements

Some research focuses on improving the retrieval stage of RAG rather than modifying the language model itself.

Goals include:

  • improving document relevance
  • reducing hallucinations
  • increasing retrieval accuracy.

Chunking

Chunking refers to splitting documents into smaller units before generating embeddings.

This is necessary because:

  • LLM context windows are limited
  • embeddings typically work best on smaller text segments.

Chunking strategy significantly affects retrieval quality.

Common Chunking Strategies

Document-level chunking

  • each document treated as one unit.

Page-based chunking

  • chunks created based on page boundaries.

Text-block chunking

  • chunks generated based on natural paragraph boundaries.

Chunking a Large PDF

The LAMMPS manual is a 3000+ page PDF.

To use it within a RAG system:

  • Extract text from the PDF
  • Separate metadata such as:
    • title
    • author
    • publication date
  • Apply chunking
  • Convert chunks into embeddings
  • Store embeddings in a vector database.

Chunking must preserve semantic coherence so that the retriever can identify meaningful context.


Prototype Implementation

The current prototype follows a typical RAG pipeline.

Core Components

Three main elements are required:

  1. Chat model
    • used for generating responses
  2. Embedding model
    • converts text into vector representations
  3. Vector database
    • stores embeddings and supports similarity search.

Prototype Workflow

The workflow currently implemented is:

  1. Chat model initialization
  2. Embedding model initialization
  3. Vector store setup
  4. Document loading
  5. Text splitting
  6. Storing document embeddings
  7. RAG agent creation
  8. Prompting and query handling

Each stage includes configurable parameters that affect the quality of retrieval and generation.


Implementation Framework

The prototype currently uses LangChain.

Advantages:

  • extensive third-party integrations
  • modular components for LLMs, embeddings, and vector stores
  • easier experimentation with different RAG architectures.

References


TODO

  • Questions
  • Copilot-like Assistant
    • it must work inside the file editing and either “generate” or “complete” the script from the current script context
      • code completion: the llm use predicts the next lines of the script from the current file.
      • contextual generation: “write a script to calculate diffusion coefficient of X in Y alloy via LAMMPS”
    • possible system components:
      • backend:
        • lammps RAG using langchain
        • using lammps documentation for retrieval
        • api endpoint
      • frontend as a VS code extention?
        • reads editor context
        • sends request to backend
        • displays suggestion