AI fundamentals: a learner's glossary from LLMs to quantization — Visakh Vijayan
AI fundamentals: a learner's glossary from LLMs to quantization
Notes from learning modern AI — LLMs, tokenization, vectors, RAG, MCP, reasoning models, and how models get smaller and faster without starting from scratch.
May 27, 2026
AI fundamentals: a learner's glossary
I have been learning how modern AI systems actually work — not just how to call an API, but what happens under the hood when text goes in and text comes out. These notes are one section per concept, written in plain language while I am still connecting the dots myself.
LLM (Large Language Model)
A large language model is a neural network trained to predict the next token in an input sequence.
At each step the model does not output a single word immediately. It predicts probabilities for many possible next tokens, assigns a score to each, and the best candidate (or one sampled from the distribution) is passed forward. That process repeats: predict, pick, append, predict again — until the model decides the sequence is complete.
How does it know when to stop? During training, sequences include a special end-of-sequence token (often written as ). The model learns that after enough of a reply, the probability of emitting that end token rises. When the end token wins, generation stops.
<EOS>
Older models sometimes looped when users asked them not to stop: they never confidently chose the end token, hit the max token limit, and failed. Newer models handle stopping more reliably because end-of-sequence behavior is baked into training and decoding.
Tokenization
Tokenization splits text into the smallest units the model processes. These are usually common fragments of words, not always whole words.
Example: glitters might become two tokens — glit and ers — because the tokenizer has seen those pieces often in training data.
Why it matters: pricing, context limits, and speed are all counted in tokens, not characters. A “word” in English is often more than one token.
Vectors (embeddings)
A vector is a numerical representation of a token (or phrase) in a high-dimensional space. Similar meaning tends to land closer together in that space.
So “king” and “queen” might sit nearer to each other than to “banana,” because the model’s training pushed their representations to reflect shared context and usage patterns.
Vectors are the bridge between human language and the math the model actually runs on.
Self-supervised learning
In self-supervised learning, the model learns from raw data without humans labeling every example.
For text: scrape large corpora from the internet, tokenize them, then repeatedly mask or truncate the sequence and ask the model to predict what comes next.
Example progression:
Input: The → predict next token
Input: The sun → predict next
Input: The sun is → predict next
…until the full sentence The sun is bright is reconstructed.
Wrong predictions adjust weights; correct ones reinforce the pattern. The same idea extends to images and video (predict missing patches or frames), not only text.
Transformer
The transformer is the dominant architecture for turning input tokens into output predictions. It uses attention so each token can weigh relevance of other tokens in the sequence — local and long-range.
Most LLMs you hear about (GPT-style, Claude, etc.) are transformer-based. The name describes how the network is wired, not the training data or product brand.
Fine-tuning
Fine-tuning updates a pre-trained model’s weights so it behaves better for a domain — medicine, finance, legal, your company’s tone, and so on.
It is powerful but expensive: you need data, compute, and careful evaluation so you do not break general ability (“catastrophic forgetting”).
Cheaper alternatives for many products:
System prompts — instructions and rules in the prompt
RAG — retrieve relevant documents at query time instead of baking everything into weights
Actual weight adjustment is reserved for when behavior must change at the model level, not just at prompt time.
Few-shot prompting
Few-shot prompting puts examples in the system or user message so the model infers the pattern you want.
You might show two sample Q&A pairs in your preferred format; the model then answers the third question in the same style without retraining.
This is not fine-tuning — no weight update — but it steers behavior at inference time.
RAG (Retrieval Augmented Generation)
RAG helps the LLM answer with grounded context it did not memorize in training.
Retrieve — find relevant documents (often via vector search)
Augment — attach those documents to the user query
Generate — the LLM writes an answer using that extra context
Use RAG when facts change often, are private, or are too large to fit in a one-time fine-tune.
Vector database
A vector database stores embeddings (vectors) and supports fast similarity search: “which chunks are closest to this query vector?”
That is the usual retrieval layer behind RAG. Traditional SQL indexes keywords; vector DBs index meaning-ish neighborhoods in embedding space.
MCP (Model Context Protocol)
MCP defines how to give a model structured access to external context — tools, files, APIs, databases — in a standard way clients and servers can agree on.
Instead of every app inventing its own plugin shape, MCP is a protocol for “what context can the model see, and how does it call out?”
It sits alongside prompts and RAG as another lever in the context stack.
Context engineering
Context engineering is the practice of assembling everything the model sees at inference time: system prompts, few-shot examples, retrieved chunks, tool results, MCP resources, memory, and guardrails.
Prompt engineering is one slice; context engineering is the wider job of designing the full input so the model can do useful work reliably.
Reinforcement learning (from human feedback)
In reinforcement learning setups for chat models, the system generates multiple candidate responses, humans (or models) score them, and training nudges the policy toward higher-scored paths.
Example: you ask for an OAuth explanation. One answer is clear, one is jargon-heavy. Downvoting the bad one teaches the model that humans prefer the clearer style — without you writing gradient math by hand.
RLHF (reinforcement learning from human feedback) is how many assistants get “helpful and harmless” preferences baked in after pre-training.
Chain of thought (reasoning models)
An LLM is still a next-token predictor. On hard problems, jumping straight to the final token often fails.
Chain-of-thought prompting asks the model to reason step by step before the answer. Intermediate steps improve later token predictions, which improves the final result — especially on math, logic, and multi-step planning.
Models trained or tuned for this behavior are often called reasoning models; they expose or encourage visible “thinking” before the user-facing answer.
Multi-modal models
Multi-modal models accept more than text — images, audio, video — in one pipeline. The same core idea (tokens or patches in, predictions out) extends across modalities with shared or linked encoders.
Examples: describe an image, transcribe speech, or answer questions about a diagram in a screenshot.
SLM (Small Language Model)
A small language model has fewer parameters than frontier LLMs. It is often trained or distilled for one company or task — support bots, on-device assistants, internal search — where latency and cost matter more than encyclopedic knowledge.
Parameters
Parameters are the learned weights (and biases) in the network — the numbers adjusted during training. More parameters generally mean more capacity and more memory at inference time.
When people say “70B model,” they mean on the order of 70 billion such connections.
Distillation
Distillation trains a smaller student model to mimic a larger teacher.
Both see the same inputs; if the student’s output diverges from the teacher’s, weights are adjusted until the student is “good enough” at a fraction of the size.
This is a common path from LLM → SLM without training the small model from scratch on the entire internet.
Quantization
Quantization compresses weight precision so models run on smaller hardware with less RAM.
Many models store weights as 32-bit floats; quantization might use 8-bit (or lower) representations, trading a little accuracy for much smaller footprint and faster inference.
That is why “run a LLM on a laptop” became realistic: fewer bits per weight → less memory per forward pass → feasible on consumer GPUs or CPUs.
How these pieces fit together
Rough mental model:
Pre-train a transformer with self-supervised learning on tokens.
Optionally fine-tune or apply RLHF for behavior.
At serving time, use context engineering — prompts, few-shot examples, RAG over a vector DB, and MCP tools.
For hard questions, use chain of thought or a reasoning model.
To deploy cheaply, distill to an SLM and quantize for edge or laptop inference.
I will keep revising this post as I learn more; corrections welcome.