AI Model Glossary

A detailed glossary explaining the architecture, training, and benchmarks behind AI models like ChatGPT, Claude, and Gemini. Learn how large language models work — from parameters to transformer design.

Photo by Steve Johnson / Unsplash

Behind every breakthrough in AI — from ChatGPT to Gemini — are models built on complex architectures, training data, and algorithms. This glossary explains the technical terms and mechanisms that power large language models and generative AI systems.

New to AI?
Start with the AI Glossary for the fundamentals before diving into model-specific terminology.

Term	Definition
Model Architecture	The structural design of an AI system — how its layers, parameters, and connections are organized to process information. GPT models use the transformer architecture.
Transformer	A neural network architecture introduced in 2017 that uses attention mechanisms to understand relationships between words or tokens in a sequence, revolutionizing natural language processing.
Attention Mechanism	A process that allows a model to focus on the most relevant parts of input data when generating an output. It’s what enables GPT models to understand context and relationships in text.
Self-Attention	A specific kind of attention where a model compares every word in a sequence to every other word, capturing meaning and context across entire sentences or documents.
Parameters	The learned numerical values that define how an AI model processes data. More parameters usually mean greater capacity to learn complex patterns, though not always better performance.
Weights	The actual numerical values assigned to the model’s parameters after training. They determine how much influence each input has on the model’s output.
Biases	Additional parameters that adjust how the model processes inputs, often used to shift activations during computation. They’re distinct from social or ethical bias.
Layer	A level within a neural network where computations occur. Transformers typically include alternating layers of attention and feed-forward networks.
Embedding	A representation of words, sentences, or other data in a numerical vector space. Embeddings allow models to understand similarity and meaning mathematically.
Tokenization	The process of breaking text into smaller pieces (tokens) — words, subwords, or characters — that the model can process. Different tokenization schemes affect context handling and model efficiency.
Context Window	The maximum number of tokens a model can “see” or consider in one prompt. A larger window allows for longer documents or conversations without losing context.
Training Data	The text, code, and other sources used to teach a model language and reasoning patterns during training. Quality and diversity of data strongly affect performance.
Pre-training	The initial phase where a model learns general language patterns from massive datasets before being fine-tuned for specific tasks.
Fine-tuning	The process of refining a pre-trained model on a smaller, targeted dataset to specialize its behavior — for example, improving accuracy for coding or conversation.
Instruction Tuning	A form of fine-tuning that teaches a model to follow explicit human instructions in prompts, making it more responsive and user-friendly.
RLHF (Reinforcement Learning from Human Feedback)	A process that aligns model responses with human preferences. Human evaluators rank outputs, and reinforcement learning adjusts the model to produce better answers.
LoRA (Low-Rank Adaptation)	A fine-tuning technique that allows adapting large models efficiently by updating only a small subset of parameters instead of retraining the entire network.
Mixture of Experts (MoE)	A model design that routes inputs to specialized sub-networks (“experts”) depending on the task, improving efficiency and scalability. Used by models like Mixtral.
Checkpoint	A saved state of a model’s parameters during or after training. Developers can resume training or deploy models from these points.
Loss Function	The mathematical formula that measures how far a model’s predictions are from the desired outputs during training. The goal is to minimize this loss.
Optimizer	The algorithm that adjusts model parameters during training to reduce the loss function — examples include Adam, SGD, and Adafactor.
Epoch	One complete pass of the training dataset through the model. Multiple epochs help the model learn more deeply but risk overfitting.
Overfitting	When a model learns the training data too precisely, losing the ability to generalize to new or unseen data.
Underfitting	When a model is too simple or insufficiently trained, failing to capture meaningful patterns in the data.
Perplexity	A metric that measures how well a language model predicts text. Lower perplexity means better performance and understanding of language structure.
FLOPs (Floating Point Operations)	A measure of computational complexity — how many mathematical operations a model performs. Useful for estimating training cost and model efficiency.
Latency	The time delay between sending a prompt and receiving a model’s response. Optimizing latency is key for user-facing AI applications.
Inference	The phase where a trained model generates outputs from new input data (as opposed to training). ChatGPT responses are examples of inference.
Throughput	The number of inferences (responses or tokens) a model can generate per second — an important performance metric in deployment.
Hallucination	When an AI model confidently outputs incorrect, fabricated, or irrelevant information. Reducing hallucinations remains an active research area.
Knowledge Cutoff	The date after which an AI model’s training data stops. For example, GPT-5 may not have knowledge of events after mid-2024.
Quantization	A technique to reduce model size and speed up inference by representing parameters with lower precision numbers.
Distillation	A process where a smaller model (“student”) learns to replicate the behavior of a larger one (“teacher”) for efficiency.
Retrieval-Augmented Generation (RAG)	A hybrid approach that lets a model query external databases or documents to provide more accurate, up-to-date responses.
Benchmarking	The process of evaluating model performance using standardized datasets and tests (e.g., MMLU, HellaSwag, GSM8K).
MMLU (Massive Multitask Language Understanding)	A benchmark that tests reasoning and knowledge across dozens of subjects, commonly used to evaluate GPT and other LLMs.
Prompt Template	A reusable prompt pattern designed to guide AI output consistently, often used in development and testing.
Temperature	A setting that controls randomness in AI responses — lower values produce more deterministic answers; higher values make responses more creative.
Top-p (Nucleus Sampling)	A parameter that determines the probability distribution from which the next token is chosen. It controls diversity of output similar to temperature.
Memory System	A mechanism (in models like GPT-5) that allows retaining information across sessions to build long-term context or personalization.
Alignment	Ensuring a model’s actions and responses reflect human values, safety principles, and intended use. A major focus in responsible AI development.
Constitutional AI	An approach developed by Anthropic that trains models to follow a predefined “constitution” of ethical principles instead of relying entirely on human feedback.

🧠 Part of the Bold Outlook AI Learning Series
Explore more guides, insights, and explainers on artificial intelligence, machine learning, and emerging technologies at BoldOutlook.com/AI.