AI/LLM Terms

You don't need to understand how a car engine works to drive one, but knowing the difference between horsepower and torque helps when someone's trying to sell you something. This page is that, for AI.

These concepts come up constantly in articles, Reddit threads, and product announcements. Most explanations assume you're either a developer or completely new to computers. This one assumes you're neither.

Jump to: Tokens · Context Window · Parameters · Model Names · Reasoning & Thinking · Temperature · Embeddings · Tools · Fine-Tuning · Hallucination · RAG · Multimodal

Tokens

The most fundamental unit of AI text. A token isn't exactly a word and isn't exactly a letter. It's somewhere in between: roughly a syllable or a short word. "Unbelievable" might be three tokens. "Cat" is one. A space before a word often counts as part of the next token.

Why does this matter? Because models don't read text the way you do. They process it one token at a time, predicting what comes next based on everything before it. When a model has a "token limit," that's how many of these chunks it can hold in its head at once, counting both what you send and what it writes back.

Tokens also determine cost. Most commercial AI APIs charge per token, so a long back-and-forth conversation is more expensive than a short one.

A useful rule of thumb: 1,000 tokens is roughly 750 words.

Context Window

Think of the context window as the model's working memory for a conversation. Everything inside it, your messages, the model's replies, any documents you've pasted in, is what the model can actually "see" when forming a response. Anything outside the window might as well not exist.

Early models had small context windows, around 4,000 tokens, which meant long conversations would cause the model to "forget" things you said at the beginning. Modern models can handle hundreds of thousands of tokens, which is enough to load an entire novel.

A bigger context window isn't always better in practice. Models sometimes struggle to pay attention to things buried in the middle of a very long context, a phenomenon researchers call "lost in the middle." But for most everyday use, more context is a genuine improvement.

Parameters

When someone says a model has "7 billion parameters" or "70 billion parameters," they're describing the size of the model's internal knowledge structure. Parameters are the numerical weights stored in a neural network, adjusted during training until the model gets good at predicting text.

More parameters generally means more capability, but also more compute to run. A 70B model needs serious hardware. A 7B model can often run on a decent laptop or phone.

The relationship between parameters and intelligence isn't perfectly linear. Newer, more efficient training techniques let smaller models punch above their weight. A 2025 model with 8 billion parameters might outperform a 2023 model with 30 billion on many tasks.

Model Names: What Does "gemma4:9b" Actually Mean?

Model names look cryptic but follow a pattern once you know what to look for.

Take gemma4:9b as an example. "Gemma" is the model family, created by Google. "4" is the version number. "9b" means 9 billion parameters. So the full name tells you: fourth version of Google's Gemma model, with 9 billion parameters.

Another example: llama3.2:3b is Meta's Llama model, version 3.2, with 3 billion parameters.

Commercial models follow slightly different naming conventions. Anthropic uses names like claude-sonnet-4 where the word in the middle (Haiku, Sonnet, Opus) signals where the model sits on the speed/intelligence tradeoff. Haiku is fast and cheap. Opus is slower but more capable. Sonnet is the middle ground.

OpenAI uses gpt-4o where "o" stands for "omni," meaning the model handles text, images, and audio. The number is the generation.

When you see suffixes like -instruct or -chat, those indicate the model has been fine-tuned to follow instructions or hold conversations, as opposed to the raw "base" model that just predicts the next word without any guidance about being helpful.

Reasoning and Thinking

Standard AI models generate responses token by token, straight through. You ask something, and the model starts writing an answer immediately.

Reasoning models work differently. Before producing a final response, they generate a long internal "scratchpad" where they work through the problem, check their logic, and sometimes backtrack and try a different approach. You can often see this process exposed as a "thinking" section that appears before the answer.

The analogy is the difference between someone blurting out an answer and someone pausing to think it through on paper first. The second approach takes longer and uses more compute, but handles complex problems, especially math, logic, and multi-step planning, much more reliably.

Not every problem benefits from this. For a simple factual question, reasoning just adds latency. For "help me debug this tax situation" or "plan a trip across five countries," it can make a significant difference.

Temperature

Temperature controls how predictable or creative a model's outputs are. It's a number, typically between 0 and 1 (or sometimes 0 and 2 depending on the platform), that adjusts how the model weighs its options at each token.

At low temperatures, the model almost always picks the most probable next token. Outputs are consistent and conservative. Ask the same question twice and you'll get nearly identical answers.

At high temperatures, the model is more likely to pick surprising or less common tokens. Outputs get more varied and creative, but also more prone to errors and weird tangents.

For extracting factual information or writing code, lower temperatures work better. For brainstorming, creative writing, or generating diverse options, higher temperatures can help. Most chat products set a moderate temperature and don't expose the setting to users.

Embeddings

Embeddings are a way of turning meaning into coordinates.

Here's the core idea. Imagine a map, except instead of physical locations, every word, sentence, or document gets plotted based on what it means. Things with similar meanings end up close together on this map. "Furious" and "livid" are neighbors. "Furious" and "spreadsheet" are on opposite ends of the continent.

In practice this map has hundreds of dimensions instead of two (you can't visualize it, but the math works the same way), and each point on it is represented as a long list of numbers. That list of numbers is the embedding.

What makes this genuinely useful is that the relationships between concepts get preserved spatially. The famous demonstration of this: if you take the embedding for "king," subtract the embedding for "man," and add the embedding for "woman," you land almost exactly on the embedding for "queen." The model hasn't been told that kings and queens are gender counterparts. It learned the relationship purely from seeing how those words appear in context, and the geometry captures it.

You've already been using embeddings without knowing it. When Spotify's radio feature surfaces a song you've never heard but somehow fits your mood perfectly, that's embeddings. Spotify maps songs into a space based on tempo, key, energy, instrumentation, and hundreds of other properties. Your listening history creates a kind of gravitational center in that space, and the recommendation engine finds songs that orbit nearby. It's not matching "you liked this rock song, here's another rock song." It's matching the actual sonic fingerprint at a level below genre labels.

The same principle powers the "Find similar" button on Pinterest, the "More like this" feature on streaming platforms, and search tools that understand what you meant rather than just matching your exact words. Type "something to watch when you're sad and want to feel understood" into a good search and it returns relevant results not because those words appear in any movie description, but because the semantic space of that query is close to the semantic space of certain films.

For AI specifically, developers use embeddings to let models search through large collections of text, like a company's entire documentation or years of email archives, without having to load all of it into the context window at once. The model converts a query into an embedding, finds the documents with nearby embeddings, and pulls only those into context to answer the question. This is the mechanism behind most "chat with your documents" products.

Tools (Also Called Function Calling)

A base language model knows a lot, but it's stuck in the past. Its training data has a cutoff date, it can't browse the internet, and it can't take actions in the world.

Tools change that. When a model has access to tools, it can call out to external systems mid-conversation. Common examples: web search, running a calculator, reading a file, checking a calendar, or sending an email.

The model decides when to use a tool based on the conversation. If you ask it what the weather is today, and it has access to a weather API, it will call that API, get the current data, and weave the result into its reply. From the outside it looks seamless.

This is what makes AI "agents" possible. An agent is a model that can use tools repeatedly, in sequence, to accomplish a goal that requires multiple steps. Give it access to your browser, a code interpreter, and your file system and it can do things that would have seemed like science fiction a few years ago.

Fine-Tuning

Pre-training is when a model learns from a massive dataset of internet text, books, and code. Fine-tuning happens after that, using a much smaller, curated dataset to adjust the model's behavior for a specific purpose.

A general-purpose model fine-tuned on medical records becomes better at clinical terminology. One fine-tuned on customer service transcripts gets better at de-escalating complaints. One fine-tuned on your company's internal docs learns your vocabulary and product names.

Fine-tuning doesn't reprogram the model from scratch. It nudges the existing weights in a direction. Think of it like breaking in a new employee who already has general skills; the fine-tuning is the onboarding.

Hallucination

When a model confidently states something false, that's called a hallucination. It doesn't mean the model is lying or malfunctioning in the traditional sense. It's a consequence of how these models work: they generate plausible-sounding text, and sometimes plausible-sounding text happens to be wrong.

Hallucinations are more common when the model is operating near the edge of its training data, asked about obscure topics, or pushed to fill in details it doesn't actually have. Asking for specific citations, recent statistics, or niche technical facts is where you're most likely to see them.

The practical response: treat AI outputs the way you'd treat a confident but occasionally unreliable friend who reads a lot. Useful for direction, worth verifying before you act on anything important.

RAG (Retrieval-Augmented Generation)

RAG is a technique for giving a model access to specific information without fine-tuning it. Instead of baking knowledge into the model's weights, you retrieve relevant documents at query time and include them in the context window.

When you ask a question, the system first searches a database of documents (using embeddings to find relevant ones), grabs the most relevant chunks, and adds them to the prompt before sending it to the model. The model then answers based on both its training and those retrieved documents.

This is how most "chat with your documents" products work. It's also more up-to-date than fine-tuning, because you can update the document database without retraining the model.

Multimodal

A multimodal model can work with more than one type of input. Most early language models only handled text. A multimodal model might accept images, audio, video, or documents alongside text and reason across all of them at once.

When you paste a screenshot into Claude or ChatGPT and ask a question about it, you're using multimodal capabilities. When a model can generate images in addition to describing them, that's also multimodal, just in the output direction.

The category is expanding quickly. Models that can see, hear, and read simultaneously are starting to feel less like chatbots and more like a general-purpose assistant.