LLMs contain a LOT of parameters. But what’s a parameter?

amu · January 7, 2026, 11:55am

What Even Is a Parameter?

In the race to build ever more powerful artificial intelligence, one number dominates headlines: parameters. OpenAI’s o1 model has 500 billion of them. xAI’s Grok-2 boasts hundreds of billions. Chinese startup DeepSeek claims its V3 model packs a staggering 671 billion, while rivals like Alibaba’s Qwen push toward 2 trillion. These figures signal scale, capability, and massive computational demands. Yet amid the excitement, a fundamental question lingers: what exactly is a parameter?

At its core, a parameter is a tunable value within a neural network, the mathematical architecture powering modern AI. Neural networks mimic the brain loosely, processing inputs through layers of interconnected nodes, or neurons. Each connection carries a weight, a number that scales the signal passing through. Additional biases shift activations. These weights and biases form the parameters, adjusted during training to minimize errors on vast datasets.

Consider a simple feedforward network: input data flows through hidden layers to output predictions. Each layer multiplies inputs by a weight matrix and adds biases. For a layer with m inputs and n outputs, the weight matrix holds m by n parameters, plus n biases. Stack layers, and parameters multiply rapidly.

Today’s flagships rely on transformers, introduced in the 2017 paper “Attention Is All You Need.” Transformers excel at language by weighing input relevance via self-attention. In attention heads, query, key, and value matrices each contribute parameters: for sequence length s and embedding dimension d, each matrix is d by d, yielding 3d squared per head, times heads, plus output projection. Feedforward blocks double down, with even larger matrices. A single transformer layer might tally millions of parameters; stack dozens or hundreds, and billions emerge.

Counting parameters standardizes model comparison. A 7-billion-parameter model like Meta’s Llama 3 fits benchmarks neatly. Tools like Hugging Face’s Transformers library report them precisely, aiding researchers. Parameters correlate with prowess: larger models grasp nuance, generalize better, follow instructions. Training them demands GPUs churning quadrillions of floating-point operations, or FLOPs, often scaling linearly with parameter count and tokens.

But parameters tell an incomplete story. Architecture evolves. Mixture of Experts (MoE) models, like Mistral’s Mixtral, activate subsets of parameters per token, slashing inference costs while claiming effective scale. DeepSeek-V3 uses MoE with 37 billion active from 671 billion total. Quantization compresses parameters, swapping 32-bit floats for 4-bit integers, shrinking models 8-fold with minimal accuracy loss. Pruning zeros out redundancies; distillation transfers knowledge to slimmer pupils.

Training data quality trumps raw count. Garbage in, garbage out: models ingest trillions of tokens from web scrapes, books, code. Post-training tweaks, like reinforcement learning from human feedback (RLHF), refine alignment without altering parameters.

Inference, deploying models, exposes limits. A trillion-parameter behemoth strains memory; even with sharding across clusters, latency spikes. Edge devices crave efficiency, favoring sub-billion-parameter mobiles like Gemma 2B.

Experts debate parameter obsession. Ilya Sutskever, ex-OpenAI chief scientist, quipped parameters measure “model mass,” but intelligence arises from organization. Andrew Ng likens it to Wright brothers’ wingspan fixation: early metrics evolve. Emergent abilities, like math prowess in GPT-4, defy linear scaling.

Yet counts climb. Rumors swirl of OpenAI’s Orion topping 10 trillion, fueled by Nvidia’s Blackwell chips. China accelerates, unhindered by export curbs. Parameter proliferation promises breakthroughs in science, code, multimodal AI, but risks homogenization if open-source lags.

Defining parameters rigorously aids progress. In dense models, every weight counts. In MoE, distinguish total from active. Quantized? Note bits per parameter. Context matters: pre-training versus fine-tuning.

As AI permeates life, grasping parameters demystifies black boxes. They are not magic, but engineered knobs tuning silicon brains toward generality. Future yardsticks may shift to FLOPs, data mix, or evals like ARC-AGI. For now, parameters reign, yardstick of ambition in the intelligence race.

What are your thoughts on this? I’d love to hear about your own experiences in the comments below.