“`html
What exactly does word2vec learn, and how?
Understanding what word2vec learns is crucial for grasping representation learning in a simple yet informative language modeling task. Despite its widespread use as a precursor to modern language models, researchers lacked a quantitative and predictive theory describing its learning process until our recent paper. We prove that under certain realistic conditions, word2vec‘s learning problem reduces to unweighted least-squares matrix factorization. Our findings allow us to solve the gradient flow dynamics in closed form and derive the final learned representations via principal component analysis (PCA).
![]()
Learning dynamics of word2vec.
The Result
To understand this, let’s first motivate the problem. word2vec is a well-known algorithm for learning dense vector representations of words. These embeddings are trained using a contrastive approach; at the end of training, their semantic relation is captured by the angle between the corresponding vectors. Notably, these learned embeddings exhibit striking linear structure in their geometry: interpretable concepts like gender, verb tense, or dialect often align along specific directions in the latent space.
These linear representations have gained significant attention recently since large language models (LLMs) also exhibit similar behavior. This enables semantic inspection of internal representations and novel model steering techniques. In word2vec, this is precisely what allows the learned embeddings to complete analogies like “man : woman :: king : queen” via vector addition.
It’s not surprising, then, that word2vec can be framed as a minimal neural language model. Understanding it is thus essential for comprehending feature learning in more complex language modeling tasks.
What are the features?
The key insight from our theory is that during training, word2vec effectively learns one “concept” (an orthogonal linear subspace) at a time in a sequence of discrete learning steps. This can be visualized as the initial confusion and gradual clarity when diving into a new branch of math: initially, all concepts are intertwined; over iterations, they become more distinct.
In this context, each new learned concept effectively increments the rank of the embedding matrix, allowing words to better express themselves by occupying additional dimensions in the latent space. Since these linear subspaces do not rotate once established, they represent the model’s learned features. The theory allows us to compute these features a priori without needing to run PCA on the entire training corpus. Instead, we can construct and diagonalize the following matrix:
\[M^{\star}_{ij} = \frac{P(i,j) – P(i)P(j)}{\frac{1}{2}(P(i,j) + P(i)P(j))}\]
where $i$ and $j$ index words in the vocabulary, and $P(i)$ is the unigram probability. The top eigenvector of this matrix selects words associated with celebrity biographies, the second one with government and administrative terms, and so on.
The takeaway from our theory is that during training, word2vec finds a sequence of optimal low-rank approximations to $M^{\star}$. These features are simply the top eigenvectors of this matrix, which are defined solely in terms of measurable corpus statistics and algorithmic hyperparameters.
The following plots illustrate this behavior:
![]()
Learning dynamics comparison showing discrete, sequential learning steps.
On the left, we observe that word2vec, along with our mild approximations, learns in a sequence of essentially discrete steps. Each step increments the effective rank of the embeddings, resulting in a stepwise decrease in loss. On the right, we show three time slices of the latent embedding space, demonstrating how new orthogonal directions are added at each learning step.
By inspecting which words most strongly align with these singular directions, we observe that each discrete “piece of knowledge” corresponds to an interpretable topic-level concept. These learning dynamics can be solved in closed form, and our theory provides excellent agreement with numerical experiments.
Mild Approximations
Our theoretical result holds under several mild approximations: 1) a quartic approximation of the objective function around the origin; 2) a particular constraint on algorithmic hyperparameters; 3) sufficiently small initial embedding weights; and 4) vanishingly small gradient descent steps. These conditions are not overly restrictive, as they align closely with those described in the original word2vec paper.
A key strength of our theory is its distributional agnosticism: it makes no assumptions about the data distribution. As a result, we can precisely predict which features are learned based solely on corpus statistics and algorithmic hyperparameters. This is particularly useful because obtaining such detailed descriptions of learning dynamics in a non-distribution-agnostic setting is rare and challenging.
To demonstrate the usefulness of our theory, we apply it to study the emergence of abstract linear representations (binary concepts like masculine/feminine or past/future). We find that these representations are built by word2vec in a sequence of noisy learning steps. The geometry of these representations is well-described by a spiked random matrix model, with semantic signal dominating early in training and noise potentially beginning to dominate later.
To showcase the result’s practical utility, we compare it against other models like PPMI (a classical alternative). Our approximate setting achieves 66% accuracy on analogy completion benchmarks, while word2vec itself gets 68%, and PPMI only manages 51%. See our paper for detailed comparisons.
In summary, this result provides one of the first complete closed-form theories of feature learning in a minimal yet relevant natural language task. In this sense, we believe it is an important step forward in obtaining realistic analytical solutions describing the performance of practical machine learning algorithms.
Key Takeaways
- We provide a closed-form theory for feature learning in
word2vec. - The learned features are simple to compute and describe using corpus statistics and algorithmic hyperparameters.
- This theory is distributional agnostic, making it useful for understanding the emergence of abstract linear representations in language models.
- Our results align closely with empirical observations from
word2vec, as evidenced by comparisons against other methods like PPMI.
Learn more about our work: Link to full paper
This post originally appeared on Dhruva Karkada’s blog.
“`
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




