Back to all lessons
Language & TransformersIntermediate

πŸ“Embeddings & Word2Vec

How words become meaningful vectors

Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.

25 min- Explore at your own pace

Before We Begin

What we are learning today

Embeddings solve a central problem in NLP: words must become numbers before a model can process them, but naΓ―ve numbering destroys meaning. Word2Vec and related methods learn vectors whose geometry reflects patterns of use, so words appearing in similar contexts end up near one another.

How this lesson fits

This module explains the modern language-model stack from the inside out. Students see how words become vectors, how attention lets models choose context dynamically, and how large-scale next-token training turns those ingredients into systems that can write, summarize, and answer questions.

The big question

How can a machine represent meaning, decide which context matters, and then generate fluent language one token at a time?

Explain how text is converted into numeric representations without losing the idea of meaningInterpret attention as a selective context mechanism rather than as magicDescribe the workflow of large language models from pretraining through generation

Why You Should Care

Embeddings are the bridge between language and computation. Without them, text is just raw symbols; with them, similarity, context, and analogy become something a model can measure and manipulate.

Where this is used today

  • βœ“Semantic search systems that retrieve related concepts rather than exact keyword matches
  • βœ“Recommendation and ranking systems that embed users, items, or queries in a shared space
  • βœ“Translation and multilingual alignment tasks that compare meaning across languages

Think of it like this

Think of a map where each word gets an address. Words used in similar situations end up in the same neighborhood, while words playing very different roles are farther apart. Distance on the map becomes a proxy for related meaning.

Easy mistake to make

Embeddings are not perfect dictionaries of meaning. They reflect the patterns, associations, and biases present in the data they were trained on.

By the end, you should be able to say:

  • Explain why language models need numeric representations rather than raw strings
  • Describe what it means for words to be close in vector space and why context drives that closeness
  • Summarize the intuition behind skip-gram training without getting lost in implementation details

Think about this first

If a computer only sees text and never receives a dictionary, how could it still discover that 'doctor' and 'nurse' are more related than 'doctor' and 'banana'?

Words we will keep using

embeddingvectorcontextsimilarityskip-gram

Vector Space Semantics

Imagine if words were places on a map. "King" and "Queen" would live next door. "Apple" and "Banana" would be down the street. This is what embeddings do: they turn meaning into geometry.

Classic analogy
king - man + woman β‰ˆ queen

In the explorer below, you are looking at Word2Vec embeddings that were originally far larger. We squash them down to 3D so you can move around them and notice that language begins to form neighborhoods.

Word2Vec 10K

PCA 3D Projection

Initialising…

Fetch labels
Fetch vectors
Parse
Center
Covariance
PCA (3D)
Project

How it works: Skip-gram

The training is surprisingly simple: Pick a word, and ask the model to guess its neighbors. Do this billions of times. Words that appear in similar contexts will naturally drift closer together in vector space.

Training Objective

The formal goal says: given the center word wtw_t, make the nearby context words wt+jw_{t+j} as predictable as possible.

J(ΞΈ)=1Tβˆ‘t=1Tβˆ‘βˆ’c≀j≀c,jβ‰ 0log⁑p(wt+j∣wt)J(\theta) = \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \le j \le c, j \neq 0} \log p(w_{t+j} | w_t)

Where p(wO∣wI)p(w_O|w_I) is defined by the softmax of the dot product:

p(wO∣wI)=exp⁑(vwOβ€²βŠ€vwI)βˆ‘w=1Vexp⁑(vwβ€²βŠ€vwI)p(w_O|w_I) = \frac{\exp(v'_{w_O}{}^\top v_{w_I})}{\sum_{w=1}^V \exp(v'_w{}^\top v_{w_I})}

Cosine Similarity

If two arrows point in the same direction, the words are related. If they point in different directions, they are unrelated. It's that simple.

similarity=cos⁑(ΞΈ)=Aβ‹…Bβˆ₯Aβˆ₯βˆ₯Bβˆ₯\text{similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}

Try searching for β€œgood” in the explorer above and inspect the nearest neighbors. That is where the abstract idea suddenly starts to feel real.