πEmbeddings & Word2Vec
How words become meaningful vectors
Take your time with this one. The interactive parts are here to help you test the idea, not rush through it.
Pause and experiment as you go.
Before We Begin
What we are learning today
Embeddings solve a central problem in NLP: words must become numbers before a model can process them, but naΓ―ve numbering destroys meaning. Word2Vec and related methods learn vectors whose geometry reflects patterns of use, so words appearing in similar contexts end up near one another.
How this lesson fits
This module explains the modern language-model stack from the inside out. Students see how words become vectors, how attention lets models choose context dynamically, and how large-scale next-token training turns those ingredients into systems that can write, summarize, and answer questions.
The big question
How can a machine represent meaning, decide which context matters, and then generate fluent language one token at a time?
Why You Should Care
Embeddings are the bridge between language and computation. Without them, text is just raw symbols; with them, similarity, context, and analogy become something a model can measure and manipulate.
Where this is used today
- βSemantic search systems that retrieve related concepts rather than exact keyword matches
- βRecommendation and ranking systems that embed users, items, or queries in a shared space
- βTranslation and multilingual alignment tasks that compare meaning across languages
Think of it like this
Think of a map where each word gets an address. Words used in similar situations end up in the same neighborhood, while words playing very different roles are farther apart. Distance on the map becomes a proxy for related meaning.
Easy mistake to make
Embeddings are not perfect dictionaries of meaning. They reflect the patterns, associations, and biases present in the data they were trained on.
By the end, you should be able to say:
- Explain why language models need numeric representations rather than raw strings
- Describe what it means for words to be close in vector space and why context drives that closeness
- Summarize the intuition behind skip-gram training without getting lost in implementation details
Think about this first
If a computer only sees text and never receives a dictionary, how could it still discover that 'doctor' and 'nurse' are more related than 'doctor' and 'banana'?
Words we will keep using
Vector Space Semantics
Imagine if words were places on a map. "King" and "Queen" would live next door. "Apple" and "Banana" would be down the street. This is what embeddings do: they turn meaning into geometry.
king - man + woman β queen
In the explorer below, you are looking at Word2Vec embeddings that were originally far larger. We squash them down to 3D so you can move around them and notice that language begins to form neighborhoods.
Word2Vec 10K
PCA 3D Projection
Initialisingβ¦
How it works: Skip-gram
The training is surprisingly simple: Pick a word, and ask the model to guess its neighbors. Do this billions of times. Words that appear in similar contexts will naturally drift closer together in vector space.
Training Objective
The formal goal says: given the center word , make the nearby context words as predictable as possible.
Where is defined by the softmax of the dot product:
Cosine Similarity
If two arrows point in the same direction, the words are related. If they point in different directions, they are unrelated. It's that simple.
Try searching for βgoodβ in the explorer above and inspect the nearest neighbors. That is where the abstract idea suddenly starts to feel real.