1. Big idea
A vocabulary is a coordinate system. A document is a vector in that coordinate system. Once text becomes vectors, linear algebra can compare, search, classify, compress, and discover patterns in language.
bag of wordscosine similarityTF-IDFSVD topicsembeddings2. Build document vectors
Enter short documents. The page will tokenize them, build a vocabulary, and create a document-term matrix.
3. Search by cosine similarity
Type a query. It becomes a vector in the same vocabulary. Documents are ranked by cosine similarity.
4. Raw counts versus TF-IDF
TF-IDF stretches rare informative word axes and shrinks common word axes. Compare the ranking from raw counts and TF-IDF.
5. Visualize document similarity
The heatmap below shows cosine similarity between all pairs of documents.
6. Toy embeddings
Bag-of-words needs exact word overlap. Embeddings place related meanings close together, even when the exact words differ.
Notice that “car,” “automobile,” and “truck” cluster together, while “guitar,” “piano,” and “music” cluster together.
7. Sparse high-dimensional intuition
Real text vectors may have 50,000 dimensions, but one document uses only a few hundred words. Move the sliders to estimate sparsity.
Reflection prompts
- What does a vocabulary do geometrically?
- Why is cosine similarity useful for text?
- What does TF-IDF change?
- Why can SVD reveal topics?
- What do embeddings add beyond word counts?