Lab 22: Text as Vectors

1. Big idea

A vocabulary is a coordinate system. A document is a vector in that coordinate system. Once text becomes vectors, linear algebra can compare, search, classify, compress, and discover patterns in language.

bag of wordscosine similarityTF-IDFSVD topicsembeddings

2. Build document vectors

Enter short documents. The page will tokenize them, build a vocabulary, and create a document-term matrix.

3. Search by cosine similarity

Type a query. It becomes a vector in the same vocabulary. Documents are ranked by cosine similarity.

4. Raw counts versus TF-IDF

TF-IDF stretches rare informative word axes and shrinks common word axes. Compare the ranking from raw counts and TF-IDF.

5. Visualize document similarity

The heatmap below shows cosine similarity between all pairs of documents.

6. Toy embeddings

Bag-of-words needs exact word overlap. Embeddings place related meanings close together, even when the exact words differ.

Notice that “car,” “automobile,” and “truck” cluster together, while “guitar,” “piano,” and “music” cluster together.

7. Sparse high-dimensional intuition

Real text vectors may have 50,000 dimensions, but one document uses only a few hundred words. Move the sliders to estimate sparsity.

Vocabulary size Words used

Reflection prompts

What does a vocabulary do geometrically?
Why is cosine similarity useful for text?
What does TF-IDF change?
Why can SVD reveal topics?
What do embeddings add beyond word counts?