---
title: "Chapter 22: Text as Vectors"
subtitle: "How language becomes geometry"
format:
html:
toc: true
toc-depth: 3
number-sections: true
code-fold: true
code-tools: true
jupyter: python3
---
## Opening story: when words enter the machine
A person reads a sentence and hears meaning.
A computer first sees characters.
To compute with language, the computer needs a translation:
> words, sentences, and documents must become numbers.
This is not merely a programming trick. It is a mathematical act. We choose a space, assign coordinates, and place pieces of language as points or directions in that space. Once text becomes vectors, the tools from previous chapters become available:
- length tells us how much of something is present;
- distance compares documents;
- angle compares direction and meaning;
- projection finds the best approximation;
- matrices organize collections of documents;
- SVD uncovers hidden topics;
- neural networks learn better coordinate systems.
The central message of this chapter is:
> Modern language AI begins by turning text into geometry.
This chapter starts with simple word counts and ends with the idea of embeddings, the vector representations used in search engines, recommendation systems, and large language models.
## Learning goals
By the end of this chapter, you should be able to:
1. Explain why text must be represented numerically before machine learning can use it.
2. Build a vocabulary and use it as a coordinate system.
3. Represent documents as word-count vectors.
4. Construct a document-term matrix.
5. Compare documents using dot product, distance, and cosine similarity.
6. Explain why raw word counts are often not enough.
7. Use TF-IDF to weight informative words more strongly.
8. Interpret sparse high-dimensional vectors.
9. Use SVD to discover hidden topics in a small text collection.
10. Explain the difference between count vectors and learned embeddings.
## 22.1 The key idea: a vocabulary is a coordinate system
In ordinary geometry, we choose coordinate axes such as $x$ and $y$.
For text, the axes are words.
Suppose our vocabulary is
$$
[\text{math},\ \text{data},\ \text{AI},\ \text{music}].
$$
This vocabulary defines a four-dimensional coordinate system. The sentence
> Math and data help AI.
can be represented by the vector
$$
\begin{bmatrix}
1\\
1\\
1\\
0
\end{bmatrix}.
$$
The coordinates mean:
- the word **math** appears once;
- the word **data** appears once;
- the word **AI** appears once;
- the word **music** does not appear.
This is the first bridge from language to linear algebra.
::: {.callout-tip}
## Main idea
A vocabulary turns words into axes. A document becomes a point in the space defined by those axes.
:::
## 22.2 Documents as vectors
Consider three short documents:
$$
\begin{aligned}
D_1 &: \text{"math data AI"},\\
D_2 &: \text{"music data music"},\\
D_3 &: \text{"math AI AI data"}.
\end{aligned}
$$
Using the vocabulary
$$
[\text{math},\ \text{data},\ \text{AI},\ \text{music}],
$$
the document vectors are
$$
\mathbf{x}_1=\begin{bmatrix}1\\1\\1\\0\end{bmatrix},\qquad
\mathbf{x}_2=\begin{bmatrix}0\\1\\0\\2\end{bmatrix},\qquad
\mathbf{x}_3=\begin{bmatrix}1\\1\\2\\0\end{bmatrix}.
$$
Now text has become geometry.
Documents $D_1$ and $D_3$ point in similar directions because they use similar words. Document $D_2$ points in a different direction because it is more about music.
## 22.3 The bag-of-words model
The representation above is called a **bag-of-words** model.
It keeps word counts but ignores word order.
For example,
> dogs chase cats
and
> cats chase dogs
have the same bag-of-words vector, even though their meanings are different.
Bag-of-words is simple, useful, and limited.
### What it keeps
It keeps information about which words appear and how often they appear.
### What it loses
It loses word order, grammar, sentence structure, negation, sarcasm, and many forms of meaning.
::: {.callout-warning}
## Important limitation
The bag-of-words model treats a document like an unordered pile of words. It is often useful for search and classification, but it is not a full model of meaning.
:::
## 22.4 A document-term matrix
A collection of document vectors can be stacked into a matrix.
Let rows be documents and columns be words:
$$
X=
\begin{bmatrix}
1 & 1 & 1 & 0\\
0 & 1 & 0 & 2\\
1 & 1 & 2 & 0
\end{bmatrix}.
$$
This is called a **document-term matrix**.
Each row is a document.
Each column is a word.
The entry $X_{ij}$ tells how often word $j$ appears in document $i$.
::: {.callout-note}
## Two views of the same matrix
- Row view: each document is a vector.
- Column view: each word has a pattern across documents.
:::
## 22.5 Python: building a document-term matrix
```{python}
import numpy as np
import pandas as pd
vocab = ["math", "data", "ai", "music"]
docs = [
"math data ai",
"music data music",
"math ai ai data"
]
X = np.array([
[1, 1, 1, 0],
[0, 1, 0, 2],
[1, 1, 2, 0]
])
pd.DataFrame(X, columns=vocab, index=["D1", "D2", "D3"])
```
## 22.6 Dot products: counting shared emphasis
The dot product of two document vectors is
$$
\mathbf{x}\cdot \mathbf{y}=
\sum_{j=1}^n x_jy_j.
$$
For text vectors, this measures shared word emphasis.
If two documents use many of the same words with large counts, their dot product is large.
Example:
$$
\mathbf{x}_1\cdot \mathbf{x}_3
=1\cdot 1+1\cdot 1+1\cdot 2+0\cdot 0=4.
$$
The documents $D_1$ and $D_3$ are similar because they share math, data, and AI language.
## 22.7 Why dot product alone can be misleading
A long document tends to have larger word counts than a short document.
So the dot product is influenced by document length.
For example, a long general document may have a large dot product with many documents simply because it contains many words.
This is why angle-based similarity is often better.
## 22.8 Cosine similarity
Cosine similarity compares direction rather than size:
$$
\cos(\theta)=
\frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}.
$$
In text analysis:
- cosine similarity near $1$ means similar word-use direction;
- cosine similarity near $0$ means nearly unrelated word-use direction;
- cosine similarity is less sensitive to document length than raw dot product.
```{python}
def cosine_similarity(x, y):
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
return float(x @ y / (np.linalg.norm(x) * np.linalg.norm(y)))
for i in range(3):
for j in range(i+1, 3):
print(f"cos(D{i+1}, D{j+1}) = {cosine_similarity(X[i], X[j]):.3f}")
```
::: {.callout-tip}
## Memory phrase
Distance asks: **How far apart are they?**
Cosine similarity asks: **Do they point in the same semantic direction?**
:::
## 22.9 Search as vector comparison
A search query can also become a vector.
Suppose the query is
> math AI
Using the same vocabulary, this query becomes
$$
\mathbf{q}=\begin{bmatrix}1\\0\\1\\0\end{bmatrix}.
$$
A search engine can compare $\mathbf{q}$ with every document vector and rank documents by similarity.
This is the linear algebra behind a simple search engine.
```{python}
q = np.array([1, 0, 1, 0])
scores = [cosine_similarity(q, X[i]) for i in range(X.shape[0])]
pd.DataFrame({"document": ["D1", "D2", "D3"], "cosine_score": scores}).sort_values("cosine_score", ascending=False)
```
## 22.10 Raw counts versus normalized frequencies
Raw counts depend on document length.
A longer document naturally has more words.
To reduce this effect, we can use **term frequency**.
For a word $t$ in document $d$,
$$
\operatorname{tf}(t,d)=
\frac{\text{number of times }t\text{ appears in }d}
{\text{total number of words in }d}.
$$
Term frequency turns counts into proportions.
## 22.11 Common words are not always informative
Words such as **the**, **and**, **of**, and **is** appear frequently, but they usually do not tell us much about the topic.
Even within a specialized collection, words like **data** may appear everywhere.
A word that appears in every document is less useful for distinguishing documents.
A word that appears strongly in only a few documents is often more informative.
This leads to TF-IDF.
## 22.12 Inverse document frequency
Let $N$ be the number of documents. Let $\operatorname{df}(t)$ be the number of documents containing term $t$.
One common version of inverse document frequency is
$$
\operatorname{idf}(t)=\log\left(\frac{N+1}{\operatorname{df}(t)+1}\right)+1.
$$
The $+1$ terms avoid division by zero and keep weights positive.
Words that appear in many documents have smaller IDF.
Words that appear in fewer documents have larger IDF.
## 22.13 TF-IDF
TF-IDF combines term frequency and inverse document frequency:
$$
\operatorname{tfidf}(t,d)=
\operatorname{tf}(t,d)\operatorname{idf}(t).
$$
TF-IDF says:
> A word is important in a document if it appears often in that document but not everywhere in the whole collection.
```{python}
# Simple TF-IDF calculation for our tiny matrix
term_counts = X.astype(float)
row_sums = term_counts.sum(axis=1, keepdims=True)
tf = term_counts / row_sums
df = (term_counts > 0).sum(axis=0)
N = X.shape[0]
idf = np.log((N + 1) / (df + 1)) + 1
tfidf = tf * idf
pd.DataFrame(tfidf, columns=vocab, index=["D1", "D2", "D3"]).round(3)
```
## 22.14 Geometry changes when we change weights
Raw count vectors and TF-IDF vectors live in the same coordinate system, but the geometry changes.
TF-IDF stretches rare, informative word axes and shrinks common word axes.
This is similar to feature scaling in data analysis.
Changing weights changes distances, angles, nearest neighbors, and search rankings.
## 22.15 Sparse vectors and high-dimensional space
Real vocabularies can contain tens of thousands or millions of words.
A document usually uses only a small fraction of them.
So text vectors are often **high-dimensional** and **sparse**.
A sparse vector has many zeros.
For example, a document may live in a 50,000-dimensional vocabulary space but use only 200 words.
This is one reason linear algebra for text requires careful computational methods.
## 22.16 Stop words and preprocessing
Before building text vectors, we usually preprocess text.
Common steps include:
1. converting to lowercase;
2. removing punctuation;
3. splitting text into tokens;
4. removing stop words such as **the** and **and**;
5. sometimes reducing words to roots, such as **learn**, **learning**, and **learned**.
Preprocessing is not just technical housekeeping. It changes the vector representation.
::: {.callout-warning}
## Modeling choice
There is no single correct preprocessing pipeline. The right choice depends on the task.
:::
## 22.17 Mini example: classification by nearest prototype
Suppose we have two topic prototypes:
$$
\mathbf{p}_{\text{AI}}=\text{average vector of AI documents},
$$
and
$$
\mathbf{p}_{\text{music}}=\text{average vector of music documents}.
$$
A new document can be classified by comparing it to each prototype.
This is a simple example of classification by geometry.
```{python}
ai_proto = np.array([1, 1, 2, 0], dtype=float)
music_proto = np.array([0, 1, 0, 2], dtype=float)
new_doc = np.array([1, 1, 1, 0], dtype=float)
print("Similarity to AI prototype:", cosine_similarity(new_doc, ai_proto))
print("Similarity to music prototype:", cosine_similarity(new_doc, music_proto))
```
## 22.18 The document-term matrix as data
The document-term matrix is a data matrix.
It has the same structure as the data matrices we have studied before:
$$
X=\begin{bmatrix}
- & \mathbf{x}_1^T & -\\
- & \mathbf{x}_2^T & -\\
& \vdots & \\
- & \mathbf{x}_m^T & -
\end{bmatrix}.
$$
Rows are documents.
Columns are features.
Each document is a point.
Each word is a coordinate.
This means PCA, SVD, clustering, classification, projection, and nearest-neighbor methods can all be applied to text data.
## 22.19 Hidden topics through SVD
Text collections often have hidden structure.
For example, some documents may be about AI, some about music, and some about sports.
SVD can discover low-dimensional directions that summarize major patterns.
If
$$
X=U\Sigma V^T,
$$
then:
- rows of $U\Sigma$ give document coordinates in a topic-like space;
- columns of $V$ describe word patterns associated with those directions;
- large singular values identify strong patterns.
This idea is related to **latent semantic analysis**.
## 22.20 Python: hidden topics with SVD
```{python}
terms = ["ai", "model", "data", "neural", "song", "guitar", "music", "melody", "team", "score", "game", "player"]
X_topic = np.array([
[3,2,3,2,0,0,0,0,0,0,0,0],
[2,3,2,3,0,0,0,0,0,0,0,0],
[2,2,3,2,0,0,0,0,0,0,0,0],
[0,0,0,0,3,2,3,2,0,0,0,0],
[0,0,0,0,2,3,2,3,0,0,0,0],
[0,0,0,0,0,0,0,0,3,2,3,2],
[0,0,0,0,0,0,0,0,2,3,2,3],
], dtype=float)
U, S, Vt = np.linalg.svd(X_topic, full_matrices=False)
print("singular values:", np.round(S, 3))
for k in range(3):
top = np.argsort(np.abs(Vt[k]))[::-1][:5]
print(f"Topic direction {k+1}:", [terms[i] for i in top])
```
## 22.21 Count vectors versus embeddings
Bag-of-words vectors are usually:
- high-dimensional;
- sparse;
- based on counts;
- tied to a fixed vocabulary;
- unable to understand word order deeply.
Embeddings are different.
A word embedding or sentence embedding is a learned vector designed to capture semantic relationships.
For example, embeddings try to place related words near one another:
$$
\text{vector}(\text{king}) \approx \text{vector}(\text{queen})
$$
in a meaningful geometric sense.
Modern systems learn embeddings from massive text collections.
## 22.22 What embeddings add
Embeddings can capture relationships that count vectors miss.
For example, the words **car** and **automobile** may never overlap in a bag-of-words representation, but a good embedding model should place them close together.
A sentence embedding can place two similar sentences near each other even if they use different words.
This is the foundation of semantic search.
## 22.23 Matrices inside language models
Modern language models are much more complex than bag-of-words models, but linear algebra is everywhere.
They use:
- embedding matrices to turn tokens into vectors;
- matrix multiplication to transform representations;
- dot products to compare tokens;
- softmax functions to turn scores into probabilities;
- attention matrices to mix information across positions;
- high-dimensional vectors to represent context.
The story is still the same:
> text becomes vectors, and meaning is processed through matrix operations.
## 22.24 Worked example: compare documents by hand
Let
$$
\mathbf{x}=\begin{bmatrix}2\\1\\0\end{bmatrix},
\qquad
\mathbf{y}=\begin{bmatrix}1\\1\\1\end{bmatrix}.
$$
Then
$$
\mathbf{x}\cdot \mathbf{y}=2\cdot 1+1\cdot 1+0\cdot 1=3.
$$
Also,
$$
\|\mathbf{x}\|=\sqrt{2^2+1^2+0^2}=\sqrt{5},
\qquad
\|\mathbf{y}\|=\sqrt{3}.
$$
So
$$
\cos(\theta)=\frac{3}{\sqrt{5}\sqrt{3}}=\frac{3}{\sqrt{15}}\approx 0.775.
$$
The documents point in fairly similar directions.
## 22.25 Practice problems
### Problem 1
Use the vocabulary
$$
[\text{cat},\ \text{dog},\ \text{food},\ \text{music}]
$$
to vectorize the following documents:
1. "cat dog dog"
2. "music food music"
3. "cat food dog"
::: {.callout-note collapse="true"}
## Solution
The vectors are
$$
\begin{bmatrix}1\\2\\0\\0\end{bmatrix},\qquad
\begin{bmatrix}0\\0\\1\\2\end{bmatrix},\qquad
\begin{bmatrix}1\\1\\1\\0\end{bmatrix}.
$$
:::
### Problem 2
Compute the cosine similarity between
$$
\mathbf{x}=\begin{bmatrix}1\\2\\0\end{bmatrix}
\quad\text{and}\quad
\mathbf{y}=\begin{bmatrix}2\\1\\0\end{bmatrix}.
$$
::: {.callout-note collapse="true"}
## Solution
$$
\mathbf{x}\cdot\mathbf{y}=1\cdot2+2\cdot1+0\cdot0=4.
$$
$$
\|\mathbf{x}\|=\sqrt{5},\qquad \|\mathbf{y}\|=\sqrt{5}.
$$
Therefore
$$
\cos(\theta)=\frac{4}{5}=0.8.
$$
:::
### Problem 3
Explain in your own words why cosine similarity is often better than dot product for comparing documents of different lengths.
::: {.callout-note collapse="true"}
## Solution
The dot product grows when documents are longer because longer documents usually have larger counts. Cosine similarity divides by the vector lengths, so it focuses more on direction, or relative word-use pattern, rather than total document size.
:::
### Problem 4
In a collection of $1000$ documents, a word appears in $10$ documents. Another word appears in $900$ documents. Which word has larger IDF? Why?
::: {.callout-note collapse="true"}
## Solution
The word appearing in $10$ documents has larger IDF because it is rarer and therefore more informative for distinguishing documents.
:::
### Problem 5
Describe one kind of meaning that bag-of-words cannot capture.
::: {.callout-note collapse="true"}
## Solution
One example is word order. The sentences "dog bites person" and "person bites dog" have the same bag-of-words counts but very different meanings.
:::
## 22.26 Challenge problems
### Challenge 1: Build a tiny search engine
Create five short documents. Choose a vocabulary. Build document vectors. Then enter a query and rank the documents by cosine similarity.
### Challenge 2: Compare raw counts and TF-IDF
Use the same documents and compare search results using raw count vectors and TF-IDF vectors. Which words changed the ranking the most?
### Challenge 3: Hidden topics
Create a document-term matrix with at least three topics. Apply SVD and inspect the largest right singular vectors. Can you interpret the topics?
### Challenge 4: Sparse vectors
Construct a vocabulary with at least $1000$ possible words and simulate documents that use only $20$ words. Estimate what fraction of entries are zero.
## 22.27 AI companion activities
Use an AI assistant as a study partner, not as a replacement for your own thinking.
### Activity 1
Ask:
> Explain bag-of-words in the style of a story for a beginner in linear algebra.
Then improve the answer by adding one mathematical formula.
### Activity 2
Ask:
> Give me three examples where cosine similarity is better than Euclidean distance for text comparison.
Check whether the examples really depend on direction rather than length.
### Activity 3
Ask:
> Create a tiny document-term matrix with three hidden topics and explain how SVD finds them.
Then reproduce the example in Python.
### Activity 4
Ask:
> Explain the difference between TF-IDF vectors and embeddings.
Rewrite the answer in your own words.
## 22.28 Summary
In this chapter, we learned that text can be represented as vectors.
A vocabulary creates a coordinate system.
A document becomes a point or direction in that space.
A collection of documents becomes a matrix.
Dot products, distances, and cosine similarity compare documents.
TF-IDF changes the geometry by emphasizing informative words.
SVD can reveal hidden topic directions.
Embeddings go further by learning dense vector representations of meaning.
The mathematical lesson is simple and powerful:
> Text becomes computable when language becomes linear algebra.
In the next chapter, we move from text vectors to neural networks, where matrices become trainable machines that learn representations from data.