22 Chapter 22: Text as Vectors

How language becomes geometry

22.1 Opening story: when words enter the machine

A person reads a sentence and hears meaning.

A computer first sees characters.

To compute with language, the computer needs a translation:

words, sentences, and documents must become numbers.

This is not merely a programming trick. It is a mathematical act. We choose a space, assign coordinates, and place pieces of language as points or directions in that space. Once text becomes vectors, the tools from previous chapters become available:

length tells us how much of something is present;
distance compares documents;
angle compares direction and meaning;
projection finds the best approximation;
matrices organize collections of documents;
SVD uncovers hidden topics;
neural networks learn better coordinate systems.

The central message of this chapter is:

Modern language AI begins by turning text into geometry.

This chapter starts with simple word counts and ends with the idea of embeddings, the vector representations used in search engines, recommendation systems, and large language models.

22.2 Learning goals

By the end of this chapter, you should be able to:

Explain why text must be represented numerically before machine learning can use it.
Build a vocabulary and use it as a coordinate system.
Represent documents as word-count vectors.
Construct a document-term matrix.
Compare documents using dot product, distance, and cosine similarity.
Explain why raw word counts are often not enough.
Use TF-IDF to weight informative words more strongly.
Interpret sparse high-dimensional vectors.
Use SVD to discover hidden topics in a small text collection.
Explain the difference between count vectors and learned embeddings.

22.3 22.1 The key idea: a vocabulary is a coordinate system

In ordinary geometry, we choose coordinate axes such as $x$ and $y$.

For text, the axes are words.

Suppose our vocabulary is

\[ [\text{math},\ \text{data},\ \text{AI},\ \text{music}]. \]

This vocabulary defines a four-dimensional coordinate system. The sentence

Math and data help AI.

can be represented by the vector

\[ \begin{bmatrix} 1\\ 1\\ 1\\ 0 \end{bmatrix}. \]

The coordinates mean:

the word math appears once;
the word data appears once;
the word AI appears once;
the word music does not appear.

This is the first bridge from language to linear algebra.

Main idea

A vocabulary turns words into axes. A document becomes a point in the space defined by those axes.

22.4 22.2 Documents as vectors

Consider three short documents:

\[ \begin{aligned} D_1 &: \text{"math data AI"},\\ D_2 &: \text{"music data music"},\\ D_3 &: \text{"math AI AI data"}. \end{aligned} \]

Using the vocabulary

\[ [\text{math},\ \text{data},\ \text{AI},\ \text{music}], \]

the document vectors are

\[ \mathbf{x}_1=\begin{bmatrix}1\\1\\1\\0\end{bmatrix},\qquad \mathbf{x}_2=\begin{bmatrix}0\\1\\0\\2\end{bmatrix},\qquad \mathbf{x}_3=\begin{bmatrix}1\\1\\2\\0\end{bmatrix}. \]

Now text has become geometry.

Documents $D_1$ and $D_3$ point in similar directions because they use similar words. Document $D_2$ points in a different direction because it is more about music.

22.5 22.3 The bag-of-words model

The representation above is called a bag-of-words model.

It keeps word counts but ignores word order.

For example,

dogs chase cats

and

cats chase dogs

have the same bag-of-words vector, even though their meanings are different.

Bag-of-words is simple, useful, and limited.

22.5.1 What it keeps

It keeps information about which words appear and how often they appear.

22.5.2 What it loses

It loses word order, grammar, sentence structure, negation, sarcasm, and many forms of meaning.

Important limitation

The bag-of-words model treats a document like an unordered pile of words. It is often useful for search and classification, but it is not a full model of meaning.

22.6 22.4 A document-term matrix

A collection of document vectors can be stacked into a matrix.

Let rows be documents and columns be words:

\[ X= \begin{bmatrix} 1 & 1 & 1 & 0\\ 0 & 1 & 0 & 2\\ 1 & 1 & 2 & 0 \end{bmatrix}. \]

This is called a document-term matrix.

Each row is a document.

Each column is a word.

The entry $X_{ij}$ tells how often word $j$ appears in document $i$.

Two views of the same matrix

Row view: each document is a vector.
Column view: each word has a pattern across documents.

22.7 22.5 Python: building a document-term matrix

Code

import numpy as np
import pandas as pd

vocab = ["math", "data", "ai", "music"]
docs = [
    "math data ai",
    "music data music",
    "math ai ai data"
]

X = np.array([
    [1, 1, 1, 0],
    [0, 1, 0, 2],
    [1, 1, 2, 0]
])

pd.DataFrame(X, columns=vocab, index=["D1", "D2", "D3"])

	math	data	ai	music
D1	1	1	1	0
D2	0	1	0	2
D3	1	1	2	0

22.8 22.6 Dot products: counting shared emphasis

The dot product of two document vectors is

\[ \mathbf{x}\cdot \mathbf{y}= \sum_{j=1}^n x_jy_j. \]

For text vectors, this measures shared word emphasis.

If two documents use many of the same words with large counts, their dot product is large.

Example:

\[ \mathbf{x}_1\cdot \mathbf{x}_3 =1\cdot 1+1\cdot 1+1\cdot 2+0\cdot 0=4. \]

The documents $D_1$ and $D_3$ are similar because they share math, data, and AI language.

22.9 22.7 Why dot product alone can be misleading

A long document tends to have larger word counts than a short document.

So the dot product is influenced by document length.

For example, a long general document may have a large dot product with many documents simply because it contains many words.

This is why angle-based similarity is often better.

22.10 22.8 Cosine similarity

Cosine similarity compares direction rather than size:

\[ \cos(\theta)= \frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}. \]

In text analysis:

cosine similarity near $1$ means similar word-use direction;
cosine similarity near $0$ means nearly unrelated word-use direction;
cosine similarity is less sensitive to document length than raw dot product.

Code

def cosine_similarity(x, y):
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    return float(x @ y / (np.linalg.norm(x) * np.linalg.norm(y)))

for i in range(3):
    for j in range(i+1, 3):
        print(f"cos(D{i+1}, D{j+1}) = {cosine_similarity(X[i], X[j]):.3f}")

cos(D1, D2) = 0.258
cos(D1, D3) = 0.943
cos(D2, D3) = 0.183

Memory phrase

Distance asks: How far apart are they?
Cosine similarity asks: Do they point in the same semantic direction?

22.11 22.9 Search as vector comparison

A search query can also become a vector.

Suppose the query is

math AI

Using the same vocabulary, this query becomes

\[ \mathbf{q}=\begin{bmatrix}1\\0\\1\\0\end{bmatrix}. \]

A search engine can compare $\mathbf{q}$ with every document vector and rank documents by similarity.

This is the linear algebra behind a simple search engine.

Code

q = np.array([1, 0, 1, 0])
scores = [cosine_similarity(q, X[i]) for i in range(X.shape[0])]
pd.DataFrame({"document": ["D1", "D2", "D3"], "cosine_score": scores}).sort_values("cosine_score", ascending=False)

	document	cosine_score
2	D3	0.866025
0	D1	0.816497
1	D2	0.000000

22.12 22.10 Raw counts versus normalized frequencies

Raw counts depend on document length.

A longer document naturally has more words.

To reduce this effect, we can use term frequency.

For a word $t$ in document $d$,

\[ \operatorname{tf}(t,d)= \frac{\text{number of times }t\text{ appears in }d} {\text{total number of words in }d}. \]

Term frequency turns counts into proportions.

22.13 22.11 Common words are not always informative

Words such as the, and, of, and is appear frequently, but they usually do not tell us much about the topic.

Even within a specialized collection, words like data may appear everywhere.

A word that appears in every document is less useful for distinguishing documents.

A word that appears strongly in only a few documents is often more informative.

This leads to TF-IDF.

22.14 22.12 Inverse document frequency

Let $N$ be the number of documents. Let $\operatorname{df}(t)$ be the number of documents containing term $t$.

One common version of inverse document frequency is

\[ \operatorname{idf}(t)=\log\left(\frac{N+1}{\operatorname{df}(t)+1}\right)+1. \]

The $+1$ terms avoid division by zero and keep weights positive.

Words that appear in many documents have smaller IDF.

Words that appear in fewer documents have larger IDF.

22.15 22.13 TF-IDF

TF-IDF combines term frequency and inverse document frequency:

\[ \operatorname{tfidf}(t,d)= \operatorname{tf}(t,d)\operatorname{idf}(t). \]

TF-IDF says:

A word is important in a document if it appears often in that document but not everywhere in the whole collection.

Code

# Simple TF-IDF calculation for our tiny matrix
term_counts = X.astype(float)
row_sums = term_counts.sum(axis=1, keepdims=True)
tf = term_counts / row_sums

df = (term_counts > 0).sum(axis=0)
N = X.shape[0]
idf = np.log((N + 1) / (df + 1)) + 1

tfidf = tf * idf
pd.DataFrame(tfidf, columns=vocab, index=["D1", "D2", "D3"]).round(3)

	math	data	ai	music
D1	0.429	0.333	0.429	0.000
D2	0.000	0.333	0.000	1.129
D3	0.322	0.250	0.644	0.000

22.16 22.14 Geometry changes when we change weights

Raw count vectors and TF-IDF vectors live in the same coordinate system, but the geometry changes.

TF-IDF stretches rare, informative word axes and shrinks common word axes.

This is similar to feature scaling in data analysis.

Changing weights changes distances, angles, nearest neighbors, and search rankings.

22.17 22.15 Sparse vectors and high-dimensional space

Real vocabularies can contain tens of thousands or millions of words.

A document usually uses only a small fraction of them.

So text vectors are often high-dimensional and sparse.

A sparse vector has many zeros.

For example, a document may live in a 50,000-dimensional vocabulary space but use only 200 words.

This is one reason linear algebra for text requires careful computational methods.

22.18 22.16 Stop words and preprocessing

Before building text vectors, we usually preprocess text.

Common steps include:

converting to lowercase;
removing punctuation;
splitting text into tokens;
removing stop words such as the and and;
sometimes reducing words to roots, such as learn, learning, and learned.

Preprocessing is not just technical housekeeping. It changes the vector representation.

Modeling choice

There is no single correct preprocessing pipeline. The right choice depends on the task.

22.19 22.17 Mini example: classification by nearest prototype

Suppose we have two topic prototypes:

\[ \mathbf{p}_{\text{AI}}=\text{average vector of AI documents}, \]

and

\[ \mathbf{p}_{\text{music}}=\text{average vector of music documents}. \]

A new document can be classified by comparing it to each prototype.

This is a simple example of classification by geometry.

Code

ai_proto = np.array([1, 1, 2, 0], dtype=float)
music_proto = np.array([0, 1, 0, 2], dtype=float)
new_doc = np.array([1, 1, 1, 0], dtype=float)

print("Similarity to AI prototype:", cosine_similarity(new_doc, ai_proto))
print("Similarity to music prototype:", cosine_similarity(new_doc, music_proto))

Similarity to AI prototype: 0.9428090415820635
Similarity to music prototype: 0.2581988897471611

22.20 22.18 The document-term matrix as data

The document-term matrix is a data matrix.

It has the same structure as the data matrices we have studied before:

\[ X=\begin{bmatrix} - & \mathbf{x}_1^T & -\\ - & \mathbf{x}_2^T & -\\ & \vdots & \\ - & \mathbf{x}_m^T & - \end{bmatrix}. \]

Rows are documents.

Columns are features.

Each document is a point.

Each word is a coordinate.

This means PCA, SVD, clustering, classification, projection, and nearest-neighbor methods can all be applied to text data.

22.21 22.19 Hidden topics through SVD

Text collections often have hidden structure.

For example, some documents may be about AI, some about music, and some about sports.

SVD can discover low-dimensional directions that summarize major patterns.

\[ X=U\Sigma V^T, \]

then:

rows of $U\Sigma$ give document coordinates in a topic-like space;
columns of $V$ describe word patterns associated with those directions;
large singular values identify strong patterns.

This idea is related to latent semantic analysis.

22.22 22.20 Python: hidden topics with SVD

Code

terms = ["ai", "model", "data", "neural", "song", "guitar", "music", "melody", "team", "score", "game", "player"]
X_topic = np.array([
    [3,2,3,2,0,0,0,0,0,0,0,0],
    [2,3,2,3,0,0,0,0,0,0,0,0],
    [2,2,3,2,0,0,0,0,0,0,0,0],
    [0,0,0,0,3,2,3,2,0,0,0,0],
    [0,0,0,0,2,3,2,3,0,0,0,0],
    [0,0,0,0,0,0,0,0,3,2,3,2],
    [0,0,0,0,0,0,0,0,2,3,2,3],
], dtype=float)

U, S, Vt = np.linalg.svd(X_topic, full_matrices=False)
print("singular values:", np.round(S, 3))

for k in range(3):
    top = np.argsort(np.abs(Vt[k]))[::-1][:5]
    print(f"Topic direction {k+1}:", [terms[i] for i in top])

singular values: [8.395 7.071 7.071 1.486 1.414 1.414 0.567]
Topic direction 1: ['data', 'ai', 'neural', 'model', 'player']
Topic direction 2: ['game', 'team', 'player', 'score', 'guitar']
Topic direction 3: ['music', 'song', 'melody', 'guitar', 'neural']

22.23 22.21 Count vectors versus embeddings

Bag-of-words vectors are usually:

high-dimensional;
sparse;
based on counts;
tied to a fixed vocabulary;
unable to understand word order deeply.

Embeddings are different.

A word embedding or sentence embedding is a learned vector designed to capture semantic relationships.

For example, embeddings try to place related words near one another:

\[ \text{vector}(\text{king}) \approx \text{vector}(\text{queen}) \]

in a meaningful geometric sense.

Modern systems learn embeddings from massive text collections.

22.24 22.22 What embeddings add

Embeddings can capture relationships that count vectors miss.

For example, the words car and automobile may never overlap in a bag-of-words representation, but a good embedding model should place them close together.

A sentence embedding can place two similar sentences near each other even if they use different words.

This is the foundation of semantic search.

22.25 22.23 Matrices inside language models

Modern language models are much more complex than bag-of-words models, but linear algebra is everywhere.

They use:

embedding matrices to turn tokens into vectors;
matrix multiplication to transform representations;
dot products to compare tokens;
softmax functions to turn scores into probabilities;
attention matrices to mix information across positions;
high-dimensional vectors to represent context.

The story is still the same:

text becomes vectors, and meaning is processed through matrix operations.

22.26 22.24 Worked example: compare documents by hand

Let

\[ \mathbf{x}=\begin{bmatrix}2\\1\\0\end{bmatrix}, \qquad \mathbf{y}=\begin{bmatrix}1\\1\\1\end{bmatrix}. \]

Then

\[ \mathbf{x}\cdot \mathbf{y}=2\cdot 1+1\cdot 1+0\cdot 1=3. \]

Also,

\[ \|\mathbf{x}\|=\sqrt{2^2+1^2+0^2}=\sqrt{5}, \qquad \|\mathbf{y}\|=\sqrt{3}. \]

\[ \cos(\theta)=\frac{3}{\sqrt{5}\sqrt{3}}=\frac{3}{\sqrt{15}}\approx 0.775. \]

The documents point in fairly similar directions.

22.27 22.25 Practice problems

22.27.1 Problem 1

Use the vocabulary

\[ [\text{cat},\ \text{dog},\ \text{food},\ \text{music}] \]

to vectorize the following documents:

“cat dog dog”
“music food music”
“cat food dog”

Solution

The vectors are

\[ \begin{bmatrix}1\\2\\0\\0\end{bmatrix},\qquad \begin{bmatrix}0\\0\\1\\2\end{bmatrix},\qquad \begin{bmatrix}1\\1\\1\\0\end{bmatrix}. \]

22.27.2 Problem 2

Compute the cosine similarity between

\[ \mathbf{x}=\begin{bmatrix}1\\2\\0\end{bmatrix} \quad\text{and}\quad \mathbf{y}=\begin{bmatrix}2\\1\\0\end{bmatrix}. \]

Solution

\[ \mathbf{x}\cdot\mathbf{y}=1\cdot2+2\cdot1+0\cdot0=4. \]

\[ \|\mathbf{x}\|=\sqrt{5},\qquad \|\mathbf{y}\|=\sqrt{5}. \]

Therefore

\[ \cos(\theta)=\frac{4}{5}=0.8. \]

22.27.3 Problem 3

Explain in your own words why cosine similarity is often better than dot product for comparing documents of different lengths.

Solution

The dot product grows when documents are longer because longer documents usually have larger counts. Cosine similarity divides by the vector lengths, so it focuses more on direction, or relative word-use pattern, rather than total document size.

22.27.4 Problem 4

In a collection of $1000$ documents, a word appears in $10$ documents. Another word appears in $900$ documents. Which word has larger IDF? Why?

Solution

The word appearing in $10$ documents has larger IDF because it is rarer and therefore more informative for distinguishing documents.

22.27.5 Problem 5

Describe one kind of meaning that bag-of-words cannot capture.

Solution

One example is word order. The sentences “dog bites person” and “person bites dog” have the same bag-of-words counts but very different meanings.

22.28 22.26 Challenge problems

22.28.1 Challenge 1: Build a tiny search engine

Create five short documents. Choose a vocabulary. Build document vectors. Then enter a query and rank the documents by cosine similarity.

22.28.2 Challenge 2: Compare raw counts and TF-IDF

Use the same documents and compare search results using raw count vectors and TF-IDF vectors. Which words changed the ranking the most?

22.28.3 Challenge 3: Hidden topics

Create a document-term matrix with at least three topics. Apply SVD and inspect the largest right singular vectors. Can you interpret the topics?

22.28.4 Challenge 4: Sparse vectors

Construct a vocabulary with at least $1000$ possible words and simulate documents that use only $20$ words. Estimate what fraction of entries are zero.

22.29 22.27 AI companion activities

Use an AI assistant as a study partner, not as a replacement for your own thinking.

22.29.1 Activity 1

Ask:

Explain bag-of-words in the style of a story for a beginner in linear algebra.

Then improve the answer by adding one mathematical formula.

22.29.2 Activity 2

Ask:

Give me three examples where cosine similarity is better than Euclidean distance for text comparison.

Check whether the examples really depend on direction rather than length.

22.29.3 Activity 3

Ask:

Create a tiny document-term matrix with three hidden topics and explain how SVD finds them.

Then reproduce the example in Python.

22.29.4 Activity 4

Ask:

Explain the difference between TF-IDF vectors and embeddings.

Rewrite the answer in your own words.

22.30 22.28 Summary

In this chapter, we learned that text can be represented as vectors.

A vocabulary creates a coordinate system.

A document becomes a point or direction in that space.

A collection of documents becomes a matrix.

Dot products, distances, and cosine similarity compare documents.

TF-IDF changes the geometry by emphasizing informative words.

SVD can reveal hidden topic directions.

Embeddings go further by learning dense vector representations of meaning.

The mathematical lesson is simple and powerful:

Text becomes computable when language becomes linear algebra.

In the next chapter, we move from text vectors to neural networks, where matrices become trainable machines that learn representations from data.

--- title: "Chapter 22: Text as Vectors" subtitle: "How language becomes geometry" format: html: toc: true toc-depth: 3 number-sections: true code-fold: true code-tools: true jupyter: python3 --- ## Opening story: when words enter the machine A person reads a sentence and hears meaning. A computer first sees characters. To compute with language, the computer needs a translation: > words, sentences, and documents must become numbers. This is not merely a programming trick. It is a mathematical act. We choose a space, assign coordinates, and place pieces of language as points or directions in that space. Once text becomes vectors, the tools from previous chapters become available: - length tells us how much of something is present; - distance compares documents; - angle compares direction and meaning; - projection finds the best approximation; - matrices organize collections of documents; - SVD uncovers hidden topics; - neural networks learn better coordinate systems. The central message of this chapter is: > Modern language AI begins by turning text into geometry. This chapter starts with simple word counts and ends with the idea of embeddings, the vector representations used in search engines, recommendation systems, and large language models. ## Learning goals By the end of this chapter, you should be able to: 1. Explain why text must be represented numerically before machine learning can use it. 2. Build a vocabulary and use it as a coordinate system. 3. Represent documents as word-count vectors. 4. Construct a document-term matrix. 5. Compare documents using dot product, distance, and cosine similarity. 6. Explain why raw word counts are often not enough. 7. Use TF-IDF to weight informative words more strongly. 8. Interpret sparse high-dimensional vectors. 9. Use SVD to discover hidden topics in a small text collection. 10. Explain the difference between count vectors and learned embeddings. ## 22.1 The key idea: a vocabulary is a coordinate system In ordinary geometry, we choose coordinate axes such as $x$ and $y$. For text, the axes are words. Suppose our vocabulary is $$ [\text{math},\ \text{data},\ \text{AI},\ \text{music}]. $$ This vocabulary defines a four-dimensional coordinate system. The sentence > Math and data help AI. can be represented by the vector $$ \begin{bmatrix} 1\\ 1\\ 1\\ 0 \end{bmatrix}. $$ The coordinates mean: - the word **math** appears once; - the word **data** appears once; - the word **AI** appears once; - the word **music** does not appear. This is the first bridge from language to linear algebra. ::: {.callout-tip} ## Main idea A vocabulary turns words into axes. A document becomes a point in the space defined by those axes. ::: ## 22.2 Documents as vectors Consider three short documents: $$ \begin{aligned} D_1 &: \text{"math data AI"},\\ D_2 &: \text{"music data music"},\\ D_3 &: \text{"math AI AI data"}. \end{aligned} $$ Using the vocabulary $$ [\text{math},\ \text{data},\ \text{AI},\ \text{music}], $$ the document vectors are $$ \mathbf{x}_1=\begin{bmatrix}1\\1\\1\\0\end{bmatrix},\qquad \mathbf{x}_2=\begin{bmatrix}0\\1\\0\\2\end{bmatrix},\qquad \mathbf{x}_3=\begin{bmatrix}1\\1\\2\\0\end{bmatrix}. $$ Now text has become geometry. Documents $D_1$ and $D_3$ point in similar directions because they use similar words. Document $D_2$ points in a different direction because it is more about music. ## 22.3 The bag-of-words model The representation above is called a **bag-of-words** model. It keeps word counts but ignores word order. For example, > dogs chase cats and > cats chase dogs have the same bag-of-words vector, even though their meanings are different. Bag-of-words is simple, useful, and limited. ### What it keeps It keeps information about which words appear and how often they appear. ### What it loses It loses word order, grammar, sentence structure, negation, sarcasm, and many forms of meaning. ::: {.callout-warning} ## Important limitation The bag-of-words model treats a document like an unordered pile of words. It is often useful for search and classification, but it is not a full model of meaning. ::: ## 22.4 A document-term matrix A collection of document vectors can be stacked into a matrix. Let rows be documents and columns be words: $$ X= \begin{bmatrix} 1 & 1 & 1 & 0\\ 0 & 1 & 0 & 2\\ 1 & 1 & 2 & 0 \end{bmatrix}. $$ This is called a **document-term matrix**. Each row is a document. Each column is a word. The entry $X_{ij}$ tells how often word $j$ appears in document $i$. ::: {.callout-note} ## Two views of the same matrix - Row view: each document is a vector. - Column view: each word has a pattern across documents. ::: ## 22.5 Python: building a document-term matrix ```{python} import numpy as np import pandas as pd vocab = ["math", "data", "ai", "music"] docs = [ "math data ai", "music data music", "math ai ai data" ] X = np.array([ [1, 1, 1, 0], [0, 1, 0, 2], [1, 1, 2, 0] ]) pd.DataFrame(X, columns=vocab, index=["D1", "D2", "D3"]) ``` ## 22.6 Dot products: counting shared emphasis The dot product of two document vectors is $$ \mathbf{x}\cdot \mathbf{y}= \sum_{j=1}^n x_jy_j. $$ For text vectors, this measures shared word emphasis. If two documents use many of the same words with large counts, their dot product is large. Example: $$ \mathbf{x}_1\cdot \mathbf{x}_3 =1\cdot 1+1\cdot 1+1\cdot 2+0\cdot 0=4. $$ The documents $D_1$ and $D_3$ are similar because they share math, data, and AI language. ## 22.7 Why dot product alone can be misleading A long document tends to have larger word counts than a short document. So the dot product is influenced by document length. For example, a long general document may have a large dot product with many documents simply because it contains many words. This is why angle-based similarity is often better. ## 22.8 Cosine similarity Cosine similarity compares direction rather than size: $$ \cos(\theta)= \frac{\mathbf{x}\cdot \mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}. $$ In text analysis: - cosine similarity near $1$ means similar word-use direction; - cosine similarity near $0$ means nearly unrelated word-use direction; - cosine similarity is less sensitive to document length than raw dot product. ```{python} def cosine_similarity(x, y): x = np.asarray(x, dtype=float) y = np.asarray(y, dtype=float) return float(x @ y / (np.linalg.norm(x) * np.linalg.norm(y))) for i in range(3): for j in range(i+1, 3): print(f"cos(D{i+1}, D{j+1}) = {cosine_similarity(X[i], X[j]):.3f}") ``` ::: {.callout-tip} ## Memory phrase Distance asks: **How far apart are they?** Cosine similarity asks: **Do they point in the same semantic direction?** ::: ## 22.9 Search as vector comparison A search query can also become a vector. Suppose the query is > math AI Using the same vocabulary, this query becomes $$ \mathbf{q}=\begin{bmatrix}1\\0\\1\\0\end{bmatrix}. $$ A search engine can compare $\mathbf{q}$ with every document vector and rank documents by similarity. This is the linear algebra behind a simple search engine. ```{python} q = np.array([1, 0, 1, 0]) scores = [cosine_similarity(q, X[i]) for i in range(X.shape[0])] pd.DataFrame({"document": ["D1", "D2", "D3"], "cosine_score": scores}).sort_values("cosine_score", ascending=False) ``` ## 22.10 Raw counts versus normalized frequencies Raw counts depend on document length. A longer document naturally has more words. To reduce this effect, we can use **term frequency**. For a word $t$ in document $d$, $$ \operatorname{tf}(t,d)= \frac{\text{number of times }t\text{ appears in }d} {\text{total number of words in }d}. $$ Term frequency turns counts into proportions. ## 22.11 Common words are not always informative Words such as **the**, **and**, **of**, and **is** appear frequently, but they usually do not tell us much about the topic. Even within a specialized collection, words like **data** may appear everywhere. A word that appears in every document is less useful for distinguishing documents. A word that appears strongly in only a few documents is often more informative. This leads to TF-IDF. ## 22.12 Inverse document frequency Let $N$ be the number of documents. Let $\operatorname{df}(t)$ be the number of documents containing term $t$. One common version of inverse document frequency is $$ \operatorname{idf}(t)=\log\left(\frac{N+1}{\operatorname{df}(t)+1}\right)+1. $$ The $+1$ terms avoid division by zero and keep weights positive. Words that appear in many documents have smaller IDF. Words that appear in fewer documents have larger IDF. ## 22.13 TF-IDF TF-IDF combines term frequency and inverse document frequency: $$ \operatorname{tfidf}(t,d)= \operatorname{tf}(t,d)\operatorname{idf}(t). $$ TF-IDF says: > A word is important in a document if it appears often in that document but not everywhere in the whole collection. ```{python} # Simple TF-IDF calculation for our tiny matrix term_counts = X.astype(float) row_sums = term_counts.sum(axis=1, keepdims=True) tf = term_counts / row_sums df = (term_counts > 0).sum(axis=0) N = X.shape[0] idf = np.log((N + 1) / (df + 1)) + 1 tfidf = tf * idf pd.DataFrame(tfidf, columns=vocab, index=["D1", "D2", "D3"]).round(3) ``` ## 22.14 Geometry changes when we change weights Raw count vectors and TF-IDF vectors live in the same coordinate system, but the geometry changes. TF-IDF stretches rare, informative word axes and shrinks common word axes. This is similar to feature scaling in data analysis. Changing weights changes distances, angles, nearest neighbors, and search rankings. ## 22.15 Sparse vectors and high-dimensional space Real vocabularies can contain tens of thousands or millions of words. A document usually uses only a small fraction of them. So text vectors are often **high-dimensional** and **sparse**. A sparse vector has many zeros. For example, a document may live in a 50,000-dimensional vocabulary space but use only 200 words. This is one reason linear algebra for text requires careful computational methods. ## 22.16 Stop words and preprocessing Before building text vectors, we usually preprocess text. Common steps include: 1. converting to lowercase; 2. removing punctuation; 3. splitting text into tokens; 4. removing stop words such as **the** and **and**; 5. sometimes reducing words to roots, such as **learn**, **learning**, and **learned**. Preprocessing is not just technical housekeeping. It changes the vector representation. ::: {.callout-warning} ## Modeling choice There is no single correct preprocessing pipeline. The right choice depends on the task. ::: ## 22.17 Mini example: classification by nearest prototype Suppose we have two topic prototypes: $$ \mathbf{p}_{\text{AI}}=\text{average vector of AI documents}, $$ and $$ \mathbf{p}_{\text{music}}=\text{average vector of music documents}. $$ A new document can be classified by comparing it to each prototype. This is a simple example of classification by geometry. ```{python} ai_proto = np.array([1, 1, 2, 0], dtype=float) music_proto = np.array([0, 1, 0, 2], dtype=float) new_doc = np.array([1, 1, 1, 0], dtype=float) print("Similarity to AI prototype:", cosine_similarity(new_doc, ai_proto)) print("Similarity to music prototype:", cosine_similarity(new_doc, music_proto)) ``` ## 22.18 The document-term matrix as data The document-term matrix is a data matrix. It has the same structure as the data matrices we have studied before: $$ X=\begin{bmatrix} - & \mathbf{x}_1^T & -\\ - & \mathbf{x}_2^T & -\\ & \vdots & \\ - & \mathbf{x}_m^T & - \end{bmatrix}. $$ Rows are documents. Columns are features. Each document is a point. Each word is a coordinate. This means PCA, SVD, clustering, classification, projection, and nearest-neighbor methods can all be applied to text data. ## 22.19 Hidden topics through SVD Text collections often have hidden structure. For example, some documents may be about AI, some about music, and some about sports. SVD can discover low-dimensional directions that summarize major patterns. If $$ X=U\Sigma V^T, $$ then: - rows of $U\Sigma$ give document coordinates in a topic-like space; - columns of $V$ describe word patterns associated with those directions; - large singular values identify strong patterns. This idea is related to **latent semantic analysis**. ## 22.20 Python: hidden topics with SVD ```{python} terms = ["ai", "model", "data", "neural", "song", "guitar", "music", "melody", "team", "score", "game", "player"] X_topic = np.array([ [3,2,3,2,0,0,0,0,0,0,0,0], [2,3,2,3,0,0,0,0,0,0,0,0], [2,2,3,2,0,0,0,0,0,0,0,0], [0,0,0,0,3,2,3,2,0,0,0,0], [0,0,0,0,2,3,2,3,0,0,0,0], [0,0,0,0,0,0,0,0,3,2,3,2], [0,0,0,0,0,0,0,0,2,3,2,3], ], dtype=float) U, S, Vt = np.linalg.svd(X_topic, full_matrices=False) print("singular values:", np.round(S, 3)) for k in range(3): top = np.argsort(np.abs(Vt[k]))[::-1][:5] print(f"Topic direction {k+1}:", [terms[i] for i in top]) ``` ## 22.21 Count vectors versus embeddings Bag-of-words vectors are usually: - high-dimensional; - sparse; - based on counts; - tied to a fixed vocabulary; - unable to understand word order deeply. Embeddings are different. A word embedding or sentence embedding is a learned vector designed to capture semantic relationships. For example, embeddings try to place related words near one another: $$ \text{vector}(\text{king}) \approx \text{vector}(\text{queen}) $$ in a meaningful geometric sense. Modern systems learn embeddings from massive text collections. ## 22.22 What embeddings add Embeddings can capture relationships that count vectors miss. For example, the words **car** and **automobile** may never overlap in a bag-of-words representation, but a good embedding model should place them close together. A sentence embedding can place two similar sentences near each other even if they use different words. This is the foundation of semantic search. ## 22.23 Matrices inside language models Modern language models are much more complex than bag-of-words models, but linear algebra is everywhere. They use: - embedding matrices to turn tokens into vectors; - matrix multiplication to transform representations; - dot products to compare tokens; - softmax functions to turn scores into probabilities; - attention matrices to mix information across positions; - high-dimensional vectors to represent context. The story is still the same: > text becomes vectors, and meaning is processed through matrix operations. ## 22.24 Worked example: compare documents by hand Let $$ \mathbf{x}=\begin{bmatrix}2\\1\\0\end{bmatrix}, \qquad \mathbf{y}=\begin{bmatrix}1\\1\\1\end{bmatrix}. $$ Then $$ \mathbf{x}\cdot \mathbf{y}=2\cdot 1+1\cdot 1+0\cdot 1=3. $$ Also, $$ \|\mathbf{x}\|=\sqrt{2^2+1^2+0^2}=\sqrt{5}, \qquad \|\mathbf{y}\|=\sqrt{3}. $$ So $$ \cos(\theta)=\frac{3}{\sqrt{5}\sqrt{3}}=\frac{3}{\sqrt{15}}\approx 0.775. $$ The documents point in fairly similar directions. ## 22.25 Practice problems ### Problem 1 Use the vocabulary $$ [\text{cat},\ \text{dog},\ \text{food},\ \text{music}] $$ to vectorize the following documents: 1. "cat dog dog" 2. "music food music" 3. "cat food dog" ::: {.callout-note collapse="true"} ## Solution The vectors are $$ \begin{bmatrix}1\\2\\0\\0\end{bmatrix},\qquad \begin{bmatrix}0\\0\\1\\2\end{bmatrix},\qquad \begin{bmatrix}1\\1\\1\\0\end{bmatrix}. $$ ::: ### Problem 2 Compute the cosine similarity between $$ \mathbf{x}=\begin{bmatrix}1\\2\\0\end{bmatrix} \quad\text{and}\quad \mathbf{y}=\begin{bmatrix}2\\1\\0\end{bmatrix}. $$ ::: {.callout-note collapse="true"} ## Solution $$ \mathbf{x}\cdot\mathbf{y}=1\cdot2+2\cdot1+0\cdot0=4. $$ $$ \|\mathbf{x}\|=\sqrt{5},\qquad \|\mathbf{y}\|=\sqrt{5}. $$ Therefore $$ \cos(\theta)=\frac{4}{5}=0.8. $$ ::: ### Problem 3 Explain in your own words why cosine similarity is often better than dot product for comparing documents of different lengths. ::: {.callout-note collapse="true"} ## Solution The dot product grows when documents are longer because longer documents usually have larger counts. Cosine similarity divides by the vector lengths, so it focuses more on direction, or relative word-use pattern, rather than total document size. ::: ### Problem 4 In a collection of $1000$ documents, a word appears in $10$ documents. Another word appears in $900$ documents. Which word has larger IDF? Why? ::: {.callout-note collapse="true"} ## Solution The word appearing in $10$ documents has larger IDF because it is rarer and therefore more informative for distinguishing documents. ::: ### Problem 5 Describe one kind of meaning that bag-of-words cannot capture. ::: {.callout-note collapse="true"} ## Solution One example is word order. The sentences "dog bites person" and "person bites dog" have the same bag-of-words counts but very different meanings. ::: ## 22.26 Challenge problems ### Challenge 1: Build a tiny search engine Create five short documents. Choose a vocabulary. Build document vectors. Then enter a query and rank the documents by cosine similarity. ### Challenge 2: Compare raw counts and TF-IDF Use the same documents and compare search results using raw count vectors and TF-IDF vectors. Which words changed the ranking the most? ### Challenge 3: Hidden topics Create a document-term matrix with at least three topics. Apply SVD and inspect the largest right singular vectors. Can you interpret the topics? ### Challenge 4: Sparse vectors Construct a vocabulary with at least $1000$ possible words and simulate documents that use only $20$ words. Estimate what fraction of entries are zero. ## 22.27 AI companion activities Use an AI assistant as a study partner, not as a replacement for your own thinking. ### Activity 1 Ask: > Explain bag-of-words in the style of a story for a beginner in linear algebra. Then improve the answer by adding one mathematical formula. ### Activity 2 Ask: > Give me three examples where cosine similarity is better than Euclidean distance for text comparison. Check whether the examples really depend on direction rather than length. ### Activity 3 Ask: > Create a tiny document-term matrix with three hidden topics and explain how SVD finds them. Then reproduce the example in Python. ### Activity 4 Ask: > Explain the difference between TF-IDF vectors and embeddings. Rewrite the answer in your own words. ## 22.28 Summary In this chapter, we learned that text can be represented as vectors. A vocabulary creates a coordinate system. A document becomes a point or direction in that space. A collection of documents becomes a matrix. Dot products, distances, and cosine similarity compare documents. TF-IDF changes the geometry by emphasizing informative words. SVD can reveal hidden topic directions. Embeddings go further by learning dense vector representations of meaning. The mathematical lesson is simple and powerful: > Text becomes computable when language becomes linear algebra. In the next chapter, we move from text vectors to neural networks, where matrices become trainable machines that learn representations from data.