Lab 04 — Data as Points

Feature spaces, data clouds, scaling, clusters, outliers, and high-dimensional intuition.

1. The Big Idea

A dataset is more than a table. After we choose features, each row becomes a point in a feature space. Many rows become a data cloud.

Main message: Data becomes geometry. Linear algebra gives us tools for studying the shape of that geometry.

Object

One student, house, song, image, document, or customer.

Vector

A numerical description of one object.

Dataset

Many vectors collected together.

Cloud

The geometric picture made by all points.

2. Rows, Columns, and the Data Matrix

In many data science settings, rows are objects and columns are features.

StudentHoursSleepScore
A2770
B5685
C1865
D7592
X = [[2, 7, 70], [5, 6, 85], [1, 8, 65], [7, 5, 92]]
Selected row vector
Selected feature vector

3. Interactive Data Cloud Builder

Use the sliders to change the slope, noise, and number of points. Watch the shape of the cloud change.

4. Scaling Changes Distance

The same houses can look geometrically different when price is measured in dollars, thousands of dollars, or standardized units.

Warning: A distance computation is only as meaningful as the feature representation behind it.

5. Clusters

Clusters are groups of nearby points. Change the separation and spread to see when clusters become clear or confusing.

6. Outliers

An outlier is a point far from the main cloud. It may be an error, a special case, or an important discovery.

7. Nearest Neighbor Classification

Move the new point. It is classified by whichever class center is closer.

8. Image as a Point

Draw a small 8 by 8 image. The image is also a vector in R^64.

Flattened vector

Each square is one coordinate. A full-size image can have thousands or millions of coordinates.

9. High-Dimensional Distance

Random points tend to get farther apart as dimension increases. This is one reason high-dimensional intuition is different.

10. Reflection

Write a short response after using the page.

feature spacedata cloudscalingclustersoutliershigh dimension