Geometric and Topological Methods in ML
Notes and relevant papers / readings for geometric and topological methods in machine learning. IN PROGRESS.
2. Geometric graphs, construction, clustering
Basics: metric spaces / metrics, graphs. Geometric graphs have vertices as points in a metric space, with edges defined by low distances in the metric space.
kNN, -NN graphs. kNN construction via KD tree.
Shortest paths methods: Dijkstra’s (single source, Floyd Warshall (all pairs,
Dijkstra:
Floyd Warshall:
We are usually interested in using shortest paths to mimic geodesic distances. Also, Alamgir (ICML 2012) shows us that unweight 1-NN graphs lead us to weird paths (want to make least number of hops, so sometimes tends towards sparse areas of great distance in the metric), and that weighted distance converges to the length of the shortest path connecting two nodes.
Manifold learning based graph construction
Distance is converted to affinities via kernels (e.g. Gaussian kernel):
The process usually goes data calculate distances calculate affinities.
Graph-based clusters / community detection
Hierarchical clustering:
- Link neighbors
- Link clusters
How do we measure distance between clusters? There are a couple of methods: single, average, and complete linkage. We can think of single linkage as linking the two closest points between two clusters, complete as linking the furthest, and average as linking the average. Single linkage is most common:
- start with every point in its own cluster
- merge the two closest clusters
- recompute distances based on min distance to the newly formed cluster
- repeat
Modularity
is the actual affinity between .
the probability of an edge existing between , calculated by:
- generate random edges of the same degrees
- then the prob is , where is the total number of edges in the graph.
Modularity is the sum of overall:
Modularity optimization: Louvain used to be standard, now most use Leiden. Louvain has two phases:
- Finding communities:
- Consider each node individually
- See if the node should be moved into a neighboring community: select the community with the greatest change in modularity. If moving it corresponds with a positive change in modularity, move it.
- Check only neighbors of the node
- Coarse graining:
- Each community in phase 1 becomes a node in phase 2
- Repeat phase 1
Why do we care about modularity? Based on its definition, we can think of it as a measure of the strength of a graph into modules / groups / communities. Networks with high modularity have dense connections between nodes within modules but sparse connections between nodes in different modules.
Recommended reading
- Alamgir et al. Shortest Path Distances in k-nearest Neighbor Graphs, ICML 2012
- Lishan et al., Data-driven graph construction and graph learning: A review, Neurocomputing 2018
- Kd-trees (Wikipedia)
- Dijkstra’s algorithm (Wikipedia)
- Floyd Warshall algorithm (Wikipedia)
- Newman, Modularity and community structure in Networks, PNAS 2006
- Blondell et al. Fast unfolding of communities in large networks, Journal of Statistical Mechanics: theory and experiment, 2008
3. Data diffusion, diffusion maps, spectral clustering
Representation Learning
We want to find the “best” representation of our data; i.e., we want to preserve as much information as possible while obeyind some penalty that simplifies the representation.
What kind of information should be preserved, and how do we define “simple”?
- Low-dimensional
- Sparse
- Independent components
These criteria aren’t necessarily mutually exclusive. Simple representations are good for interpretability (visualization), reducing computational cost (compression), and performance on supervised tasks (can be easier to classify / regress from).
Isomap (Tenanbaum, 2020)
- Construct neighborhood graph
- Find all pairs shortest paths
- Apply MDS to compute low-dimensional coordinates that preserve distances
MDS (Multidimensional Scaling)
MDS finds an embedding of the data that preserves distances by optimizing point positions such that “stress” is minimized:
This can be done with gradient descent and other related procedures.
Isomap’s weakness: real data is often very noisy, and the shortest path algorithm results in noisy distances between points. More commonly, we might see UMAP being used or PHATE (geometry preserving, common in biology).
Diffusion maps
Diffusion distance can be more robust to noise! Introduced by Coifman and Lafon. 3 step process:
- Compute affinities with Gaussian kernel
- Compute degree matrix
- Markovian diffusion operator gives us coordinates
- Recall kernel functions (discussed previously for converting distances to affinities). If is a measurable space, a kernel function is a mapping such that for any
Kernels are generally nonlinear. Diffusion maps are specifically defined with respect to the Gaussian kernel:
where is a bandwidth parameter. This kernel captures a notion of affinity/similarity between points in
The Gaussian kernel emphasizes local connections and pushes non-local connections to in a “softer” way than kNN. We can also threshold these outputs for computational purposes (so that the affinity graph wouldn’t be fully connected).
- After computing affinities with the Gaussian kernel, we define the degree matrix. For a weighted graph, the degree of vertex is defined as the sum of similarities/affinities between its neighbors:
The degree matrix is a diagonal matrix where .
- We can now define the “diffusion operator”
a matrix of transition probabilities with row-normalized affinities. Therefore, this matrix is Markov & describes a random walk over the data.
Since represents the 1-step transition probabilities, represents the -step random walk probabilities.
The diffusion map representation is a spectral embedding; that is, has a spectral decomposition that we can use to assign coordinates in the embedding to each data point.
Consider the matrix where
Then and is symmetric, so there exists an eigendecomposition. In fact, and are similar, so they must have the same eigenvalues and related eigenvectors.
Let the eigenvectors of be .
- A left eigenvector of , , would be .
- A right eigenvector of , , would be
We can use these eigenpairs are used to compute the spectral decomposition of :
For
where each is a column eigenvector from , we can map a datapoint to its embedding coordinates as:
Properties of the spectrum
- Eigenvalues are in range
- 1 is an eigenvalue
- The first eigenvector is the steady state/stationary vector of all 1’s
- Stationary distribution given as
- Markov chain is reversible, e.g.
Powers of the matrix make lower eigenvalues get lower:
Hence powers of denoise in a low-pass filtering manner. When we map to its new coordinates, we can use only the first eigvals/eigvecs due to spectrum decay while still maintaining accuracy, which reduces our dimensionality.
Diffusion distance
Diffusion maps preserve diffusion distance, a family of distance on the transition probability distribution between 2 points:
This is essentially the Euclidean distance between rows of the diffusion operator.
Diffusion distance can also be interpreted as the overall distance in walking probabilities after steps when starting out in point vs. point .
Diffusion maps do a good job at cleaning data, but are not necessarily great at visualization. This is partially because diffusion maps split information into multiple orthogonal directions, e.g. plotting in 2-D will likely only show us 2-ish trajectories.
Partitioning data with K-means
In k-means, we want to find splits/partitions to minimize variance within each cluster:
- initialize: pick random cluster centroids
- assignment: assign each datapoint to cluster that has the nearest mean
- update: compute the means of clusters
k-means converges because each reassignment lowers the MSE:
Therefore, this optimization will reach a local minima. However, k-means only works well on data with certain characteristics. It assumes:
- k is chosen correctly
- data is distributed normally around the mean
- clusters all have equal variance
- clusters all have the same number of points
It is also sensitive to initialization and outliers.
An alternative is spectral clustering. Spectral clustering can use eigenvectors of the diffusion oeprator or Laplacian. Advantages of using the spectrum for k-means:
- good for clusters of arbitrary shape
- good for graph data
- only need connectivity data
How it works:
- construct a similarity graph
- compute diffusion operator or graph Laplacian
- using top k eigenvectors, run k-means
Recommended reading
- Nadler et al., Diffusion maps, spectral clustering, and eigenfunctions of Fokker-Planck operators, NeurIPS 2005
- Coifman and Lafon, Diffusion Maps, Applied and computational harmonic analysis 2006
- Tenenbaum et al., A global geometric framework for non-linear dimensionality reduction, Science 2000 (Isomap)