Geometric and Topological Methods in ML

Notes and relevant papers / readings for geometric and topological methods in machine learning. IN PROGRESS.

2. Geometric graphs, construction, clustering

Basics: metric spaces / metrics, graphs. Geometric graphs have vertices as points in a metric space, with edges defined by low distances in the metric space.

kNN, ϵ\epsilon-NN graphs. kNN construction via KD tree.

Shortest paths methods: Dijkstra’s (single source, O(E+VlogV)O(|E| + |V|\log |V|) Floyd Warshall (all pairs, O(V3)O(|V|^3)

Dijkstra: notes

Floyd Warshall: notes

We are usually interested in using shortest paths to mimic geodesic distances. Also, Alamgir (ICML 2012) shows us that unweight 1-NN graphs lead us to weird paths (want to make least number of hops, so sometimes tends towards sparse areas of great distance in the metric), and that weighted distance converges to the length of the shortest path connecting two nodes.

Manifold learning based graph construction

Distance is converted to affinities via kernels (e.g. Gaussian kernel):

affinityi,j=exp(dist(xi,xj)22σ2).\text{affinity}_{i,j} = \exp \left( - \frac{\text{dist}(x_i, x_j)^2}{2 \sigma^2} \right).

The process usually goes data \rightarrow calculate distances \rightarrow calculate affinities.

Graph-based clusters / community detection

Hierarchical clustering:

How do we measure distance between clusters? There are a couple of methods: single, average, and complete linkage. We can think of single linkage as linking the two closest points between two clusters, complete as linking the furthest, and average as linking the average. Single linkage is most common:

Modularity

AijA_{ij} is the actual affinity between i,ji,j.

Pi,jP_{i,j} the probability of an edge existing between i,ji,j, calculated by:

Modularity is the sum of AijPijA_{ij} - P_{ij} overall:

Q=12mi,j(Aijkikj2m)δ(ci,cj)Q = \frac{1}{2m} \sum_{i,j} (A_{ij} - \frac{k_i k_j}{2m}) \delta(c_i, c_j)

Modularity optimization: Louvain used to be standard, now most use Leiden. Louvain has two phases:

Why do we care about modularity? Based on its definition, we can think of it as a measure of the strength of a graph into modules / groups / communities. Networks with high modularity have dense connections between nodes within modules but sparse connections between nodes in different modules.

3. Data diffusion, diffusion maps, spectral clustering

Representation Learning

We want to find the “best” representation of our data; i.e., we want to preserve as much information as possible while obeyind some penalty that simplifies the representation.

What kind of information should be preserved, and how do we define “simple”?

These criteria aren’t necessarily mutually exclusive. Simple representations are good for interpretability (visualization), reducing computational cost (compression), and performance on supervised tasks (can be easier to classify / regress from).

Isomap (Tenanbaum, 2020)

MDS (Multidimensional Scaling)

MDS finds an embedding of the data that preserves distances by optimizing point positions such that “stress” is minimized:

stressD(x1,x2,...,xN)=ij=1,...,N(dijxixj)2.\text{stress}_{D} (x_1, x_2, ..., x_N) = \sqrt{ \sum_{i \neq j = 1, ..., N} (d_{ij} - ||x_i - x_j || )^2 }.

This can be done with gradient descent and other related procedures.

Isomap’s weakness: real data is often very noisy, and the shortest path algorithm results in noisy distances between points. More commonly, we might see UMAP being used or PHATE (geometry preserving, common in biology).

Diffusion maps

Diffusion distance can be more robust to noise! Introduced by Coifman and Lafon. 3 step process:

  1. Recall kernel functions (discussed previously for converting distances to affinities). If (X,A,μ)(X, A, \mu) is a measurable space, a kernel function is a mapping K:X×XRK : X \times X \rightarrow \R such that for any x,yX,x, y \in X,

K(x,y)=K(y,x)K(x,y)0.\begin{align*} K(x,y) &= K(y,x) \\ K(x,y) &\geq 0. \end{align*}

Kernels are generally nonlinear. Diffusion maps are specifically defined with respect to the Gaussian kernel:

k(x,y)=exy2/ϵ,k(x,y) = e^{-||x-y||^2 / \eps},

where ϵ\eps is a bandwidth parameter. This kernel captures a notion of affinity/similarity between points in X.X.

The Gaussian kernel emphasizes local connections and pushes non-local connections to 00 in a “softer” way than kNN. We can also threshold these outputs for computational purposes (so that the affinity graph wouldn’t be fully connected).

  1. After computing affinities with the Gaussian kernel, we define the degree matrix. For a weighted graph, the degree of vertex xx is defined as the sum of similarities/affinities between its neighbors:

q(x)=i=0i<nk(x,vi).q(x) = \sum_{i=0}^{i < n} k(x, v_i).

The degree matrix is a diagonal matrix where di,i=q(vertex i)d_{i,i} = q(\text{vertex } i).

  1. We can now define the “diffusion operator”

P(x,y)=k(x,y)/q(x),P(x,y) = k(x,y) / q(x),

a matrix of transition probabilities with row-normalized affinities. Therefore, this matrix is Markov & describes a random walk over the data.

Since PP represents the 1-step transition probabilities, PtP^t represents the tt-step random walk probabilities.

The diffusion map representation is a spectral embedding; that is, PP has a spectral decomposition that we can use to assign coordinates in the embedding to each data point.

Consider the matrix AA where

a(x,y)=k(x,y)q(x)q(y)=q(x)1/2p(x,y)q(y)1/2.a(x,y) = \frac{k(x,y)}{\sqrt{q(x)} \sqrt{q(y)}} = q(x)^{1/2} * p(x,y) * q(y)^{-1/2}.

Then A=Q1/2PQ1/2A = Q^{1/2}*P*Q^{-1/2} and is symmetric, so there exists an eigendecomposition. In fact, PP and AA are similar, so they must have the same eigenvalues and related eigenvectors.

Let the eigenvectors of AA be ϕ1,ϕ2,...,ϕn\phi_1, \phi_2, ..., \phi_n.

We can use these eigenpairs are used to compute the spectral decomposition of PP:

For 1=λ0λ1λ2...λδ>0,1 = \lambda_0 \geq \lambda_1 \geq \lambda_2 \geq ... \geq \lambda_\delta > 0,

(ϕ0ϕ1ϕ2ϕdelta)\begin{pmatrix} \phi_0 & \phi_1 & \phi_2 & \cdots & \phi_delta \end{pmatrix}

where each ϕi\phi_i is a column eigenvector from AA, we can map a datapoint xx to its embedding coordinates as:

xϕt(x)=[λ0tϕ0(x),λ1tϕ1(x),λ2tϕ2(x),...,λδtϕδ(x)]T.x \mapsto \phi^t(x) = [ \lambda_0^t \phi_0(x), \lambda_1^t \phi_1(x), \lambda_2^t \phi_2(x), ..., \lambda_\delta^t \phi_\delta(x)]^T.

Properties of the spectrum

Powers of the matrix AA make lower eigenvalues get lower:

notes

Hence powers of AA denoise in a low-pass filtering manner. When we map xx to its new coordinates, we can use only the first kk eigvals/eigvecs due to spectrum decay while still maintaining accuracy, which reduces our dimensionality.

Diffusion distance

Diffusion maps preserve diffusion distance, a family of L2L^2 distance on the transition probability distribution between 2 points:

Dt(x,y)2=pt(x,)pt(y,)L2(X,dμ/π)2.D_t(x,y)^2 = ||p_t(x, \cdot) - p_t(y, \cdot) ||^2_{L^2(X, d\mu / \pi)}.

This is essentially the Euclidean distance between rows of the diffusion operator.

Diffusion distance can also be interpreted as the overall distance in walking probabilities after tt steps when starting out in point xx vs. point yy.

Diffusion maps do a good job at cleaning data, but are not necessarily great at visualization. This is partially because diffusion maps split information into multiple orthogonal directions, e.g. plotting in 2-D will likely only show us 2-ish trajectories.

notes

Partitioning data with K-means

In k-means, we want to find kk splits/partitions to minimize variance within each cluster:

k-means converges because each reassignment lowers the MSE:

arg minSi=1kxSixμi2=arg minSi=1kSivar(Si).\argmin_S \sum_{i=1}^k \sum_{x \in S_i} || x- \mu_i ||^2 = \argmin_S \sum_{i=1}^k |S_i|\text{var}(S_i).

Therefore, this optimization will reach a local minima. However, k-means only works well on data with certain characteristics. It assumes:

It is also sensitive to initialization and outliers.

An alternative is spectral clustering. Spectral clustering can use eigenvectors of the diffusion oeprator or Laplacian. Advantages of using the spectrum for k-means:

How it works:

4.