Geometric and Topological Methods in ML

Notes and relevant papers / readings for geometric and topological methods in machine learning. IN PROGRESS.

divider

2. Geometric graphs, construction, clustering

Basics: metric spaces / metrics, graphs. Geometric graphs have vertices as points in a metric space, with edges defined by low distances in the metric space.

kNN, ϵ\epsilon-NN graphs. kNN construction via KD tree.

Shortest paths methods: Dijkstra’s (single source, O(E+VlogV)O(|E| + |V|\log |V|) Floyd Warshall (all pairs, O(V3)O(|V|^3)

Dijkstra: notes

Floyd Warshall: notes

We are usually interested in using shortest paths to mimic geodesic distances. Also, Alamgir (ICML 2012) shows us that unweight 1-NN graphs lead us to weird paths (want to make least number of hops, so sometimes tends towards sparse areas of great distance in the metric), and that weighted distance converges to the length of the shortest path connecting two nodes.

Manifold learning based graph construction

Distance is converted to affinities via kernels (e.g. Gaussian kernel):

affinityi,j=exp(dist(xi,xj)22σ2).\text{affinity}_{i,j} = \exp \left( - \frac{\text{dist}(x_i, x_j)^2}{2 \sigma^2} \right).

The process usually goes data \rightarrow calculate distances \rightarrow calculate affinities.

Graph-based clusters / community detection

Hierarchical clustering:

How do we measure distance between clusters? There are a couple of methods: single, average, and complete linkage. We can think of single linkage as linking the two closest points between two clusters, complete as linking the furthest, and average as linking the average. Single linkage is most common:

Modularity

AijA_{ij} is the actual affinity between i,ji,j.

Pi,jP_{i,j} the probability of an edge existing between i,ji,j, calculated by:

Modularity is the sum of AijPijA_{ij} - P_{ij} overall:

Q=12mi,j(Aijkikj2m)δ(ci,cj)Q = \frac{1}{2m} \sum_{i,j} (A_{ij} - \frac{k_i k_j}{2m}) \delta(c_i, c_j)

Modularity optimization: Louvain used to be standard, now most use Leiden. Louvain has two phases:

Why do we care about modularity? Based on its definition, we can think of it as a measure of the strength of a graph into modules / groups / communities. Networks with high modularity have dense connections between nodes within modules but sparse connections between nodes in different modules.

3. Data diffusion, diffusion maps, spectral clustering

Representation Learning

We want to find the “best” representation of our data; i.e., we want to preserve as much information as possible while obeyind some penalty that simplifies the representation.

What kind of information should be preserved, and how do we define “simple”?

These criteria aren’t necessarily mutually exclusive. Simple representations are good for interpretability (visualization), reducing computational cost (compression), and performance on supervised tasks (can be easier to classify / regress from).

Isomap (Tenanbaum, 2020)

MDS (Multidimensional Scaling)

MDS finds an embedding of the data that preserves distances by optimizing point positions such that “stress” is minimized:

stressD(x1,x2,...,xN)=ij=1,...,N(dijxixj)2.\text{stress}_{D} (x_1, x_2, ..., x_N) = \sqrt{ \sum_{i \neq j = 1, ..., N} (d_{ij} - ||x_i - x_j || )^2 }.

This can be done with gradient descent and other related procedures.

Isomap’s weakness: real data is often very noisy, and the shortest path algorithm results in noisy distances between points. More commonly, we might see UMAP being used or PHATE (geometry preserving, common in biology).

Diffusion maps

Diffusion distance can be more robust to noise! Introduced by Coifman and Lafon. 3 step process:

  1. Recall kernel functions (discussed previously for converting distances to affinities). If (X,A,μ)(X, A, \mu) is a measurable space, a kernel function is a mapping K:X×XRK : X \times X \rightarrow \R such that for any x,yX,x, y \in X,

K(x,y)=K(y,x)K(x,y)0.\begin{align*} K(x,y) &= K(y,x) \\ K(x,y) &\geq 0. \end{align*}

Kernels are generally nonlinear. Diffusion maps are specifically defined with respect to the Gaussian kernel:

k(x,y)=exy2/ϵ,k(x,y) = e^{-||x-y||^2 / \eps},

where ϵ\eps is a bandwidth parameter. This kernel captures a notion of affinity/similarity between points in X.X.

The Gaussian kernel emphasizes local connections and pushes non-local connections to 00 in a “softer” way than kNN. We can also threshold these outputs for computational purposes (so that the affinity graph wouldn’t be fully connected).

  1. After computing affinities with the Gaussian kernel, we define the degree matrix. For a weighted graph, the degree of vertex xx is defined as the sum of similarities/affinities between its neighbors:

q(x)=i=0i<nk(x,vi).q(x) = \sum_{i=0}^{i < n} k(x, v_i).

The degree matrix is a diagonal matrix where di,i=q(vertex i)d_{i,i} = q(\text{vertex } i).

  1. We can now define the “diffusion operator”

P(x,y)=k(x,y)/q(x),P(x,y) = k(x,y) / q(x),

a matrix of transition probabilities with row-normalized affinities. Therefore, this matrix is Markov & describes a random walk over the data.

Since PP represents the 1-step transition probabilities, PtP^t represents the tt-step random walk probabilities.

The diffusion map representation is a spectral embedding; that is, PP has a spectral decomposition that we can use to assign coordinates in the embedding to each data point.

Consider the matrix AA where

a(x,y)=k(x,y)q(x)q(y)=q(x)1/2p(x,y)q(y)1/2.a(x,y) = \frac{k(x,y)}{\sqrt{q(x)} \sqrt{q(y)}} = q(x)^{1/2} * p(x,y) * q(y)^{-1/2}.

Then A=Q1/2PQ1/2A = Q^{1/2}*P*Q^{-1/2} and is symmetric, so there exists an eigendecomposition. In fact, PP and AA are similar, so they must have the same eigenvalues and related eigenvectors.

Let the eigenvectors of AA be ϕ1,ϕ2,...,ϕn\phi_1, \phi_2, ..., \phi_n.

We can use these eigenpairs are used to compute the spectral decomposition of PP:

For 1=λ0λ1λ2...λδ>0,1 = \lambda_0 \geq \lambda_1 \geq \lambda_2 \geq ... \geq \lambda_\delta > 0,

(ϕ0ϕ1ϕ2ϕdelta)\begin{pmatrix} \phi_0 & \phi_1 & \phi_2 & \cdots & \phi_delta \end{pmatrix}

where each ϕi\phi_i is a column eigenvector from AA, we can map a datapoint xx to its embedding coordinates as:

xϕt(x)=[λ0tϕ0(x),λ1tϕ1(x),λ2tϕ2(x),...,λδtϕδ(x)]T.x \mapsto \phi^t(x) = [ \lambda_0^t \phi_0(x), \lambda_1^t \phi_1(x), \lambda_2^t \phi_2(x), ..., \lambda_\delta^t \phi_\delta(x)]^T.

Properties of the spectrum

Powers of the matrix AA make lower eigenvalues get lower:

notes

Hence powers of AA denoise in a low-pass filtering manner. When we map xx to its new coordinates, we can use only the first kk eigvals/eigvecs due to spectrum decay while still maintaining accuracy, which reduces our dimensionality.

Diffusion distance

Diffusion maps preserve diffusion distance, a family of L2L^2 distance on the transition probability distribution between 2 points:

Dt(x,y)2=pt(x,)pt(y,)L2(X,dμ/π)2.D_t(x,y)^2 = ||p_t(x, \cdot) - p_t(y, \cdot) ||^2_{L^2(X, d\mu / \pi)}.

This is essentially the Euclidean distance between rows of the diffusion operator.

Diffusion distance can also be interpreted as the overall distance in walking probabilities after tt steps when starting out in point xx vs. point yy.

Diffusion maps do a good job at cleaning data, but are not necessarily great at visualization. This is partially because diffusion maps split information into multiple orthogonal directions, e.g. plotting in 2-D will likely only show us 2-ish trajectories.

notes

Partitioning data with K-means

In k-means, we want to find kk splits/partitions to minimize variance within each cluster:

k-means converges because each reassignment lowers the MSE:

arg minSi=1kxSixμi2=arg minSi=1kSivar(Si).\argmin_S \sum_{i=1}^k \sum_{x \in S_i} || x- \mu_i ||^2 = \argmin_S \sum_{i=1}^k |S_i|\text{var}(S_i).

Therefore, this optimization will reach a local minima. However, k-means only works well on data with certain characteristics. It assumes:

It is also sensitive to initialization and outliers.

An alternative is spectral clustering. Spectral clustering can use eigenvectors of the diffusion oeprator or Laplacian. Advantages of using the spectrum for k-means:

How it works:

4. Data diffusion, continuous diffusion, PHATE

Previously (3), diffusion maps.

Connections between discrete and continuous diffusion

For finite bandwidth ϵ\epsilon, the Markov chain in discrete time and space converges to a Markov process in discrete time but continuous space.

As ϵ0\epsilon \rightarrow 0, this jump process converges to a diffusion process on Rn\R^n. In the continuum limit of infinitely many points and continuous time, this resembles a Fokker-Planck diffusion process.

Fokker-Planck Diffusion Process

Describes diffusion as a mix of random Brownian forces WtW_t and drag forces U(x)U(x):

dXt=μ(Xt,t)dt+σ(Xt,t)dWt.dX_t = \mu (X_t, t) dt + \sigma (X_t, t) dW_t.

Given an initial condition at time s<ts < t this follows the forward Fokker-Planck equation (FPE):

pt=(p+pU(x)).\frac{\partial p}{\partial t} = \nabla \cdot (\nabla p + p \nabla U(x)).

Δ\Delta refers to the Laplacian Δy=(u)\Delta y = \nabla(\nabla u), the divergence of the gradient.

General solutions of the FPE are given by the eigenfunctions of the FPE operator. In our case, the potential is the log of the data density (so drag is related to data density).

Data density

The density of the data D(xi)=q(xi)=jA(xi,xj)D(x_i) = q(x_i) = \sum_j A(x_i, x_j) is a proxy for the density at point xix_i.

In the setting of our data, the “potential” UU is the log of data density:

U(x)=log(q(x)).U(x) = - \log (q(x)).

Essentially, the FPE is saying we have a diffusion process that has “drag” because of data density.

Anisotropic normalization

Instead of the diffusion kernel before, we can define a different symmetric kernel MS=DαADαM_S = D^{ - \alpha} A D^{- \alpha}:

MS=k(x,y)q(x)αq(y)α.M_S = \frac{k(x,y)}{q(x)^\alpha q(y)^\alpha}.

If α=1\alpha=1 this kernel has uniform degree (all points have density of 1).

Convergence to Laplace-Beltrami

If you set α=1\alpha =1, then U=log(1)=0U = \log(1) = 0. Then the second term in the FPE drops out:

pt=(p+pU(x))\frac{\partial p}{\partial t} = \nabla \cdot (\nabla p + p \nabla U(x))

and we are left with

pt=p\frac{\partial p}{\partial t} = \nabla p

which is purely controlled by the Laplacian. This is a second derivative operator and purely geometric, hence separating density from geometry.

Summary: In infinitely many points and continuous time the FPE represents the diffusion process. We need to reconcile with the drag term UU in the FPE, which is given by the density of the data. We resolve this by dividing it out in anisotropic normalization kernel, which converges to the Laplace Beltrami operator, successfully separating density from geometry.

Application of diffusion maps to clustering / PHATE

Previously (3): k-means clustering algorithm, spectral clustering using diffusion map coordinates.

PHATE goal: Create a visualization that preserves data geometry in low dimensions.

notes

PHATE uses adaptive bandwidths/an alpha-decaying kernel:

Kk,α(x,y)=12exp((xy2ϵk(x))α)+12exp((xy2ϵk(y))α),K_{k, \alpha}(x,y) = \frac{1}{2} \exp \left( - \left( \frac{||x-y||_2}{\epsilon_k(x)} \right)^\alpha \right) + \frac{1}{2} \exp \left( - \left( \frac{||x-y||_2}{\epsilon_k(y)} \right)^\alpha \right),

where

Plot of alpha decay kernel:

notes

Spectral entropy

notes

Potential distance in PHATE

notes

Diffusion components track one branch at a time, zeroing out everything else. Makes them good for clustering but not great for visualization. See image for example:

notes

Other remarks

5. tSNE and UMAP

Ways to compare probability distributions

Entropy

Represents the expected amount of uncertainty in a distribution:

H(P)=iP(xi)log(P(xi)).H(P) = - \sum_i P(x_i)\log(P(x_i)).

Entropy measures information you learn by knowing the solution.

Cross Entropy

Given 2 distributions P,Q,P,Q, how many bits on average does it take to “encode QQ” in a code that is optimized for PP?

H(P,Q)=iQ(xi)log(P(xi)).H(P,Q) = - \sum_i Q(x_i) \log(P(x_i)).

Example: encoding a distribution QQ.

If we have 3 outcomes a,b,ca, b, c and q(a)=1/2,q(b)=q(c)=1/4,q(a) = 1/2, q(b) = q(c) = 1/4, we can encode:

Then the average number of bits needed is

(1/2)1+(1/4)1+(1/4)2.(1/2)*1 + (1/4)*1 + (1/4)*2.

Suppose we use encoding for QQ on a distribution PP now, where p(a)=p(b)=1/4p(a) = p(b) = 1/4 and p(c)=1/2p(c) = 1/2. Then the average number of bits needed is

(1/4)1+(1/4)1+(1/2)2.(1/4)*1 + (1/4)*1 + (1/2)*2.

KL divergence

The expected number of extra bits needed if using samples from PP on a code optimized for QQ. This is related to cross entropy and sometimes called relative entropy:

DKL(PQ)=iP(xi)log(P(xi)Q(xi))=iP(xi)log(Q(xi)P(xi)).D_{KL}(P||Q) = - \sum_i P(x_i)\log \left( \frac{P(x_i)}{Q(x_i)} \right) = \sum_i P(x_i)\log \left( \frac{Q(x_i)}{P(x_i)} \right).

Note that

DKL(PQ)=H(P,Q)H(P).D_{KL}(P||Q) = H(P,Q) - H(P).

Important properties: DKL(PQ)D_{KL}(P||Q) is not a distance. It is not the same as DKL(QP)D_{KL}(Q||P) and does not follow the triangle inequality.

Jensen Shannon Distance

JS divergence is symmetric but still does not follow the triangle inequality. Can be made a proper distance by taking a square root of the Jensen Shannon divergence.

tSNE: t-Stochastic Neighbor Embedding

Concept: Maintain neighborhoods, preserve the data’s shape both locally and globally. notes

Stochastic neighbors: tSNE creates a probability distribution that represents similarities between neighbors, or the conditional probability that one node would pick another as its neighbor. The similarity for a node xix_i is defined to be proportional to a probability density under a Gaussian centered at xix_i.

The variance (bandwidth parameter) for each of these Gaussians of any point xix_i is set to σi\sigma_i such that the Shannon entropy of neighbors is a fixed value.

H(p(xi,))=jp(xi,xj)log(p(xi,xj)).H(p(x_i, \cdot)) = - \sum_j p(x_i, x_j) \log(p(x_i, x_j)).

perplexity=WH(p(xi,))= effective no. of nbrs\text{perplexity} = W^{H(p(x_{i, \cdot}))} = \text{ effective no. of nbrs}

This fixed value is defined by a parameters we set called perplexity.

Penalty: tSNE minimizes KL divergence, i.e., we want to match the probability distributions of high dimensional and low dimensional neighbors. Optimize with SGD.

Note on SNE vs tSNE: high-dimensional similarities are measured as shown below and use an adaptive bandwidth. The low-dim similarities are computed differently depending on if using SNE or tSNE.

notes notes

In both SNE and tSNE, we are moving points around to minimize KL divergence. However, the problem with normal SNE is that there is a low penalty for moving points (in low dim) around if they are far away in high dimensions.

notes

To fix this, tSNE uses the student-t distribution in low-dim space. The fatter tails help keep pi,jp_{i,j} a little bit higher, so points which are far away can still be reasonably penalized in their low-dim representations. This also helps with crowding.

notes

notes

UMAP

Uniform Manifold Approximation and Projection = UMAP.

Assumes:

Effectively uses a tSNE-like method to embed data.

μij={exp((d(xi,xj)d(xi,xi1))/σi) for j{i1,...,ik}0 else.\mu_{i \rightarrow j} = \begin{cases} \exp( - (d(x_i, x_j) - d(x_i, x_{i_1}))/ \sigma_i) & \text{ for } j \in \{i_1, ..., i_k\} \\ 0 & \text{ else.} \end{cases}

μij=μij+μjiμijμji[0,1].\mu_{ij} = \mu_{i \rightarrow j} + \mu_{j \rightarrow i} - \mu_{i \rightarrow j}\mu_{j \rightarrow i} \in [0,1].

Uses binary crossentropy loss, a penalty for modeling similar points as dissimilar and vice versa:

L({ei}{μij})=21i<jnμijlog(vij)+(1μ)ijlog(1vij).\mathcal{L} ( \{e_i \} | \{ \mu_{ij}\}) = -2 \sum_{1 \leq i < j \leq n} \mu_{ij} \log(v_{ij}) + (1 - \mu)_{ij} \log(1 - v_{ij}).

Initialization to a Laplacian eigenmap (Belkin and Niyogi), a geometric method. The “effective” loss function says that UMAP effectively binarizes neighbors.

Scalability: neighbors and non-neighbors are just sampled from the data graph. Therefore, we don’t have to compute a whole affinity matrix or Markov matrix like in tSNE, hence it is much faster and more scalable.

Attraction-repulsion continuum: Unlke tSNE, something being close doesn’t mean something else has to be far (effect of row-stochastic normalization). Therefore, UMAP visualizations have a less “explosive” look:

notes

UMAP vs tSNE

6. Graph Laplacians and Graph Signal Processing

Laplacian

Laplacian = divergence of the gradient, sum of all the unmixed second derivatives at any point.

Δf=(f)=2f=2fx2+2fy2+2fz2.\Delta f = \nabla \cdot (\nabla f) = \nabla^2 f = \frac {\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2} + \frac{\partial^2 f}{\partial z^2}.

In 1-D, Laplacian is the second derivative, which measures the curvature of a function. In multiple dimensions, the Laplacian at a point measures how much the function’s value deviates from the average ofits values in a small neighborhood around that point, related to mean curvature.

The heat equation uses the Laplacian:

ut\frac{\partial u}{\partial t}

to represent the rate at which a quantity diffuses at a given point. Other applications of the Laplacian include computer vision, quantum mechanics (Hamiltonian), electromagnetism (Poisson eq.).

The Laplacian is a linear operator:

cos(ωx),sin(ωx)\cos(\omega x), \sin(\omega x) are eigenfunctions of the Laplacian in Euclidean space (e.g., the second deriv of each function is a scalar multiple of the og. function).

Discrete and Graph Laplacian

A discrete Laplacian is a discrete analog of continuous Laplacians, often on a grid. In a graph, the Laplacian operator is a linear operator acting as a matrix that represents how a function’s value at a vertex differs from its average over neighboring vertices in a graph. Therefore, the graph Laplacian is defined as the degree matrix minus the adjacency matrix

L=DA,L = D- A,

which measures how a vertex is different from its neighbors.

Quick note on finite difference approximation, where continuous second-order partial derivatives of the Laplacian are replaced iwht finite difeerences using Taylor series expansions, eg:

2upartialx2u(x+h,y)2u(x,y)+u(xh,y)h2.\frac{\partial^2 u}{partial x^2} \approx \frac{u(x+h, y) - 2u(x,y) + u(x-h,y)}{h^2}.

notes

We can think of the graph Laplacian as a discrete version of finite difference approximation.

By the spectral theorem, LL is eigendecomposable into orthonormal eigenbasis L=UΛUT,L = U \Lambda U^T, where LU0=λ0U0,LU1=λ1U1,....LU_0 = \lambda_0 U_0, LU_1 = \lambda_1 U_1, ....

Eigenvalues of LL are in the range of 0λ0λ1...λn0 \leq \lambda_0 \leq \lambda_1 \leq ... \leq \lambda_n.

We often take LL to be a normalized Laplacian L=ID1/2AD1/2,L = I - D^{-1/2}AD^{-1/2}, which helps remove the effect of density.

00 is an eigenvalue of the eigenvector (1,1,1,).(1, 1, 1, \cdots). The second smallest eigenvalue, called the “Fiedler value,” measures graph connectivity.

notes

Laplacian Eigenmaps

Belkin & Niyogi, maps each point to its Laplacian eigenvector coordinates.

xiU1(xi),U2(xi),...,Uk(xi)x_i \rightarrow U_1(x_i), U_2(x_i), ..., U_k(x_i)

where each UiU_i an eigenvector of L=DKL= D-K. (xix_i literally gets mapped to the ii-th coordinate in each eigenvector, using the num. of eigenvecs depending on the dimension you’re reducing to.)

We can recast this as a smoothness minimization problem: a desirable embedding may be one where nearby points are similar. We choose a Gaussian kernel K(xi,xj)=expxixj2/ϵ.K(x_i, x_j) = \exp{-||x_i - x_j||^2 / \epsilon}.

C(Y)=1/2i,j(yiyj)2K(xi,xj)=ytLy.C(Y) = 1/2 \sum_{i,j} (y_i - y_j)^2 K(x_i, x_j) = y^t L y.

Based on Courant-Fisher, the minimizer of this is U0U_0 and the next best minimizer is U1U_1.

notes

Graph Signal Processing

Fourier transform of a signal:

f(k)=f(x)e2πikxdx.f(k) = \int_{-\infty}^{\infty} f(x) e^{-2 \pi i k x}dx.

Usually, we don’t have the functional form available to compute the integral, so we have to use discrete samples and discrete wavelengths for the discrete Fourier transform.

notes

DFT is a matrix multiplication:

notes

We can interpret features as signals on a graph. Applying graph Fourier transform, L=UΛUTL = U \Lambda U^T, we can decompose signals into graph Fourier coefficients.

Spectral Graph Filtering

Introduced by Stankovic et al. 2018. Interested in finding which frequencies are noise, then altering the frequencies to retrieve the corrected signal.

Filter is generally constructed by having some function HH, which changes the eigenvalues H(λi)H(\lambda_i) (see below).

notes

Data signals are low frequency and noise can be high frequency. We can take noise off by taking high frequency eigenvectors off. A low pass filter is when you let low frequencies through and filter out high frequencies.

notes

Problems with frequency domain filters:

K-filter coefficients

Reduce number of coefficients in filter

Chebyshev Polynomials

Use polynomial filter, automatically applies filter to eigenvalues and not eigenvectors.

notes

From Fourier to Wavelets

A solution: wavelets! Unlike a sine or cosine function, a wavelet is a localized oscillation. THey can be scaled (to look at low vs high frequency components) and translated (to look at different times and locations).

Discrete wavelet transform

notes

Diffusions can be used to create wavelets. Diffusion wavelets , formed by taking differences between two scales of diffusion.

7. Graph Neural Networks and Geometric Scattering

Rough neural network overview. COvered:

Design choices

Graph neural network connection

Graph notation–adjacency matrix can be weighted, vertex features are graph signals.

Message passing: aggregate information from neighbors, update. There can be several message passing iterations. After kk iterations you get a latent embedding of node uu: zu=hu(K)z_u = h^{(K)}_u.

Aggregation and update steps can involve trainable weight matrices.

notes

notes

We can train GNNs to do node, edge, graph, and signal level tasks. Common tasks were node/edge/graph tasks.

Node level tasks: node classification/regression. Map node embeddings directly via node head to prediction.

notes

Link prediction: predict a link that should be there based on neighboring links.

Graph level tasks: graph classification (does this graph have a 5-clique?). Graph regression (does the molecule this graph encodes bind to a virus?). Usually an additional layer of pooling. Also a pooling strategy called DiffPool.

notes

Signal level task: features on vertices constitute a signal on a graph. Some graphs have more signals than vertices / edges. Classifying signals is largely unrecognized but important. Classic example is classifying traffic patterns.

  1. Permutation equivariance: graph and node representations are the same regardless of how nodes are ordered in the adjacency matrix. Achieved via permutation equivariant aggregation operations like mmean, sum, max.

  2. Inductive capability: sharing weights across nodes

These types of GNNs are limited and message passing can lead to oversmoothing.

Weisfeler Lehman Test of Isomorphism

Landmark result: GNNs are only as powerful as W-L test of isomorphism. notes

This test results in false positives (e.g. we can get graphs colored the same but aren’t isomorphic).

Some solutions

GraphSAGE

Spectral Construction

Features are signals on graph. These signals have an analogous Graph Fourier Transform, which involves using the eigenvectors of the graph Laplacian as waveforms (see previous lecs). Some spectral methods won’t directly compute GFT, will instead implement filter as polynomial of graph Laplacian.

See previous lec for Graph Laplacian review.

What does it mean to create filters?

notes

We want to learn these filters with a neural network, which is what graph spectral nns were designed to do.

notes

Problems

notes

Wavelet-based constructions

See previous lec on continuous and discrete wavelet transforms (don’t need to know). Waveform centered at every node performs message aggregation.

Graph wavelet networks usually created with diffusion operator. Diffusion wavelets differ from lazy random walks (normal diffusion) because it tells you differences in diffusion between diadic powers.

Geometric Scattering

Alternative to GNNs. Use a cascade of diffusion wavelets to create a “deep” wavelet transform of a signal on a graph (instead of message aggregation and updating). The absolute values are pointwise non-linearity or “activation”. Cascade is parameterized by a scattering path that determines the scales of each wavelet.

Whole graph representation formed by taking an aggregation over all node level features.

8. Optimal Transport on Graphs

Recall KL, JS divergence–divergences have issues when distributions have disjoint support. Optimal transport–how hard is it to push/transform one distribution into another (Wasserstein)?

Wasserstein Distance & Optimal Transport

Wd(P,Q)=infπ(P,Q)d(x,y)π(dx,dy)W_d(P,Q) = \inf_{\pi \in \prod(P,Q)} \int d(x,y) \pi(dx,dy)

Ground “cost” or distance:

d(x,y)=xy2d(x,y) = ||x-y||_2

(there’s also different LpL^p Wasserstein dists.) Proportional to the amount of work needed to turn one dist into another.

Want to minimize

c=(x,y)Pi,jDi,j,c = \sum_{(x,y)}P_{i,j}D_{i,j},

transport cost with constraints:

jPi,j=ri,\sum_j P_{i,j} = r_i,

for rir_i the row marginal, and

iPi,j=cj,\sum_i P_{i,j} = c_j,

with cjc_j the column marginal. Linear program to optimize.

On Graphs

Distributions are normalized positive graph signals (if we have negative signal values, we can shift all signals to being positive).

Entropy Regularized Optimal Transport

infπ(P,Q)d(x,y)π(dx,dy)pπ(dx,dy)ϵh(π)\inf_{\pi \in \prod(P,Q)} \int d(x,y) \pi(dx,dy)^p \pi(dx,dy) - \epsilon h(\pi)

h(π):=logπdπ.h(\pi) := - \int \log \pi d\pi.

Suppose you wanted to maximally spread the probability distribution.Then you might want to proportionally spread each bin in “source” distribution” r(i)r(i) according to the needs of the sink distribution c(j)c(j). This particular joint probability distribution is actually the independence table rcTrc^T. This is the outer product of rr and cc. Thus we can add a penalty for PP to deviate from the rcTrc^T.

Discrete problem formulation with entropic constraint, solvable with Lagrange multipliers.

Sinkhorn

Primal method of optimal transport.

notes

Dual form of optimal transport

Kantorovich Rubenstein Duality

notes

Instead of finding ways to transport from one bin to another (primal), you try to find a signal whose likelihood is very different between two distributions:

notes

In image, high likelihood with top dist. and low likelihood with low dist. Use multiscale historgrams/density estimates of the data:

This has also been done with trees (tree Wasserstein).

Dual form computations vs primal

Diffusion EMD

Uses diffusion wavelets on different scales to decompose signal on graphs. Take L1L^1 difference between difference signals, get diffusion EMD.

notes

Embeddings using diffusion EMD

Create a higher-level distance matrix using pairwise EMD. Visualize with PHATE to preserve manifold geometry.

Ground manifold distance and EMD: take ground distance as diffusion distance.

notes

9. Riemannian Geometry and ML Applications

Topological space

Let XX be a set and τ\tau a family of subsets. τ\tau is a topology on XX if:

Topological manifolds

f:XYf: X \rightarrow Y between two topological spaces is a homeomorphism if:

Chart and Atlases

Neigborhood UU together with homeomorphism x:URnx: U \rightarrow \R^n is a chart. Manifolds which are not homemorphic to Rn\R^n necessarily require more than one chart. The collection of these charts together is an atlas, and the charts allow us to create local coordinate systems for each nbhd and create a tangent basis for it.

We use this idea in manifold learning by using Euclidean distances only to find local neighbors in manifold learning. Recall that we compute pairwise distances when computing kernels. The pointwise kernel function (e.g. Gaussian) eliminates non-local distances.

Importance of neighborhoods: In some manifold learning methods (UMAP, tSNE), there is computation of neighborhoods. The idea is match local neighbors between high and low dimensional embeddings. The idea is that if the charts are matched, then manifold information is carried to low dimensions. This breaks if the manifold is not well sampled (UMAP manifold uniformly sampled assumption).

Differentiable Manifolds

Differentiable manifold a topological manifold with differential structure. A topological manifold can be given a differential structure locally by using the homeomorphisms in its atlas and the standard differential structure on a vector space.

To include a global differential structure on the local coordinate systems induced by the homeomorphisms, their compositions on chart intersections in the atlas must be differentiable functions on the corresponding vector space. Where the domains of charts overlap, the coordinates defined by each chart are required to be differentiable with respect to the coordinates defined by every chart in the atlas. The maps that relate the coordinates defined by the various charts to one another are called transition maps.

notes

The tangent space of a manifold at a specific point is denoted TpMT_pM. It is an nn-dimensional vector space that locally approximates the manifold at that point.

notes

In ML methods where we look at how to move from point to point on a manifold, this assumes some kind of smoothness (differentiability). Therefore, we’re interested in Riemannian geometry.

Rimeannian metric

Given a smooth topological manifold MM, we introduce geometry to the manifold by prescribing an inner product:

notes

This allows us to define ideas of length and distance: notes

notes

notes

The metric is a local element of volume or length. It allows us to build up the real volume in small steps. This is exactly what we do when we compute the length of a path on a graph, albeit discretely. When we find a shortest path we are finding the infimum between these paths.

The volume can also be used for density estimation on the manifold: if you have many points in a nbhd, you can divide by the local volume to get local density.

Curvature

Intuition: one way to define curvature in low dimensions is with the osculating circle (a circle that kisses a curve at a point by being tangent to it and have the same curvature as the curve at that point).

In 2-D, there is Gaussian curvature. There are infinitely many curves that pass through any point on a 2d surface. Locally at any point pp we can find a unit normal vector N(p)N(p) that is perpendicular to the surface at pp. This is called the Gauss map N:R3S2N: \R^3 \rightarrow S^2. We can measure curvature by measuring how fast N(p)N(p) changes.

notes notes

In higher dimesnions, one can look at how a higher dimensional shape is deformed as it moves through the manifold. This requires covariant derivatives, curvature tensor, Ricci tensor, etc.

Gaussian vs Ricci Curvature

Curvature on Graphs

Ollivier-Ricci curvature compares distance between points vs. optimal transport of a 1-hop neighborhood.

There are other notions of curvatures on graphs, such as the diffusion curvature: notes

Bishop-Gromov gives us theorems about the volume of balls on Riemannian manifolds. notes

A comparable volume theorem for diffusion/graphs: notes

Forman Ricci curvature: useful in GNNs, helps combat “oversquashing” of information or too-tight information bottlenecks. Negative edges can be reinforced to pass additional information.

10. Geometric Manifold Learning

Connecting discrete Laplacian to continuous. Ask when they converge?

Energy in discrete space as

ftLf=edgeswij(fifj)2.f^t L f = \sum_{\text{edges}} w_{ij}(f_i - f_j)^2.

Energy in smooth setting, we would want

E(f)=<f,f>=f2/E(f) = < \nabla f, \nabla f > = | \nabla f |^2/

Claim that

M<f,g>dμ=Mf(Δg)dμ.\int_M < \nabla f, \nabla g> d\mu = \int_M f(-\Delta g) d\mu.

Pf.

ParseError: KaTeX parse error: Undefined control sequence: \nbala at position 25: …}f \nabla g = <\̲n̲b̲a̲l̲a̲ ̲f, \nabla g> + …

Integrating over manifold on both sides + Stokes’ Theorem,

M<f,g>dμ+MfΔgdμ=mdivfg=0.\int_M <\nabla f, \nabla g> d\mu + \int_M f \Delta g d\mu = \int_m \text{div} f \nabla g = 0.

We also have similarity in Rayleigh Quotients. E.g., in discrete setting, Rayleigh quotient

minf=1fTLf=λ2.\min_{|f| = 1} f^T L f = \lambda_2.

In smooth,

ParseError: KaTeX parse error: Undefined control sequence: \mun at position 1: \̲m̲u̲n̲ ̲\frac{\int_M | …

Next, connecting diffusions. L=DAL = D-A the Laplacian (discrete) and P=AD1P =AD^{-1} the column normalized random walk operator (same thing as row normalized transpose when we define P=D1AP = D^{-1}A). If ff the initial dist (e.g. of heat) on the nodes of graph, LfLf represents rate of change of diffusion process over graph and PfPf the 1-step transition probabilities. The discrete Laplacian, LL, the infinitesimal generator of PP.

Compare this to the heat equation in the smooth setting,

tu(t,x)=Δu(t,x).\partial_t u(t,x) = \Delta u(t,x).

Solution the heat semigroup,

Ptf=etΔf.P_t f = e^{-t \Delta} f.

The Laplacian Δ\Delta the infinitesimal generator of PtP_t:

tPt evaluated at t=0 =Δf.\frac{\partial}{\partial t}P_t \text{ evaluated at t=0 } = - \Delta f.

Lastly, convergence–when does discrete Laplacian coincide with smooth Laplacian?

11. Autoencoders and Intro to Generative Models

NN overview, unsupervised learning, representation learning

Autoencoders

Linear vs. non-linear AEs

PCA

PCA as encoder, decoder

=1Ni=1N(xixˉ)(xixˉ)T=UΛUT\sum = \frac{1}{N} \sum_{i=1}^N (x_i - \bar x)(x_i - \bar x)^T = U \Lambda U^T

Two views of PCA: maximize variance, minimize error

Reconstruction:

E=i=1nxx~2E = \sum_{i=1}^n ||x - \tilde x||^2

is minimized, where x~\tilde x the projection of xx onto the first kk components of UU.

Regularization

Any modification made to a learning algorithm that is intended to reduce its generalization error but not its training error

An example of bias for simple solutions would be adding # of parameters to the loss function, this penalizes model complexity.

Denoising Autoencoders

The process:

Denoising autoencoders have been hypothesized to do manifold learning.

Generative models

Usually noise -> neural net -> spit out something like training sample.

Probabilistic interpretation: randomly sampled nosie XN(μ,σ)X \sim N(\mu, \sigma) -> neural net -> sample from training distribution XP(X)X \sim P(X).

Can we modify a regular autoencoder for generation?

See: generalized denoising autoencoder

Walkback training

This way of training trains the DAE to estimate a conditional probability distribution P(XX~)P(X|\tilde X). Original paper (Bengio 2013) shows that a consistent estimator of P(X)P(X) can be recovered by alternating sampling from the corruption process C(X~,X)C(\tilde X, X) and the denoising process P(XX~)P(X | \tilde X)

However, issues with this approach: data generation process is very slow (like taking slow walk through data), may not span the space, not penalized to generate the distribution.

Variational Autoencoders (VAEs)

Learning latent spaces that allow for sampling, i.e., we want to generate new examples just by sampling in the latent space.

VAEs force the latent vectors to have a roughly unit Gaussian distribution. Generate images by sampling a unit Gaussian and passing it into the decoder.

There exists a tradeoff between accuracy and the unit Gaussian approximation. Build this directly into the loss:

12. Variational Calculus, Pullback Metrics, NeuralFIM

Attach slides

13. Probabilistic Methods for Robust and Transparent Machine Learning

Guest lecture

14.