Geometric and Topological Methods in ML

Notes and relevant papers / readings for geometric and topological methods in machine learning. IN PROGRESS.

divider

2. Geometric graphs, construction, clustering

Basics: metric spaces / metrics, graphs. Geometric graphs have vertices as points in a metric space, with edges defined by low distances in the metric space.

kNN, ϵ\epsilon-NN graphs. kNN construction via KD tree.

Shortest paths methods: Dijkstra’s (single source, O(E+VlogV)O(|E| + |V|\log |V|) Floyd Warshall (all pairs, O(V3)O(|V|^3)

Dijkstra: notes

Floyd Warshall: notes

We are usually interested in using shortest paths to mimic geodesic distances. Also, Alamgir (ICML 2012) shows us that unweight 1-NN graphs lead us to weird paths (want to make least number of hops, so sometimes tends towards sparse areas of great distance in the metric), and that weighted distance converges to the length of the shortest path connecting two nodes.

Manifold learning based graph construction

Distance is converted to affinities via kernels (e.g. Gaussian kernel):

affinityi,j=exp(dist(xi,xj)22σ2).\text{affinity}_{i,j} = \exp \left( - \frac{\text{dist}(x_i, x_j)^2}{2 \sigma^2} \right).

The process usually goes data \rightarrow calculate distances \rightarrow calculate affinities.

Graph-based clusters / community detection

Hierarchical clustering:

How do we measure distance between clusters? There are a couple of methods: single, average, and complete linkage. We can think of single linkage as linking the two closest points between two clusters, complete as linking the furthest, and average as linking the average. Single linkage is most common:

Modularity

AijA_{ij} is the actual affinity between i,ji,j.

Pi,jP_{i,j} the probability of an edge existing between i,ji,j, calculated by:

Modularity is the sum of AijPijA_{ij} - P_{ij} overall:

Q=12mi,j(Aijkikj2m)δ(ci,cj)Q = \frac{1}{2m} \sum_{i,j} (A_{ij} - \frac{k_i k_j}{2m}) \delta(c_i, c_j)

Modularity optimization: Louvain used to be standard, now most use Leiden. Louvain has two phases:

Why do we care about modularity? Based on its definition, we can think of it as a measure of the strength of a graph into modules / groups / communities. Networks with high modularity have dense connections between nodes within modules but sparse connections between nodes in different modules.

3. Data diffusion, diffusion maps, spectral clustering

Representation Learning

We want to find the “best” representation of our data; i.e., we want to preserve as much information as possible while obeyind some penalty that simplifies the representation.

What kind of information should be preserved, and how do we define “simple”?

These criteria aren’t necessarily mutually exclusive. Simple representations are good for interpretability (visualization), reducing computational cost (compression), and performance on supervised tasks (can be easier to classify / regress from).

Isomap (Tenanbaum, 2020)

MDS (Multidimensional Scaling)

MDS finds an embedding of the data that preserves distances by optimizing point positions such that “stress” is minimized:

stressD(x1,x2,...,xN)=ij=1,...,N(dijxixj)2.\text{stress}_{D} (x_1, x_2, ..., x_N) = \sqrt{ \sum_{i \neq j = 1, ..., N} (d_{ij} - ||x_i - x_j || )^2 }.

This can be done with gradient descent and other related procedures.

Isomap’s weakness: real data is often very noisy, and the shortest path algorithm results in noisy distances between points. More commonly, we might see UMAP being used or PHATE (geometry preserving, common in biology).

Diffusion maps

Diffusion distance can be more robust to noise! Introduced by Coifman and Lafon. 3 step process:

  1. Recall kernel functions (discussed previously for converting distances to affinities). If (X,A,μ)(X, A, \mu) is a measurable space, a kernel function is a mapping K:X×XRK : X \times X \rightarrow \R such that for any x,yX,x, y \in X,

K(x,y)=K(y,x)K(x,y)0.\begin{align*} K(x,y) &= K(y,x) \\ K(x,y) &\geq 0. \end{align*}

Kernels are generally nonlinear. Diffusion maps are specifically defined with respect to the Gaussian kernel:

k(x,y)=exy2/ϵ,k(x,y) = e^{-||x-y||^2 / \eps},

where ϵ\eps is a bandwidth parameter. This kernel captures a notion of affinity/similarity between points in X.X.

The Gaussian kernel emphasizes local connections and pushes non-local connections to 00 in a “softer” way than kNN. We can also threshold these outputs for computational purposes (so that the affinity graph wouldn’t be fully connected).

  1. After computing affinities with the Gaussian kernel, we define the degree matrix. For a weighted graph, the degree of vertex xx is defined as the sum of similarities/affinities between its neighbors:

q(x)=i=0i<nk(x,vi).q(x) = \sum_{i=0}^{i < n} k(x, v_i).

The degree matrix is a diagonal matrix where di,i=q(vertex i)d_{i,i} = q(\text{vertex } i).

  1. We can now define the “diffusion operator”

P(x,y)=k(x,y)/q(x),P(x,y) = k(x,y) / q(x),

a matrix of transition probabilities with row-normalized affinities. Therefore, this matrix is Markov & describes a random walk over the data.

Since PP represents the 1-step transition probabilities, PtP^t represents the tt-step random walk probabilities.

The diffusion map representation is a spectral embedding; that is, PP has a spectral decomposition that we can use to assign coordinates in the embedding to each data point.

Consider the matrix AA where

a(x,y)=k(x,y)q(x)q(y)=q(x)1/2p(x,y)q(y)1/2.a(x,y) = \frac{k(x,y)}{\sqrt{q(x)} \sqrt{q(y)}} = q(x)^{1/2} * p(x,y) * q(y)^{-1/2}.

Then A=Q1/2PQ1/2A = Q^{1/2}*P*Q^{-1/2} and is symmetric, so there exists an eigendecomposition. In fact, PP and AA are similar, so they must have the same eigenvalues and related eigenvectors.

Let the eigenvectors of AA be ϕ1,ϕ2,...,ϕn\phi_1, \phi_2, ..., \phi_n.

We can use these eigenpairs are used to compute the spectral decomposition of PP:

For 1=λ0λ1λ2...λδ>0,1 = \lambda_0 \geq \lambda_1 \geq \lambda_2 \geq ... \geq \lambda_\delta > 0,

(ϕ0ϕ1ϕ2ϕdelta)\begin{pmatrix} \phi_0 & \phi_1 & \phi_2 & \cdots & \phi_delta \end{pmatrix}

where each ϕi\phi_i is a column eigenvector from AA, we can map a datapoint xx to its embedding coordinates as:

xϕt(x)=[λ0tϕ0(x),λ1tϕ1(x),λ2tϕ2(x),...,λδtϕδ(x)]T.x \mapsto \phi^t(x) = [ \lambda_0^t \phi_0(x), \lambda_1^t \phi_1(x), \lambda_2^t \phi_2(x), ..., \lambda_\delta^t \phi_\delta(x)]^T.

Properties of the spectrum

Powers of the matrix AA make lower eigenvalues get lower:

notes

Hence powers of AA denoise in a low-pass filtering manner. When we map xx to its new coordinates, we can use only the first kk eigvals/eigvecs due to spectrum decay while still maintaining accuracy, which reduces our dimensionality.

Diffusion distance

Diffusion maps preserve diffusion distance, a family of L2L^2 distance on the transition probability distribution between 2 points:

Dt(x,y)2=pt(x,)pt(y,)L2(X,dμ/π)2.D_t(x,y)^2 = ||p_t(x, \cdot) - p_t(y, \cdot) ||^2_{L^2(X, d\mu / \pi)}.

This is essentially the Euclidean distance between rows of the diffusion operator.

Diffusion distance can also be interpreted as the overall distance in walking probabilities after tt steps when starting out in point xx vs. point yy.

Diffusion maps do a good job at cleaning data, but are not necessarily great at visualization. This is partially because diffusion maps split information into multiple orthogonal directions, e.g. plotting in 2-D will likely only show us 2-ish trajectories.

notes

Partitioning data with K-means

In k-means, we want to find kk splits/partitions to minimize variance within each cluster:

k-means converges because each reassignment lowers the MSE:

arg minSi=1kxSixμi2=arg minSi=1kSivar(Si).\argmin_S \sum_{i=1}^k \sum_{x \in S_i} || x- \mu_i ||^2 = \argmin_S \sum_{i=1}^k |S_i|\text{var}(S_i).

Therefore, this optimization will reach a local minima. However, k-means only works well on data with certain characteristics. It assumes:

It is also sensitive to initialization and outliers.

An alternative is spectral clustering. Spectral clustering can use eigenvectors of the diffusion oeprator or Laplacian. Advantages of using the spectrum for k-means:

How it works:

4. Data diffusion, continuous diffusion, PHATE

Previously (3), diffusion maps.

Connections between discrete and continuous diffusion

For finite bandwidth ϵ\epsilon, the Markov chain in discrete time and space converges to a Markov process in discrete time but continuous space.

As ϵ0\epsilon \rightarrow 0, this jump process converges to a diffusion process on Rn\R^n. In the continuum limit of infinitely many points and continuous time, this resembles a Fokker-Planck diffusion process.

Fokker-Planck Diffusion Process

Describes diffusion as a mix of random Brownian forces WtW_t and drag forces U(x)U(x):

dXt=μ(Xt,t)dt+σ(Xt,t)dWt.dX_t = \mu (X_t, t) dt + \sigma (X_t, t) dW_t.

Given an initial condition at time s<ts < t this follows the forward Fokker-Planck equation (FPE):

pt=(p+pU(x)).\frac{\partial p}{\partial t} = \nabla \cdot (\nabla p + p \nabla U(x)).

Δ\Delta refers to the Laplacian Δy=(u)\Delta y = \nabla(\nabla u), the divergence of the gradient.

General solutions of the FPE are given by the eigenfunctions of the FPE operator. In our case, the potential is the log of the data density (so drag is related to data density).

Data density

The density of the data D(xi)=q(xi)=jA(xi,xj)D(x_i) = q(x_i) = \sum_j A(x_i, x_j) is a proxy for the density at point xix_i.

In the setting of our data, the “potential” UU is the log of data density:

U(x)=log(q(x)).U(x) = - \log (q(x)).

Essentially, the FPE is saying we have a diffusion process that has “drag” because of data density.

Anisotropic normalization

Instead of the diffusion kernel before, we can define a different symmetric kernel MS=DαADαM_S = D^{ - \alpha} A D^{- \alpha}:

MS=k(x,y)q(x)αq(y)α.M_S = \frac{k(x,y)}{q(x)^\alpha q(y)^\alpha}.

If α=1\alpha=1 this kernel has uniform degree (all points have density of 1).

Convergence to Laplace-Beltrami

If you set α=1\alpha =1, then U=log(1)=0U = \log(1) = 0. Then the second term in the FPE drops out:

pt=(p+pU(x))\frac{\partial p}{\partial t} = \nabla \cdot (\nabla p + p \nabla U(x))

and we are left with

pt=p\frac{\partial p}{\partial t} = \nabla p

which is purely controlled by the Laplacian. This is a second derivative operator and purely geometric, hence separating density from geometry.

Summary: In infinitely many points and continuous time the FPE represents the diffusion process. We need to reconcile with the drag term UU in the FPE, which is given by the density of the data. We resolve this by dividing it out in anisotropic normalization kernel, which converges to the Laplace Beltrami operator, successfully separating density from geometry.

Application of diffusion maps to clustering / PHATE

Previously (3): k-means clustering algorithm, spectral clustering using diffusion map coordinates.

PHATE goal: Create a visualization that preserves data geometry in low dimensions.

notes

PHATE uses adaptive bandwidths/an alpha-decaying kernel:

Kk,α(x,y)=12exp((xy2ϵk(x))α)+12exp((xy2ϵk(y))α),K_{k, \alpha}(x,y) = \frac{1}{2} \exp \left( - \left( \frac{||x-y||_2}{\epsilon_k(x)} \right)^\alpha \right) + \frac{1}{2} \exp \left( - \left( \frac{||x-y||_2}{\epsilon_k(y)} \right)^\alpha \right),

where

Plot of alpha decay kernel:

notes

Spectral entropy

notes

Potential distance in PHATE

notes

Diffusion components track one branch at a time, zeroing out everything else. Makes them good for clustering but not great for visualization. See image for example:

notes

Other remarks

5. tSNE and UMAP

Ways to compare probability distributions

Entropy

Represents the expected amount of uncertainty in a distribution:

H(P)=iP(xi)log(P(xi)).H(P) = - \sum_i P(x_i)\log(P(x_i)).

Entropy measures information you learn by knowing the solution.

Cross Entropy

Given 2 distributions P,Q,P,Q, how many bits on average does it take to “encode QQ” in a code that is optimized for PP?

H(P,Q)=iQ(xi)log(P(xi)).H(P,Q) = - \sum_i Q(x_i) \log(P(x_i)).

Example: encoding a distribution QQ.

If we have 3 outcomes a,b,ca, b, c and q(a)=1/2,q(b)=q(c)=1/4,q(a) = 1/2, q(b) = q(c) = 1/4, we can encode:

Then the average number of bits needed is

(1/2)1+(1/4)1+(1/4)2.(1/2)*1 + (1/4)*1 + (1/4)*2.

Suppose we use encoding for QQ on a distribution PP now, where p(a)=p(b)=1/4p(a) = p(b) = 1/4 and p(c)=1/2p(c) = 1/2. Then the average number of bits needed is

(1/4)1+(1/4)1+(1/2)2.(1/4)*1 + (1/4)*1 + (1/2)*2.

KL divergence

The expected number of extra bits needed if using samples from PP on a code optimized for QQ. This is related to cross entropy and sometimes called relative entropy:

DKL(PQ)=iP(xi)log(P(xi)Q(xi))=iP(xi)log(Q(xi)P(xi)).D_{KL}(P||Q) = - \sum_i P(x_i)\log \left( \frac{P(x_i)}{Q(x_i)} \right) = \sum_i P(x_i)\log \left( \frac{Q(x_i)}{P(x_i)} \right).

Note that

DKL(PQ)=H(P,Q)H(P).D_{KL}(P||Q) = H(P,Q) - H(P).

Important properties: DKL(PQ)D_{KL}(P||Q) is not a distance. It is not the same as DKL(QP)D_{KL}(Q||P) and does not follow the triangle inequality.

Jensen Shannon Distance

JS divergence is symmetric but still does not follow the triangle inequality. Can be made a proper distance by taking a square root of the Jensen Shannon divergence.

tSNE: t-Stochastic Neighbor Embedding

Concept: Maintain neighborhoods, preserve the data’s shape both locally and globally. notes

Stochastic neighbors: tSNE creates a probability distribution that represents similarities between neighbors, or the conditional probability that one node would pick another as its neighbor. The similarity for a node xix_i is defined to be proportional to a probability density under a Gaussian centered at xix_i.

The variance (bandwidth parameter) for each of these Gaussians of any point xix_i is set to σi\sigma_i such that the Shannon entropy of neighbors is a fixed value.

H(p(xi,))=jp(xi,xj)log(p(xi,xj)).H(p(x_i, \cdot)) = - \sum_j p(x_i, x_j) \log(p(x_i, x_j)).

perplexity=WH(p(xi,))= effective no. of nbrs\text{perplexity} = W^{H(p(x_{i, \cdot}))} = \text{ effective no. of nbrs}

This fixed value is defined by a parameters we set called perplexity.

Penalty: tSNE minimizes KL divergence, i.e., we want to match the probability distributions of high dimensional and low dimensional neighbors. Optimize with SGD.

Note on SNE vs tSNE: high-dimensional similarities are measured as shown below and use an adaptive bandwidth. The low-dim similarities are computed differently depending on if using SNE or tSNE.

notes notes

In both SNE and tSNE, we are moving points around to minimize KL divergence. However, the problem with normal SNE is that there is a low penalty for moving points (in low dim) around if they are far away in high dimensions.

notes

To fix this, tSNE uses the student-t distribution in low-dim space. The fatter tails help keep pi,jp_{i,j} a little bit higher, so points which are far away can still be reasonably penalized in their low-dim representations. This also helps with crowding.

notes

notes

UMAP

Uniform Manifold Approximation and Projection = UMAP.

Assumes:

Effectively uses a tSNE-like method to embed data.

μij={exp((d(xi,xj)d(xi,xi1))/σi) for j{i1,...,ik}0 else.\mu_{i \rightarrow j} = \begin{cases} \exp( - (d(x_i, x_j) - d(x_i, x_{i_1}))/ \sigma_i) & \text{ for } j \in \{i_1, ..., i_k\} \\ 0 & \text{ else.} \end{cases}

μij=μij+μjiμijμji[0,1].\mu_{ij} = \mu_{i \rightarrow j} + \mu_{j \rightarrow i} - \mu_{i \rightarrow j}\mu_{j \rightarrow i} \in [0,1].

Uses binary crossentropy loss, a penalty for modeling similar points as dissimilar and vice versa:

L({ei}{μij})=21i<jnμijlog(vij)+(1μ)ijlog(1vij).\mathcal{L} ( \{e_i \} | \{ \mu_{ij}\}) = -2 \sum_{1 \leq i < j \leq n} \mu_{ij} \log(v_{ij}) + (1 - \mu)_{ij} \log(1 - v_{ij}).

Initialization to a Laplacian eigenmap (Belkin and Niyogi), a geometric method. The “effective” loss function says that UMAP effectively binarizes neighbors.

Scalability: neighbors and non-neighbors are just sampled from the data graph. Therefore, we don’t have to compute a whole affinity matrix or Markov matrix like in tSNE, hence it is much faster and more scalable.

Attraction-repulsion continuum: Unlke tSNE, something being close doesn’t mean something else has to be far (effect of row-stochastic normalization). Therefore, UMAP visualizations have a less “explosive” look:

notes

UMAP vs tSNE

6. Graph Laplacians and Graph Signal Processing

Laplacian

Laplacian = divergence of the gradient, sum of all the unmixed second derivatives at any point.

Δf=(f)=2f=2fx2+2fy2+2fz2.\Delta f = \nabla \cdot (\nabla f) = \nabla^2 f = \frac {\partial^2 f}{\partial x^2} + \frac{\partial^2 f}{\partial y^2} + \frac{\partial^2 f}{\partial z^2}.

In 1-D, Laplacian is the second derivative, which measures the curvature of a function. In multiple dimensions, the Laplacian at a point measures how much the function’s value deviates from the average ofits values in a small neighborhood around that point, related to mean curvature.

The heat equation uses the Laplacian:

ut\frac{\partial u}{\partial t}

to represent the rate at which a quantity diffuses at a given point. Other applications of the Laplacian include computer vision, quantum mechanics (Hamiltonian), electromagnetism (Poisson eq.).

The Laplacian is a linear operator:

cos(ωx),sin(ωx)\cos(\omega x), \sin(\omega x) are eigenfunctions of the Laplacian in Euclidean space (e.g., the second deriv of each function is a scalar multiple of the og. function).

Discrete and Graph Laplacian

A discrete Laplacian is a discrete analog of continuous Laplacians, often on a grid. In a graph, the Laplacian operator is a linear operator acting as a matrix that represents how a function’s value at a vertex differs from its average over neighboring vertices in a graph. Therefore, the graph Laplacian is defined as the degree matrix minus the adjacency matrix

L=DA,L = D- A,

which measures how a vertex is different from its neighbors.

Quick note on finite difference approximation, where continuous second-order partial derivatives of the Laplacian are replaced iwht finite difeerences using Taylor series expansions, eg:

2upartialx2u(x+h,y)2u(x,y)+u(xh,y)h2.\frac{\partial^2 u}{partial x^2} \approx \frac{u(x+h, y) - 2u(x,y) + u(x-h,y)}{h^2}.

notes

We can think of the graph Laplacian as a discrete version of finite difference approximation.

By the spectral theorem, LL is eigendecomposable into orthonormal eigenbasis L=UΛUT,L = U \Lambda U^T, where LU0=λ0U0,LU1=λ1U1,....LU_0 = \lambda_0 U_0, LU_1 = \lambda_1 U_1, ....

Eigenvalues of LL are in the range of 0λ0λ1...λn0 \leq \lambda_0 \leq \lambda_1 \leq ... \leq \lambda_n.

We often take LL to be a normalized Laplacian L=ID1/2AD1/2,L = I - D^{-1/2}AD^{-1/2}, which helps remove the effect of density.

00 is an eigenvalue of the eigenvector (1,1,1,).(1, 1, 1, \cdots). The second smallest eigenvalue, called the “Fiedler value,” measures graph connectivity.

notes

Laplacian Eigenmaps

Belkin & Niyogi, maps each point to its Laplacian eigenvector coordinates.

xiU1(xi),U2(xi),...,Uk(xi)x_i \rightarrow U_1(x_i), U_2(x_i), ..., U_k(x_i)

where each UiU_i an eigenvector of L=DKL= D-K. (xix_i literally gets mapped to the ii-th coordinate in each eigenvector, using the num. of eigenvecs depending on the dimension you’re reducing to.)

We can recast this as a smoothness minimization problem: a desirable embedding may be one where nearby points are similar. We choose a Gaussian kernel K(xi,xj)=expxixj2/ϵ.K(x_i, x_j) = \exp{-||x_i - x_j||^2 / \epsilon}.

C(Y)=1/2i,j(yiyj)2K(xi,xj)=ytLy.C(Y) = 1/2 \sum_{i,j} (y_i - y_j)^2 K(x_i, x_j) = y^t L y.

Based on Courant-Fisher, the minimizer of this is U0U_0 and the next best minimizer is U1U_1.

notes

Graph Signal Processing

Fourier transform of a signal:

f(k)=f(x)e2πikxdx.f(k) = \int_{-\infty}^{\infty} f(x) e^{-2 \pi i k x}dx.

Usually, we don’t have the functional form available to compute the integral, so we have to use discrete samples and discrete wavelengths for the discrete Fourier transform.

notes

DFT is a matrix multiplication:

notes

We can interpret features as signals on a graph. Applying graph Fourier transform, L=UΛUTL = U \Lambda U^T, we can decompose signals into graph Fourier coefficients.

Spectral Graph Filtering

Introduced by Stankovic et al. 2018. Interested in finding which frequencies are noise, then altering the frequencies to retrieve the corrected signal.

Filter is generally constructed by having some function HH, which changes the eigenvalues H(λi)H(\lambda_i) (see below).

notes

Data signals are low frequency and noise can be high frequency. We can take noise off by taking high frequency eigenvectors off. A low pass filter is when you let low frequencies through and filter out high frequencies.

notes

Problems with frequency domain filters:

K-filter coefficients

Reduce number of coefficients in filter

Chebyshev Polynomials

Use polynomial filter, automatically applies filter to eigenvalues and not eigenvectors.

notes

From Fourier to Wavelets

A solution: wavelets! Unlike a sine or cosine function, a wavelet is a localized oscillation. THey can be scaled (to look at low vs high frequency components) and translated (to look at different times and locations).

Discrete wavelet transform

notes

Diffusions can be used to create wavelets. Diffusion wavelets , formed by taking differences between two scales of diffusion.

7. Graph Neural Networks and Geometric Scattering

Rough neural network overview. COvered:

Design choices

Graph neural network connection

Graph notation–adjacency matrix can be weighted, vertex features are graph signals.

Message passing: aggregate information from neighbors, update. There can be several message passing iterations. After kk iterations you get a latent embedding of node uu: zu=hu(K)z_u = h^{(K)}_u.

Aggregation and update steps can involve trainable weight matrices.

notes

notes

We can train GNNs to do node, edge, graph, and signal level tasks. Common tasks were node/edge/graph tasks.

Node level tasks: node classification/regression. Map node embeddings directly via node head to prediction.

notes

Link prediction: predict a link that should be there based on neighboring links.

Graph level tasks: graph classification (does this graph have a 5-clique?). Graph regression (does the molecule this graph encodes bind to a virus?). Usually an additional layer of pooling. Also a pooling strategy called DiffPool.

notes

Signal level task: features on vertices constitute a signal on a graph. Some graphs have more signals than vertices / edges. Classifying signals is largely unrecognized but important. Classic example is classifying traffic patterns.

  1. Permutation equivariance: graph and node representations are the same regardless of how nodes are ordered in the adjacency matrix. Achieved via permutation equivariant aggregation operations like mmean, sum, max.

  2. Inductive capability: sharing weights across nodes

These types of GNNs are limited and message passing can lead to oversmoothing.

Weisfeler Lehman Test of Isomorphism

Landmark result: GNNs are only as powerful as W-L test of isomorphism. notes

This test results in false positives (e.g. we can get graphs colored the same but aren’t isomorphic).

Some solutions

GraphSAGE

Spectral Construction

Features are signals on graph. These signals have an analogous Graph Fourier Transform, which involves using the eigenvectors of the graph Laplacian as waveforms (see previous lecs). Some spectral methods won’t directly compute GFT, will instead implement filter as polynomial of graph Laplacian.

See previous lec for Graph Laplacian review.

What does it mean to create filters?

notes

We want to learn these filters with a neural network, which is what graph spectral nns were designed to do.

notes

Problems

notes

Wavelet-based constructions

See previous lec on continuous and discrete wavelet transforms (don’t need to know). Waveform centered at every node performs message aggregation.

Graph wavelet networks usually created with diffusion operator. Diffusion wavelets differ from lazy random walks (normal diffusion) because it tells you differences in diffusion between diadic powers.

Geometric Scattering

Alternative to GNNs. Use a cascade of diffusion wavelets to create a “deep” wavelet transform of a signal on a graph (instead of message aggregation and updating). The absolute values are pointwise non-linearity or “activation”. Cascade is parameterized by a scattering path that determines the scales of each wavelet.

Whole graph representation formed by taking an aggregation over all node level features.

8. Optimal Transport on Graphs

Recall KL, JS divergence–divergences have issues when distributions have disjoint support. Optimal transport–how hard is it to push/transform one distribution into another (Wasserstein)?

Wasserstein Distance & Optimal Transport

Wd(P,Q)=infπ(P,Q)d(x,y)π(dx,dy)W_d(P,Q) = \inf_{\pi \in \prod(P,Q)} \int d(x,y) \pi(dx,dy)

Ground “cost” or distance:

d(x,y)=xy2d(x,y) = ||x-y||_2

(there’s also different LpL^p Wasserstein dists.) Proportional to the amount of work needed to turn one dist into another.

Want to minimize

c=(x,y)Pi,jDi,j,c = \sum_{(x,y)}P_{i,j}D_{i,j},

transport cost with constraints:

jPi,j=ri,\sum_j P_{i,j} = r_i,

for rir_i the row marginal, and

iPi,j=cj,\sum_i P_{i,j} = c_j,

with cjc_j the column marginal. Linear program to optimize.

On Graphs

Distributions are normalized positive graph signals (if we have negative signal values, we can shift all signals to being positive).

Entropy Regularized Optimal Transport

infπ(P,Q)d(x,y)π(dx,dy)pπ(dx,dy)ϵh(π)\inf_{\pi \in \prod(P,Q)} \int d(x,y) \pi(dx,dy)^p \pi(dx,dy) - \epsilon h(\pi)

h(π):=logπdπ.h(\pi) := - \int \log \pi d\pi.

Suppose you wanted to maximally spread the probability distribution.Then you might want to proportionally spread each bin in “source” distribution” r(i)r(i) according to the needs of the sink distribution c(j)c(j). This particular joint probability distribution is actually the independence table rcTrc^T. This is the outer product of rr and cc. Thus we can add a penalty for PP to deviate from the rcTrc^T.

Discrete problem formulation with entropic constraint, solvable with Lagrange multipliers.

Sinkhorn

Primal method of optimal transport.

notes

Dual form of optimal transport

Kantorovich Rubenstein Duality

notes

Instead of finding ways to transport from one bin to another (primal), you try to find a signal whose likelihood is very different between two distributions:

notes

In image, high likelihood with top dist. and low likelihood with low dist. Use multiscale historgrams/density estimates of the data:

This has also been done with trees (tree Wasserstein).

Dual form computations vs primal

Diffusion EMD

Uses diffusion wavelets on different scales to decompose signal on graphs. Take L1L^1 difference between difference signals, get diffusion EMD.

notes

Embeddings using diffusion EMD

Create a higher-level distance matrix using pairwise EMD. Visualize with PHATE to preserve manifold geometry.

Ground manifold distance and EMD: take ground distance as diffusion distance.

notes

9. Riemannian Geometry and ML Applications

Topological space

Let XX be a set and τ\tau a family of subsets. τ\tau is a topology on XX if:

Topological manifolds

f:XYf: X \rightarrow Y between two topological spaces is a homeomorphism if:

Chart and Atlases

Neigborhood UU together with homeomorphism x:URnx: U \rightarrow \R^n is a chart. Manifolds which are not homemorphic to Rn\R^n necessarily require more than one chart. The collection of these charts together is an atlas, and the charts allow us to create local coordinate systems for each nbhd and create a tangent basis for it.

We use this idea in manifold learning by using Euclidean distances only to find local neighbors in manifold learning. Recall that we compute pairwise distances when computing kernels. The pointwise kernel function (e.g. Gaussian) eliminates non-local distances.

Importance of neighborhoods: In some manifold learning methods (UMAP, tSNE), there is computation of neighborhoods. The idea is match local neighbors between high and low dimensional embeddings. The idea is that if the charts are matched, then manifold information is carried to low dimensions. This breaks if the manifold is not well sampled (UMAP manifold uniformly sampled assumption).

Differentiable Manifolds

Differentiable manifold a topological manifold with differential structure. A topological manifold can be given a differential structure locally by using the homeomorphisms in its atlas and the standard differential structure on a vector space.

To include a global differential structure on the local coordinate systems induced by the homeomorphisms, their compositions on chart intersections in the atlas must be differentiable functions on the corresponding vector space. Where the domains of charts overlap, the coordinates defined by each chart are required to be differentiable with respect to the coordinates defined by every chart in the atlas. The maps that relate the coordinates defined by the various charts to one another are called transition maps.

notes

The tangent space of a manifold at a specific point is denoted TpMT_pM. It is an nn-dimensional vector space that locally approximates the manifold at that point.

notes

In ML methods where we look at how to move from point to point on a manifold, this assumes some kind of smoothness (differentiability). Therefore, we’re interested in Riemannian geometry.

Rimeannian metric

Given a smooth topological manifold MM, we introduce geometry to the manifold by prescribing an inner product:

notes

This allows us to define ideas of length and distance: notes

notes

notes

The metric is a local element of volume or length. It allows us to build up the real volume in small steps. This is exactly what we do when we compute the length of a path on a graph, albeit discretely. When we find a shortest path we are finding the infimum between these paths.

The volume can also be used for density estimation on the manifold: if you have many points in a nbhd, you can divide by the local volume to get local density.

Curvature

Intuition: one way to define curvature in low dimensions is with the osculating circle (a circle that kisses a curve at a point by being tangent to it and have the same curvature as the curve at that point).

In 2-D, there is Gaussian curvature. There are infinitely many curves that pass through any point on a 2d surface. Locally at any point pp we can find a unit normal vector N(p)N(p) that is perpendicular to the surface at pp. This is called the Gauss map N:R3S2N: \R^3 \rightarrow S^2. We can measure curvature by measuring how fast N(p)N(p) changes.

notes notes

In higher dimesnions, one can look at how a higher dimensional shape is deformed as it moves through the manifold. This requires covariant derivatives, curvature tensor, Ricci tensor, etc.

Gaussian vs Ricci Curvature

Curvature on Graphs

Ollivier-Ricci curvature compares distance between points vs. optimal transport of a 1-hop neighborhood.

There are other notions of curvatures on graphs, such as the diffusion curvature: notes

Bishop-Gromov gives us theorems about the volume of balls on Riemannian manifolds. notes

A comparable volume theorem for diffusion/graphs: notes

Forman Ricci curvature: useful in GNNs, helps combat “oversquashing” of information or too-tight information bottlenecks. Negative edges can be reinforced to pass additional information.

10. Geometric Manifold Learning

Connecting discrete Laplacian to continuous. Ask when they converge?

Energy in discrete space as

ftLf=edgeswij(fifj)2.f^t L f = \sum_{\text{edges}} w_{ij}(f_i - f_j)^2.

Energy in smooth setting, we would want

E(f)=<f,f>=f2/E(f) = < \nabla f, \nabla f > = | \nabla f |^2/

Claim that

M<f,g>dμ=Mf(Δg)dμ.\int_M < \nabla f, \nabla g> d\mu = \int_M f(-\Delta g) d\mu.

Pf.

ParseError: KaTeX parse error: Undefined control sequence: \nbala at position 25: …}f \nabla g = <\̲n̲b̲a̲l̲a̲ ̲f, \nabla g> + …

Integrating over manifold on both sides + Stokes’ Theorem,

M<f,g>dμ+MfΔgdμ=mdivfg=0.\int_M <\nabla f, \nabla g> d\mu + \int_M f \Delta g d\mu = \int_m \text{div} f \nabla g = 0.

We also have similarity in Rayleigh Quotients. E.g., in discrete setting, Rayleigh quotient

minf=1fTLf=λ2.\min_{|f| = 1} f^T L f = \lambda_2.

In smooth,

ParseError: KaTeX parse error: Undefined control sequence: \mun at position 1: \̲m̲u̲n̲ ̲\frac{\int_M | …

Next, connecting diffusions. L=DAL = D-A the Laplacian (discrete) and P=AD1P =AD^{-1} the column normalized random walk operator (same thing as row normalized transpose when we define P=D1AP = D^{-1}A). If ff the initial dist (e.g. of heat) on the nodes of graph, LfLf represents rate of change of diffusion process over graph and PfPf the 1-step transition probabilities. The discrete Laplacian, LL, the infinitesimal generator of PP.

Compare this to the heat equation in the smooth setting,

tu(t,x)=Δu(t,x).\partial_t u(t,x) = \Delta u(t,x).

Solution the heat semigroup,

Ptf=etΔf.P_t f = e^{-t \Delta} f.

The Laplacian Δ\Delta the infinitesimal generator of PtP_t:

tPt evaluated at t=0 =Δf.\frac{\partial}{\partial t}P_t \text{ evaluated at t=0 } = - \Delta f.

Lastly, convergence–when does discrete Laplacian coincide with smooth Laplacian?

11. Autoencoders and Intro to Generative Models

NN overview, unsupervised learning, representation learning

Autoencoders

Linear vs. non-linear AEs

PCA

PCA as encoder, decoder

=1Ni=1N(xixˉ)(xixˉ)T=UΛUT\sum = \frac{1}{N} \sum_{i=1}^N (x_i - \bar x)(x_i - \bar x)^T = U \Lambda U^T

Two views of PCA: maximize variance, minimize error

Reconstruction:

E=i=1nxx~2E = \sum_{i=1}^n ||x - \tilde x||^2

is minimized, where x~\tilde x the projection of xx onto the first kk components of UU.

Regularization

Any modification made to a learning algorithm that is intended to reduce its generalization error but not its training error

An example of bias for simple solutions would be adding # of parameters to the loss function, this penalizes model complexity.

Denoising Autoencoders

The process:

Denoising autoencoders have been hypothesized to do manifold learning.

Generative models

Usually noise -> neural net -> spit out something like training sample.

Probabilistic interpretation: randomly sampled nosie XN(μ,σ)X \sim N(\mu, \sigma) -> neural net -> sample from training distribution XP(X)X \sim P(X).

Can we modify a regular autoencoder for generation?

See: generalized denoising autoencoder

Walkback training

This way of training trains the DAE to estimate a conditional probability distribution P(XX~)P(X|\tilde X). Original paper (Bengio 2013) shows that a consistent estimator of P(X)P(X) can be recovered by alternating sampling from the corruption process C(X~,X)C(\tilde X, X) and the denoising process P(XX~)P(X | \tilde X)

However, issues with this approach: data generation process is very slow (like taking slow walk through data), may not span the space, not penalized to generate the distribution.

Variational Autoencoders (VAEs)

Learning latent spaces that allow for sampling, i.e., we want to generate new examples just by sampling in the latent space.

VAEs force the latent vectors to have a roughly unit Gaussian distribution. Generate images by sampling a unit Gaussian and passing it into the decoder.

There exists a tradeoff between accuracy and the unit Gaussian approximation. Build this directly into the loss:

12. Variational Calculus, Pullback Metrics, NeuralFIM

Attach slides; math lecture

13. Probabilistic Methods for Robust and Transparent Machine Learning

Guest lecture

14. Neural ODEs, Flows, and Flow Matching

First part: covering ODEs. ODE deals with one variable and its functions and derivatives. Initial value problem.

Neural ODEs

Resnets are neural networks with skip connections. Neural ODEs paper relates ODEs to Resnets, though it isn’t a super useful analogy.

Neural ODE: a neural network that computes derivatives (Usually NN tries to approx function, neural ODE goes for derivative of that function). The derivative is “parameterized” by a neural network. To compute F(x,t1)F(x,t_1) given F(x,t0)F(x,t_0) we use an ODE solver.

ODE network layers are calls to ODE solvers. In order to train it, we would have to backpropagate through operations of the ODE solver, but this can be very time consuming even if every operation is differentiable.

To avoid backprop through the ODE solver, they try to approx. the derivative. Approx the SGD using “adjoint sensitivity.”

Adjoint sensitivity method

Suppose we have a loss function whose output is the result of an ODE solver. To optimize LL we have to consider its gradient wrt θ\theta, but the parameters in θ\theta are optimizing the loss function via intermediate variables output by the ODE solver z(t0),z(t1),...z(t_0), z(t_1), .... We can imagine applying the chain rule to figure this out, which would be weird b/c we have to do this continuously for every value of tt.

We need a continuous analogue of the chain rule called the sensitivity equation instead. Two DEs: forwards and backwards, used to compute adjoint sensitivity equation, which will approximate derivatives for ODE solver.

Continuous normalizing flows

Neural ODEs can be used to perform population dynamics and not just single entity dynamics. They can do this by using continuous normalizing flows, a continuous extension of normalizing flows. Still trained to maximize the likelihood of the data distribution.

Normalizing flows are a series of invertible simple functions that can be run forwards or backwards. Trained “backwards” from how they operate.

Deep normalizing flows compose the function ff into a composition of invertible mappings, starting from a simple noise distribution (which must be easily computable to work).

Maximum likelihood (MLE) training: train a NN to maximize the density it assigns to samples from the data. In the case of normalizing flows, we can compute it exactly using the density of the initial noise distribution, which is Gaussian.

OT Regularized Flows

Optimal transport regularized flows–restricting CNF to optimal transport, helpful for modeling biological and cellular systems. Constrain the derivative network such that they are smoother (penalize L2L^2 norm), this gives rise to optimal transport paths.

Manifold Interpolating Flows

Flow is restricted to be smooth, don’t have to start from noise.

Architecture: autoencoder has PHATE regularization to enforce data embedded via manifold distances.

Train via ODE solver. No longer have observations in cont. time of each cell, so we penalize to match the distribution of the next timepoint. This takes the form of a static optimal transport penalty, or you can use KL divergence. This will give us smooth paths from time point ot time point.

Regularized flow has been shown to yield dynamic optimal transport.

Why would we do MIOFlow? With low-dim PHATE representation, we can flow backwards and look at the decoded trajectories. Can also use Granger causality to understand to infer things like gene regulatory networks (or other causal structure between data).

Flow matching

Neural ODEs (simulation based training) require many integration steps, have high variance gradients. Can we approx OT without Neural ODE?

Use flow matching. If we only care about source to target point, we can assume Euclidean latent space and then the path will be a straight line between source and target points. SO the vector field from the source always points to the target. Flow matching is to regress onto the vector field.

Diffusion based methods can take circuitous paths to the data, whereas flow matching will go right towards it. We can interpret the OT paths as linearly interpolating between a very low variance distribution around a data point and target. OT has been shown to have fast convergence.

15. CNNs and Diffusion Models

Parts:

Classifier guidance: Some diffusion models take in text or other labels as a prompt, generally accommodated by putting that label in as a conditional label P(x0y).P(x_0 | y). There is also manifold guidance: correcting noise using a denoising autoencoder. The noise then “snaps” onto the data manifold and the diffusion occurs on the data manifold.

16. Intro to Topological Data Analysis

TDA is interested in the shape, structure, and connectivity of data. Especially suitable for datasets that are high-dimensional, noisy, or incomplete. The central idea is that the shape of data contains valuable information that traditional statistical methods might miss. One challenge with the extracted topological and geometric information is trying to incorporate it in a differentiable framework.

A core TDA concept is using individual data points to build a “continuous” shape that approximates the data. This follows a pipeline:

Datasets are usually finite sets of discrete observations, do not directly reveal any topological information. A natural way to highlight some topological structure out of this data is to “connect” it with neighbors that are close, quantified via metric space. TDA considers data sets as discrete metric spaces or samples from continuous metric spaces.

Simplices

Given a set X={x0,x1,...,xk}RkX= \{x_0, x_1, ..., x_k\} \in \R^k of k+1k+1 affinely independent points, the kk-dimensional simplex σ=[x0,x1,...,xk]\sigma = [x_0, x_1, ..., x_k] is the convex hull of XX. The points of XX are the vertices of the simplex. Simplices spanned by subsets SXS \subset X are called the faces of σ\sigma.

A simplicial complex KRdK \in \R^d is a collection of simplices such that:

Abstract vs Geometric Simplicial Complexes: KK is fully characterized by the combinatorial description of a collection of simplices satisfying some incidence rules. We can define simplicial complexes with no reference to geometry:

The combinatorial description of any geometric simplicial KK gives rise to an abstract simplicial complex KK. Conversely, we can associate an abstract simplical complex K~\tilde K a topological space K~|\tilde K| such that if KK is a geometric complex whose combinatorial description is the same as K~\tilde K, then the underlying space of KK is homeomorphic to K~.|\tilde K|. Such a KK is called a geometric realization of K~|\tilde K|.

Abstract simplicial complexes can be seen as topological spaces and geometric complexes can be seen as geometric realizations of their underlying combinatorial structure.

Therefore, we can consider simplicial complexes simultaneously as combinatorial objects for effective computation and as topological spaces from which we can infer topological properties.

Building simplicial complexes from data

Given a dataset, there are many ways of building a simplicial complex. A common method is the Vietoris-Rips complex:

It follows immediately from the definition that this is an abstract simplicial complex. However, in general, even when XX is a finite subset of Rd\R^d, Rips(X)\text{Rips}(X) does not admit a geometric realization in Rd\R^d, and in particular, it can be of dimension higher than dd.

The Cech complex Cechα(X)\text{Cech}_\alpha(X) is the set of simplices σ=[x0,x1,...,xk]\sigma = [x_0, x_1, ..., x_k] such that the K+1K+1 closed balls B(xi,α)B(x_i, \alpha) have nonempty intersections

i=0kB(xi,α).\cap_{i=0}^k B(x_i, \alpha) \neq \emptyset.

This is different from the VR complex in that unless balls centered at x1,x2,x3x_1, x_2, x_3 have a 3-way intersection, there is no triangle formed in a Cech complex.

Nerves

Given a cover of a dataset, its nerve provides a compact and global combinatorial description of the relationship between these sets through their intersection patterns.

Nerve theorem: Given a cover U=(Ui)iIU = (U_i)_{i\in I} of MM, the nerve of UU is the abstract simplicial complex C(U)C(U) whose vertices are the UiU_i’s and such that σ=[Ui0,...,Uik]C(U)\sigma = [U_{i0}, ..., U_{ik}] \in C(U) if and only if j=0kUij.\cap_{j=0}^k U_{ij} \neq \emptyset.

Using the nerve of covers as a way to summarize, visualize, and explore data is a natural idea first proposed in Singh (2007), giving rise to the Mapper algorithm.

Mapper algorithm used to convert complex, high-dim data into a simplified and understandable graphical representation based on the idea of a nerve:

Choice of filter function will affect the results. Common ones include density estimates, PC coordinates, centrality measures, and distances from root.

Homology

17. TDA Guest Lecture

Computational topology: a subfield of algebraic topology that actually tries to compute things on a computer.

Think: why is a cube more similar to a sphere than a torus? The hole in the middle? How can we calculate this? Which qualities of the sphere make it different from the torus?

One idea: Betti numbers. β0,β1,β1\beta_0, \beta_1, \beta_1 measures connected components, tunnels, and voids in that order.

Chain groups

Given a simplicial complex KK, the pp-th chain group CpC_p of KK consists of all combinations of pp simplices in the complex.

Coefficients are in Z2\mathbb{Z}_2, so all elements of CpC_p are of the form jσj\sum_j \sigma_j for σjK\sigma_j \in K. The group operation is addition with Z2\mathbb{Z}_2 coefficients.

notes

Boundary homomorphism

Given a simplicial complex KK, the pp-th boundary homomorphism is a function that assigns each simplex σ={v0,...vp}K\sigma = \{v_0, ... v_p\} \in K to its boundary:

pσ=i{v0,...,v^i,...,vk}\partial_p \sigma = \sum_i \{v_0, ..., \hat v_i, ..., v_k\}

where v^i\hat v_i indicates the set does not contain the ii-th vertex. The function p:CpCp1\partial_p : C_p \rightarrow C_{p-1} is thus a homomorphism between the chain groups.

For all pp, we have p1p=0\partial_{p-1} \circ \partial_p = 0: boundaries do not have a boundary themselves. This leads to the chain complex.

notes

The cycle group ParseError: KaTeX parse error: Undefined control sequence: \partal at position 18: …p = \text{ker} \̲p̲a̲r̲t̲a̲l̲_p and the boundary group Bp=imp+1.B_p = \text{im} \partial_{p+1}.

BpB_p is a subgroup of ZpZ_p.

Homology groups and Betti numbers

The pp-th homology group HpH_p is a quotient group, defined by removing cycles that are boundaries from a higher dimension:

Hp=Zp/Bp=kerp/imp+1.H_p = Z_p / B_p = \text{ker} \partial_p / \text{im} \partial_{p+1}.

The pp-th Betti number is calculated as

βp=rankHp.\beta_p = \text{rank}H_p.

Intuition: Calculate all boundaries, remove the boundaries that come from a higher-dim object, and count what is left.

Homology calculations in practice use Smith normal form:

notes

Vietoris-Rips Complex

notes

Issues with this approach: how to pick the right ϵ\epsilon? Also may be computationally inefficient: matrix reduction has to be performed for every simplicial complex.

The fix: calculate topological features for all possible scales. How is this any better?

By going through all the scales and tracking topological features, we can see the “destruction” and “creation” of features. We will then be able to identify at what scales features appear and disappear.

This works because VR complex has nesting property.

notes

notes

notes

Persistent Hommology Group

Given two indices ij,i \leq j, the pp-th persistent homology group Hpi,jH_p^{i,j} is defined as

Hpi,j:=Zp(Ki)/(Bp(Kj)Zp(Ki)),H_p^{i,j} := Z_p(K_i) / (B_p(K_j) \cap Z_p(K_i)),

which contains all the homology classes of KiK_i that are still present in KjK_j.

We can calulate a new set of homology groups alongside the filtration and assign a ‘duration’ to each topological feature.

notes

Standard filtrations

notes

notes

Given boundary matrix, we can reduce it. We can examine a reduced boundary matrix for topological information:

Tracking topological features

Given a topological feature with associated simplicial complexes KiK_i and KjK_j, store the point (w(i),w(j))(w(i), w(j)) in a persistence diagram. If the feature is never destroyed, we assign it a weight of .\infty.

18. Homology in ML

Homotopy: a homeomorphism where you can “contract” things. Involves a cross product with time parameter in domain.

notes

Topological invariant: a property of a mathematical object that remains unchanged under a homemorphism. An example is the Euler characteristic:

notes

Another invariant is homology:

notes

notes

notes

Thus to compute cavities:

We can repeat this in a lower dimension.

Chains

A k simplex is represented by listing its vertices. We call the union of k simplices a k subcomplex of XX. The group of kk chains, or the kk-th chaing roup, is the group of all possible k-dimensional subcomplexes you can have.

To identify kk holes in the simplicial complex, we have to identify kk subcomplexes that have no boundary. The boundary operator maps kk chains to their k1k-1 dimensional boundaries

δk:Ck(X)Ck1(X).\delta_k: C_k(X) \rightarrow C_{k-1}(X).

The boundary operator is linear and can therefore be represented by a matrix.

From the chain group, we can get the $k$th cycle group Zk(X),Z_k(X), the kernel of the boundary operator. This includes all kk chains that map to 0.

Of all the boundaryless k subcomplexes we find, we need to eliminate the ones that bound a k+1k+1 subcomplex. For this we look at the image of the boundary map on Ck+1(X).C_{k+1}(X). The k boundary is

Bk(X)=δkCk+1(X)Ck(X).B_k(X) = \delta_k C_{k+1}(X) \subset C_k(X).

Homology group

Hk=Zk(X)/Bk(X).H_k = Z_k(X) / B_k(X).

Persistence homology in ML constructs a nested sequence of simplicial complexes called a filtration. This tracks the birth and death of k-holes, and we can store this infomration in a persistence diagram.

Sublevel set filtration

Thresholding values on objects, e.g. imagine we have grayscale images:

notes

This filtration can be used when we already have a geometrically structured object (e.g. images). This uses cubical complexes as opposed to simplicial complexes.

In cubical complexes, vertices and edges correspond to the same, but triangles now correspond to squares, and tetrahedra correspond to cubes.

19. (20?) Various Filtrations for Graphs and Point Clouds

Filtrations:

Filtrations through node and edge functions:

Centrality: assigns a ranking to nodes based on their network position. Diff types: betweenness centrality (how many times a nodes appears in shortest path b.t. two other nodes), eigencentrality (measures a node’s influence on other nodes).

Eigenvector centrality: Let there be a graph G=(V,E)G= (V,E) with adj. matrix A(i,j)A(i,j). The relative centrality score of vertext ii is c(i)=jA(i,j)c(j).c(i) = \sum_j A(i,j) c(j). Rewriting, we get AC=λcAC = \lambda c; this can be any eigenvector. With the additional constraint that c(i)>0,c(i) > 0, we get that it must be the largest eigenvector of AA based on Perron Frobenius norm.

Graph filtrations

Graph filtrations based on node-edge functions create an ordering of nodes or edges. Based on the threshold of the function, subgraphs “appear” in a nested fashion.

Graph distance filtration: converts a graph to a point cloud, compute pairwise graph (geodesic distances) b.t. points. Then compute VR complex.

Condensation filtration [ add image ]

Convergence

Application of a positive diffusion operator to a datapoint makes it a convex combination of the points in the dataset, and thus it goes into the interior of the current data convex hull. The convex hull shrinks at a rate proportional to \partial which is the smallest value in the kernel.

Multiscale PHATE: PHATE on frozen diffusion condensation clusters.

Multiparameter persistence

For original persistence homology, the term single persistence is applied because we filter the data in just one direction K1K2...Kn.K_1 \subset K_2 \subset ... \subset K_n. But there are several ways to expand a single filtration into multifiltration to get finer information on the topological patterns.

Multifiltration: multiparameter filtration refers to sweeping through two variables. The result is a bifiltration with columns and rows nested.

Multifiltration in a graph: Use two different node/edge functions f,gf,g. These induce a multifiltration F:VR2F: V \rightarrow \R^2 where F(v)=(f(v),g(v)).F(v) = (f(v), g(v)).

Multifiltration in a point cloud: consider using a radius rir_i and a density estimate at each point pi.p_i. One is VR filtration, and the other is a level-set filtration. At each threshold of the density, we can construct a complete Rips filtration.

Vectorizing multifiltrations: birth and death are hard to compute; the two variables may only form a partial ordering. Multiparameter Betti numbers are an option, as well as sliced persistence diagrams (examine a series of slices as normal persistence diagrams).

Zigzag persistence: Persistence homology for time varying point clouds or graphs. Instead of going in the same direction for filtration, it zig zags (goes back in forth) in directionality. It uses intersections to create the intermediate steps.

Ordinal partition networks:

Given time series x=[x0,x1,...,xn]x=[x_0, x_1, ..., x_n] for a sequence of times x=[t0,t1,...,tn],x=[t_0, t_1, ..., t_n], the ordinal partition network is defined by all generating permutations of length mm (of which there are m!m!), and then adding edges when they chronologically follow one another.

20. Another guest lecture… topological deep learning

An objection to point cloud -> persistent homology -> featurized input to machine learning pipeline is that it degrades topology to just a set of handcrafted features.

A way to “reverse” this pipeline? How can we learn suitable representations?

Toopological autoencoders, motivation: classic AE struggles to find a representation which preserves multiple scales all at once. If you have data which appears at different scales (e.g., nested spheres), it’s hard to craft a low-dim representation that preserves all that information.

TAE: input data + reconstruction gives reconstruction loss. Topological info of input data AND latent data is compared to give topological loss. The topological loss term serves as regularization.

Why this works: let XX be point cloud of size nn and XmX^m be one subsample of XX of size mm sampled without replacement. We can bound the probability of the persistence diagrams of XmX^m exceeding at threshold in terms of the bottleneck distance as

P(W(DX,DXm)>ϵ)P(distH(X,Xm)>2ϵ),P(W_\infty (D^X, D^{X^m}) > \epsilon) \leq P(\text{dist}_H (X, X^m) > 2 \epsilon),

where distH\text{dist}_H is the Hausdorff distance. So mini batches are topologically close if the subsampling is not too coarse.

Gradient calculation: everything is based on distances. Every point in the persistence diagram can be mapped to one entry in the distance matrix. Each entry is a distance, so it can be changed during training.

Loss term is bidirectional: how much topo info is preserved going from input to latent + the reverse.

Topology-driven graph learning

Main issue: two graphs can have different numbers of vertices. So we need a vectorized representation f:GRdf: G \rightarrow \R^d of graphs. Such a representation ff needs to be permutation invariant.

Message passing is incapable of capturing large-scale cycles in a graph (and other things, eg oversmoothing problems). In topological graph NN, use a node map ϕ:RdRk\phi: \R^d \rightarrow \R^k to create kk different filtrations of the graph. Create persistence diagrams from these graphs, and use a coordinatisation function ψ\psi to create compatible representations of the node attributes. Then aggregate for outputs.

Choosing ϕ,ψ\phi, \psi? The node map can be realized using a neural network. The coordinatisation function can be realized using any vectorization of persistence diagrams, though a differentiable coordinatisation function is most effective.

Sheaves

Let GG be a finite, undirected graph with fixed orientation on each edge. Cellular sheaves on GG: a finite dimensional real vector space, known as the stalk, F(v)F(v) for each vv vertex and F(e)F(e) for each edge. For each incident vertex edge pair, we have a linear restriction map FvFv (funky symbol) e:F(v)F(e).e: F(v) \rightarrow F(e).

Bodnar, et al.: Neural sheaf diffusion: a topological perspective on Heterophily and Oversmoothing in GNNs.

Overview

Topology based methods can be powerful if higher-order information is relevant. THere are different perspectives we can take: more itnerest in multi-scale assessment of data (persistent homology), vs. trying to capture the interaction of graph structure and features.

Sheaves generalize existing mechanisms to account for more complex interactions in the data.

21. Sheaf NNs