Geometric and Topological Methods in ML
Notes and relevant papers / readings for geometric and topological methods in machine learning. IN PROGRESS.
2. Geometric graphs, construction, clustering
Basics: metric spaces / metrics, graphs. Geometric graphs have vertices as points in a metric space, with edges defined by low distances in the metric space.
kNN, -NN graphs. kNN construction via KD tree.
Shortest paths methods: Dijkstra’s (single source, Floyd Warshall (all pairs,
Dijkstra:

Floyd Warshall:

We are usually interested in using shortest paths to mimic geodesic distances. Also, Alamgir (ICML 2012) shows us that unweight 1-NN graphs lead us to weird paths (want to make least number of hops, so sometimes tends towards sparse areas of great distance in the metric), and that weighted distance converges to the length of the shortest path connecting two nodes.
Manifold learning based graph construction
Distance is converted to affinities via kernels (e.g. Gaussian kernel):
The process usually goes data calculate distances calculate affinities.
Graph-based clusters / community detection
Hierarchical clustering:
- Link neighbors
- Link clusters
How do we measure distance between clusters? There are a couple of methods: single, average, and complete linkage. We can think of single linkage as linking the two closest points between two clusters, complete as linking the furthest, and average as linking the average. Single linkage is most common:
- start with every point in its own cluster
- merge the two closest clusters
- recompute distances based on min distance to the newly formed cluster
- repeat
Modularity
is the actual affinity between .
the probability of an edge existing between , calculated by:
- generate random edges of the same degrees
- then the prob is , where is the total number of edges in the graph.
Modularity is the sum of overall:
Modularity optimization: Louvain used to be standard, now most use Leiden. Louvain has two phases:
- Finding communities:
- Consider each node individually
- See if the node should be moved into a neighboring community: select the community with the greatest change in modularity. If moving it corresponds with a positive change in modularity, move it.
- Check only neighbors of the node
- Coarse graining:
- Each community in phase 1 becomes a node in phase 2
- Repeat phase 1
Why do we care about modularity? Based on its definition, we can think of it as a measure of the strength of a graph into modules / groups / communities. Networks with high modularity have dense connections between nodes within modules but sparse connections between nodes in different modules.
Recommended reading
- Alamgir et al. Shortest Path Distances in k-nearest Neighbor Graphs, ICML 2012
- Lishan et al., Data-driven graph construction and graph learning: A review, Neurocomputing 2018
- Kd-trees (Wikipedia)
- Dijkstra’s algorithm (Wikipedia)
- Floyd Warshall algorithm (Wikipedia)
- Newman, Modularity and community structure in Networks, PNAS 2006
- Blondell et al. Fast unfolding of communities in large networks, Journal of Statistical Mechanics: theory and experiment, 2008
3. Data diffusion, diffusion maps, spectral clustering
Representation Learning
We want to find the “best” representation of our data; i.e., we want to preserve as much information as possible while obeyind some penalty that simplifies the representation.
What kind of information should be preserved, and how do we define “simple”?
- Low-dimensional
- Sparse
- Independent components
These criteria aren’t necessarily mutually exclusive. Simple representations are good for interpretability (visualization), reducing computational cost (compression), and performance on supervised tasks (can be easier to classify / regress from).
Isomap (Tenanbaum, 2020)
- Construct neighborhood graph
- Find all pairs shortest paths
- Apply MDS to compute low-dimensional coordinates that preserve distances
MDS (Multidimensional Scaling)
MDS finds an embedding of the data that preserves distances by optimizing point positions such that “stress” is minimized:
This can be done with gradient descent and other related procedures.
Isomap’s weakness: real data is often very noisy, and the shortest path algorithm results in noisy distances between points. More commonly, we might see UMAP being used or PHATE (geometry preserving, common in biology).
Diffusion maps
Diffusion distance can be more robust to noise! Introduced by Coifman and Lafon. 3 step process:
- Compute affinities with Gaussian kernel
- Compute degree matrix
- Markovian diffusion operator gives us coordinates
- Recall kernel functions (discussed previously for converting distances to affinities). If is a measurable space, a kernel function is a mapping such that for any
Kernels are generally nonlinear. Diffusion maps are specifically defined with respect to the Gaussian kernel:
where is a bandwidth parameter. This kernel captures a notion of affinity/similarity between points in
The Gaussian kernel emphasizes local connections and pushes non-local connections to in a “softer” way than kNN. We can also threshold these outputs for computational purposes (so that the affinity graph wouldn’t be fully connected).
- After computing affinities with the Gaussian kernel, we define the degree matrix. For a weighted graph, the degree of vertex is defined as the sum of similarities/affinities between its neighbors:
The degree matrix is a diagonal matrix where .
- We can now define the “diffusion operator”
a matrix of transition probabilities with row-normalized affinities. Therefore, this matrix is Markov & describes a random walk over the data.
Since represents the 1-step transition probabilities, represents the -step random walk probabilities.
The diffusion map representation is a spectral embedding; that is, has a spectral decomposition that we can use to assign coordinates in the embedding to each data point.
Consider the matrix where
Then and is symmetric, so there exists an eigendecomposition. In fact, and are similar, so they must have the same eigenvalues and related eigenvectors.
Let the eigenvectors of be .
- A left eigenvector of , , would be .
- A right eigenvector of , , would be
We can use these eigenpairs are used to compute the spectral decomposition of :
For
where each is a column eigenvector from , we can map a datapoint to its embedding coordinates as:
Properties of the spectrum
- Eigenvalues are in range
- 1 is an eigenvalue
- The first eigenvector is the steady state/stationary vector of all 1’s
- Stationary distribution given as
- Markov chain is reversible, e.g.
Powers of the matrix make lower eigenvalues get lower:

Hence powers of denoise in a low-pass filtering manner. When we map to its new coordinates, we can use only the first eigvals/eigvecs due to spectrum decay while still maintaining accuracy, which reduces our dimensionality.
Diffusion distance
Diffusion maps preserve diffusion distance, a family of distance on the transition probability distribution between 2 points:
This is essentially the Euclidean distance between rows of the diffusion operator.
Diffusion distance can also be interpreted as the overall distance in walking probabilities after steps when starting out in point vs. point .
Diffusion maps do a good job at cleaning data, but are not necessarily great at visualization. This is partially because diffusion maps split information into multiple orthogonal directions, e.g. plotting in 2-D will likely only show us 2-ish trajectories.

Partitioning data with K-means
In k-means, we want to find splits/partitions to minimize variance within each cluster:
- initialize: pick random cluster centroids
- assignment: assign each datapoint to cluster that has the nearest mean
- update: compute the means of clusters
k-means converges because each reassignment lowers the MSE:
Therefore, this optimization will reach a local minima. However, k-means only works well on data with certain characteristics. It assumes:
- k is chosen correctly
- data is distributed normally around the mean
- clusters all have equal variance
- clusters all have the same number of points
It is also sensitive to initialization and outliers.
An alternative is spectral clustering. Spectral clustering can use eigenvectors of the diffusion oeprator or Laplacian. Advantages of using the spectrum for k-means:
- good for clusters of arbitrary shape
- good for graph data
- only need connectivity data
How it works:
- construct a similarity graph
- compute diffusion operator or graph Laplacian
- using top k eigenvectors, run k-means
Recommended reading
- Nadler et al., Diffusion maps, spectral clustering, and eigenfunctions of Fokker-Planck operators, NeurIPS 2005
- Coifman and Lafon, Diffusion Maps, Applied and computational harmonic analysis 2006
- Tenenbaum et al., A global geometric framework for non-linear dimensionality reduction, Science 2000 (Isomap)
4. Data diffusion, continuous diffusion, PHATE
Previously (3), diffusion maps.
Connections between discrete and continuous diffusion
For finite bandwidth , the Markov chain in discrete time and space converges to a Markov process in discrete time but continuous space.
As , this jump process converges to a diffusion process on . In the continuum limit of infinitely many points and continuous time, this resembles a Fokker-Planck diffusion process.
Fokker-Planck Diffusion Process
Describes diffusion as a mix of random Brownian forces and drag forces :
Given an initial condition at time this follows the forward Fokker-Planck equation (FPE):
refers to the Laplacian , the divergence of the gradient.
General solutions of the FPE are given by the eigenfunctions of the FPE operator. In our case, the potential is the log of the data density (so drag is related to data density).
Data density
The density of the data is a proxy for the density at point .
In the setting of our data, the “potential” is the log of data density:
Essentially, the FPE is saying we have a diffusion process that has “drag” because of data density.
Anisotropic normalization
Instead of the diffusion kernel before, we can define a different symmetric kernel :
If this kernel has uniform degree (all points have density of 1).
Convergence to Laplace-Beltrami
If you set , then . Then the second term in the FPE drops out:
and we are left with
which is purely controlled by the Laplacian. This is a second derivative operator and purely geometric, hence separating density from geometry.
Summary: In infinitely many points and continuous time the FPE represents the diffusion process. We need to reconcile with the drag term in the FPE, which is given by the density of the data. We resolve this by dividing it out in anisotropic normalization kernel, which converges to the Laplace Beltrami operator, successfully separating density from geometry.
Application of diffusion maps to clustering / PHATE
Previously (3): k-means clustering algorithm, spectral clustering using diffusion map coordinates.
PHATE goal: Create a visualization that preserves data geometry in low dimensions.
- Denoise the data
- Capture geometry
- Compress into visualizable dimensions
- Allow for scalability to large modern datasets

PHATE uses adaptive bandwidths/an alpha-decaying kernel:
where
- controls rate of decay of kernel
- the distance to -th nearest neighbor
- a parameter
- adaptive bandwidths adjust for density.
Plot of alpha decay kernel:

Spectral entropy
- The amount of entropy, proxy for information, decays as increases
- Choose a spot after the elbow
- This heuristically corresponds to after the noise decays sand before signal decay

Potential distance in PHATE

- a triplet distance between that sums over diffusion inner products to all other points
- damps diffusion probabilities using a log factor which increases the importance of far points in contextualizing
- accounts for the covariance between random walks arising from vs those arising from
Diffusion components track one branch at a time, zeroing out everything else. Makes them good for clustering but not great for visualization. See image for example:

Other remarks
- Ways to speed up PHATE: compressed diffusion
- DeMAP: denoised manifold affinity preservation
- PHATE preserves cluster and trajectory structure, denoises to reveal more local structure
Recommended reading
- Nadler et al., Diffusion maps, spectral clustering and reaction coordinates of dynamical systems, Applied and Computational Harmonic Analysis 2006
- Moon et al., Visualizing Structure and Transitions in High Dimensional Biological Data, Nature Biotechnology 2019
5. tSNE and UMAP
Ways to compare probability distributions
- cross entropy
- divergences
- KL divergence
- Jensen-Shannon divergence
- distances
- Wasserstein distance
- maximum mean discrepancy
- more in (6)
Entropy
Represents the expected amount of uncertainty in a distribution:
Entropy measures information you learn by knowing the solution.
Cross Entropy
Given 2 distributions how many bits on average does it take to “encode ” in a code that is optimized for ?
Example: encoding a distribution .
If we have 3 outcomes and we can encode:
- a as 0
- b as 1
- c as 10
Then the average number of bits needed is
Suppose we use encoding for on a distribution now, where and . Then the average number of bits needed is
KL divergence
The expected number of extra bits needed if using samples from on a code optimized for . This is related to cross entropy and sometimes called relative entropy:
Note that
Important properties: is not a distance. It is not the same as and does not follow the triangle inequality.
Jensen Shannon Distance
JS divergence is symmetric but still does not follow the triangle inequality. Can be made a proper distance by taking a square root of the Jensen Shannon divergence.
tSNE: t-Stochastic Neighbor Embedding
Concept: Maintain neighborhoods, preserve the data’s shape both locally and globally.

Stochastic neighbors: tSNE creates a probability distribution that represents similarities between neighbors, or the conditional probability that one node would pick another as its neighbor. The similarity for a node is defined to be proportional to a probability density under a Gaussian centered at .
The variance (bandwidth parameter) for each of these Gaussians of any point is set to such that the Shannon entropy of neighbors is a fixed value.
This fixed value is defined by a parameters we set called perplexity.
Penalty: tSNE minimizes KL divergence, i.e., we want to match the probability distributions of high dimensional and low dimensional neighbors. Optimize with SGD.
Note on SNE vs tSNE: high-dimensional similarities are measured as shown below and use an adaptive bandwidth. The low-dim similarities are computed differently depending on if using SNE or tSNE.

In both SNE and tSNE, we are moving points around to minimize KL divergence. However, the problem with normal SNE is that there is a low penalty for moving points (in low dim) around if they are far away in high dimensions.

To fix this, tSNE uses the student-t distribution in low-dim space. The fatter tails help keep a little bit higher, so points which are far away can still be reasonably penalized in their low-dim representations. This also helps with crowding.


UMAP
Uniform Manifold Approximation and Projection = UMAP.
Assumes:
- the data is uniformly distributed on Riemannian manifold
- Riemannian metric is locally constant
- manifold is locally connected
Effectively uses a tSNE-like method to embed data.
- UMAP works with unnormalized similarities
- uses the same adaptive kernel as tSNE
- symmetricized high dimensional similarities on kNN:
- low dimensional similarities
Uses binary crossentropy loss, a penalty for modeling similar points as dissimilar and vice versa:
Initialization to a Laplacian eigenmap (Belkin and Niyogi), a geometric method. The “effective” loss function says that UMAP effectively binarizes neighbors.
Scalability: neighbors and non-neighbors are just sampled from the data graph. Therefore, we don’t have to compute a whole affinity matrix or Markov matrix like in tSNE, hence it is much faster and more scalable.
Attraction-repulsion continuum: Unlke tSNE, something being close doesn’t mean something else has to be far (effect of row-stochastic normalization). Therefore, UMAP visualizations have a less “explosive” look:

UMAP vs tSNE
- UMAP uses similarities, tSNE uses probabilities (stochastic nbrs)
- positive sampling of neighbors vs. negative sampling of non-neighbors
- Laplacian eigenmap initialization of UMAP is important and different
Recommended reading
- Hinton & Roweis, Stochastic Neighbor Embedding, NeurIPS 2002
- van der Maaten, Laurens, and Geoffrey Hinton. “Visualizing Data using t-SNE.” Journal of Machine Learning Research 9 (2008):
- Bohm et al. Attraction-Repulsion Spectrum in Neighbor Embeddings, JMLR 2022
- Damrich & Hamprecht, On UMAPs True Loss Function, NeurIPS 2025
6. Graph Laplacians and Graph Signal Processing
Laplacian
Laplacian = divergence of the gradient, sum of all the unmixed second derivatives at any point.
In 1-D, Laplacian is the second derivative, which measures the curvature of a function. In multiple dimensions, the Laplacian at a point measures how much the function’s value deviates from the average ofits values in a small neighborhood around that point, related to mean curvature.
The heat equation uses the Laplacian:
to represent the rate at which a quantity diffuses at a given point. Other applications of the Laplacian include computer vision, quantum mechanics (Hamiltonian), electromagnetism (Poisson eq.).
The Laplacian is a linear operator:
- additivity
- homogeneity
are eigenfunctions of the Laplacian in Euclidean space (e.g., the second deriv of each function is a scalar multiple of the og. function).
Discrete and Graph Laplacian
A discrete Laplacian is a discrete analog of continuous Laplacians, often on a grid. In a graph, the Laplacian operator is a linear operator acting as a matrix that represents how a function’s value at a vertex differs from its average over neighboring vertices in a graph. Therefore, the graph Laplacian is defined as the degree matrix minus the adjacency matrix
which measures how a vertex is different from its neighbors.
Quick note on finite difference approximation, where continuous second-order partial derivatives of the Laplacian are replaced iwht finite difeerences using Taylor series expansions, eg:
-
Unnormalized Laplacian
-
Normalized Laplacian
-
Random walk Laplacian related to Markovian matrix.

We can think of the graph Laplacian as a discrete version of finite difference approximation.
By the spectral theorem, is eigendecomposable into orthonormal eigenbasis where
Eigenvalues of are in the range of .
We often take to be a normalized Laplacian which helps remove the effect of density.
is an eigenvalue of the eigenvector The second smallest eigenvalue, called the “Fiedler value,” measures graph connectivity.

Laplacian Eigenmaps
Belkin & Niyogi, maps each point to its Laplacian eigenvector coordinates.
where each an eigenvector of . ( literally gets mapped to the -th coordinate in each eigenvector, using the num. of eigenvecs depending on the dimension you’re reducing to.)
We can recast this as a smoothness minimization problem: a desirable embedding may be one where nearby points are similar. We choose a Gaussian kernel
Based on Courant-Fisher, the minimizer of this is and the next best minimizer is .

Graph Signal Processing
Fourier transform of a signal:
Usually, we don’t have the functional form available to compute the integral, so we have to use discrete samples and discrete wavelengths for the discrete Fourier transform.

DFT is a matrix multiplication:

We can interpret features as signals on a graph. Applying graph Fourier transform, , we can decompose signals into graph Fourier coefficients.
Spectral Graph Filtering
Introduced by Stankovic et al. 2018. Interested in finding which frequencies are noise, then altering the frequencies to retrieve the corrected signal.
Filter is generally constructed by having some function , which changes the eigenvalues (see below).

Data signals are low frequency and noise can be high frequency. We can take noise off by taking high frequency eigenvectors off. A low pass filter is when you let low frequencies through and filter out high frequencies.

Problems with frequency domain filters:
- Problem 1: The filters are not localized operations
- They are operating globally based on frequency characteristics
- Potential fix: Use wavelet filters instead of Fourier
- Wavelets are localized in time and space
- Problem 2: Fourier transform is expensive, eigendecomposition
of an NxN matrix each time
- Fix: Polynomial filters (will cover in GNN lecture)
- Fix: Diffusion based operations (also a polynomial filter)
K-filter coefficients
Reduce number of coefficients in filter
Chebyshev Polynomials
Use polynomial filter, automatically applies filter to eigenvalues and not eigenvectors.

From Fourier to Wavelets
- Fourier transforms capture global frequency information, frequencies that persist over the entire length of the signal
- What about small bursts of frequency changes?
- Uneven frequency distributions over time?
- These might benefit from a more localized frequency domain analysis
A solution: wavelets! Unlike a sine or cosine function, a wavelet is a localized oscillation. THey can be scaled (to look at low vs high frequency components) and translated (to look at different times and locations).
Discrete wavelet transform

Diffusions can be used to create wavelets. Diffusion wavelets , formed by taking differences between two scales of diffusion.
Recommended reading
- Belkin and Niyogi Laplacian Eigenmaps for Dimensionality Reduction and Data Representation, Neural Computation 2003
- Shuman et al. The Emerging Field of Signal Processing on Graphs: Extending High-Dimensional Data Analysis to Networks and Other Irregular Domains
- Coifman & Maggioni, Diffusion Wavelets, Applied and Harmonic Analysis 2006
- Gao et al. Geometric Scattering for Graph Data Analysis, ICML 2019
- Castro et al. Uncovering the Folding Landscape of RNA Secondary Structure with Deep Graph Embeddings, IEEE Big Data 2019
7. Graph Neural Networks and Geometric Scattering
Rough neural network overview. COvered:
- neurons as linear transformation + nonlinear activation function
- differentiability of neural networks allows us to learn relationship between input and desired output
- nodes arranged in layers to form deep networks
- depth creates a composition of functions to learn complicated relationships / increase abstraction
- learning algorithms automatically tune the weights and biases
- gradient descent
Design choices
- activation fn
- cost fn
- number and dimension of layers
- connection / dropout between layers
- regularization (layers, batching, etc)
- more
Graph neural network connection
- Spatial (vertex space) construction
- apply same operator and move over graph space in some fashion
- notion of locality based on neighbors of a node
- Spectral construction
- define a convolutional operation in graph spectral space
- use signal processing theoery to define filters in the graph Fourier domain
- notion of locality based on global properties of the filter
Graph notation–adjacency matrix can be weighted, vertex features are graph signals.
Message passing: aggregate information from neighbors, update. There can be several message passing iterations. After iterations you get a latent embedding of node : .
Aggregation and update steps can involve trainable weight matrices.


We can train GNNs to do node, edge, graph, and signal level tasks. Common tasks were node/edge/graph tasks.
Node level tasks: node classification/regression. Map node embeddings directly via node head to prediction.

Link prediction: predict a link that should be there based on neighboring links.
Graph level tasks: graph classification (does this graph have a 5-clique?). Graph regression (does the molecule this graph encodes bind to a virus?). Usually an additional layer of pooling. Also a pooling strategy called DiffPool.

Signal level task: features on vertices constitute a signal on a graph. Some graphs have more signals than vertices / edges. Classifying signals is largely unrecognized but important. Classic example is classifying traffic patterns.
-
Permutation equivariance: graph and node representations are the same regardless of how nodes are ordered in the adjacency matrix. Achieved via permutation equivariant aggregation operations like mmean, sum, max.
-
Inductive capability: sharing weights across nodes
These types of GNNs are limited and message passing can lead to oversmoothing.
Weisfeler Lehman Test of Isomorphism
Landmark result: GNNs are only as powerful as W-L test of isomorphism.

This test results in false positives (e.g. we can get graphs colored the same but aren’t isomorphic).
Some solutions
- Don’t just add layers to GNNs (eg message passing iterations): increase power with aggregation.
- Add skip connections–each layer does smoothing, so skip connections can be like a band pass.
- Feature augmentation / engineering
- add features ot make GNNs more expressive, e.g. cycle count, node degree, edge degree, things from graph theory.
GraphSAGE
- inductive, message passing network
- mean aggregation (aggregates neighbors and concatenates self)
- LSTM aggregation
Spectral Construction
Features are signals on graph. These signals have an analogous Graph Fourier Transform, which involves using the eigenvectors of the graph Laplacian as waveforms (see previous lecs). Some spectral methods won’t directly compute GFT, will instead implement filter as polynomial of graph Laplacian.
See previous lec for Graph Laplacian review.
What does it mean to create filters?

We want to learn these filters with a neural network, which is what graph spectral nns were designed to do.

Problems
- fourier transform is expensive, requires eigendecomposition of matrix each time
- fix: Chebyshev approximation instead
- many GCNs use first order polynomials of graph Laplacian (we learn the polynomial coefficients)

Wavelet-based constructions
See previous lec on continuous and discrete wavelet transforms (don’t need to know). Waveform centered at every node performs message aggregation.
Graph wavelet networks usually created with diffusion operator. Diffusion wavelets differ from lazy random walks (normal diffusion) because it tells you differences in diffusion between diadic powers.
Geometric Scattering
Alternative to GNNs. Use a cascade of diffusion wavelets to create a “deep” wavelet transform of a signal on a graph (instead of message aggregation and updating). The absolute values are pointwise non-linearity or “activation”. Cascade is parameterized by a scattering path that determines the scales of each wavelet.
Whole graph representation formed by taking an aggregation over all node level features.
8. Optimal Transport on Graphs
Recall KL, JS divergence–divergences have issues when distributions have disjoint support. Optimal transport–how hard is it to push/transform one distribution into another (Wasserstein)?
Wasserstein Distance & Optimal Transport
Ground “cost” or distance:
(there’s also different Wasserstein dists.) Proportional to the amount of work needed to turn one dist into another.
Want to minimize
transport cost with constraints:
for the row marginal, and
with the column marginal. Linear program to optimize.
On Graphs
Distributions are normalized positive graph signals (if we have negative signal values, we can shift all signals to being positive).
Entropy Regularized Optimal Transport
Suppose you wanted to maximally spread the probability distribution.Then you might want to proportionally spread each bin in “source” distribution” according to the needs of the sink distribution . This particular joint probability distribution is actually the independence table . This is the outer product of and . Thus we can add a penalty for to deviate from the .
Discrete problem formulation with entropic constraint, solvable with Lagrange multipliers.
Sinkhorn
Primal method of optimal transport.
- Sinkhorn iteration algorithm for optimal transport.
- Heat kernel method
- Heat kernel on graph, given by matrix exponential approximated via Taylor series for the graph Laplacian
- graph kernel can be used to implement sinkhorn on a graph
- Chebyshev polynomial approximation results in matrix-vector products, efficient for sparse matrices. . Can be used to filter signal and begin sinkhorn iterations.
- Geodesic sinkhorn method: converts sinkhorn method to operate on a graph.

Dual form of optimal transport
Kantorovich Rubenstein Duality

Instead of finding ways to transport from one bin to another (primal), you try to find a signal whose likelihood is very different between two distributions:

In image, high likelihood with top dist. and low likelihood with low dist. Use multiscale historgrams/density estimates of the data:
- Take the difference of two histograms at different scales
- WEMD uses a wavelet basis to represent the difference histogram
- Since wavelets are a rich basis the wavelet transform is an approximation of f
This has also been done with trees (tree Wasserstein).
- sps you have data from 2 dists .
- combine the data and approx the data with a tree
Dual form computations vs primal
- Pros: faster, results from norms on data embeddings
- Cons: loss of accuracy and information. Limitations on ground distance (usually Euclidean), structural assumptions (image, tree embeddings).
- We want the best of both worlds, e.g. adapting to data structure, accurate, fast.
Diffusion EMD
Uses diffusion wavelets on different scales to decompose signal on graphs. Take difference between difference signals, get diffusion EMD.

Embeddings using diffusion EMD
Create a higher-level distance matrix using pairwise EMD. Visualize with PHATE to preserve manifold geometry.
Ground manifold distance and EMD: take ground distance as diffusion distance.

Recommended Reading
- Peyre & Cuturi, Computational optimal transport, https://arxiv.org/abs/1803.00567
- Cuturi,Lightspeed computation of optimal transport distances, NeurIPS 2023
- Huguet et al. Geodesic Sinkhorn for fast and accurate optimal transport on manifolds, IEEE MLSP 2023
- Chen et al. Uncovering axes of variation among single-cell cancer specimens, Nature Methods 2020
- Tong et al. Diffusion Earth Mover’s distance and Distribution Embeddings, ICML 2021
- Zapatero, Tong et al. , Trellis tree-based analysis reveals stromal regulation of patient derived organoid drug responses, Cell 2023
9. Riemannian Geometry and ML Applications
Topological space
- Combines a set of points with a topology
- Topology: a colelction of subsets (called open sets) that define the structure of neighborhoods and closeness within the set
- Enables formal analysis of continuity and convergence without relying on a notion of distance (unlike in geometry)
Let be a set and a family of subsets. is a topology on if:
- The empty set and are elements of
- closed under unions and finite intersections
- called a topological space
Topological manifolds
- A manifold is a topological space which is locally Euclidean
- This means each point has a neighborhood which is homeomorphic to an open subset of Euclidean space
between two topological spaces is a homeomorphism if:
- it is a bijection
- it is continuous
- inverse image is continuous and maps open sets to open sets (open mapping)
Chart and Atlases
Neigborhood together with homeomorphism is a chart. Manifolds which are not homemorphic to necessarily require more than one chart. The collection of these charts together is an atlas, and the charts allow us to create local coordinate systems for each nbhd and create a tangent basis for it.
We use this idea in manifold learning by using Euclidean distances only to find local neighbors in manifold learning. Recall that we compute pairwise distances when computing kernels. The pointwise kernel function (e.g. Gaussian) eliminates non-local distances.
Importance of neighborhoods: In some manifold learning methods (UMAP, tSNE), there is computation of neighborhoods. The idea is match local neighbors between high and low dimensional embeddings. The idea is that if the charts are matched, then manifold information is carried to low dimensions. This breaks if the manifold is not well sampled (UMAP manifold uniformly sampled assumption).
Differentiable Manifolds
Differentiable manifold a topological manifold with differential structure. A topological manifold can be given a differential structure locally by using the homeomorphisms in its atlas and the standard differential structure on a vector space.
To include a global differential structure on the local coordinate systems induced by the homeomorphisms, their compositions on chart intersections in the atlas must be differentiable functions on the corresponding vector space. Where the domains of charts overlap, the coordinates defined by each chart are required to be differentiable with respect to the coordinates defined by every chart in the atlas. The maps that relate the coordinates defined by the various charts to one another are called transition maps.

The tangent space of a manifold at a specific point is denoted . It is an -dimensional vector space that locally approximates the manifold at that point.

In ML methods where we look at how to move from point to point on a manifold, this assumes some kind of smoothness (differentiability). Therefore, we’re interested in Riemannian geometry.
Rimeannian metric
Given a smooth topological manifold , we introduce geometry to the manifold by prescribing an inner product:

This allows us to define ideas of length and distance:



The metric is a local element of volume or length. It allows us to build up the real volume in small steps. This is exactly what we do when we compute the length of a path on a graph, albeit discretely. When we find a shortest path we are finding the infimum between these paths.
The volume can also be used for density estimation on the manifold: if you have many points in a nbhd, you can divide by the local volume to get local density.
Curvature
Intuition: one way to define curvature in low dimensions is with the osculating circle (a circle that kisses a curve at a point by being tangent to it and have the same curvature as the curve at that point).
In 2-D, there is Gaussian curvature. There are infinitely many curves that pass through any point on a 2d surface. Locally at any point we can find a unit normal vector that is perpendicular to the surface at . This is called the Gauss map . We can measure curvature by measuring how fast changes.

In higher dimesnions, one can look at how a higher dimensional shape is deformed as it moves through the manifold. This requires covariant derivatives, curvature tensor, Ricci tensor, etc.
- Intuitively, Ricci curvature measures how the volume of a small ball or the divergence of geodesics changes in a specific direction on a curved surface, indicating how much the local geometry deviates from the flat EUclidean space.
- Positive Ricci implies convergence of parallal paths (like on a sphere)
- Negative suggests divergence (like hyperbolic)
- A directional average
Gaussian vs Ricci Curvature
- In 2D, the Ricci curvature is the Gaussian curvature
- In higher dimensions, the Gaussian curvature is the sectional curvature of a 2D slice of the manifold
- Ricci curvature is the average of the sectional curvatures of all possible 2D slices that include a given tangent vector.
Curvature on Graphs
Ollivier-Ricci curvature compares distance between points vs. optimal transport of a 1-hop neighborhood.
There are other notions of curvatures on graphs, such as the diffusion curvature:

Bishop-Gromov gives us theorems about the volume of balls on Riemannian manifolds.

A comparable volume theorem for diffusion/graphs:

Forman Ricci curvature: useful in GNNs, helps combat “oversquashing” of information or too-tight information bottlenecks. Negative edges can be reinforced to pass additional information.
10. Geometric Manifold Learning
Connecting discrete Laplacian to continuous. Ask when they converge?
Energy in discrete space as
Energy in smooth setting, we would want
Claim that
Pf.
ParseError: KaTeX parse error: Undefined control sequence: \nbala at position 25: …}f \nabla g = <\̲n̲b̲a̲l̲a̲ ̲f, \nabla g> + …
Integrating over manifold on both sides + Stokes’ Theorem,
We also have similarity in Rayleigh Quotients. E.g., in discrete setting, Rayleigh quotient
In smooth,
ParseError: KaTeX parse error: Undefined control sequence: \mun at position 1: \̲m̲u̲n̲ ̲\frac{\int_M | …
Next, connecting diffusions. the Laplacian (discrete) and the column normalized random walk operator (same thing as row normalized transpose when we define ). If the initial dist (e.g. of heat) on the nodes of graph, represents rate of change of diffusion process over graph and the 1-step transition probabilities. The discrete Laplacian, , the infinitesimal generator of .
Compare this to the heat equation in the smooth setting,
Solution the heat semigroup,
The Laplacian the infinitesimal generator of :
Lastly, convergence–when does discrete Laplacian coincide with smooth Laplacian?
11. Autoencoders and Intro to Generative Models
NN overview, unsupervised learning, representation learning
Autoencoders
- design the AE to it can’t learn to copy the input perfectly
- informational bottleneck
- new representation is lower dimension / simpler
- also helps with denoising We call this an undercomplete autoencoder
- hope: training the autoencoder will result in having useful properties $ undercomplete AE, forces AE to capture the most salient features of the training data
- AE only approximately copies the input
Linear vs. non-linear AEs
- AEs with only linear layers with no activation function learn PCA
- nonlinear AEs learn a nonlinear dimensionality reduction of the data that can still be decoded
- e.g., diffusion maps, UMAP are hard to decode
- generalizable (adding new points is easy)
PCA
- finds the directions in input space that explain the most variation, re-represents data by projecting along those directions
PCA as encoder, decoder
- data vectors with dim , the sample mean
- want to reduce dimensionality from to
- sample covariance matrix:
- contains the first eigenvectors, giving the directions of most explained variance
- project the data points on this space:
- encoding
- decoding with
Two views of PCA: maximize variance, minimize error
Reconstruction:
- assume data come from -dim vectors where -th vector is
- we can represent a point in dimensional Euclidean space in terms of any orthogonal vectors, call them
- the goal is for find the vectors such that
is minimized, where the projection of onto the first components of .
Regularization
Any modification made to a learning algorithm that is intended to reduce its generalization error but not its training error
- e.g., L1 and L2 regularization
- adding noise to inputs
- early stopping
- dropout
An example of bias for simple solutions would be adding # of parameters to the loss function, this penalizes model complexity.
Denoising Autoencoders
- traditional AE minimizes
- denoising autoencoder minimizes
- a corrupted version of
- denoising AEs learn to undo this corruption
- corruption process modeled by conditional distribution
- the AE learns a reconstruction distribution from training pairs
The process:
- sample training example from training data
- sample corrupted version from
- use as a training example for estimating
- the encoder output defined by decoder
- can then perform SGD on negative log likelihood
- equivalent to doing SGD on
- is the training distribution
Denoising autoencoders have been hypothesized to do manifold learning.
Generative models
Usually noise -> neural net -> spit out something like training sample.
Probabilistic interpretation: randomly sampled nosie -> neural net -> sample from training distribution .
Can we modify a regular autoencoder for generation?
- basic idea: add noise to samples to push them away from the data manifold and then have them pull it back
- for each training sample define a corruption process that creates a corrupted sample
- train a denoising autoencoder to reverse this by using as a training sample
See: generalized denoising autoencoder
Walkback training
- train the neural network to walk back from several steps away
- create longer range corruption processes using the original corruption process
- sample a second step
- add the training sample to the training
This way of training trains the DAE to estimate a conditional probability distribution . Original paper (Bengio 2013) shows that a consistent estimator of can be recovered by alternating sampling from the corruption process and the denoising process
However, issues with this approach: data generation process is very slow (like taking slow walk through data), may not span the space, not penalized to generate the distribution.
Variational Autoencoders (VAEs)
Learning latent spaces that allow for sampling, i.e., we want to generate new examples just by sampling in the latent space.
VAEs force the latent vectors to have a roughly unit Gaussian distribution. Generate images by sampling a unit Gaussian and passing it into the decoder.
There exists a tradeoff between accuracy and the unit Gaussian approximation. Build this directly into the loss:
- include a reg term that penalizes the KL divergence between unit Gaussian and the latent vector distribution.
- note that KL divergence penalizing distributions in latent space from being too far from standard normal encourages every ball to be the same
Recommended reading
- Hinton and Salakhutdinov, Reducing the dimensionality of Data with Neural Networks, Science 2006
- Alain & Bengio, JMLR 2014 What Regularized Auto-Encoders Learn from the Data- Generating Distribution
- Bengio et al. Generalized Denoising Auto-Encoders as Generative Models, NeurIPS 2013
- Kingma & Welling, Autoencoding Variational Bayes, https://arxiv.org/pdf/1312.6114.pdf
- ScVI Lopez et al. Deep generative modeling for single-cell transcriptomics, Nature Methods 2018
- van Dijk et al. Finding Archetypal Spaces using Neural Networks, IEEE Big Data 2019 • Venkat, Youlten et al. AAnet resolves a continuum of spatially-localized cell states to unveil intratumoral heterogeneity, Cancer Discovery 2025
12. Variational Calculus, Pullback Metrics, NeuralFIM
Attach slides
13. Probabilistic Methods for Robust and Transparent Machine Learning
Guest lecture