8 Sheaf Diffusion: Heat Equations and Convergence

Purpose. Runs the sheaf heat equation on a fixed sheaf, proves exponential convergence to the harmonic extension, and establishes the “smooth baseline” for the nonlinear ReLU case in Chs. 9–10.

8.1 Motivating example

Take the three-agent discourse sheaf from Ch. 4 with agents pinned at opposite ends of a path. Initialize the middle agent’s opinion randomly. The sheaf heat equation is just gradient descent on the total discrepancy \(\|\delta x\|^2\): at every step, each agent adjusts its opinion to reduce disagreement with its neighbors. The middle agent is pulled toward a compromise between the two pinned agents — the harmonic extension. How quickly does this happen? At what rate does the discrepancy decay?

This is exactly the question Ch. 9 will ask about the neural sheaf, except there the restriction maps switch as the dynamics evolve. Understanding the fixed-sheaf case first makes the switching case a clear extension.

8.2 Key concepts and results

Sheaf heat equation \(\dot{x} = -\alpha\mathcal{L}_\mathcal{F} x\) and its pinned version.
Exponential convergence rate = \(\alpha\lambda_{\min}(\mathcal{L}_\mathcal{F}[\Omega,\Omega])\).
Discrete-time Euler stepping.
Opinion-dynamics interpretation (Hansen–Ghrist 2021).
Connections to sheaf neural networks (Bodnar et al. 2022, Hansen–Gebhart 2020).

8.3 Intuition

The sheaf heat equation is gradient descent on the total discrepancy \(E(\omega) = \tfrac{1}{2}\|\delta\omega\|^2\). Since \(E\) is a positive-definite quadratic form on the interior coordinates (by Theorem 5.4), this is the simplest possible optimization problem: a quadratic with a unique minimum, minimized by gradient descent. Exponential convergence follows immediately from the positive-definiteness spectral bound.

The new ingredient over the plain graph case (Ch. 2) is that the “gradient” now involves the restriction maps: vertex \(v\)’s update is a weighted sum of transformed neighbor values, not just a plain average. Richer restriction maps can encode richer consensus requirements — and in the neural sheaf, they encode the entire forward pass.

8.4 Formal development

The sheaf heat equation

Let \(\mathcal{F}\) be a fixed cellular sheaf on \(G\) with boundary set \(U\) and free set \(\Omega = V \setminus U\). Write \(\omega = x|_\Omega\) for the free coordinates and \(u = x|_U\) for the fixed boundary data. The pinned sheaf heat equation on the free coordinates is: \[\frac{d\omega}{dt} = -\alpha\bigl(\mathcal{L}_\mathcal{F}[\Omega,\Omega]\,\omega + \mathcal{L}_\mathcal{F}[\Omega,U]\,u\bigr), \qquad \alpha > 0. \tag{6.1}\]

This is exactly (9) of [1] (their Theorem 2.5). The total discrepancy \(E(\omega) = \tfrac{1}{2}\|\delta_\Omega\omega + \delta_U u\|^2\) satisfies \(\nabla_\omega E = \mathcal{L}_\mathcal{F}[\Omega,\Omega]\,\omega + \mathcal{L}_\mathcal{F}[\Omega,U]\,u\), so (6.1) is gradient descent on \(E\) with respect to \(\omega\).

The fixed point of (6.1) is the unique \(\omega^*\) satisfying \(\mathcal{L}_\mathcal{F}[\Omega,\Omega]\,\omega^* = -\mathcal{L}_\mathcal{F}[\Omega,U]\,u\), which is exactly the harmonic extension from Theorem 5.4.

Exponential convergence

Theorem 8.1 Theorem 6.1 (Exponential convergence of sheaf diffusion, [2] Theorem 5.1). Assume \(H^0(G,U;\mathcal{F}) = 0\) (equivalently, \(\mathcal{L}_\mathcal{F}[\Omega,\Omega]\) is positive definite). Then the solution of (6.1) satisfies, for all \(t \geq 0\): \[\|\omega(t) - \omega^*\|^2 \leq e^{-2\alpha\lambda_{\min}(\mathcal{L}_\mathcal{F}[\Omega,\Omega])\,t}\,\|\omega(0) - \omega^*\|^2.\]

Proof. Let \(\tilde\omega = \omega - \omega^*\). Then \(\frac{d\tilde\omega}{dt} = -\alpha\mathcal{L}_\mathcal{F}[\Omega,\Omega]\,\tilde\omega\), so \(\tilde\omega(t) = e^{-\alpha\mathcal{L}_\mathcal{F}[\Omega,\Omega]\,t}\tilde\omega(0)\). The matrix exponential \(e^{-\alpha\mathcal{L}_\mathcal{F}[\Omega,\Omega]\,t}\) has norm \(e^{-\alpha\lambda_{\min}\,t}\) (the smallest eigenvalue decays slowest). \(\square\)

The local update rule

The local action of \(\mathcal{L}_\mathcal{F}\) at vertex \(v\) gives the update rule explicitly: \[\frac{d x_v}{dt} = -\alpha \sum_{e:\, v \leq e} \mathcal{F}_{v \leq e}^T\bigl(\mathcal{F}_{v \leq e}\, x_v - \mathcal{F}_{w \leq e}\, x_w\bigr). \tag{6.2}\]

Each edge \(e = \{v,w\}\) contributes a term \(\mathcal{F}_{v \leq e}^T(\cdot)\) — the adjoint of the restriction map — times the discrepancy at that edge. The update at \(v\) depends only on the data at \(v\) and its immediate neighbors \(w\): the dynamics are local.

This locality is a key feature of the framework. In the training algorithm of Ch. 12, each weight update also depends only on adjacent stalks (equation (5.24) of [1]), with no backward pass through the entire network.

Discrete-time Euler stepping

For simulation (Ch. 13, and the experiments in [1]), the continuous dynamics are approximated by forward Euler: \[\omega^{(k+1)} = \omega^{(k)} - \alpha\,\mathrm{dt}\bigl(\mathcal{L}_\mathcal{F}[\Omega,\Omega]\,\omega^{(k)} + \mathcal{L}_\mathcal{F}[\Omega,U]\,u\bigr). \tag{6.3}\]

Convergence of the discrete-time system requires the step size \(\alpha\,\mathrm{dt} < 2/\lambda_{\max}(\mathcal{L}_\mathcal{F}[\Omega,\Omega])\) (standard condition for Euler descent on quadratics). The experiments in [1] use \(\alpha = 1.0\) and \(\mathrm{dt} \in \{0.005, 0.01\}\).

Opinion dynamics interpretation

In the Hansen–Ghrist framework [2], equation (6.2) models an agent updating its opinion in response to disagreement with neighbors. The term \(\mathcal{F}_{v \leq e}^T(\mathcal{F}_{v \leq e} x_v - \mathcal{F}_{w \leq e} x_w)\) measures how much agent \(v\)’s expressed opinion (as seen through edge \(e\)) differs from agent \(w\)’s expressed opinion, transformed back into \(v\)’s private opinion space via the adjoint \(\mathcal{F}_{v \leq e}^T\).

The same local update rule will later specialize to a neural network: each layer will play the role of an “agent” with its own local state, each edge will encode one of the network’s layerwise consistency conditions (e.g. “the pre-activation equals the weighted post-activation of the previous layer”), and the heat equation will drive every edge’s discrepancy to zero. At that zero-discrepancy fixed point, every layerwise rule is satisfied simultaneously — which is precisely what it means to have executed the forward pass. Concretely (Ch. 7): the “opinion” at vertex \(v_{z^{(\ell)}}\) is the pre-activation \(z^{(\ell)}\), and the weight-edge discrepancy is \(z^{(\ell)} - \widetilde{W}^{(\ell)}\tilde{a}^{(\ell-1)}\).

Connections to sheaf neural networks

Sheaf neural networks ([3], [4]) use a discretized version of (6.3) as a message-passing layer in a graph neural network. In that setting, the sheaf structure (restriction maps) is learned as part of the architecture, and one or a few steps of (6.3) are applied to graph-structured data to propagate information. This is a different use of the same dynamics: sheaf diffusion as a learned transformation, rather than as an inference algorithm on a fixed network. See Ch. 14 for further discussion.

8.5 Worked example: the path graph revisited

We run the sheaf heat equation on the path sheaf of Example 4.5 (three vertices \(v_0, v_1, v_2\); \(v_0\) pinned to \(u \in \mathbb{R}^2\); restriction map \(W \in \mathbb{R}^{2\times2}\) on the weight edge; restriction map \(R \in \mathbb{R}^{2\times2}\) diagonal on the activation edge; \(v_2\) free).

Setup. Stalk dimensions: \(\mathcal{F}(v_0) = \mathcal{F}(v_1) = \mathbb{R}^2\), \(\mathcal{F}(v_2) = \mathbb{R}^2\). Restriction maps: weight edge \(e_1\): \(\mathcal{F}_{v_0 \leq e_1} = W\), \(\mathcal{F}_{v_1 \leq e_1} = I\); activation edge \(e_2\): \(\mathcal{F}_{v_1 \leq e_2} = R\), \(\mathcal{F}_{v_2 \leq e_2} = P\) (projection onto first component).

Restricted Laplacian. With \(v_0\) pinned, the free coordinates are \(\omega = (z^{(1)}, a^{(1)}) \in \mathbb{R}^4\). The restricted Laplacian (cf. equation (33) of [1]) is: \[\mathcal{L}_\mathcal{F}[\Omega,\Omega] = \begin{pmatrix} I + R^T R & -R \\ -R & I + W_2^T W_2 \end{pmatrix},\] where \(W_2\) is the weight matrix from \(v_1\) to \(v_2\) (the next layer). This is a \(4\times4\) positive definite matrix; its smallest eigenvalue governs the convergence rate.

Heat equation dynamics. Starting from \(\omega(0) = (z^{(1)}_0, a^{(1)}_0)\) random, the dynamics drive: - \(z^{(1)}\) toward \(Wu\) (agreement with the weight edge) and toward \(Ra^{(1)}\) (agreement with the activation edge), - \(a^{(1)}\) toward \(Rz^{(1)}\) (agreement with the activation edge) and toward the target value from the next layer.

At equilibrium, \(z^{(1)*} = Wu\) and \(a^{(1)*} = Rz^{(1)*} = RWu\), which is exactly the forward pass of a single-layer network with input \(u\), weight \(W\), and activation \(R\).

8.6 Exercises

6.1 (Warm-up — convergence rate). For the path sheaf with a single weight edge (weight matrix \(W \in \mathbb{R}^{n\times n}\), \(v_0\) pinned, \(v_1\) free), compute \(\mathcal{L}_\mathcal{F}[\Omega,\Omega] = W^T W + I\) and find its eigenvalues. Show that the convergence rate is \(\alpha(1 + \sigma_{\min}(W)^2)\) where \(\sigma_{\min}\) is the smallest singular value of \(W\).

6.2 (Discrete stability). Show that the forward Euler scheme (6.3) converges if and only if \(\alpha\,\mathrm{dt} < 2/\lambda_{\max}(\mathcal{L}_\mathcal{F}[\Omega,\Omega])\). (Hint: all eigenvalues of \(I - \alpha\,\mathrm{dt}\,\mathcal{L}\) must lie in \((-1,1)\).)

6.3 (Discrete vs. continuous rate). Show that for the optimal step size \(\alpha\,\mathrm{dt} = 1/\lambda_{\max}\), the discrete-time convergence rate per iteration is \(1 - \lambda_{\min}/\lambda_{\max} = 1 - 1/\kappa\) where \(\kappa = \lambda_{\max}/\lambda_{\min}\) is the condition number. Interpret: a poorly conditioned Laplacian (high \(\kappa\)) converges slowly.

6.4 (Local update rule). Write out the component-wise update rule (6.2) for the three-agent discourse sheaf from Ch. 4. Verify that the update for the middle agent depends only on the two outer agents’ opinions, not on any global quantity.

6.5 (Stubbornness / reluctance). In the opinion dynamics framework, a “stubborn” agent \(v\) maintains its own opinion with coupling strength \(\gamma > 0\) by adding a penalty term \(\gamma\|x_v - p_v\|^2\) to the Dirichlet energy (where \(p_v\) is the agent’s anchor value). Show that this augments the Laplacian by \(\gamma I\) at the diagonal block for \(v\). (This is the reluctance mechanism of [2], which becomes pinned neurons in Ch. 11.)

6.6 (Project — convergence validation). Reproduce Figure 6 from [1] for a [2, 4, 1] network with identity output: initialize all free stalks randomly, run the discrete sheaf heat equation (6.3), and plot the total discord \(\|\delta\omega^{(k)}\|^2\) on a log scale. Verify exponential decay and identify the empirical rate. Compare to the theoretical rate \(\alpha\lambda^*\). Use lab-06-sheaf-diffusion.

8.7 Coding lab

Lab 06 — Sheaf Diffusion — Implement the full sheaf heat equation for a fixed sheaf: compute the restricted Laplacian \(\mathcal{L}_\mathcal{F}[\Omega,\Omega]\), run the forward Euler iteration (6.3), and track total discord and per-edge discord over time. Apply this to three cases: (1) the constant sheaf on a path (should reproduce Ch. 2 linear interpolation), (2) the three-agent discourse sheaf (watch the middle agent converge to the harmonic extension), (3) a fixed-weight [2, 4, 1] neural sheaf (reproduce Figure 6a of [1]). In case (3), also plot the output stalk trajectory and verify it converges to the forward pass value. This is the smooth baseline for the switching dynamics of Chs. 9–10.

8.8 Further reading

The convergence of sheaf diffusion to the harmonic extension is proved in [2] (their Theorem 5.1). The spectral framework is developed in [5]. For the sheaf neural network perspective — using one or a few steps of sheaf diffusion as a graph neural network layer — see [3] and [4]. The joint opinion-expression dynamics that underlie the training algorithm of Ch. 12 are introduced in [2] §10 and extended in [6]. For a parallel development using predictive coding as sheaf diffusion in linear recurrent networks, see [7].

8.9 FAQ / common misconceptions

Q: If sheaf diffusion is just gradient descent on a quadratic, isn’t it trivial? It is simple — but “simple” is not the same as “trivial.” The key insight is that the quadratic whose minimum is the forward pass output has a positive definite Hessian for every activation pattern, and this positive definiteness follows from a structural property (the unitriangular coboundary, Lemma 3.2 of [1]). The same argument does not work for architectures with skip connections or shared parameters. The simplicity of the fixed-sheaf case is what makes the switching-ReLU extension tractable.

Q: Why run the heat equation to compute a forward pass? The forward pass is instant. The motivation is not efficiency. It is that sheaf diffusion propagates information bidirectionally — from later layers back to earlier ones, and vice versa. This is what enables pinned neurons (Ch. 11), backpropagation-free training (Ch. 12), and post-hoc diagnostics on any trained network (Ch. 13). The forward pass, as a limit of sheaf diffusion, is the warm-up that explains why the sheaf structure is the right one to build.

Q: What is the relationship between sheaf diffusion here and the diffusion in Bodnar et al.? In [3], sheaf diffusion is used as a message-passing operator for graph data (like molecular graphs or social networks). The “sheaf” there is learned as part of the GNN architecture, and diffusion runs for a few discrete steps. Here, sheaf diffusion is an inference algorithm on a fixed neural network: the network’s weights define the sheaf, and diffusion runs to convergence. The mathematical machinery (sheaf Laplacian, harmonic extensions) is the same; the purpose is completely different.