11  The Neural Sheaf Heat Equation: Identity Output

Purpose. Proves exponential convergence of the sheaf heat equation on the neural sheaf with identity output, despite ReLU switching.

11.1 Key concepts & results

  • The sheaf heat equation ẋ = −L_F(σ(t)) x with state-dependent activation pattern σ(t).
  • Common Lyapunov function: Dirichlet energy E(x) = ½‖δ(σ)x‖² is decreasing across all activation patterns.
  • Theorem 4.1: exponential convergence to the forward-pass value; rate bounded by the worst-case spectral gap.
  • Filippov framework + uniform positive-definiteness gives convergence despite switching.

Prerequisites: Ch 3, Ch 6, Ch 8

11.2 Motivating example

Take the [2, 4, 1] sheaf with input pinned to \(x_{\text{in}} = (1, -1)\) but initialize the interior cochain \(x(0)\) at something arbitrary — say i.i.d. Gaussian noise far from the forward-pass value. Run the sheaf heat equation \(\dot{x} = -L_\mathcal{F}(\sigma(t)) \, x\) numerically. Plot \(\|x(t) - \texttt{forward}(x_{\text{in}})\|\) on a semi-log axis. The curve decays roughly as a straight line, with small visible “kinks” at each instant where one of the hidden pre-activations \(z^{(1)}_j\) crosses zero and the sheaf switches to a new activation pattern \(\sigma\). Despite those switches, the envelope stays exponential — by the end, \(x(t)\) has converged to the unique harmonic extension, which Ch. 8 identified as the forward pass itself. This is Theorem 4.1 of [1] in action: the identity-output heat equation converges exponentially, at a rate bounded below by the worst-case spectral gap over activation patterns.

11.3 Intuition

A switched linear system — a system whose state matrix jumps between finitely many choices — can behave badly. Even when every branch is individually Hurwitz (exponentially stable), switching between them too fast or at the wrong times can destabilize the trajectory. Classical examples in control theory are routine. So on the face of it, running ReLU heat equation should be frightening: the Laplacian \(L_\mathcal{F}(\sigma)\) changes every time an interior neuron crosses its activation boundary, and the activation boundaries can be crossed arbitrarily often during a single trajectory.

The miracle — the central pleasant surprise of [1] §4 — is that all branches of this switched system share a common Lyapunov function, and it is the object we cared about anyway: Dirichlet energy \(E(x) = \tfrac{1}{2} \|\delta(\sigma) x\|^2\). The unitriangular factorization of Ch. 8 implies that for every activation pattern, \(L_\mathcal{F}(\sigma)\) restricted to the free vertices is positive definite, with a spectrum uniformly bounded away from zero. So along any trajectory — whatever sequence of activation patterns is visited — \(E\) decreases at a rate at least \(2 \lambda_{\min}^{\text{free}}\), where \(\lambda_{\min}^{\text{free}}\) is the minimum over all patterns of the smallest eigenvalue of the free Laplacian. Exponential decay of \(E\) forces exponential convergence of \(x\) to the unique equilibrium (the harmonic extension = forward pass).

The Filippov framework (Ch. 3) is what lets us even talk about the dynamics on measure-zero boundary sets where \(\sigma\) is ambiguous: it promotes the ODE to a differential inclusion, and the Clarke subdifferential of \(E\) takes over the role of \(\nabla E\) at the switching surfaces. The argument is two ingredients — Filippov existence + common Lyapunov function — glued by the unitriangular identity. Everything else in this book that says “despite ReLU, things converge” will be a decorated version of this argument (Ch. 10 adds a nonlinear output; Ch. 12 couples a slow weight flow).

Intuition device (planned): Animation overlaying (i) trajectory in cochain space, (ii) time-axis with shaded activation-pattern epochs, (iii) semi-log energy decay.

11.4 Formal development

Throughout, \(\mathcal{F}\) is the neural sheaf of a ReLU network with identity output (\(\varphi = I\)), the input vertex is pinned to \(x\), and \(L_{\text{free}}(\sigma) = \delta_\Omega(\sigma)^T \delta_\Omega(\sigma)\) is the positive-definite free Laplacian from Def. 8.3. Let \(c^\star\) denote the forward-pass cochain from Prop. 8.6. Write \(y = c|_\Omega \in \mathbb{R}^{\dim \Omega}\) for the free part of a 0-cochain.

The state-dependent heat equation

Definition 11.1 Definition 9.1 (Neural-sheaf heat equation, identity output). The neural-sheaf heat equation with pinned input \(x\) and identity output is the gradient flow of Dirichlet energy on the free coordinates: \[\dot{y}(t) = -L_{\text{free}}(\sigma(y(t)))\, \bigl(y(t) - y^\star\bigr) + \text{(const)}, \tag{9.1}\] where \(\sigma(y) \in \{0,1\}^N\) is the activation pattern read off from \(y\)’s pre-activation coordinates, and \(y^\star = c^\star|_\Omega\) is the free part of the forward-pass cochain.

The constant absorbs the pinned boundary contribution and vanishes at equilibrium; for readability we write (9.1) as \(\dot{y} = -L_{\text{free}}(\sigma(y))(y - y^\star)\).

The right-hand side of (9.1) is discontinuous at any \(y\) whose pre-activation has a zero coordinate — the same measure-zero set on which the Clarke subdifferential of ReLU fails to be single-valued (Ch. 3). The correct notion of solution is therefore a Filippov solution.

Filippov interpretation

For a measurable discontinuous field \(F : \mathbb{R}^n \to \mathbb{R}^n\), the Filippov regularization is the set-valued map \[\mathcal{F}[F](y) := \bigcap_{\varepsilon > 0} \bigcap_{\mathcal{N} : |\mathcal{N}| = 0} \overline{\mathrm{conv}}\, F(B_\varepsilon(y) \setminus \mathcal{N}),\] and a Filippov solution is an absolutely continuous \(y(\cdot)\) satisfying \(\dot{y}(t) \in \mathcal{F}[F](y(t))\) for a.e. \(t\).

Definition 11.2 Definition 9.2 (Filippov solutions of the sheaf heat equation). A Filippov solution of (9.1) is an absolutely continuous curve \(y : [0, \infty) \to \mathbb{R}^{\dim \Omega}\) satisfying \[\dot{y}(t) \in -\mathcal{F}\!\left[L_{\text{free}}(\sigma(\cdot))(\cdot - y^\star)\right](y(t)) \quad \text{for a.e. } t.\]

For \(y\) with every pre-activation coordinate nonzero, the Filippov inclusion reduces to the classical ODE (9.1). At switching points, the right-hand side is the convex hull of the neighboring branches — exactly the Clarke subdifferential of Dirichlet energy when written in Filippov form ([2]).

Common Lyapunov function

Theorem 11.1 Theorem 9.3 (Dirichlet energy is a common Lyapunov function — [1] Lem. 4.3). The Dirichlet energy \(V(y) = \tfrac{1}{2} (y - y^\star)^T L_{\text{free}}(\sigma(y)) (y - y^\star)\) is continuous, positive definite (as a function of \(y - y^\star\)), and satisfies along every Filippov solution of (9.1): \[\dot{V}(y(t)) \leq -2 \lambda_{\min}^{\text{free}}\, V(y(t)) \quad \text{for a.e. } t,\] where \(\lambda_{\min}^{\text{free}} > 0\) is the uniform lower bound from Corollary 8.4.

The content of the theorem is that even though the quadratic form defining \(V\) depends on \(\sigma\) — so \(V\) appears to change when the pattern switches — the value \(V(y)\) is continuous across switching surfaces, because activation-pattern changes only zero out or activate blocks of \(\delta_\Omega\) that contribute zero discord at the switching configuration itself. This is what makes \(V\) a common Lyapunov function across all branches of the switched system.

Theorem 4.1: exponential convergence

Theorem 11.2 Theorem 9.4 (Convergence of the sheaf heat equation, identity output — Theorem 4.1 of [1]). Let \(y(\cdot)\) be a Filippov solution of (9.1) with initial condition \(y(0)\). Then \(y(t)\) converges exponentially to the forward-pass value \(y^\star\): \[\|y(t) - y^\star\| \leq \kappa\, e^{-\lambda_{\min}^{\text{free}}\, t}\, \|y(0) - y^\star\| \quad \text{for all } t \geq 0,\] where \(\kappa\) is a condition-number constant depending only on \(\theta\), and \(\lambda_{\min}^{\text{free}}\) is the uniform spectral-gap bound from Corollary 8.4.

The theorem is the direct specialization of the pinned-heat-equation convergence theorem (Thm. 2.5) to the state-dependent sheaf setting, made possible by Theorem 9.3.

Remark 9.5 (Rate depends on the worst pattern). The rate \(\lambda_{\min}^{\text{free}} = \inf_\sigma \lambda_{\min}(L_{\text{free}}(\sigma))\) is the worst-case spectral gap over activation patterns. For most initializations, \(y(t)\) spends most of its time within a single activation region, where the local rate \(\lambda_{\min}(L_{\text{free}}(\sigma(y(t))))\) may be considerably larger. The bound in Theorem 9.4 is therefore typically loose in practice; Ch. 13 discusses how to estimate the realized rate from per-edge discords.

Remark 9.6 (Role of unitriangularity). Theorem 9.3 rests on Corollary 8.4, which rests on Lemma 8.2. The unit-determinant identity \(\det \delta_\Omega(\sigma) = 1\) is therefore the ultimate source of convergence. Ch. 14 flags that this identity fails for non-path-graph architectures, where even stating an analogue of Theorem 9.4 is an open problem.

11.5 Theorem demonstrations

Proof of Thm. 9.3 (common Lyapunov). We verify three claims: (a) \(V\) is continuous in \(y\); (b) \(V\) is positive definite in \(y - y^\star\); (c) \(\dot V \leq -2\lambda_{\min}^{\text{free}} V\) along Filippov solutions.

Continuity. Write \(V(y) = \tfrac{1}{2} \|\delta_\Omega(\sigma(y))(y - y^\star)\|^2\). The map \(y \mapsto \delta_\Omega(\sigma(y))\) is piecewise constant, with discontinuities only at switching surfaces \(\{y : c_{z^{(\ell)}}_j = 0\}\). Across such a surface, two activation patterns \(\sigma, \sigma'\) differ in the \((\ell, j)\)-th diagonal of \(R_\sigma\), i.e. in exactly one row of one ReLU edge. That row of \(\delta_\Omega(\sigma)\) is \((-R_{\sigma,jj}\, e_j^T,\, e_j^T)\) applied to the pair \((c_{z^{(\ell)}}, c_{a^{(\ell)}})\)-columns; it contributes \((c_{a^{(\ell)}}_j - R_{\sigma,jj}\, c_{z^{(\ell)}}_j)^2\) to \(2 V(y)\). At the switching configuration \(c_{z^{(\ell)}}_j = 0\), both values \(R_{\sigma,jj} \in \{0, 1\}\) yield the same contribution \((c_{a^{(\ell)}}_j)^2\). The contributions from all other edges are unaffected. Hence \(V\) agrees on both branches at every switching surface and is continuous.

Positive definiteness. Each \(L_{\text{free}}(\sigma)\) is positive definite (Cor. 8.4), so \(V(y) \geq \tfrac{1}{2} \lambda_{\min}^{\text{free}} \|y - y^\star\|^2 \geq 0\), with equality iff \(y = y^\star\).

Decrease rate. Fix a Filippov solution \(y(\cdot)\). At any \(t\) where \(\sigma(y(t))\) is locally constant, (9.1) is a classical linear ODE \(\dot y = -L_{\text{free}}(\sigma)(y - y^\star)\), and \(\dot V = -(y - y^\star)^T L_{\text{free}}(\sigma)^2 (y - y^\star)\). Using \(L_{\text{free}}(\sigma)^2 \succeq \lambda_{\min}^{\text{free}} L_{\text{free}}(\sigma)\), \[\dot V \leq -\lambda_{\min}^{\text{free}} (y - y^\star)^T L_{\text{free}}(\sigma)(y - y^\star) = -2 \lambda_{\min}^{\text{free}} V.\] At switching points, continuity of \(V\) plus the chain rule for Clarke’s generalized gradient ([2] Prop. 2.2) gives the same inequality as a one-sided derivative. The measure-zero switching set does not affect the a.e. statement. \(\square\)

Proof of Thm. 9.4 (exponential convergence). From Thm. 9.3, \(V(y(t)) \leq e^{-2\lambda_{\min}^{\text{free}} t}\, V(y(0))\). Using \(V(y) \geq \tfrac{1}{2} \lambda_{\min}^{\text{free}} \|y - y^\star\|^2\) on the lower side and \(V(y) \leq \tfrac{1}{2} \lambda_{\max}^{\text{free}} \|y - y^\star\|^2\) on the upper side, where \(\lambda_{\max}^{\text{free}} := \sup_\sigma \lambda_{\max}(L_{\text{free}}(\sigma))\), \[\|y(t) - y^\star\|^2 \leq \frac{2}{\lambda_{\min}^{\text{free}}} V(y(t)) \leq \frac{\lambda_{\max}^{\text{free}}}{\lambda_{\min}^{\text{free}}}\, e^{-2 \lambda_{\min}^{\text{free}} t}\, \|y(0) - y^\star\|^2.\] Taking square roots gives the claim with \(\kappa = \sqrt{\lambda_{\max}^{\text{free}} / \lambda_{\min}^{\text{free}}}\), a condition number depending only on the weights \(\theta\). \(\square\)

11.6 Worked example: heat-equation convergence on the [2, 2, 1] sheaf

Continue the [2, 2, 1] sheaf from the worked examples of Chs. 7–8. Pin \(x = (1, 2)^T\); the forward-pass cochain is \(y^\star = ((1,1),\ (1,1),\ 2,\ 2)\).

Free Laplacian at \(\sigma = (1,1)\). Using the \(\delta_\Omega\) from Ch. 8, \[L_{\text{free}}(\sigma) = \delta_\Omega(\sigma)^T \delta_\Omega(\sigma) = \begin{pmatrix} 2 I_2 & -I_2 & 0 & 0 \\ -I_2 & I_2 + (W^{(2)})^T W^{(2)} & -(W^{(2)})^T & 0 \\ 0 & -W^{(2)} & 2 & -1 \\ 0 & 0 & -1 & 1 \end{pmatrix}.\] Plugging in \(W^{(2)} = (1, 1)\) gives an explicit \(6 \times 6\) symmetric matrix whose eigenvalues are (numerically) approximately \(\{0.28,\ 0.91,\ 1.00,\ 2.00,\ 3.09,\ 5.72\}\). All strictly positive, confirming Corollary 8.4. The spectral gap is \(\lambda_{\min}^{\text{free}}(\sigma{=}(1,1)) \approx 0.28\).

Heat equation near the fixed point. Let \(\tilde{y}(t) = y(t) - y^\star\). Within the activation region, (9.1) becomes \[\dot{\tilde{y}} = -L_{\text{free}}(\sigma)\, \tilde{y},\] a linear ODE with solution \(\tilde{y}(t) = e^{-L_{\text{free}}(\sigma) t}\, \tilde{y}(0)\). Dirichlet energy satisfies \[E(\tilde{y}(t)) = \tfrac{1}{2} \tilde{y}(t)^T L_{\text{free}}(\sigma)\, \tilde{y}(t) \leq e^{-2 \lambda_{\min}^{\text{free}} t}\, E(\tilde{y}(0)),\] matching Theorem 9.4 with rate \(\lambda_{\min}^{\text{free}} \approx 0.28\).

A switching trajectory. Initialize \(y(0)\) with \(c_{z^{(1)}}(0) = (-0.5,\ 3)\), so the activation pattern at \(t = 0\) is \(\sigma(0) = (0, 1)\) — different from the forward-pass pattern \((1, 1)\). The free Laplacian at \(\sigma = (0, 1)\) has one active and one inactive ReLU, producing a different \(6 \times 6\) matrix; direct computation gives spectral gap \(\approx 0.21\). Integrating (9.1) numerically from \(y(0)\):

  • In the time window \([0, t_1]\) with \(t_1 \approx 0.8\), \(c_{z^{(1)}}(t)\) is driven toward \((1, 1)\) along the \(\sigma = (0, 1)\) branch; the pre-activation’s first coordinate rises through zero.
  • At \(t = t_1\), \(c_{z^{(1)}}_1\) crosses zero and the pattern switches to \(\sigma = (1, 1)\). The sheaf’s \(R_{z^{(1)}}\) block flips its first diagonal from \(0\) to \(1\).
  • For \(t > t_1\), the dynamics proceeds under \(L_{\text{free}}(1,1)\), and \(y(t) \to y^\star\) exponentially at rate \(\approx 0.28\).

Despite the switch, Dirichlet energy \(V(y(t)) = \tfrac{1}{2}(y(t) - y^\star)^T L_{\text{free}}(\sigma(t))(y(t) - y^\star)\) is continuous across \(t = t_1\) (Theorem 9.3): the branches of \(L_{\text{free}}\) agree on \(\{c_{z^{(1)}}_1 = 0\}\) because the \(R_{z^{(1)}}\)-first-diagonal contribution to discord vanishes at that switching surface. A semi-log plot of \(V(y(t))\) shows one visible kink at \(t_1\) but an otherwise clean exponential decay with effective rate \(\min(0.21, 0.28) = 0.21\) — exactly the worst-case spectral gap from Theorem 9.4.

Takeaways.

  • The worst-case spectral gap \(\lambda_{\min}^{\text{free}} = \inf_\sigma \lambda_{\min}(L_{\text{free}}(\sigma))\) bounds convergence; the realized rate varies within each region.
  • Common Lyapunov function ≠ common matrix: the quadratic form defining \(V\) changes with \(\sigma\), but its value is continuous across switches.
  • Kinks in the semi-log \(V(t)\) plot are the empirical fingerprint of activation switches; their number is a diagnostic for how far off the activation pattern of \(y(0)\) was.

11.7 Coding lab

Lab 09 — ReLU-Sheaf Heat Equation — Integrate the sheaf heat equation (Def. 9.1) for the [2, 4, 1] network from a random cochain initialization. Use a simple explicit Euler scheme inside each activation region and detect region crossings by sign changes in \(c_{z^{(\ell)}}\). Plot (a) the trajectory’s projection onto two hidden-layer coordinates, (b) Dirichlet energy \(V(t)\) on a semi-log axis, and (c) a time-axis banded by the active \(\sigma\)-pattern. Confirm the exponential envelope predicted by Thm. 9.4 and locate the kinks at switching times.

11.8 Exercises

  1. (warm-up) For the [2, 2, 1] example’s \(L_{\text{free}}(\sigma = (1,1))\), compute the exact eigenvalues and verify \(\lambda_{\min} \approx 0.28\) using your favorite linear-algebra library.
  2. (warm-up) Show that the fixed points of (9.1) are exactly the cochains \(c\) with \(\delta_\Omega(\sigma(c))(c - c^\star) = 0\), and explain why uniqueness of the fixed point follows from Cor. 8.4.
  3. (intermediate) Enumerate all four activation patterns \(\sigma \in \{0,1\}^2\) of the [2, 2, 1] hidden layer and compute \(\lambda_{\min}(L_{\text{free}}(\sigma))\) for each. Which is the worst case? Compare to the bound \(\lambda_{\min}^{\text{free}}\) used in Thm. 9.4.
  4. (intermediate) Prove directly that \(V(y) = \tfrac{1}{2}\|\delta_\Omega(\sigma(y))(y - y^\star)\|^2\) is continuous at switching surfaces, using the explicit fact that \(R_{\sigma, jj}\, c_{z^{(\ell)}, j} = 0\) when \(c_{z^{(\ell)}, j} = 0\).
  5. (project) Empirically estimate the realized convergence rate \(\hat\lambda(t) = -\frac{1}{2 V(t)} \frac{dV}{dt}\) for several random initializations of the [2, 30, 1] network. How much slack is there between \(\hat\lambda(t)\) and the worst-case bound \(\lambda_{\min}^{\text{free}}\)? Relate to Rem. 9.5.
  6. (advanced) Construct an example on a 3-vertex path where two restriction maps are non-orthogonal projections, and show that the corresponding \(V\) is not continuous across their joint switching surface. Identify which property of the ReLU projection the example violates.

11.9 Further reading

[1] §4 and Lem. 4.3 cover the common-Lyapunov construction; Theorem 4.1 is the main convergence statement. [2] is the standard reference for Filippov solutions and Clarke subdifferentials used in Def. 9.2. For common-Lyapunov theorems in switched systems more generally, see the control-theory literature on switched linear systems (Liberzon’s monograph and Shorten et al.’s survey are standard entry points). The path-graph spectral-gap scaling \(\lambda_{\min} \sim 1/k^2\) is classical spectral graph theory (e.g., Chung’s spectral-graph-theory monograph).

11.10 FAQ / common misconceptions

Is the “common Lyapunov function” the same quadratic form across all \(\sigma\)? No — the quadratic form \(\tfrac{1}{2}(y - y^\star)^T L_{\text{free}}(\sigma) (y - y^\star)\) depends on \(\sigma\). What is common is the value \(V(y)\) as a function of \(y\), by the switching-surface-continuity argument of Thm. 9.3.

Why Filippov solutions specifically, rather than Caratheodory or classical? Because the right-hand side of (9.1) is discontinuous (in \(y\)) at switching surfaces. Classical ODE theory doesn’t apply; Caratheodory handles only discontinuity in \(t\). Filippov’s framework is the minimal extension that handles state-discontinuous fields and gives existence, uniqueness almost everywhere, and the chain rule we need for Thm. 9.3.

Does the trajectory pass through infinitely many regions? Potentially yes, in finite time (chattering). But Thm. 9.4 is uniform over all trajectories — the exponential bound holds regardless.

Why does the rate depend on the worst* pattern, not the realized one?* Because without knowing a priori which patterns will be visited, the only guarantee we can give uses \(\inf_\sigma \lambda_{\min}(L_{\text{free}}(\sigma))\). In practice the realized rate is usually faster (Rem. 9.5); the diagnostics of Ch. 13 can estimate it from per-edge discords.