16 Frontiers: Deeper Architectures, Scaling, and Open Questions

Purpose. Surveys what the framework does and does not yet cover, and outlines open mathematical and computational problems.

16.1 Key concepts & results

Beyond MLPs: convolutions (weight-sharing as quotient sheaves?), residual connections (graph with cycles), attention / transformers.
Scaling: when does the spectral-gap bound become loose? Where are tight rates known?
Stochastic sheaf dynamics: noise injection, regularization.
Sheaf-based interpretability: per-edge discord as a saliency analogue.
Connections to equilibrium propagation, deep equilibrium models, neural ODEs, predictive coding.
Open problems explicitly listed in paper §7.

Prerequisites: all earlier chapters

16.2 Motivating example

Take the smallest architecture that breaks the path-graph assumption: a single residual block. Two affine + ReLU sub-layers from input \(x\) to output \(z\), plus a skip edge that feeds \(x\) directly back into \(z\). The underlying graph of the sheaf now has a chord — it is no longer a path. Write out the coboundary \(\delta_\Omega\) for this sheaf. Three facts immediately break down. (1) \(\delta_\Omega\) is no longer square: there is one more edge than there are interior vertices, so the unit-determinant identity of Lemma 3.2 fails. (2) Topological order is still well-defined (skip connections don’t introduce cycles), but the triangular structure of \(\delta_\Omega\) acquires extra off-diagonal blocks from the chord. (3) The forward pass is still computable — the block is feedforward — but it is no longer the unique harmonic extension in the sense of Ch. 8, because positive-definiteness of \(L_{\text{free}}\) now requires a separate argument.

This is the gateway example for Ch. 14. Everything past it — deep ResNets, U-Nets, attention, graph neural networks — layers further combinations of chord, tied weights, or dynamic topology on top. The questions are: which parts of Chs. 7–13 survive, and which need new machinery?

16.3 Intuition

Everything in this book rides on three structural coincidences of the plain MLP. First, the computation graph is a path, so the coboundary \(\delta_\Omega\) is square and unitriangular (Lemma 3.2), which is what makes the forward pass equal to the harmonic extension. Second, the only hidden-layer nonlinearity is ReLU, so each nonlinear restriction map is an orthogonal projection \(R_{z^{(\ell)}}\) — and orthogonal projections are exactly the right object for Dirichlet energy to remain a common Lyapunov function. Third, the output is at the end of the path, so pinning it is a clean Dirichlet condition on a single vertex.

Each of these coincidences breaks for some architecture of interest.

Residual connections, encoder–decoder, multi-modal couplings break the path. \(\delta_\Omega\) is now tall instead of square; positive-definiteness may still hold case by case, but the unit-determinant identity is gone (Open Problem 2 of paper §7).
Convolutions break “one weight matrix per edge” — a single kernel is shared across many edges. The natural reframing is a quotient sheaf under a translation group action, but the convergence theory has not been worked out.
Attention and transformers break the fixed base graph: the graph itself is produced by the data (the attention pattern). The framework has to be lifted to a sheaf-of-sheaves, or to a data-dependent sheaf, before the identity of Ch. 8 can even be stated.
Non-ReLU activations (GELU, SiLU, smooth activations generally) break the orthogonal-projection picture. The local-adjoint machinery of Ch. 10 covers them, but the common-Lyapunov argument of Ch. 9 needs a new sector hypothesis.
Scaling is its own frontier: the spectral-gap bound of Theorem 4.1 becomes loose in depth (the gap scales as \(1/k^2\) on a length-\(k\) path), predicting impractically slow convergence for deep networks. Tighter rates are known in linear predictive coding ([1]); extending them to ReLU is Open Problem 3.

The three open problems [2] list in §7 — (i) joint CPWA dynamics with simultaneously moving switching surfaces, (ii) extension beyond path graphs, (iii) scaling — organize this chapter. Adjacent programs worth watching: the concurrent work of [1] on sheaf cohomology of linear predictive-coding networks (diagnostic via \(H^1\)); the companion paper [3] on joint state–structure dynamics that supplies the \(\beta \sim 1/n_{\text{train}}\) scaling law empirically confirmed in §6.3; and the broader geometric deep learning program that the sheaf picture sits inside.

Intuition device (planned): A ‘tree of extensions’ diagram showing each architecture as a different base graph.

16.4 Formal development

This chapter is a landscape, not a theory — so the “formal development” below is a careful list of statements and open problems rather than new proofs. Where specific partial results exist, they are cited.

Architectures beyond the path graph

For each architecture, we record (a) the base graph \(G\) of the natural sheaf, (b) which of the three structural coincidences of Ch. 7 (path base, orthogonal-projection nonlinearities, single output vertex) survives, and (c) the open problem.

Residual networks. Let \(G_{\text{res}}\) be the path graph with a chord edge from \(v_{a^{(\ell - 1)}}\) to \(v_{a^{(\ell+1)}}\) implementing a skip. The coboundary \(\delta_\Omega(\sigma)\) is no longer square: there is one more edge than there are free vertices, so Lemma 8.2 (unit determinant) is no longer meaningful. Positive-definiteness of \(\delta_\Omega^T \delta_\Omega\) may still hold case-by-case but has to be checked directly.

Open Problem 14.1 (Path → DAG extension — Open Problem 2 of [2] §7). Give sufficient conditions on a directed acyclic base graph \(G\) and a cellular sheaf \(\mathcal{F}\) on \(G\) under which the pinned Dirichlet problem has a unique minimizer equal to a well-defined “forward pass” of the underlying computation.

Convolutional networks. The natural base graph is the product \(G_{\text{conv}} = G_{\text{feedforward}} \times \Lambda\) where \(\Lambda\) is the spatial lattice. The weight-sharing across \(\Lambda\) is a \(\Lambda\)-action on the sheaf; the natural framework is a quotient sheaf \(\mathcal{F} / \Lambda\) whose stalks are \(\Lambda\)-invariant subspaces.

Open Problem 14.2 (Quotient sheaves for weight-sharing). Define a quotient-sheaf analogue of the neural sheaf that encodes weight-sharing across translations (for CNNs) or across permutation groups (for message-passing GNNs, see [4]). Extend Prop. 8.6 and Thm. 9.4 to the quotient.

Attention and transformers. The attention pattern depends on the data itself (via the attention scores), so the base graph \(G = G(x; \theta)\) is state-dependent. The sheaf’s base is therefore not a fixed combinatorial object but a map \(\mathbb{R}^{n_0} \times \Theta \to \text{Graphs}\).

Open Problem 14.3 (Data-dependent sheaves). Formalize attention as a family of cellular sheaves indexed by \((x, \theta)\), so that the forward pass through a transformer becomes a harmonic extension on the data-dependent base graph. Identify sufficient smoothness/Lipschitz conditions on \(G(\cdot; \theta)\) for analogues of Thm. 9.4 and Thm. 10.4 to hold.

Non-ReLU activations

Smooth activations \(\varphi \in \{\text{GELU}, \text{SiLU}, \text{Softplus}\}\) break the orthogonal-projection structure of the ReLU edge — \(\varphi\) is not a projection on any half-space. The local-adjoint machinery of Ch. 10 (Def. 10.1) covers them, replacing the hard-switching diagonal \(R_{\sigma}\) with the smooth Jacobian \(J_\varphi\).

Proposition 14.4 (Smooth activations via local adjoints). Let \(\mathcal{F}\) be the neural sheaf with ReLU replaced by a continuously differentiable activation \(\varphi\) satisfying Def. 10.3 (Lipschitz + sector-monotone). Then \(L_{\text{free}}^{\text{aug}}(\sigma, y)\) is uniformly positive definite and Theorem 10.4 applies. The Filippov apparatus of Ch. 9 is not needed: trajectories are classical.

The catch: the common-Lyapunov argument of Theorem 9.3 uses the fact that ReLU edges’ restriction maps are \(0/1\) projections and contribute zero discord on the switching surface itself. For smooth activations the “switching surface” is replaced by a continuum, and the uniform bound in Def. 10.3 takes over. The resulting theory is cleaner but gives weaker constants.

Stochastic and regularized dynamics

Definition 16.1 Definition 14.5 (Stochastic sheaf heat equation). The stochastic sheaf heat equation is \[dy(t) = -L_{\text{free}}(\sigma(y(t)))\, (y(t) - y^\star)\, dt + \sqrt{2 T}\, dW(t),\] with Brownian motion \(W\) and temperature \(T > 0\).

Steady-state distributions of Def. 14.5 are Gibbs on Dirichlet energy: \(\pi(y) \propto \exp(-E(y; \sigma(y)) / T)\) within each activation region, with switching at region boundaries. This is the natural stochastic analogue of Ch. 9; \(T \to 0\) recovers the deterministic theorem, and \(T > 0\) gives a noise-injected training regime of potential interest for regularization.

Scaling and rate tightness

The depth-dependence of \(\lambda_{\min}^{\text{free}}\) is the central quantitative challenge. On a length-\(k\) path with unit weights, \(\lambda_{\min}^{\text{free}} \sim 1/k^2\) (the standard path-graph scaling), predicting \(t_c \sim k^2\) mixing time — too slow for production depths.

Open Problem 14.6 (Tight scaling — Open Problem 3 of [2] §7). Establish depth-and-width scaling laws for \(\lambda_{\min}^{\text{free}}\) tighter than the generic path-graph bound. Separate the contribution of the affine (weight-dependent) and ReLU (pattern-dependent) spectral clusters (Prop. 13.4).

[1] treats the analogous question in a linear predictive-coding setting (recurrent topologies, no ReLU) and obtains \(H^1\)-based diagnostics and tighter rates. Extending these to the nonlinear setting of this book is a direct avenue.

Joint CPWA dynamics

The joint flow of Ch. 11 has been analyzed here under frozen-parameter switching. When the parameters themselves move, the activation-region boundaries also move — a “joint CPWA dynamical system” with simultaneously evolving state and switching surfaces.

Open Problem 14.7 (Convergence under moving switching surfaces — Open Problem 1 of [2] §7). Establish a convergence theorem for the joint state–parameter flow (Def. 11.4) in the regime where \(\eta_c, \eta_\theta\) are comparable and the activation-region boundaries move on the same timescale as the cochain. The timescale-separated regime (Ch. 12) is the \(\eta_\theta / \eta_c \to 0\) limit; the general case is open.

Interpretability and adjacent frameworks

The per-edge discord of Ch. 13 is a sheaf-native saliency. Relevant adjacent programs:

Equilibrium propagation ([5]): two-phase clamped training; the joint flow of Ch. 11 is a continuous-time, graph-native generalization.
Predictive coding ([6], [7], [1]): local prediction-error dynamics; the linear case admits cohomological diagnostics (\(H^1\)) that parallel Ch. 13 in language.
Deep equilibrium models: forward pass as a fixed point of an implicit map; the sheaf harmonic extension of Ch. 8 is a specific case.
Neural ODEs and continuous-depth networks: limit of infinitely-thin layers; the path-graph sheaf sends to a continuous base whose cohomology is the natural object of study.
Geometric deep learning ([8]): encompassing framework; the sheaf picture sits inside the “graphs + symmetry” slice.

The frontier is whether the harmonic-extension / common-Lyapunov / per-edge-discord packaging of [2] will extend across this landscape as cleanly as Chs. 7–13 extend it across the path-graph MLP.

16.5 Theorem demonstrations

This chapter is primarily a landscape of open problems; two propositions carry explicit content and admit short proofs.

Proof of Prop. 14.4 (smooth activations). Replace each hidden-layer restriction map \(R_\sigma\) (a \(\{0,1\}\) projection) with the smooth Jacobian \(J_\varphi(c_{z^{(\ell)}})\), and include the output-edge block \(J_{\varphi_{\text{out}}}^T J_{\varphi_{\text{out}}}\) as in Thm. 10.4. Under Def. 10.3 applied to every hidden-layer activation, each Jacobian factor \(J_\varphi\) satisfies \(J_\varphi + J_\varphi^T \succeq 2 m\, I\), so \(J_\varphi^T J_\varphi \succeq m^2 I\). The affine structure of Lem. 8.2 still provides the off-diagonal \(-W^{(\ell)}\) pattern; the diagonal blocks pick up \(J_\varphi^T J_\varphi \succeq m^2 I\) in place of the ReLU-projection identity. A row-by-row Cauchy-Schwarz (or equivalently, a Schur-complement) argument then gives \(L_{\text{free}}^{\text{aug}} \succeq \tilde\lambda\, I\) for some \(\tilde\lambda > 0\) depending on \(m\), \(\|W^{(\ell)}\|\), and \(L_\varphi\). Exponential convergence then follows as in Thm. 10.4. Because \(\varphi\) is \(C^1\), \(y \mapsto L_{\text{free}}^{\text{aug}}(y)\) is continuous, and the Filippov machinery of Ch. 9 is not needed — solutions of the ODE (10.4) are classical. \(\square\)

Proof of Prop. 14.8 (residual block full-rank). Order free coordinates as \((c_{z^{(1)}}, c_{a^{(1)}}, c_{z^{(2)}}, c_{\hat y})\) with dimensions \((2, 2, 1, 1)\) summing to \(6\), and edges as \((e_1^{\text{aff}}, e_1^{\text{act}}, e_2^{\text{aff}}, e^{\text{out}}, e^{\text{skip}})\) with edge-stalk dimensions \((2, 2, 1, 1, 1)\) summing to \(7\). The first four rows form the path-graph \(\delta_\Omega\) of Ch. 8 (block lower-triangular with identity diagonals), which has full column rank \(6\) by Lem. 8.2. The skip row contributes a single additional row \((-W^{\text{skip}}\text{-pin}, 0_{1 \times 2}, 0_{1 \times 2}, 1, 0)\) — a \(1 \times 6\) vector where the \(c_{z^{(2)}}\)-coordinate carries \(+1\) and the rest encode the pinned contribution (zero on free coordinates except through \(c_{z^{(2)}}\)). Adding any additional row to a matrix of full column rank cannot reduce column rank, so \(\delta_\Omega\) has rank \(6\) (full column rank). Hence \(L_{\text{free}}(\sigma) = \delta_\Omega^T \delta_\Omega\) is positive definite: \(v^T L_{\text{free}} v = \|\delta_\Omega v\|^2 \geq 0\), with equality iff \(v = 0\). The “generic” qualifier handles the measure-zero parameter locus where the augmented matrix could degenerate; for Lebesgue-almost every \((W^{(1)}, W^{(2)}, W^{\text{skip}})\) the condition holds. \(\square\)

Open problems. Open Problems 14.1, 14.2, 14.3, 14.6, 14.7 are open — no proofs, no partial proofs. Their placement here is to flag what the extension of the theory has to address, not to claim progress on them. Where partial results exist (e.g., [1] for linear predictive coding) they are cited inline. \(\square\)

16.6 Worked example: a single residual block

We make Open Problem 14.1 concrete by working through the smallest non-path example: a one-block residual network.

Architecture. Two-layer MLP with a skip: \[z^{(1)} = W^{(1)} x + b^{(1)}, \quad a^{(1)} = \mathrm{ReLU}(z^{(1)}), \quad z^{(2)} = W^{(2)} a^{(1)} + b^{(2)} + W^{\text{skip}} x.\]

Base graph. Start with the path graph of Ch. 7 (\(v_x \to v_{z^{(1)}} \to v_{a^{(1)}} \to v_{z^{(2)}} \to v_{\hat{y}}\)) and add a chord edge \(e^{\text{skip}}: v_x \to v_{z^{(2)}}\). Now \(|V| = 5\) (unchanged) but \(|E| = 5\) (one more edge than before).

Edge inventory. Four path edges as in Ch. 7 (affine/ReLU/affine/output, now with identity output for simplicity) plus the skip edge \(e^{\text{skip}}\) with stalk \(\mathbb{R}^1\) and restriction maps \[\mathcal{F}_{v_x \trianglelefteq e^{\text{skip}}} = W^{\text{skip}} \in \mathbb{R}^{1 \times 2}, \qquad \mathcal{F}_{v_{z^{(2)}} \trianglelefteq e^{\text{skip}}} = I_1.\] The edge-agreement condition is \(W^{\text{skip}} c_{v_x} = c_{z^{(2)}}\), which, combined with the path-edge condition \(c_{z^{(2)}} = W^{(2)} c_{a^{(1)}} + b^{(2)}\), gives the two-term residual update only at the harmonic extension.

Dimension count (analogue of Remark 7.3). Take widths \(n_0 = 2\), \(n_1 = 2\), \(n_2 = 1\) as before.

\(\dim C^0 = 2 + 2 + 2 + 1 + 1 = 8\); after pinning \(v_x\), \(\dim \Omega = 6\).
\(\dim C^1 = 2 + 2 + 1 + 1 + 1 = 7\) (four path edges plus one skip).

So \(\delta_\Omega \in \mathbb{R}^{7 \times 6}\) — tall, not square. Lemma 8.2’s “\(\det \delta_\Omega = 1\)” is no longer meaningful; \(\delta_\Omega\) has no determinant at all.

What survives. Positive-definiteness of \(L_{\text{free}}(\sigma) = \delta_\Omega^T \delta_\Omega\) is now equivalent to \(\delta_\Omega\) having full column rank. One can check:

Proposition 14.8 (Residual block — full-rank check). For generic \(W^{(1)}, W^{(2)}, W^{\text{skip}}\) in the network above, \(\delta_\Omega(\sigma)\) has full column rank, hence \(L_{\text{free}}(\sigma)\) is positive definite. Explicitly, the extra skip row of \(\delta_\Omega\) is \((-W^{\text{skip}}, 0, 0, 0, 1, 0)\) in the free-coordinate order \((c_{z^{(1)}}, c_{a^{(1)}}, c_{z^{(2)}}, c_{\hat{y}})\); its rank-1 contribution to \(L_{\text{free}}\) does not cancel the existing path-graph blocks.

So the harmonic extension still exists and is unique, and it still equals the forward pass \(z^{(2)} = W^{(2)} \mathrm{ReLU}(W^{(1)} x + b^{(1)}) + b^{(2)} + W^{\text{skip}} x\). But the route from coboundary structure to this conclusion is no longer “unitriangular → determinant 1 → unique.”

What fails. The clean back-substitution procedure of Ch. 8 is gone: the skip edge introduces a coupling from \(c_{v_x}\) directly to \(c_{z^{(2)}}\), which is not sequential in the layer ordering. A direct linear solve of \(L_{\text{free}}(\sigma) y_\Omega = -L[\Omega, U] u\) still works, but it is now a full \(6 \times 6\) solve rather than a \(6\)-step back-substitution.

What is open. For a generic DAG \(G\) (multiple skips, U-Net topologies, dense skip patterns), two questions remain:

Sufficient conditions on \(G\) and on the restriction maps under which \(\delta_\Omega(\sigma)\) is full rank for every \(\sigma\). (Sufficient conditions exist for the one-block case above; no general result is known for arbitrary DAGs.)
Convergence rates as a function of the DAG topology. The path-graph gap \(\lambda_{\min}^{\text{free}} \sim 1/k^2\) is replaced by a graph-dependent rate whose scaling with depth and skip density is an open problem (Open Problem 14.6).

Even in the single-residual-block case, the common-Lyapunov argument of Theorem 9.3 carries through verbatim as long as Prop. 14.8’s full-rank condition holds — an encouraging sign. But for deep ResNets (multiple blocks, dense skips) no proof is known: both existence of the harmonic extension and convergence of the heat equation are open questions even for identity-output networks.

Sketch: attention as a data-dependent sheaf

A transformer block attends tokens \(x_1, \ldots, x_T\) via learned \(Q, K, V\) matrices producing an attention matrix \(A(x; W) \in \mathbb{R}^{T \times T}\). The natural sheaf has base graph \(G(x; W)\) with vertices \(\{v_{x_1}, \ldots, v_{x_T}\}\) and weighted edges \((v_{x_i}, v_{x_j})\) with weight \(A_{ij}(x; W)\). This \(G\) depends on both the parameters \(W\) and the data \(x\) — the sheaf is a functor from \((x, W)\) to the category of cellular sheaves, not a single fixed sheaf.

The formal setup is a parameterized family of sheaves \(\{\mathcal{F}(x, W)\}_{(x, W)}\); the harmonic-extension story of Ch. 8 asks for a coherent choice of forward pass across this family, continuous in \((x, W)\). No such framework has been developed. This is the content of Open Problem 14.3.

16.7 Coding lab

Lab 14 — Residual-Block Sheaf — Build the sheaf of a single-residual-block network on the chord-augmented path graph. Compute \(\delta_\Omega\) (tall \(7 \times 6\)), verify Prop. 14.8 numerically for random weights, and solve the pinned Dirichlet problem via a full linear solve (since back-substitution fails). Compare the output to forward(x) of the equivalent PyTorch module. Bonus: stack two residual blocks and investigate whether the full-rank property persists.

16.8 Exercises

(warm-up) Write down the augmented base graph for a [2, 2, 1] network with a skip from \(v_x\) to \(v_{z^{(2)}}\). Count vertices and edges, compute \(\dim \Omega\) and \(\dim C^1\), and confirm \(\delta_\Omega\) is tall.
(warm-up) Verify that smooth activations \(\varphi(z) = z / (1 + e^{-z})\) (SiLU/Swish) satisfy Def. 10.3 on a compact domain; compute explicit constants \(m, M, L_\varphi\).
(intermediate) For a U-Net-style graph with two down-sample and two up-sample skip connections, predict \(\dim C^1 - \dim \Omega\) (the number of “extra” rows in \(\delta_\Omega\)) and argue whether Prop. 14.8’s argument generalizes.
(intermediate) Integrate the stochastic sheaf heat equation (Def. 14.5) for the [2, 2, 1] network at \(T = 0.01\). Histogram the steady-state distribution of an interior cochain coordinate and compare with the Gibbs prediction \(\pi \propto \exp(-E/T)\).
(project) Pick one of the five open problems (14.1–14.3, 14.6, 14.7) and write a two-page survey of progress since [2]. Cite at least three recent papers, identify the missing ingredient, and propose a concrete first step.
(advanced) Attempt a quotient-sheaf construction (Open Problem 14.2) for a 1D convolution with kernel size 3 and stride 1 on a length-8 signal. Write down the quotient base graph and the induced restriction maps; identify where the sheaf structure forces the shared-weight constraint.

16.9 Further reading

[2] §7 lists the three open problems that structure this chapter. [1] is the most closely related recent work, extending sheaf cohomology to linear predictive-coding networks and providing the template for tighter scaling laws (Open Problem 14.6). For weight-sharing via group actions on sheaves, see [8] Chs. 4–5 and the equivariant-CNN literature (e.g., Cohen–Welling’s group-equivariant convolutional networks). [4] introduces neural sheaf diffusion for GNNs — a natural bridge to Open Problem 14.2. Vaswani et al.’s original transformer paper is the reference for Open Problem 14.3; the sheaf-of-sheaves / data-dependent sheaf formalism that Open Problem 14.3 asks for remains to be developed.

16.10 FAQ / common misconceptions

Are the open problems in §7 just “the hard cases”? They are structural: each names a specific structural coincidence of the plain MLP that breaks, and asks how the theory recovers. Solving any one of them would extend Chs. 7–13 to a large class of architectures.

Has anyone written down a forward-pass-as-harmonic-extension theorem for ResNets? For a single residual block, Prop. 14.8 does it. For general ResNets with many skips, no full result is published as of 2026 — though the linear case is tractable via direct \(L_{\text{free}}\)-PD arguments.

Why do attention-based transformers need a whole new formalism? Because the base graph itself is a function of the data: the attention pattern \(A(x; W)\) tells you which token-tokens talk, and this changes across inputs. A cellular sheaf sits on a fixed base; a “data-dependent sheaf” is not yet formally defined.

Can I use the stochastic heat equation (Def. 14.5) as a regularizer now? You can, but without convergence theory the hyperparameter \(T\) is uncalibrated. [3] takes a first step in the stochastic direction.

Does this book cover everything neural-sheaf? No — it covers the [2] MLP story plus enough of [3] for Ch. 12’s scaling. Parallel developments in graph learning ([4]), opinion dynamics ([9]), and predictive coding ([1]) are adjacent and cited throughout Ch. 14.