16 Frontiers: Deeper Architectures, Scaling, and Open Questions
Purpose. Surveys what the framework does and does not yet cover, and outlines open mathematical and computational problems.
16.1 Key concepts & results
- Beyond MLPs: convolutions (weight-sharing as quotient sheaves?), residual connections (graph with cycles), attention / transformers.
- Scaling: when does the spectral-gap bound become loose? Where are tight rates known?
- Stochastic sheaf dynamics: noise injection, regularization.
- Sheaf-based interpretability: per-edge discord as a saliency analogue.
- Connections to equilibrium propagation, deep equilibrium models, neural ODEs, predictive coding.
- Open problems explicitly listed in paper §7.
Prerequisites: all earlier chapters
16.2 Motivating example
Take the smallest architecture that breaks the path-graph assumption: a single residual block. Two affine + ReLU sub-layers from input \(x\) to output \(z\), plus a skip edge that feeds \(x\) directly back into \(z\). The underlying graph of the sheaf now has a chord — it is no longer a path. Write out the coboundary \(\delta_\Omega\) for this sheaf. Three facts immediately break down. (1) \(\delta_\Omega\) is no longer square: there is one more edge than there are interior vertices, so the unit-determinant identity of Lemma 3.2 fails. (2) Topological order is still well-defined (skip connections don’t introduce cycles), but the triangular structure of \(\delta_\Omega\) acquires extra off-diagonal blocks from the chord. (3) The forward pass is still computable — the block is feedforward — but it is no longer the unique harmonic extension in the sense of Ch. 8, because positive-definiteness of \(L_{\text{free}}\) now requires a separate argument.
This is the gateway example for Ch. 14. Everything past it — deep ResNets, U-Nets, attention, graph neural networks — layers further combinations of chord, tied weights, or dynamic topology on top. The questions are: which parts of Chs. 7–13 survive, and which need new machinery?
16.3 Intuition
Everything in this book rides on three structural coincidences of the plain MLP. First, the computation graph is a path, so the coboundary \(\delta_\Omega\) is square and unitriangular (Lemma 3.2), which is what makes the forward pass equal to the harmonic extension. Second, the only hidden-layer nonlinearity is ReLU, so each nonlinear restriction map is an orthogonal projection \(R_{z^{(\ell)}}\) — and orthogonal projections are exactly the right object for Dirichlet energy to remain a common Lyapunov function. Third, the output is at the end of the path, so pinning it is a clean Dirichlet condition on a single vertex.
Each of these coincidences breaks for some architecture of interest.
- Residual connections, encoder–decoder, multi-modal couplings break the path. \(\delta_\Omega\) is now tall instead of square; positive-definiteness may still hold case by case, but the unit-determinant identity is gone (Open Problem 2 of paper §7).
- Convolutions break “one weight matrix per edge” — a single kernel is shared across many edges. The natural reframing is a quotient sheaf under a translation group action, but the convergence theory has not been worked out.
- Attention and transformers break the fixed base graph: the graph itself is produced by the data (the attention pattern). The framework has to be lifted to a sheaf-of-sheaves, or to a data-dependent sheaf, before the identity of Ch. 8 can even be stated.
- Non-ReLU activations (GELU, SiLU, smooth activations generally) break the orthogonal-projection picture. The local-adjoint machinery of Ch. 10 covers them, but the common-Lyapunov argument of Ch. 9 needs a new sector hypothesis.
- Scaling is its own frontier: the spectral-gap bound of Theorem 4.1 becomes loose in depth (the gap scales as \(1/k^2\) on a length-\(k\) path), predicting impractically slow convergence for deep networks. Tighter rates are known in linear predictive coding ([1]); extending them to ReLU is Open Problem 3.
The three open problems [2] list in §7 — (i) joint CPWA dynamics with simultaneously moving switching surfaces, (ii) extension beyond path graphs, (iii) scaling — organize this chapter. Adjacent programs worth watching: the concurrent work of [1] on sheaf cohomology of linear predictive-coding networks (diagnostic via \(H^1\)); the companion paper [3] on joint state–structure dynamics that supplies the \(\beta \sim 1/n_{\text{train}}\) scaling law empirically confirmed in §6.3; and the broader geometric deep learning program that the sheaf picture sits inside.
Intuition device (planned): A ‘tree of extensions’ diagram showing each architecture as a different base graph.
16.4 Formal development
[TO FILL: formal development — definitions, statements, careful notation]
16.5 Theorem demonstrations
[TO FILL: proofs / proof sketches of the key results named above. Proofs should come *after* the intuition section, as agreed.]
16.6 Worked examples
[TO FILL: worked example(s) carried out by hand]
16.7 Coding lab
lab-14-residual-block-sheaf —
[TO FILL: one-paragraph description of the lab's goal]
16.8 Exercises
[TO FILL: 3–6 exercises, graded from warm-up to project-level]
16.9 Further reading
[TO FILL: annotated paragraph of 3–6 references]
16.10 FAQ / common misconceptions
[TO FILL: short Q&A for things readers frequently get wrong]