9  Building a Sheaf from a Feedforward ReLU Network

Purpose. Constructs the central object of the paper: the cellular sheaf on the path graph whose vertices are intermediate quantities and whose edges encode affine, ReLU, and output operations.

9.1 Key concepts & results

  • Path-graph base: one vertex per intermediate quantity.
  • Three edge types: affine edges (W_ℓ + bias b_ℓ), ReLU edges (activation-pattern-indexed diagonal), output edges (identity or nonlinear).
  • Stalk dimensions, restriction maps, the state-dependent sheaf view.
  • Pinning the input vertex stalk = setting boundary data.

Prerequisites: Ch 1, Ch 4, Ch 5

9.2 Motivating example

Take the [2, 4, 1] running network from Ch. 1. Its computation graph is a short pipeline of intermediate quantities: input \(x \in \mathbb{R}^2\), pre-activation \(z^{(1)} \in \mathbb{R}^4\), post-activation \(a^{(1)} \in \mathbb{R}^4\), and output \(z^{(2)} \in \mathbb{R}\). Each arrow is a concrete operation — affine (\(W^{(1)} \cdot + b^{(1)}\)), elementwise ReLU, final affine. Lay this pipeline flat as an undirected path graph, attach a vector space at each vertex sized to the corresponding quantity, and encode each operation as a pair of restriction maps from its edge stalk into its two endpoint stalks. You now have a cellular sheaf \(\mathcal{F}\) on the path graph. The striking payoff, which Ch. 8 will prove, is that the forward pass of the network is the unique harmonic extension of the input data on this sheaf — the same Dirichlet problem from Ch. 2, but with \(\mathcal{F}\) in place of the constant sheaf \(\underline{\mathbb{R}}\).

In other words, the network’s sequential “run the layers in order” semantics is recovered as one particular way of solving a single, globally defined sheaf-theoretic equation. This chapter is about writing down that sheaf carefully; subsequent chapters exploit it.

9.3 Intuition

A sheaf wraps a directed computation graph inside a symmetric, undirected object. Instead of “input flows into output,” the sheaf view says: “neighboring vertices have to agree across each edge; pick the cochain that makes all those local agreements hold simultaneously.” Each edge \(e = (u, v)\) carries its own vector space \(\mathcal{F}(e)\) — the edge stalk, interpreted as a shared frame in which the two endpoints report their state. Restriction maps \(\mathcal{F}_{u \trianglelefteq e}\) and \(\mathcal{F}_{v \trianglelefteq e}\) push each endpoint’s cochain into that shared frame, and the edge discrepancy \(\mathcal{F}_{u \trianglelefteq e} x_u - \mathcal{F}_{v \trianglelefteq e} x_v\) measures how badly they disagree.

Three edge types encode the three kinds of operation in an MLP. Affine edges wire the pre-activation \(z^{(\ell)}\) to the previous post-activation \(a^{(\ell-1)}\): the restriction on the \(z\)-side is the identity, and the restriction on the \(a\)-side is the weight matrix \(W^{(\ell)}\) (with the bias folded in as an edge offset). ReLU edges wire \(a^{(\ell)}\) to \(z^{(\ell)}\): the restriction on the \(a\)-side is the identity, and the restriction on the \(z\)-side is the diagonal projection \(R_{z^{(\ell)}}\) from Ch. 1 — a \(0/1\) matrix picking out the active neurons. Output edges are either identity (regression) or nonlinear (sigmoid, softmax — Ch. 10). Pinning the input vertex stalk \(\mathcal{F}(v_x)\) to a concrete input vector is exactly the Dirichlet boundary data of Ch. 2.

The one subtlety is that the ReLU edge’s restriction map \(R_{z^{(\ell)}}\) depends on the sign of \(z^{(\ell)}\), which depends on the current cochain. So \(\mathcal{F}\) is a state-dependent sheaf: its restriction maps change as the cochain moves. For a fixed activation pattern \(\sigma \in \{0,1\}^N\), the sheaf is piecewise constant; across activation-region boundaries (Ch. 1) it switches. Chs. 8–9 will exploit the fact that within a pattern, the sheaf is linear and well-behaved, and that Dirichlet energy is a common Lyapunov function across all patterns.

Intuition device (planned): Side-by-side wiring diagrams: PyTorch computation graph on the left, cellular sheaf on the path graph on the right, with arrows linking each layer to its edge.

9.4 Formal development

The base graph

Fix a feedforward ReLU network with \(k\) hidden layers, widths \(n_1, \ldots, n_k\), input dimension \(n_0\), and output dimension \(n_{k+1}\), with weights \(W^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\) and biases \(b^{(\ell)} \in \mathbb{R}^{n_\ell}\) for \(\ell = 1, \ldots, k+1\). Following [1] §3.1, associate to this network a path graph \(G = (V, E)\) with \(2k + 3\) vertices \[V = \{v_x,\, v_{z^{(1)}},\, v_{a^{(1)}},\, v_{z^{(2)}},\, v_{a^{(2)}},\, \ldots,\, v_{z^{(k)}},\, v_{a^{(k)}},\, v_{z^{(k+1)}},\, v_{\hat{y}}\}\] and \(2k+2\) edges ordered along the path: \[E = \{\underbrace{e_1^{\text{aff}}, e_1^{\text{act}}}_{\text{layer 1}}, \underbrace{e_2^{\text{aff}}, e_2^{\text{act}}}_{\text{layer 2}}, \ldots, \underbrace{e_k^{\text{aff}}, e_k^{\text{act}}}_{\text{layer } k}, \underbrace{e_{k+1}^{\text{aff}}, e^{\text{out}}}_{\text{output}}\}.\] Each edge is oriented in the forward direction: \(e_\ell^{\text{aff}} : v_{a^{(\ell-1)}} \to v_{z^{(\ell)}}\) (with \(a^{(0)} := x\)) and \(e_\ell^{\text{act}} : v_{z^{(\ell)}} \to v_{a^{(\ell)}}\). The output edge \(e^{\text{out}} : v_{z^{(k+1)}} \to v_{\hat{y}}\) carries the final activation.

Vertex stalks

Assign the stalks in the natural dimension: \[\mathcal{F}(v_x) = \mathbb{R}^{n_0}, \qquad \mathcal{F}(v_{z^{(\ell)}}) = \mathcal{F}(v_{a^{(\ell)}}) = \mathbb{R}^{n_\ell}, \qquad \mathcal{F}(v_{\hat{y}}) = \mathbb{R}^{n_{k+1}}.\] A 0-cochain \(c \in C^0(G; \mathcal{F})\) is a tuple \((c_x, c_{z^{(1)}}, c_{a^{(1)}}, \ldots, c_{\hat{y}})\) with \(c_{z^{(\ell)}}, c_{a^{(\ell)}} \in \mathbb{R}^{n_\ell}\). The interpretation is that the cochain stores one candidate value for every intermediate quantity in the network, independently of any forward-pass evaluation.

Edge types and restriction maps

Affine edges. For \(\ell = 1, \ldots, k+1\), the affine edge \(e_\ell^{\text{aff}}\) has stalk \(\mathcal{F}(e_\ell^{\text{aff}}) = \mathbb{R}^{n_\ell}\) and restriction maps \[\mathcal{F}_{v_{a^{(\ell-1)}} \trianglelefteq e_\ell^{\text{aff}}} = W^{(\ell)}, \qquad \mathcal{F}_{v_{z^{(\ell)}} \trianglelefteq e_\ell^{\text{aff}}} = I_{n_\ell}.\] The bias \(b^{(\ell)}\) enters as an edge offset, so the edge agreement condition is \(W^{(\ell)} c_{a^{(\ell-1)}} + b^{(\ell)} = c_{z^{(\ell)}}\). We absorb the offset into the coboundary by working in an affine cochain space, or equivalently by appending a constant “\(1\)” coordinate to each stalk and treating the bias as a column of \(W^{(\ell)}\); we adopt the latter convention for compactness.

Activation (ReLU) edges. For \(\ell = 1, \ldots, k\), the activation edge \(e_\ell^{\text{act}}\) has stalk \(\mathcal{F}(e_\ell^{\text{act}}) = \mathbb{R}^{n_\ell}\) and restriction maps \[\mathcal{F}_{v_{z^{(\ell)}} \trianglelefteq e_\ell^{\text{act}}} = R_{c_{z^{(\ell)}}}, \qquad \mathcal{F}_{v_{a^{(\ell)}} \trianglelefteq e_\ell^{\text{act}}} = I_{n_\ell},\] where \(R_{c_{z^{(\ell)}}} = \mathrm{diag}\!\left(\mathbf{1}[c_{z^{(\ell)}}_j \geq 0]\right)\) is the diagonal ReLU matrix from Ch. 1. The edge agreement condition is \(R_{c_{z^{(\ell)}}} c_{z^{(\ell)}} = c_{a^{(\ell)}}\), i.e. the post-activation equals \(\mathrm{ReLU}\) applied to the pre-activation.

Output edge. The output edge \(e^{\text{out}}\) has stalk \(\mathcal{F}(e^{\text{out}}) = \mathbb{R}^{n_{k+1}}\) and restriction maps \[\mathcal{F}_{v_{z^{(k+1)}} \trianglelefteq e^{\text{out}}} = \varphi, \qquad \mathcal{F}_{v_{\hat{y}} \trianglelefteq e^{\text{out}}} = I_{n_{k+1}},\] where \(\varphi\) is the task-specific output activation (identity for regression; sigmoid or softmax for classification). When \(\varphi\) is nonlinear, this restriction map is no longer linear, and the coboundary on that edge must be interpreted in the local-adjoint sense of Ch. 10; for the linear/identity case treated through Ch. 9, \(\varphi = I\) and the output edge is just an identity edge that can be eliminated by Schur complement (App. A.1 of the paper).

The state-dependent sheaf

Definition 9.1 Definition 7.1 (Neural sheaf). The neural sheaf \(\mathcal{F} = \mathcal{F}_{\theta, c}\) associated to a feedforward ReLU network with parameters \(\theta = (W^{(\ell)}, b^{(\ell)})\) and cochain \(c \in C^0(G; \mathcal{F})\) is the cellular sheaf on the path graph \(G\) defined by the vertex stalks, edge stalks, and restriction maps above.

The sheaf depends on two things: the parameters \(\theta\), which fix the affine restriction maps; and the cochain \(c\), which fixes the ReLU projections \(R_{c_{z^{(\ell)}}}\) on the activation edges. The activation pattern of \(c\) is the binary vector \(\sigma(c) \in \{0,1\}^N\) (where \(N = \sum_{\ell=1}^k n_\ell\)) encoding the signs of all \(c_{z^{(\ell)}}_j\); holding \(\sigma\) fixed, \(\mathcal{F}\) is a fixed cellular sheaf in the sense of Def. 4.1, and all constructions of Chs. 4–6 apply verbatim.

The coboundary in block form

Writing \(\delta\) in the block structure dictated by the vertex ordering \((v_x, v_{z^{(1)}}, v_{a^{(1)}}, \ldots, v_{\hat{y}})\) gives a block bidiagonal matrix. For \(\ell = 1, \ldots, k\), the pair of rows for layer \(\ell\) reads \[\begin{pmatrix} -W^{(\ell)} & I & 0 & 0 \\ 0 & -R_{c_{z^{(\ell)}}} & I & 0 \end{pmatrix}\] acting on \((c_{a^{(\ell-1)}}, c_{z^{(\ell)}}, c_{a^{(\ell)}}, c_{a^{(\ell+1)}})\)-blocks; the output rows are \((-W^{(k+1)}, I)\) and \((-\varphi, I)\). The identities on the “downstream” column are the observation used in Ch. 8 to prove that the restricted coboundary \(\delta_\Omega\) is unitriangular.

Pinning

Definition 9.2 Definition 7.2 (Pinned input). For a given input \(x \in \mathbb{R}^{n_0}\), the input-pinned configuration fixes \(c_{x} = x\) and leaves the remaining vertex stalks free. The boundary is \(U = \{v_x\}\) and the interior (free) vertices are \(\Omega = V \setminus U\). Two-sided pinning (Ch. 11) adds \(v_{\hat{y}}\) to \(U\) with the supervised target.

Pinning is the sheaf-theoretic form of supplying boundary data to a Dirichlet problem (Def. 2.3). Solving the Dirichlet problem on the pinned neural sheaf is the content of Ch. 8.

Remark 7.3 (Stalk dimensions). Summed over all vertices, \(\dim C^0(G; \mathcal{F}) = n_0 + 2 \sum_{\ell=1}^k n_\ell + n_{k+1} + n_{k+1}\); summed over all edges, \(\dim C^1(G; \mathcal{F}) = \sum_{\ell=1}^{k+1} n_\ell + \sum_{\ell=1}^k n_\ell + n_{k+1}\). After pinning the input (dropping \(n_0\)) the two counts match: \(\dim C^1 = \dim C^0 - n_0\). This dimensional coincidence\(\delta_\Omega\) is square — is the path-graph fact that makes Lemma 3.2 even statable. Chapter 14 returns to architectures where this coincidence fails.

9.5 Theorem demonstrations

The only result in this chapter that requires justification is that Def. 7.1 actually produces a cellular sheaf in the sense of Def. 4.1, and the claim in Remark 7.3 about dimensions.

Def. 7.1 is a well-defined cellular sheaf (for each fixed activation pattern \(\sigma\)). Recall from Ch. 4 that a cellular sheaf on \(G\) is specified by (i) a vector space \(\mathcal{F}(\tau)\) for each cell \(\tau\) and (ii) a linear restriction map \(\mathcal{F}_{v \trianglelefteq e}\) for each incident pair \(v \trianglelefteq e\). We have exhibited both: the vertex and edge stalks in §7.1 are finite-dimensional real vector spaces, and for each edge we have written down linear maps from each endpoint stalk to the edge stalk. The only subtlety is the activation edge: with \(\sigma\) fixed, the matrix \(R_{c_{z^{(\ell)}}}\) reduces to the constant diagonal \(R_\sigma\) and the restriction map is a single fixed linear map. Nothing in Def. 4.1 requires the restriction maps to be injective, surjective, or otherwise structured, so \(\mathcal{F}\) is a cellular sheaf as claimed. When \(\sigma\) is allowed to vary with the cochain, \(\mathcal{F}\) becomes a state-dependent sheaf — a family of cellular sheaves \(\{\mathcal{F}_\sigma\}_{\sigma \in \{0,1\}^N}\) indexed by the activation pattern, with switching surfaces where two patterns agree on the sheaf; this is precisely the setting of Chs. 9–10. \(\square\)

Remark 7.3 (dimensional coincidence). We verify \(\dim C^1 = \dim C^0 - n_0\). Summing edge stalks, \[\dim C^1(G; \mathcal{F}) = \underbrace{\sum_{\ell=1}^{k+1} n_\ell}_{\text{affine edges}} + \underbrace{\sum_{\ell=1}^{k} n_\ell}_{\text{ReLU edges}} + \underbrace{n_{k+1}}_{\text{output edge}} = 2\sum_{\ell=1}^{k} n_\ell + 2 n_{k+1}.\] Summing vertex stalks, \[\dim C^0(G; \mathcal{F}) = n_0 + 2\sum_{\ell=1}^k n_\ell + n_{k+1} + n_{k+1} = n_0 + 2\sum_{\ell=1}^k n_\ell + 2 n_{k+1}.\] Subtracting gives \(\dim C^0 - \dim C^1 = n_0\), so pinning the input (removing \(n_0\) rows from \(C^0\)) produces a square coboundary \(\delta_\Omega : C^0(\Omega; \mathcal{F}) \to C^1(G; \mathcal{F})\). This is the structural fact behind Lem. 8.2. \(\square\)

9.6 Worked example: the neural sheaf for a [2, 2, 1] network

To get the construction on paper, we trim from the paper’s [2, 4, 1] running network to an even smaller [2, 2, 1] network. The smaller width makes every stalk and every restriction map easy to write down in one line.

Parameters. Take \(k = 1\) hidden layer with widths \(n_0 = 2\), \(n_1 = 2\), \(n_2 = 1\), and \[W^{(1)} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \quad b^{(1)} = \begin{pmatrix} 0 \\ -1 \end{pmatrix}, \quad W^{(2)} = (1, \ 1), \quad b^{(2)} = 0,\] with identity output \(\varphi = I\).

Base graph. Path graph with \(2k + 3 = 5\) vertices \[V = \{v_x,\ v_{z^{(1)}},\ v_{a^{(1)}},\ v_{z^{(2)}},\ v_{\hat{y}}\},\] and \(2k + 2 = 4\) edges \(E = \{e_1^{\text{aff}}, e_1^{\text{act}}, e_2^{\text{aff}}, e^{\text{out}}\}\) oriented left-to-right.

Stalks. \(\mathcal{F}(v_x) = \mathbb{R}^2\), \(\mathcal{F}(v_{z^{(1)}}) = \mathcal{F}(v_{a^{(1)}}) = \mathbb{R}^2\), \(\mathcal{F}(v_{z^{(2)}}) = \mathcal{F}(v_{\hat{y}}) = \mathbb{R}\). Edge stalks match the downstream vertex: \(\mathcal{F}(e_1^{\text{aff}}) = \mathcal{F}(e_1^{\text{act}}) = \mathbb{R}^2\), \(\mathcal{F}(e_2^{\text{aff}}) = \mathcal{F}(e^{\text{out}}) = \mathbb{R}\).

Restriction maps. Following Ch. 7.1: \[\begin{array}{|l|l|l|} \hline \text{edge } e & \mathcal{F}_{\text{upstream} \trianglelefteq e} & \mathcal{F}_{\text{downstream} \trianglelefteq e} \\ \hline e_1^{\text{aff}} & W^{(1)} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} & I_2 \\ e_1^{\text{act}} & R_{c_{z^{(1)}}} = \mathrm{diag}(\mathbf{1}[c^1_1 \geq 0],\, \mathbf{1}[c^1_2 \geq 0]) & I_2 \\ e_2^{\text{aff}} & W^{(2)} = (1, \ 1) & I_1 \\ e^{\text{out}} & I_1 & I_1 \\ \hline \end{array}\]

The bias \(b^{(1)} = (0, -1)^T\) enters as a constant offset on \(e_1^{\text{aff}}\); we absorb it by augmenting \(W^{(1)}\) with a bias column and \(x\) with a “1” coordinate (as discussed in Ch. 7.1).

Activation pattern at \(x = (1, 2)\). Pre-activation \(z^{(1)} = W^{(1)} x + b^{(1)} = (1, \ 1)^T\), so both hidden neurons are active: \(\sigma = (1, 1)\) and \(R_{z^{(1)}} = I_2\). Post-activation \(a^{(1)} = R_{z^{(1)}} z^{(1)} = (1, 1)^T\). Output \(z^{(2)} = W^{(2)} a^{(1)} + b^{(2)} = 2\) and \(\hat{y} = 2\).

Activation pattern at \(x = (-1, -1)\). Pre-activation \(z^{(1)} = (-1, -2)^T\), so both hidden neurons are inactive: \(\sigma = (0, 0)\) and \(R_{z^{(1)}} = 0_{2 \times 2}\). Post-activation \(a^{(1)} = 0\). Output \(z^{(2)} = 0\), \(\hat{y} = 0\). Notice the sheaf’s restriction map on \(e_1^{\text{act}}\) has changed — this is the state-dependence of Def. 7.1 made visible.

Dimensional check. \(\dim C^0(G; \mathcal{F}) = 2 + 2 + 2 + 1 + 1 = 8\); pinning \(v_x\) drops \(2\), leaving \(\dim C^0(\Omega; \mathcal{F}) = 6\). \(\dim C^1(G; \mathcal{F}) = 2 + 2 + 1 + 1 = 6\). The coincidence \(6 = 6\) is Remark 7.3 in action: \(\delta_\Omega\) is square. Ch. 8 will exploit this to solve the Dirichlet problem by back-substitution.

9.7 Coding lab

Lab 07 — Build the Neural Sheaf — Construct the neural sheaf for an arbitrary [\({n_0}, {n_1}, \ldots, n_{k+1}\)] ReLU MLP as a data structure: vertex list, edge list, edge-type labels, restriction-map callables. Populate it from a trained PyTorch model, evaluate the coboundary \(\delta(\sigma)\) on a concrete input, and verify the dimensional coincidence \(\dim C^1 = \dim C^0 - n_0\) (Rem. 7.3). Includes a side-by-side visualization of the PyTorch computation graph and the sheaf’s path-graph base.

9.8 Exercises

  1. (warm-up) For the [2, 2, 1] worked example with \(x = (0, 0.5)\), compute the activation pattern \(\sigma\), write down \(R_{z^{(1)}}\), and enumerate the restriction maps on each of the four edges.
  2. (warm-up) Verify Rem. 7.3 for a [3, 5, 4, 2] network: count \(\dim C^0\), \(\dim C^1\), and \(\dim \Omega = \dim C^0 - n_0\).
  3. (intermediate) Show that re-absorbing the bias \(b^{(\ell)}\) by appending a constant “\(1\)” coordinate to \(c_{a^{(\ell-1)}}\) and a bias column to \(W^{(\ell)}\) preserves the restriction-map formulas, and verify the pinned boundary data still lives in \(\mathbb{R}^{n_0}\) (not \(\mathbb{R}^{n_0 + 1}\)).
  4. (intermediate) For the [2, 2, 1] network with \(W^{(1)}\) replaced by \(\mathrm{diag}(-1, 1)\), find all activation regions of the pre-activation at layer 1 in the input domain \([-2, 2]^2\) and sketch them.
  5. (project) Pick a small pretrained MLP (e.g., MNIST classifier with two hidden layers of width 16). Build its neural sheaf and serialize \(\delta(\sigma)\) as a sparse matrix for a single input. Report matrix density, eigenvalue histogram of \(L_{\text{free}}(\sigma)\), and compare with the prediction of Prop. 13.4.

9.9 Further reading

[1] §3 is the primary reference for the construction, with the state-dependent subtlety noted in Remark 1.4. [2] Ch. 2–3 develops cellular sheaves abstractly; read alongside [3] for the sheaf-diffusion viewpoint on networks. For the path-graph vs DAG distinction that becomes Ch. 14’s territory, [4] motivates why path graphs are the comfortable case and where GNN extensions need more machinery. [5] sets the weight initialization used throughout our worked examples.

9.10 FAQ / common misconceptions

Is the sheaf the same thing as the computation graph? No. The computation graph is a directed object with arrows; the sheaf lives on the underlying undirected graph and encodes operations as restriction-map pairs (upstream and downstream). The asymmetry of “information flows forward” is recovered from the unitriangular structure of \(\delta_\Omega\) (Ch. 8), not baked into the sheaf.

Why does the ReLU edge have a state-dependent restriction map? Because ReLU is not a single linear map — it is a family of linear projections indexed by the activation pattern. The sheaf picks a fresh projection at each cochain. Within a single activation region, everything is linear and ordinary.

Don’t I have to pin the output vertex too? Only for training (Ch. 11). For plain inference, pinning the input alone suffices: the output falls out of the harmonic extension.

What changes if I use tanh instead of ReLU? The activation edge’s restriction map becomes a smooth Jacobian \(J_{\tanh}\) rather than a \(0/1\) projection. Chs. 9–10 still work with the local-adjoint machinery, but the common-Lyapunov argument picks up sector-monotonicity constants instead of the clean switching-surface continuity (see Prop. 14.4).