12 Nonlinear Output Activations and Local Adjoints
Purpose. Extends Theorem 4.1 to networks whose output edge is itself a nonlinear map (sigmoid, softmax), using local adjoint structure.
12.1 Key concepts & results
- The augmented Laplacian when the output edge is nonlinear.
- Local adjoints: each nonlinear restriction map contributes a Jacobian-transpose term that plays the role of backward information flow on that edge.
- Theorem 4.2: exponential convergence under a Lipschitz + monotonicity (sector) hypothesis on the output.
- Connection to backpropagation: the local-adjoint pieces are exactly the building blocks that backprop assembles globally.
Prerequisites: Ch 9
12.2 Motivating example
Take the circular binary-classification task of [1] §6 and attach a sigmoid output head to the [2, 4, 1] sheaf. The output edge is no longer the identity: its stalk carries the probability \(\varphi(z^{(2)}) = \sigma(z^{(2)})\), and the supervised target \(y \in \{0, 1\}\) enters as the Dirichlet data on the output vertex. At equilibrium the interior cochain balances two forces — the affine and ReLU edges pulling it toward the forward-pass value of \(z^{(2)}\), and the sigmoid edge pulling \(z^{(2)}\) toward whatever value makes \(\sigma(z^{(2)}) \approx y\). The pull from the output edge, written in the cochain frame, turns out to be exactly \(J_\varphi^T (\varphi(z^{(2)}) - y)\) — the same quantity backpropagation computes as the output-layer gradient of binary cross-entropy. The sheaf’s local adjoint on the sigmoid edge is the Jacobian transpose; backpropagation is nothing more than the concatenation of these local adjoints along the path graph.
12.3 Intuition
With identity output the Laplacian is linear and \(L_\mathcal{F} = \delta^T \delta\), so Ch. 9’s common-Lyapunov argument works verbatim. With a nonlinear output, the restriction map on the output edge is a smooth map \(\varphi : \mathbb{R}^{n_{k+1}} \to \mathbb{R}^{n_{k+1}}\) (sigmoid, softmax, tanh), so \(\delta\) no longer acts linearly there. The natural move — the one [1] borrow from Gould’s thesis (Def. 7.2.9) — is to replace “coboundary on the output edge” by its local adjoint, i.e. the Jacobian \(J_\varphi\) at the current cochain, transposed. The flow equation on the output vertex then reads \(\dot{z}^{(k+1)} = -J_\varphi^T (\varphi(z^{(k+1)}) - y) - (\text{interior pull})\), and the Dirichlet energy \(E\) is replaced by the augmented energy \(\tfrac{1}{2} \|\delta_{\text{int}} x\|^2 + \tfrac{1}{2} \|\varphi(z^{(k+1)}) - y\|^2\).
Two hypotheses make everything go through (Theorem 4.2 of [1]): \(\varphi\) is Lipschitz (uniform bound on \(\|J_\varphi\|\), so the output force is bounded) and sector-monotone (the Jacobian is uniformly positive in an appropriate sense, so the output edge cannot add energy). Sigmoid, softmax, and tanh all satisfy both by direct computation. Under these, the augmented Laplacian is uniformly positive-definite on the free vertices, Dirichlet energy (augmented) is still a common Lyapunov function across activation patterns, and convergence is still exponential.
A pleasant algebraic cancellation makes classification particularly clean. For softmax with cross-entropy loss, or sigmoid with binary cross-entropy, the Jacobian–gradient product collapses to the residual: \(J_\varphi^T \nabla_{\varphi} L = \varphi(z) - y\) (paper Eq. 27). In the sheaf picture this says that the output-edge force for classification is identical to the output-edge force for regression — just with \(\varphi(z)\) in place of \(z\). So classification and regression fit under a single local-update rule with no special-casing. This identity is also why the rest of the book can keep treating “output edge” as a conceptually uniform thing, swapping the nonlinearity in and out at will.
Intuition device (planned): Side-by-side ‘backprop on a node ↔︎ local adjoint on an edge’ diagram.
12.4 Formal development
Fix a ReLU network with a nonlinear output activation \(\varphi : \mathbb{R}^{n_{k+1}} \to \mathbb{R}^{n_{k+1}}\) (sigmoid, softmax, tanh), its neural sheaf \(\mathcal{F}\) from Ch. 7, and a pinned input \(x\). Unlike in Ch. 9, the output edge’s restriction map on the \(z^{(k+1)}\)-side is the nonlinear map \(\varphi\), not a matrix.
The nonlinear output edge
The edge \(e^{\text{out}} : v_{z^{(k+1)}} \to v_{\hat{y}}\) carries the agreement condition \(\varphi(c_{z^{(k+1)}}) = c_{\hat{y}}\). The corresponding edge contribution to Dirichlet energy is \[E^{\text{out}}(c) = \tfrac{1}{2} \|\varphi(c_{z^{(k+1)}}) - c_{\hat{y}}\|^2. \tag{10.1}\] In a regression-style setup with \(c_{\hat{y}}\) free, the minimum of (10.1) at fixed \(c_{z^{(k+1)}}\) is attained at \(c_{\hat{y}} = \varphi(c_{z^{(k+1)}})\) and equals zero. In a classification/training setup (Ch. 11), \(c_{\hat{y}}\) is pinned to the target \(y\), and (10.1) becomes the standard squared-residual loss — though paired with sigmoid/softmax it reproduces BCE/cross-entropy after the cancellation in Remark 10.8.
Local adjoints
Following [2] (Def. 7.2.9), the local contribution of a nonlinear restriction map to the sheaf coboundary is governed by its local adjoint.
Definition 12.1 Definition 10.1 (Local adjoint). Let \(\varphi : \mathbb{R}^m \to \mathbb{R}^m\) be continuously differentiable. The local adjoint of \(\varphi\) at \(z\) is the linear map \[\varphi^*_z : \mathbb{R}^m \to \mathbb{R}^m, \qquad \varphi^*_z(w) := J_\varphi(z)^T w,\] where \(J_\varphi(z) = \partial \varphi / \partial z\) is the Jacobian.
The local adjoint plays, on each nonlinear edge, the role that \(F^T\) plays on each linear edge in the construction of the sheaf Laplacian (Ch. 5). Concretely, the output-edge contribution to the force \(\nabla_c E\) at the \(z^{(k+1)}\)-vertex is \[\nabla_{c_{z^{(k+1)}}} E^{\text{out}}(c) = J_\varphi(c_{z^{(k+1)}})^T \bigl(\varphi(c_{z^{(k+1)}}) - c_{\hat{y}}\bigr). \tag{10.2}\]
The augmented Laplacian and heat equation
Combining (10.2) with the linear contributions of Chs. 8–9 gives the augmented Laplacian acting on free cochains \(y = c|_\Omega\): \[L_{\text{free}}^{\text{aug}}(\sigma, y) := L_{\text{free}}^{\text{lin}}(\sigma) + L_{\text{free}}^{\text{out}}(y), \tag{10.3}\] where \(L_{\text{free}}^{\text{lin}}(\sigma)\) is the free Laplacian of Chs. 8–9 restricted to the linear (affine + ReLU) edges, and \(L_{\text{free}}^{\text{out}}(y)\) is the outer-product contribution of the output edge — schematically, \(L_{\text{free}}^{\text{out}}(y)\) has a block \(J_\varphi(c_{z^{(k+1)}})^T J_\varphi(c_{z^{(k+1)}})\) at the \(z^{(k+1)}\)-diagonal.
Definition 12.2 Definition 10.2 (Neural-sheaf heat equation, nonlinear output). The nonlinear-output heat equation is the gradient flow \[\dot{y}(t) = -L_{\text{free}}^{\text{aug}}(\sigma(y(t)), y(t))\, (y(t) - y^\star) \tag{10.4}\] on free cochains, interpreted as a Filippov solution at switching surfaces as in Def. 9.2.
Because \(L_{\text{free}}^{\text{out}}\) depends smoothly on \(y\) (the Jacobian \(J_\varphi\) is continuous), the switching structure of (10.4) is inherited entirely from the ReLU edges — same as in Ch. 9. What is new is the \(y\)-dependence of the output block.
Regularity and sector hypotheses
Definition 12.3 Definition 10.3 (Lipschitz + sector output). The output map \(\varphi\) is Lipschitz if \(\|J_\varphi(z)\| \leq L_\varphi\) for all \(z\), and sector-monotone with constants \(0 < m \leq M\) if \[m\, I \preceq \tfrac{1}{2}\bigl(J_\varphi(z) + J_\varphi(z)^T\bigr) \preceq M\, I \quad \text{for all } z.\]
Sigmoid, softmax, and tanh all satisfy Def. 10.3 with explicit bounds; ReLU does not (its Jacobian is not continuous, and degenerates to zero on a half-space). For \(\varphi\) satisfying Def. 10.3, the output-edge contribution \(L_{\text{free}}^{\text{out}}(y)\) is uniformly bounded and positive semidefinite.
Theorem 4.2: convergence
Theorem 12.1 Theorem 10.4 (Convergence with nonlinear output — Theorem 4.2 of [1]). Assume \(\varphi\) satisfies Def. 10.3. Then the augmented Laplacian \(L_{\text{free}}^{\text{aug}}(\sigma, y)\) is uniformly positive definite: there is \(\tilde{\lambda}_{\min}^{\text{free}} > 0\) depending only on \(\theta\), \(m\), \(M\), and \(L_\varphi\) such that \[\inf_{\sigma, y} \lambda_{\min}\!\left(L_{\text{free}}^{\text{aug}}(\sigma, y)\right) \geq \tilde{\lambda}_{\min}^{\text{free}}.\] In particular, Filippov solutions of (10.4) converge exponentially to the unique equilibrium \(y^\star_\varphi\), at rate \(\tilde{\lambda}_{\min}^{\text{free}}\).
The proof structure parallels Theorem 9.4: use the common Lyapunov function (augmented Dirichlet energy) with Theorem 9.3 replaced by an augmented version that absorbs \(L_{\text{free}}^{\text{out}}\) via the sector hypothesis.
Local adjoints as backpropagation
Proposition 10.5 (Backpropagation as concatenated local adjoints). Fix a pinned input \(x\) and pinned target \(y\) (Ch. 11). The gradient of the end-to-end loss \(L(\theta) = \tfrac{1}{2} \|\varphi(z^{(k+1)}(x; \theta)) - y\|^2\) with respect to parameters \(\theta\) equals, layer by layer, the composition of local adjoints applied to the output residual \(\varphi(z^{(k+1)}) - y\): \[\nabla_{W^{(\ell)}} L = \underbrace{\left(\prod_{j=\ell+1}^{k+1} (W^{(j)})^T R_{z^{(j-1)}}\right)}_{\text{linear local adjoints (aff + ReLU)}} \cdot \underbrace{J_\varphi(z^{(k+1)})^T (\varphi(z^{(k+1)}) - y)}_{\text{output local adjoint}} \cdot (a^{(\ell-1)})^T.\]
Proposition 10.5 is not a new identity — it is standard backpropagation. What the sheaf-theoretic packaging shows is that each factor is a local quantity attached to a single edge of the sheaf; the global gradient is the graph-theoretic composition of these local quantities along the path.
Remark 10.6 (Backprop vs forward local updates). The local-adjoint factors in Prop. 10.5 are assembled globally by backprop, which propagates the output residual through the network. The same factors appear, un-composed, as the per-edge forces in the joint state-parameter flow of Ch. 11 — where each edge updates using only its own endpoints. Prop. 10.5 is the bridge between the two: backprop equals forward local updates at a fixed point of the fast cochain dynamics.
Remark 10.7 (Why sigmoid and softmax “just work”). Sigmoid has \(J_\sigma(z) = \mathrm{diag}(\sigma(z) \odot (1 - \sigma(z)))\), with spectrum in \((0, 1/4]\) — uniformly Lipschitz and sector-monotone on any compact set. Softmax has \(J_{\mathrm{softmax}}(z) = \mathrm{diag}(p) - pp^T\) where \(p = \mathrm{softmax}(z)\); its spectrum lies in \([0, 1]\). Both are sector-monotone on their respective domains, so Theorem 10.4 applies.
Remark 10.8 (The Jacobian–gradient cancellation, paper Eq. 27). For softmax with cross-entropy loss \(L = -\sum_i y_i \log p_i\) (or sigmoid with BCE), a direct computation gives \[J_\varphi(z)^T \nabla_\varphi L = \varphi(z) - y,\] so the “local adjoint applied to the \(\nabla_\varphi\) loss” is the residual. Substituting into (10.2), the output force for classification reads simply \(\varphi(z) - y\) — the same form as the regression residual. This is why classification fits Theorem 10.4 with no special-casing: the sheaf’s local-update rule is the same for regression and classification when the output activation and loss are matched in the usual way.
12.5 Theorem demonstrations
Proof of Thm. 10.4. We show the augmented free Laplacian is uniformly positive definite. Decompose \[L_{\text{free}}^{\text{aug}}(\sigma, y) = L_{\text{free}}^{\text{lin}}(\sigma) + L_{\text{free}}^{\text{out}}(y).\] By Cor. 8.4, \(L_{\text{free}}^{\text{lin}}(\sigma) \succeq \lambda_{\min}^{\text{free}} I\) uniformly in \(\sigma\). The output-edge contribution equals, in the \(z^{(k+1)}\)-block, \[L_{\text{free}}^{\text{out}}(y)|_{z^{(k+1)}, z^{(k+1)}} = J_\varphi(c_{z^{(k+1)}})^T J_\varphi(c_{z^{(k+1)}}),\] with zeros elsewhere. The sector hypothesis (Def. 10.3) gives \(J_\varphi + J_\varphi^T \succeq 2m\, I\), so \(J_\varphi^T J_\varphi\) has all eigenvalues \(\geq m^2\) (by polar decomposition \(J_\varphi = U\Sigma V^T\): the sector bound forces \(\Sigma \succeq m\, I\)). Hence \(L_{\text{free}}^{\text{out}}(y) \succeq 0\), and \[L_{\text{free}}^{\text{aug}}(\sigma, y) \succeq L_{\text{free}}^{\text{lin}}(\sigma) \succeq \lambda_{\min}^{\text{free}} I.\] Take \(\tilde\lambda_{\min}^{\text{free}} = \lambda_{\min}^{\text{free}}\); if one further uses the output-block contribution, \(\tilde\lambda_{\min}^{\text{free}} \geq \lambda_{\min}^{\text{free}} + m^2 / \|L_{\text{free}}^{\text{lin}}\|_{\text{op}}^{-1}\), but the bare bound suffices.
For exponential convergence, use the augmented Dirichlet energy \[\tilde V(y) := \tfrac{1}{2} \|\delta_\Omega^{\text{lin}}(\sigma(y)) (y - y^\star_\varphi)\|^2 + \tfrac{1}{2} \|\varphi(c_{z^{(k+1)}}) - y\|^2\] (with \(y\) the pinned target). Continuity across ReLU switching surfaces follows exactly as in Thm. 9.3 (the output term is continuous in \(y\) since \(\varphi\) is continuous). Along the flow (10.4), \[\dot{\tilde V} = -(y - y^\star_\varphi)^T \bigl(L_{\text{free}}^{\text{aug}}\bigr)^2 (y - y^\star_\varphi) \leq -2 \tilde\lambda_{\min}^{\text{free}}\, \tilde V,\] so \(\tilde V(y(t)) \leq e^{-2 \tilde\lambda_{\min}^{\text{free}} t}\, \tilde V(y(0))\) and the claim follows as in Thm. 9.4. \(\square\)
Proof sketch of Prop. 10.5. Write the end-to-end loss as the composition \(L = \ell \circ \varphi \circ z^{(k+1)} \circ a^{(k)} \circ \cdots \circ z^{(\ell)} \circ (W^{(\ell)}, b^{(\ell)})\) and apply the chain rule, remembering that \(\partial z^{(j)} / \partial a^{(j-1)} = W^{(j)}\) and \(\partial a^{(j)} / \partial z^{(j)} = R_{z^{(j)}}\) (with the usual subgradient convention at zero). The resulting telescoping product is \[\nabla_{W^{(\ell)}} L = \left(\prod_{j=\ell+1}^{k+1} (W^{(j)})^T R_{z^{(j-1)}}\right) J_\varphi(z^{(k+1)})^T \nabla_\varphi \ell \cdot (a^{(\ell-1)})^T.\] Each factor is the local adjoint of one edge of the neural sheaf: affine edges contribute \(W^T\), ReLU edges contribute \(R_z\) (which is its own transpose), the output edge contributes \(J_\varphi^T\). The outer rank-one factor \((a^{(\ell-1)})^T\) is the parameter-gradient of the edge \(e_\ell^{\text{aff}}\) — the one direct dependence on \(W^{(\ell)}\). For matched output/loss pairs (softmax+CE, sigmoid+BCE), the Jacobian-gradient identity of Rem. 10.8 collapses \(J_\varphi^T \nabla_\varphi \ell\) to \(\varphi(z^{(k+1)}) - y\). \(\square\)
12.6 Worked example: sigmoid head on the [2, 2, 1] sheaf
Continue the [2, 2, 1] sheaf. Replace the identity output \(\varphi = I\) with a sigmoid head \(\varphi(z) = \sigma(z) = 1/(1 + e^{-z})\) and pin the output vertex to a binary target \(y \in \{0, 1\}\). The interior structure (hidden layer, ReLU, second affine) is unchanged from Chs. 7–8.
Output-edge potential. The edge \(e^{\text{out}}\) now carries \[E^{\text{out}}(c) = \tfrac{1}{2} \bigl(\sigma(c_{z^{(2)}}) - c_{\hat{y}}\bigr)^2.\] With \(c_{\hat{y}} = y\) pinned, this is the standard squared-residual loss in sigmoid-output coordinates.
Jacobian. The Jacobian of sigmoid is the scalar \(\sigma'(z) = \sigma(z)(1 - \sigma(z)) \in (0, 1/4]\).
Output-edge force. By (10.2), \[\nabla_{c_{z^{(2)}}} E^{\text{out}} = \sigma'(c_{z^{(2)}})\bigl(\sigma(c_{z^{(2)}}) - y\bigr).\]
BCE-style cancellation. Replace the squared residual by binary cross-entropy \[L^{\text{BCE}}(c) = -y \log \sigma(c_{z^{(2)}}) - (1 - y) \log(1 - \sigma(c_{z^{(2)}})).\] Direct differentiation: \[\frac{\partial L^{\text{BCE}}}{\partial c_{z^{(2)}}} = \sigma(c_{z^{(2)}}) - y.\] The factor \(\sigma'(c_{z^{(2)}})\) from the squared-residual force has canceled. The output-edge force under BCE is just the residual \(\sigma(c_{z^{(2)}}) - y\) — the same algebraic form as the regression residual \(c_{z^{(2)}} - y\) in Chs. 8–9, with \(\sigma(c_{z^{(2)}})\) in place of \(c_{z^{(2)}}\). This is the Jacobian–gradient identity of Remark 10.8 / paper Eq. 27, written out in one concrete case.
Numerical instance. Take \(x = (1, 2)^T\) with the weights from Ch. 7, so \(c_{z^{(2)}} = 2\) and \(\sigma(c_{z^{(2)}}) = 1/(1 + e^{-2}) \approx 0.881\).
- If \(y = 1\): output-edge force under BCE is \(0.881 - 1 = -0.119\); the edge pulls \(c_{z^{(2)}}\) upward (toward \(+\infty\)). Under squared residual: force \(= \sigma'(2)(0.881 - 1) = 0.105 \cdot (-0.119) \approx -0.0125\), about \(9\times\) weaker.
- If \(y = 0\): force under BCE is \(0.881\); under squared residual \(0.105 \cdot 0.881 \approx 0.093\).
The \(9\times\) weakening near saturated sigmoids is the vanishing-gradient problem for squared-residual classification training; BCE with its Jacobian–gradient cancellation is the standard fix.
Augmented free Laplacian. At \(\sigma = (1, 1)\) (the pattern from Ch. 7 at \(x = (1, 2)\)) and cochain \(c = c^\star\), the augmented free Laplacian (10.3) differs from the identity-output case only in the diagonal block at \(c_{z^{(2)}}\), which becomes \[(L_{\text{free}}^{\text{aug}})_{z^{(2)}, z^{(2)}} = 2 + \sigma'(c_{z^{(2)}})^2 \approx 2.011 \quad (\text{BCE case: use } 2 + 1 = 3).\] Either way, strictly positive. The full \(L_{\text{free}}^{\text{aug}}\) remains symmetric positive definite, uniformly over \(\sigma\) and \(c\) — Theorem 10.4 applies.
Gradient of \(L^{\text{BCE}}\) w.r.t. \(W^{(2)}\) (backprop vs local). Standard backprop gives \[\nabla_{W^{(2)}} L^{\text{BCE}} = (\sigma(c_{z^{(2)}}) - y) \cdot a^{(1)} = -0.119 \cdot (1, 1) = (-0.119, -0.119)\] when \(y = 1\). The sheaf’s local update (Def. 12.3) at the equilibrium cochain produces the same rank-one outer product using only the two endpoints of edge \(e_2^{\text{aff}}\). This is Proposition 10.5 made concrete: backpropagation assembles the local adjoints; the sheaf uses them directly.
12.7 Coding lab
Lab 10 — Nonlinear Output Activation — Attach a sigmoid output to a [2, 8, 1] binary classifier on the circular task of [1] §6. Integrate the nonlinear-output heat equation (Def. 10.2) under two loss choices — squared residual vs BCE — and compare the output-edge force magnitude as the pre-output sweeps through the saturation regime. Verify numerically the Jacobian–gradient cancellation of Rem. 10.8: the BCE output force equals \(\sigma(z) - y\) while the squared-residual force picks up the extra factor \(\sigma'(z)\), and plot the resulting slowdown.
12.8 Exercises
- (warm-up) Show by direct differentiation that for sigmoid \(\varphi = \sigma\) and BCE loss \(L = -y\log\sigma(z) - (1-y)\log(1-\sigma(z))\), \(\partial L / \partial z = \sigma(z) - y\).
- (warm-up) Compute \(\|J_\sigma\|_{\text{op}}\) and verify sigmoid is Lipschitz with \(L_\varphi = 1/4\) and sector-monotone with \(m = 0\), \(M = 1/4\). Where does the sector lower bound fail, and how does Thm. 10.4 still apply?
- (intermediate) Do the same for softmax on \(K\) classes. Explicitly derive \(J_{\text{softmax}}(z) = \mathrm{diag}(p) - p p^T\) and show its spectrum lies in \([0, 1]\). What is the effective sector constant?
- (intermediate) Prove Prop. 10.5 by writing the end-to-end loss as a composition and applying the chain rule edge-by-edge. Mark which factor corresponds to which edge’s local adjoint.
- (project) Quantify the vanishing-gradient problem for squared-residual sigmoid training: initialize \(W^{(1)}\) so that \(|z^{(2)}| > 5\) for most training inputs and measure the per-step loss drop under (a) squared residual, (b) BCE. Report the wall-clock training gap and relate to the Jacobian–gradient cancellation.
- (advanced) Extend Def. 10.1 to a genuinely non-elementwise output (e.g., a spectral normalization layer). Is the local adjoint still well-defined? Is the sector hypothesis reasonable?
12.9 Further reading
[1] §4 Theorem 4.2 and Remark 4.4 are the primary references for nonlinear output convergence. The local-adjoint formalism comes from [2] Def. 7.2.9; read §7.1–7.3 for the broader implicit-function / declarative-networks viewpoint. The Jacobian–gradient cancellation is classical (see any standard neural-networks textbook’s treatment of matched output/loss pairs) but the sheaf-theoretic packaging is new to [1] Eq. 27. For the parallel predictive-coding picture of nonlinear outputs, see [3].
12.10 FAQ / common misconceptions
Is a local adjoint the same as a backward pass? Each local adjoint is a single edge’s backward contribution. Prop. 10.5 says that composing them along the path graph recovers the standard backward pass; conceptually they are backprop’s building blocks, not a replacement.
What breaks if the output activation is not sector-monotone? The augmented Laplacian’s output block can have a negative-definite component, and Thm. 10.4 no longer guarantees positive-definiteness of \(L_{\text{free}}^{\text{aug}}\). In practice, all standard output activations (sigmoid, softmax, tanh) satisfy the sector hypothesis; the one to watch is a custom output that mixes signed components.
Why does BCE work so much better than squared residual with sigmoid? Because the Jacobian factor \(\sigma'(z) = \sigma(z)(1-\sigma(z))\) vanishes at saturation, and BCE cancels it algebraically (Rem. 10.8). Squared residual leaves it in the force, so the gradient vanishes on saturated examples — the classic vanishing-gradient symptom that BCE fixes.
Do I have to use the matched output/loss pair? For Thm. 10.4 itself, no — any Lipschitz + sector-monotone output works. For the clean algebraic collapse to “force \(=\) residual,” yes, and this is what makes classification fit the regression-style formulas without special casing.