12  Nonlinear Output Activations and Local Adjoints

Purpose. Extends Theorem 4.1 to networks whose output edge is itself a nonlinear map (sigmoid, softmax), using local adjoint structure.

12.1 Key concepts & results

  • The augmented Laplacian when the output edge is nonlinear.
  • Local adjoints: each nonlinear restriction map contributes a Jacobian-transpose term that plays the role of backward information flow on that edge.
  • Theorem 4.2: exponential convergence under a Lipschitz + monotonicity (sector) hypothesis on the output.
  • Connection to backpropagation: the local-adjoint pieces are exactly the building blocks that backprop assembles globally.

Prerequisites: Ch 9

12.2 Motivating example

Take the circular binary-classification task of [1] §6 and attach a sigmoid output head to the [2, 4, 1] sheaf. The output edge is no longer the identity: its stalk carries the probability \(\varphi(z^{(2)}) = \sigma(z^{(2)})\), and the supervised target \(y \in \{0, 1\}\) enters as the Dirichlet data on the output vertex. At equilibrium the interior cochain balances two forces — the affine and ReLU edges pulling it toward the forward-pass value of \(z^{(2)}\), and the sigmoid edge pulling \(z^{(2)}\) toward whatever value makes \(\sigma(z^{(2)}) \approx y\). The pull from the output edge, written in the cochain frame, turns out to be exactly \(J_\varphi^T (\varphi(z^{(2)}) - y)\) — the same quantity backpropagation computes as the output-layer gradient of binary cross-entropy. The sheaf’s local adjoint on the sigmoid edge is the Jacobian transpose; backpropagation is nothing more than the concatenation of these local adjoints along the path graph.

12.3 Intuition

With identity output the Laplacian is linear and \(L_\mathcal{F} = \delta^T \delta\), so Ch. 9’s common-Lyapunov argument works verbatim. With a nonlinear output, the restriction map on the output edge is a smooth map \(\varphi : \mathbb{R}^{n_{k+1}} \to \mathbb{R}^{n_{k+1}}\) (sigmoid, softmax, tanh), so \(\delta\) no longer acts linearly there. The natural move — the one [1] borrow from Gould’s thesis (Def. 7.2.9) — is to replace “coboundary on the output edge” by its local adjoint, i.e. the Jacobian \(J_\varphi\) at the current cochain, transposed. The flow equation on the output vertex then reads \(\dot{z}^{(k+1)} = -J_\varphi^T (\varphi(z^{(k+1)}) - y) - (\text{interior pull})\), and the Dirichlet energy \(E\) is replaced by the augmented energy \(\tfrac{1}{2} \|\delta_{\text{int}} x\|^2 + \tfrac{1}{2} \|\varphi(z^{(k+1)}) - y\|^2\).

Two hypotheses make everything go through (Theorem 4.2 of [1]): \(\varphi\) is Lipschitz (uniform bound on \(\|J_\varphi\|\), so the output force is bounded) and sector-monotone (the Jacobian is uniformly positive in an appropriate sense, so the output edge cannot add energy). Sigmoid, softmax, and tanh all satisfy both by direct computation. Under these, the augmented Laplacian is uniformly positive-definite on the free vertices, Dirichlet energy (augmented) is still a common Lyapunov function across activation patterns, and convergence is still exponential.

A pleasant algebraic cancellation makes classification particularly clean. For softmax with cross-entropy loss, or sigmoid with binary cross-entropy, the Jacobian–gradient product collapses to the residual: \(J_\varphi^T \nabla_{\varphi} L = \varphi(z) - y\) (paper Eq. 27). In the sheaf picture this says that the output-edge force for classification is identical to the output-edge force for regression — just with \(\varphi(z)\) in place of \(z\). So classification and regression fit under a single local-update rule with no special-casing. This identity is also why the rest of the book can keep treating “output edge” as a conceptually uniform thing, swapping the nonlinearity in and out at will.

Intuition device (planned): Side-by-side ‘backprop on a node ↔︎ local adjoint on an edge’ diagram.

12.4 Formal development

[TO FILL: formal development — definitions, statements, careful notation]

12.5 Theorem demonstrations

[TO FILL: proofs / proof sketches of the key results named above. Proofs should come *after* the intuition section, as agreed.]

12.6 Worked examples

[TO FILL: worked example(s) carried out by hand]

12.7 Coding lab

lab-10-nonlinear-output[TO FILL: one-paragraph description of the lab's goal]

12.8 Exercises

[TO FILL: 3–6 exercises, graded from warm-up to project-level]

12.9 Further reading

[TO FILL: annotated paragraph of 3–6 references]

12.10 FAQ / common misconceptions

[TO FILL: short Q&A for things readers frequently get wrong]