26  Lab 10 — Nonlinear Output Activation

Anchor chapter: Chapter 10 — Nonlinear Output Activations and Local Adjoints.

Goal. Extend the neural sheaf to a sigmoid output neuron, verify the local-adjoint construction, and confirm convergence for a binary classification task.

Attach a sigmoid output to a [2, 8, 1] binary classifier on the circular task. Integrate the nonlinear-output heat equation (Def. 10.2) under two loss choices — squared residual vs BCE — and compare the output-edge force magnitude as the pre-output sweeps through the saturation regime. Verify numerically the Jacobian–gradient cancellation of Rem. 10.8: the BCE output force equals \(\sigma(z) - y\) while the squared-residual force picks up the extra factor \(\sigma'(z)\), and plot the resulting slowdown.

TipRuns in your browser

This lab requires only NumPy and Matplotlib, loaded automatically via Pyodide. Code cells run directly in the page via WebAssembly — no local Python installation needed.

Prefer a local Jupyter environment? Download lab-10-nonlinear-output.ipynb

26.1 Setup

26.2 1. Build the object

We generate a circular binary classification dataset (inner ring: class 0, outer ring: class 1) and initialise a [2, 8, 1] ReLU network with a sigmoid output as plain NumPy arrays. The local-adjoint output restriction map \(\mathcal{F}_{\text{out} \trianglelefteq e_{\text{out}}}\) encodes the loss gradient: for BCE loss \(\mathcal{L} = -y \log \hat{y} - (1-y)\log(1-\hat{y})\) the output-edge force is \(\nabla_{z^{(2)}} \mathcal{L} = \sigma(z^{(2)}) - y\), whereas for squared residual \(\mathcal{L} = \tfrac{1}{2}(\hat{y} - y)^2\) the force is \((\sigma(z^{(2)}) - y)\,\sigma'(z^{(2)})\) — the extra \(\sigma'\) factor causes slowdown near saturation (Remark 10.8).

26.3 2. Verify a theorem / run an experiment

We plot the output-edge force as a function of the pre-activation \(z^{(2)}\) for a fixed target \(y=1\), sweeping \(z \in [-6, 6]\). The BCE force \(\sigma(z)-1\) is linear near \(z=0\) and saturates gently; the MSE force \((\sigma(z)-1)\,\sigma'(z)\) is \(O(\sigma')\) near saturation and therefore vanishes exponentially in \(|z|\), making gradient descent extremely slow when the network is already confident but wrong. We verify the analytic formulas numerically for all 200 training points and plot the force ratio MSE/BCE, which equals \(\sigma'(z)\) everywhere.

26.4 Exercises

  1. Saturation slowdown. Initialize the network so that one data point has \(|z^{(2)}| = 5\) (strongly saturated). Compare the number of gradient steps needed to move that point’s output by 0.1 under BCE vs MSE loss. Quantify the speedup.

  2. Softmax multiclass. Replace the sigmoid with a 3-class softmax head and a 3-class dataset (anisotropic blobs). Derive the local-adjoint force for cross-entropy loss (the result should be \(\hat{y} - y_{\text{onehot}}\), analogous to the BCE case). Verify numerically.

  3. Force as edge cochain. In the sheaf framework the output-edge force lives in the edge stalk \(\mathcal{F}(e_{\text{out}}) = \mathbb{R}^{n_2}\). For the BCE case, show that the gradient \(\nabla_\theta \mathcal{L}\) can be expressed as \(-\delta_\Omega^\top f_{\text{out}}\) where \(f_{\text{out}} = \sigma(z^{(2)}) - y\) and \(\delta_\Omega\) is the neural sheaf coboundary. Verify this equality numerically for one training point.

  4. BCE vs MSE convergence. Implement a simple gradient-flow training loop (weight update \(\dot{W} = -\nabla_W \mathcal{L}\)) for both losses on the circular dataset. Plot loss vs iteration and show that BCE converges significantly faster on points near saturation.