15 Spectral and Per-Edge Discord Diagnostics

Purpose. Develops the diagnostic toolkit: spectral signatures of the restricted Laplacian and per-edge discord measures localized to layer/operation type.

15.1 Key concepts & results

Per-edge discord d_e(x) = ‖F_{u⊴e} x_u − F_{v⊴e} x_v‖²; sums to Dirichlet energy.
Spectrum of L_free: eigenvalue clusters tied to layer types (affine vs ReLU vs output).
Diagnosing bottlenecks (slow-converging modes localized to specific edges).
Use during training: monitor per-layer / per-operation discord to detect dead neurons, vanishing gradients, etc.

Prerequisites: Ch 9, Ch 12

15.2 Motivating example

Train the [2, 30, 1] paraboloid network of Ch. 12 for a few hundred epochs, but this time record the per-edge discord \(d_e(x) = \|\mathcal{F}_{u \trianglelefteq e} x_u - \mathcal{F}_{v \trianglelefteq e} x_v\|^2\) at every step, organized as a layered heatmap with one row per edge (grouped by operation type: affine vs ReLU vs output) and one column per epoch. Several common pathologies light up immediately in this picture.

Dead ReLU neurons appear as flat zero rows on specific ReLU edges — a neuron whose pre-activation is always negative contributes no discord, because its restriction map \(R_{z^{(\ell)}}\) zeros it out on the post-activation side. Vanishing gradients show up as persistent near-zero discord on affine edges near the input, paired with large discord deeper in the network — the “signal” has nowhere to propagate. Output saturation (sigmoid stuck at \(0\) or \(1\)) shows up as a persistent hot stripe on the output edge. One figure, three diagnostics, each automatically labeled by layer and operation type. Compare this to the standard practice of probing gradient norms and activation statistics from a pile of opaque forward/backward buffers.

15.3 Intuition

Dirichlet energy \(E(x) = \tfrac{1}{2} \|\delta x\|^2 = \tfrac{1}{2} \sum_e d_e(x)\) is a single scalar that tells you how far the state is from harmonic — how badly the network’s internal representations disagree across the sheaf. Factor this sum by edge, and you get a where instead: which edges carry the discord, and how that attribution redistributes over training. Because each edge in the neural sheaf corresponds to a specific operation (affine, ReLU, or output) at a specific layer, the per-edge discord \(d_e\) is automatically labeled — “layer 3 ReLU is contributing 40% of the total residual energy” — making it the sheaf-side analogue of saliency for training dynamics.

The spectrum of the free Laplacian \(L_{\text{free}}\) tells a complementary, pre-training story. Its eigenvalues cluster by edge type (a cluster near \(\|W^{(\ell)}\|^2\) from affine edges, a cluster near \(1\) from ReLU edges, a cluster determined by the output operation), and its slowest-decaying eigenvector localizes on whichever layer has the weakest coupling. Running the heat equation once and watching which modes take longest to die out gives a zero-cost prediction of which layer will bottleneck training. Per-edge discord during training then tells you whether that bottleneck actually materializes or whether it dissolves as \(\theta\) moves.

The whole point is that these diagnostics are not bolted on; they are built into the object. In the feedforward view you have to decide what to probe (gradient norm, activation histogram, weight update magnitude) and then ask what it means. In the sheaf view the object you are already computing with — Dirichlet energy and its edgewise decomposition — is the diagnostic, labeled by operation type for free.

Intuition device (planned): Layered heatmap (layers × time) of per-edge discord, with color-coded operation types.

15.4 Formal development

Fix a neural sheaf \(\mathcal{F}_\theta\) with coboundary \(\delta_\theta(\sigma)\) and its associated free Laplacian \(L_{\text{free}}(\sigma)\) (Def. 8.3). Let \(c\) be a cochain — during inference, the equilibrium \(c^\star\) of Ch. 9; during training, the two-sided-pinned equilibrium of Ch. 11; during sampling-based diagnostics, any cochain of interest.

Per-edge discord

Definition 15.1 Definition 13.1 (Per-edge discord). For edge \(e \in E\) with endpoints \(u, v\) and restriction maps \(\mathcal{F}_{u \trianglelefteq e}\), \(\mathcal{F}_{v \trianglelefteq e}\), the per-edge discord at cochain \(c\) is \[d_e(c) := \|\mathcal{F}_{u \trianglelefteq e}\, c_u - \mathcal{F}_{v \trianglelefteq e}\, c_v\|^2_{\mathcal{F}(e)} = \|(\delta c)_e\|^2.\]

Proposition 13.2 (Discord partitions Dirichlet energy). For every cochain \(c\), \[E(c) = \tfrac{1}{2} \|\delta c\|^2 = \tfrac{1}{2} \sum_{e \in E} d_e(c).\] In particular, the total Dirichlet energy is exactly the sum of per-edge discords, so \(d_e(c)\) is a localized summand of the global tension.

Each edge in the neural sheaf corresponds to a specific operation (affine at a specific layer, ReLU at a specific layer, output), so Prop. 13.2 partitions \(E(c)\) by operation and layer. “Layer-3 ReLU contributes 40% of the total energy at this cochain” is a precise, local statement.

Per-layer and per-type aggregates

Definition 15.2 Definition 13.3 (Per-layer discord). Write \(E_\ell^{\text{aff}}(c) = \tfrac{1}{2} d_{e_\ell^{\text{aff}}}(c)\) and \(E_\ell^{\text{act}}(c) = \tfrac{1}{2} d_{e_\ell^{\text{act}}}(c)\) for the contributions of the affine and activation edges at layer \(\ell\); write \(E^{\text{out}}(c) = \tfrac{1}{2} d_{e^{\text{out}}}(c)\) for the output-edge contribution.

These aggregates group edges by operation type and layer depth. The three per-type aggregates \[E^{\text{aff}}(c) = \sum_{\ell=1}^{k+1} E_\ell^{\text{aff}}(c), \quad E^{\text{act}}(c) = \sum_{\ell=1}^{k} E_\ell^{\text{act}}(c), \quad E^{\text{out}}(c)\] sum to \(E(c)\).

Spectral structure of the free Laplacian

Proposition 13.4 (Edge-type spectral clusters). For a feedforward ReLU network with uniform-variance weight initialization, the spectrum of \(L_{\text{free}}(\sigma)\) decomposes approximately into three edge-type clusters:

(affine cluster) eigenvalues near \(\|W^{(\ell)}\|^2 + 1\), one cluster per layer;

(activation cluster) eigenvalues near \(2\) (corresponding to the \(R_\sigma^T R_\sigma + I\) blocks), with size equal to the number of active neurons;

(output cluster) eigenvalues determined by \(\|J_\varphi\|^2\) or the identity block.

The inter-cluster gaps scale as the ratios of layer widths and weight norms.

Proposition 13.4 is useful ahead of training: diagonalizing \(L_{\text{free}}\) at initialization reveals which cluster sits nearest zero, and therefore which layer will bottleneck convergence.

Bottleneck localization

Definition 15.3 Definition 13.5 (Bottleneck edge). Let \(\phi_{\min}\) be the eigenvector of \(L_{\text{free}}(\sigma)\) at its smallest eigenvalue. The bottleneck edge of the current configuration is the edge maximizing the contribution \(\|(\delta \phi_{\min})_e\|^2\) to the smallest-eigenmode discord.

Proposition 13.6 (Bottleneck edge dominates convergence). Under the heat-equation dynamics of Ch. 9, the mode along \(\phi_{\min}\) decays at rate \(\lambda_{\min}(L_{\text{free}}(\sigma))\). The bottleneck edge is the edge through which that slow mode flows; its discord \(d_{e^\star}(c(t))\) is proportional (to leading order) to the total residual \(E(c(t) - c^\star)\).

Diagnostic signatures

The following three failure modes have clean diagnostic signatures in per-edge discord.

Proposition 13.7 (Dead-ReLU signature). The \(j\)-th neuron of layer \(\ell\) is dead in the training distribution if for every example \(i\), \(c^{(i)}_{z^{(\ell)}, j} < 0\) at equilibrium. A dead neuron contributes \(d_{e_\ell^{\text{act}}, j}(c^{(i)}) = 0\) for every \(i\), appearing as a flat-zero row in the per-neuron discord heatmap. Additionally, \(\nabla_{W^{(\ell)}_{j, :}} \mathcal{L}(\theta) = 0\) (Prop. 11.5), so the neuron cannot recover via gradient descent.

Proposition 13.8 (Vanishing-gradient signature). Vanishing gradients at early layers manifest as near-zero \(d_{e_\ell^{\text{aff}}}(c^{(i)})\) for small \(\ell\), paired with non-zero discord at late layers. This identifies the structural obstruction: the early-layer agreement condition is satisfied trivially because the signal has collapsed before it reaches the deep layers.

Proposition 13.9 (Output-saturation signature). Sigmoid/softmax output saturation — \(\varphi(c_{z^{(k+1)}}) \approx 0\) or \(1\) with non-negligible residual \(\|\varphi(c_{z^{(k+1)}}) - y\|\) — manifests as persistent large \(d_{e^{\text{out}}}(c^{(i)})\) for class-boundary examples, visible as a hot stripe on the output-edge row of the discord heatmap.

The diagnostic dashboard

Definition 15.4 Definition 13.10 (Per-edge discord dashboard). The discord dashboard is the space-time heatmap \(D(\ell, \text{type}; t)\) indexed by layer \(\ell\), edge type (affine/act/output), and training step \(t\), with cell values equal to the batch-averaged per-edge discord \(\frac{1}{B} \sum_i d_{e_\ell^{\text{type}}}(c^{(i)\star}(\theta_t))\).

The dashboard visualizes the evolution of per-edge discord across training, with automatic labeling by operation type and layer. Props. 13.7–13.9 say that dead ReLUs, vanishing gradients, and output saturation appear as distinguishable patterns in this one picture.

Remark 13.11 (Relation to standard diagnostics). The standard training-diagnostic toolkit includes gradient-norm monitoring, activation histograms, and loss curves. Per-edge discord subsumes these: loss curve = \(\mathcal{L}(\theta_t)\); gradient-norm tells you that a layer is stuck, discord tells you where and why; activation histograms are redundant given pre- and post-activation stalks of each cochain. The dashboard is not a novel observable — it is a relabeling that makes the standard diagnostics localize automatically to specific edges, and that extends naturally to training schemes where gradients don’t exist globally (e.g., joint flow with discrete switching).

15.5 Theorem demonstrations

Proof of Prop. 13.2. The 1-cochain space \(C^1(G; \mathcal{F}) = \bigoplus_e \mathcal{F}(e)\) is an orthogonal direct sum under the inner product inherited from the edge stalks. For any \(c\), \((\delta c)_e \in \mathcal{F}(e)\) is the \(e\)-component of \(\delta c\), so \[\|\delta c\|^2 = \sum_e \|(\delta c)_e\|^2 = \sum_e d_e(c).\] Dividing by \(2\) gives Dirichlet energy. \(\square\)

Proof of Prop. 13.6 (bottleneck dominates convergence). Expand the deviation from equilibrium in the eigenbasis of \(L_{\text{free}}(\sigma)\) (fixed \(\sigma\) for simplicity): \(c(t) - c^\star = \sum_k \alpha_k(t)\, \phi_k\), with \(\dot\alpha_k = -\lambda_k \alpha_k\) so \(\alpha_k(t) = e^{-\lambda_k t} \alpha_k(0)\). The total residual decays as \(\sum_k e^{-2\lambda_k t} \alpha_k(0)^2\); for large \(t\), the slowest-decaying mode \(\phi_{\min}\) at eigenvalue \(\lambda_{\min}\) dominates, giving \(E(c(t) - c^\star) \sim e^{-2\lambda_{\min} t} \alpha_{\min}(0)^2\). Since \((\delta \phi_k)_e\) is fixed per mode, the edge with largest \(\|(\delta \phi_{\min})_e\|^2\) — the bottleneck edge \(e^\star\) of Def. 13.5 — receives the largest slice of the slow mode’s discord: \[d_{e^\star}(c(t)) \approx \alpha_{\min}(t)^2 \|(\delta \phi_{\min})_{e^\star}\|^2 \propto E(c(t) - c^\star)\] to leading order in \(t\). \(\square\)

Proof of Prop. 13.7 (dead-ReLU signature). By hypothesis, \(c^{(i)}_{z^{(\ell)}, j} < 0\) at equilibrium for every \(i\), so \((R_{\sigma^{(i)}})_{jj} = 0\). Then the \((\ell, j)\)-component of the ReLU restriction maps agree at both endpoints: upstream contributes \(0 \cdot c^{(i)}_{z^{(\ell)}, j} = 0\), downstream contributes \(c^{(i)}_{a^{(\ell)}, j}\); minimizing Dirichlet energy sets \(c^{(i)}_{a^{(\ell)}, j} = 0\), so the per-neuron discord vanishes. For the weight row: by Prop. 11.5, \((\nabla_{W^{(\ell)}_{j,:}} E)(c^{(i)}) = (\delta c^{(i)})_{e_\ell^{\text{aff}}, j} \cdot (c^{(i)}_{a^{(\ell-1)}})^T\); at a two-sided equilibrium of an always-negative neuron, \((\delta c^{(i)})_{e_\ell^{\text{aff}}, j} = 0\) as well (no pull from downstream since the ReLU edge already agrees at zero). Summing over \(i\) gives \(\nabla_{W^{(\ell)}_{j,:}} \mathcal{L}(\theta) = 0\). \(\square\)

Proof of Prop. 13.8 (vanishing-gradient signature). Suppose \(\|W^{(1)}\| \ll 1\). At two-sided equilibrium, the linear system \(L_{\text{free}}(\sigma; U) y_\Omega = -L[\Omega, U] u\) has right-hand side with an input piece of magnitude \(\|W^{(1)} x + b^{(1)}\| \ll \|y\|\) and an output piece of magnitude \(\|y\|\). The interior cochain therefore concentrates near the output-pulled solution, with \(c_{z^{(1)}} \approx W^{(1)} x + b^{(1)}\) up to a small perturbation. This gives \(d_{e_1^{\text{aff}}}(c^\star) = \|W^{(1)} x + b^{(1)} - c^\star_{z^{(1)}}\|^2 = O(\|W^{(1)}\|^2)\), vanishingly small, while \(d_{e^{\text{out}}}(c^\star) = (c^\star_{z^{(2)}} - y)^2 = O(1)\). Hence the characteristic late-hot, early-cold pattern. \(\square\)

Proof of Prop. 13.9 (saturation signature). At initialization with \(\|c_{z^{(k+1)}}\| \gg 1\), \(|\varphi(c_{z^{(k+1)}})| \to \{0, 1\}\) and \(|\varphi'(c_{z^{(k+1)}})| \to 0\). The output-edge Jacobian is near-singular, so the augmented Laplacian’s output-block contribution shrinks and the \(\theta\)-update through the output edge is weak. The residual \(|\varphi(c_{z^{(k+1)}}) - y|\) stays \(\Theta(1)\) for misclassified examples (wrong saturation side), giving persistently large \(d_{e^{\text{out}}}(c^{(i)})\). The squared-residual loss retains the vanishing Jacobian factor, explaining the slow recovery; BCE cancels it (Rem. 10.8), accelerating the approach. \(\square\)

15.6 Worked example: a dashboard for the [2, 4, 1] paraboloid network

Train the [2, 4, 1] network of [1] §6 on the paraboloid task of Ch. 12 using Algorithm 12.4. Record the per-edge discord dashboard (Def. 13.10) at epochs 0, 50, and 500.

Edge inventory. The [2, 4, 1] sheaf has \(2k + 2 = 4\) edges grouped into three types:

Affine edges: \(e_1^{\text{aff}}\) (carries \(W^{(1)} \in \mathbb{R}^{4 \times 2}\)), \(e_2^{\text{aff}}\) (carries \(W^{(2)} \in \mathbb{R}^{1 \times 4}\)).
ReLU edge: \(e_1^{\text{act}}\), which decomposes into 4 per-neuron sub-discords.
Output edge: \(e^{\text{out}}\) (identity for regression).

Total: 2 + 4 + 1 = 7 rows in the dashboard per training step.

Epoch 0 (random init). All weight matrices random; forward passes don’t match the targets. Expected signature:

\(\bar{d}_{e^{\text{out}}}\): large (forward pass \(\neq\) target).
\(\bar{d}_{e_2^{\text{aff}}}\): small (affine edges agree at equilibrium whenever both endpoints are free; discord only appears when the output pull propagates backward).
\(\bar{d}_{e_1^{\text{aff}}}\): small, similarly.
\(\bar{d}_{e_1^{\text{act}}, j}\) for \(j = 1, \ldots, 4\): distributed roughly uniformly; no neuron dominant yet.

Epoch 50 (mid-training). Expected signature under healthy training:

Total residual \(\bar{R}\) has dropped by \(\sim 10\times\).
\(\bar{d}_{e^{\text{out}}}\) shrinks fastest (the output edge has the strongest pull and no downstream obstruction).
Per-neuron ReLU discords begin to separate: neurons contributing to the fit concentrate, idle neurons flatten.
Affine discords are small throughout.

Epoch 500 (late training). Convergence. Expected:

\(\bar{R}\) plateau — close to the irreducible training loss set by the network’s expressivity.
All surviving non-zero discords come from the output edge (in-sample prediction error) and a small-but-nonzero spread on affine edges reflecting residual approximation error.

Three failure signatures

Dead ReLU (Prop. 13.7). Inject a dead ReLU at epoch 0 by setting \(W^{(1)}_{3, :} = -100\) and \(b^{(1)}_3 = -100\). For every \(x_i \in [-2, 2]^2\), \(c^{(i)}_{z^{(1)}, 3} < 0\), so \(d_{e_1^{\text{act}}, 3}(c^{(i)}) = 0\) and \(\nabla_{W^{(1)}_{3,:}} \mathcal{L} = 0\). Dashboard: row 3 of the per-neuron ReLU block is flat black across every epoch; the other three neurons’ discords rebalance to compensate. Total \(\bar{R}\) plateau rises — the network cannot use the dead neuron.

Vanishing gradient (Prop. 13.8). Initialize \(W^{(1)}\) with variance \(2/n_0 \cdot 10^{-4}\) (100× too small). At epoch 0, forward passes \(W^{(1)} x + b^{(1)} \approx 0\); ReLUs all fire at their boundaries; the “signal” does not reach the output. Dashboard: \(\bar{d}_{e_1^{\text{aff}}}\) is vanishingly small (the output residual pulls on \(c_{z^{(1)}}\) but can only shift it by a tiny amount, so the affine-edge mismatch stays \(\approx 0\)), while \(\bar{d}_{e^{\text{out}}}\) is large. The characteristic pattern: small at early layers, large at late layers. Compare to a healthy init where discords are roughly uniform across the affine-edge layers.

Output saturation (Prop. 13.9) — illustrated with a sigmoid head. Replace identity output with sigmoid (Ch. 10) and run binary classification on circular data. At initialization, most pre-outputs \(c_{z^{(2)}}\) have \(|c_{z^{(2)}}| \gg 1\), so \(\sigma(c_{z^{(2)}}) \in \{0, 1\}\) regardless of class. Dashboard: \(\bar{d}_{e^{\text{out}}}\) is a persistent hot stripe until enough training pushes \(|c_{z^{(2)}}|\) down; BCE loss (Ch. 10 worked example) accelerates the recovery via the Jacobian–gradient cancellation.

Spectral diagnostic

Before training, compute the eigenvalues of \(L_{\text{free}}(\sigma_0)\) at the initialization and plot them. By Prop. 13.4 we expect three clusters: one near \(\|W^{(1)}\|^2 + 1 \approx 2\) (affine-\(e_1\)), one near \(\|W^{(2)}\|^2 + 1 \approx 2\) (affine-\(e_2\)), one near \(2\) (activation, all active). With He initialization these overlap nicely; with too-small initialization the affine cluster collapses to \(\approx 1\), bringing \(\lambda_{\min}^{\text{free}}\) down and slowing convergence. The spectral diagnostic is a five-second check that predicts whether training will mix fast.

Dashboard reading protocol

Load the log. \(D(\text{edge}, t)\) arrays at each logged epoch.
Normalize. Divide each edge’s discord by total \(\bar{R}(t)\) to see relative contribution.
Flag anomalies. Zero-rows on ReLU edges → dead neurons. Hot stripes on affine edges near the input with cold late → vanishing signal. Persistent hot output stripe → output saturation or inherent unlearnability.
Correlate with loss curve. If \(\bar{R}\) plateaus but specific edge discords are dropping, the network is redistributing residual without reducing it — check for capacity limits.

The dashboard is an edge-level analogue of the “loss curve + activation histograms + gradient norms” ensemble used in standard deep-learning practice, with the structural advantage that every observable is automatically labeled by operation type and layer.

15.7 Coding lab

Lab 13 — Diagnostics Dashboard — Build the per-edge discord dashboard of Def. 13.10 for a sheaf-trained [2, 4, 1] network on the paraboloid task. Log per-edge, per-example discords at every epoch; render the space–time heatmap with color-coded edge types; compute the running spectrum of \(L_{\text{free}}(\sigma_t)\) to track \(\lambda_{\min}\) over training. Inject three controlled pathologies — a dead ReLU (hardcoded negative bias), a vanishing gradient (too-small \(W^{(1)}\)), an output saturation (sigmoid head with large pre-output) — and verify each produces the signature predicted by Props. 13.7–13.9.

15.8 Exercises

(warm-up) For the [2, 2, 1] Ch. 11 worked-example equilibrium, recompute the per-edge discords and verify they sum to \(2 R \approx 0.22\).
(warm-up) Show algebraically that for an affine edge \(e_\ell^{\text{aff}}\) at two-sided equilibrium, \(d_{e_\ell^{\text{aff}}}(c^\star) = 0\) iff the current weights exactly interpolate between \(c^\star_{a^{(\ell-1)}}\) and \(c^\star_{z^{(\ell)}}\) on that pair.
(intermediate) Use Prop. 13.4 to predict the eigenvalue clusters of \(L_{\text{free}}\) at He initialization for a [10, 100, 100, 10] network, and check numerically.
(intermediate) Simulate a dead ReLU and verify both signatures: (a) flat-zero row in the per-neuron discord heatmap, (b) zero gradient for the corresponding weight row. Can the neuron ever recover? What perturbation would revive it?
(project) For a vanilla MLP trained on MNIST with SGD, compute the discord dashboard post-hoc (using the equilibrium cochain at each training example under the final weights). Does it correlate with the per-example loss? What per-edge patterns distinguish well-classified from misclassified examples?
(advanced) Extend the discord diagnostic to Ch. 14’s residual block. Define per-edge discords including the skip edge and study how the skip-edge discord evolves relative to the path-edge discords during training.

15.9 Further reading

[1] §6 illustrates the diagnostic toolkit across four tasks; Figs. 10–13 are the visual references for the dashboard. For the spectral-clusters perspective, see Chung’s spectral graph theory monograph on graph Laplacian spectra and [2] Ch. 6 for the geometric-deep-learning framing. [3] develops analogous cohomological diagnostics (\(H^1\)) for linear predictive-coding networks; the nonlinear version is Open Problem 14.6. For comparison with standard deep-learning diagnostics (gradient norms, activation histograms), the optimization chapter of Goodfellow–Bengio–Courville is the textbook reference.

15.10 FAQ / common misconceptions

Is per-edge discord the same as per-layer gradient norm? Related but distinct. Gradient norm is \(\|\nabla_\theta L\|\); discord is \(\|(\delta c)_e\|^2\) — edge disagreement at the equilibrium cochain. The two agree in direction (both flag stuck layers) but discord is automatically indexed by operation type (affine / ReLU / output) and depends on the cochain, not the loss gradient directly.

Why eigenvalue clusters by edge type (Prop. 13.4)? Because \(L_{\text{free}} = \sum_e\) (contribution of edge \(e\)) and different edge types contribute blocks with different spectra (affine \(\to\) \(\|W\|^2 + 1\), ReLU \(\to\) projection eigenvalues, output \(\to\) \(\|J_\varphi\|^2\)). At moderate width, the blocks don’t mix much and the combined spectrum looks like a union of three rough clusters.

Does a dead neuron always mean a dead weight row? Under plain gradient descent, yes (Prop. 13.7). Under noisy or momentum-based training, a dead ReLU can sometimes recover via a perturbation that flips its sign — but the zero-discord signature persists until it does.

Can the dashboard be computed in real time during training? Yes — the per-edge discords are a by-product of the fast-phase equilibrium cochain, which is already computed in Algorithm 12.4. Overhead is negligible. Under SGD it requires an extra cochain relaxation per diagnostic point.

What does the dashboard miss? It tells you where tension lives in the sheaf, not why. Two networks with identical discord dashboards can have very different semantic failure modes (e.g., both underfitting on different feature subsets). Saliency-style per-input analysis (Ch. 14) is complementary.