13  Pinned Neurons and Joint State-Parameter Dynamics

Purpose. Introduces the pinned-neuron construction and the joint flow on (cochain, parameters) that underlies sheaf-based training.

13.1 Key concepts & results

  • Pinning: fixing the value of any vertex stalk (not only input).
  • Output pinning ⇔ supervised target; bidirectional propagation drives interior neurons to a harmonic extension consistent with both boundaries.
  • Joint state-parameter dynamics: simultaneous flow on cochain x and weights θ minimizing total Dirichlet energy ½‖δ_θ x‖².
  • Stationary points characterize trained networks under the sheaf objective.

Prerequisites: Ch 8, Ch 10

13.2 Motivating example

Pin both ends of the [2, 4, 1] path graph. Clamp the input vertex to a concrete training point \(x_i\) and clamp the output vertex to its supervised target \(y_i\). The Dirichlet problem now has two boundary conditions instead of one. With the current weights fixed, the interior cochain relaxes to whatever configuration minimizes the residual Dirichlet energy \(E(x; \theta) = \tfrac{1}{2} \|\delta_\theta x\|^2\) subject to both clamps — an intermediate state that the network can produce consistent with its current parameters, but not necessarily consistent with fitting \((x_i, y_i)\).

Now release \(\theta\) and let the weights also flow, slowly, down the same energy. The fixed points of the joint flow on \((x, \theta)\) are exactly the parameter settings for which some interior cochain realizes both boundary conditions with zero residual — i.e., the network that fits the training pair exactly. With multiple training pairs, the joint flow averages over them, and the fixed points characterize the best attainable fit. This is sheaf-based training: no forward/backward alternation, no stored activations, just a single gradient flow of Dirichlet energy on cochain and weights simultaneously.

13.3 Intuition

In Chs. 8–10 the input was pinned and the forward-pass output fell out of the harmonic extension. Nothing in the sheaf picture privileged the input vertex, though — pinning is just Dirichlet boundary data, and any vertex can carry it. Supervised learning, read through this lens, is two-sided pinning: fix the input stalk at \(x_i\), fix the output stalk at \(y_i\), and ask what configuration of interior state + weights is consistent with both ends. The remaining Dirichlet energy measures how badly the two ends disagree through the current network; driving it to zero is fitting the example.

The picture to keep is a string clamped at both ends, draping between them under tension. With the weights frozen, the cochain settles into whatever shape the tension permits — a harmonic interpolation between input and target, mediated by the current sheaf. With the weights also free to move (on a slower timescale), the sheaf itself deforms to reduce the tension further. At a fixed point, the string is slack: the interior is consistent with both ends, which is the same as saying the network maps \(x_i\) to \(y_i\).

Two structural features distinguish this from ordinary backprop. First, the update is local: each weight’s update depends only on the two stalks it connects, not on chain-rule products propagated from the loss. Second, the two-sided pinning is symmetric: the framework treats input and target interchangeably. This symmetry is what lets Ch. 11 generalize cleanly to partial clamping — pin any subset of stalks to any subset of values (Note 5.1 of the paper), which covers hidden-neuron steering, missing-feature imputation, counterfactual editing, and ordinary supervised learning as the special case “pin the input and the output.” The mathematical machinery for all of these is the same Dirichlet problem with different boundary sets.

Intuition device (planned): Picture of two clamps on the path graph, with the cochain ‘draping’ between them like a string under tension.

13.4 Formal development

Fix a neural sheaf \(\mathcal{F}_\theta\) parameterized by \(\theta = (W^{(\ell)}, b^{(\ell)})_{\ell=1}^{k+1}\) (Ch. 7). Let \(G = (V, E)\) be the path graph, \(C^0(G; \mathcal{F}_\theta) = \bigoplus_v \mathcal{F}(v)\), and \(\delta_\theta(\sigma)\) the coboundary at activation pattern \(\sigma\).

Partial clamping

Definition 13.1 Definition 11.1 (Partial clamping — Note 5.1 of [1]). Let \(U \subseteq V\) be any subset of vertices, and let \(u : U \to \bigsqcup_{v \in U} \mathcal{F}(v)\) assign a value \(u_v \in \mathcal{F}(v)\) to each clamped vertex. The partially clamped Dirichlet problem on \(\mathcal{F}_\theta\) asks for a cochain \(c \in C^0(G; \mathcal{F}_\theta)\) with \(c_v = u_v\) for all \(v \in U\) that minimizes Dirichlet energy \(E(c; \sigma) = \tfrac{1}{2} \|\delta_\theta(\sigma) c\|^2\).

Three special cases:

  • Input-only pinning (\(U = \{v_x\}\)): inference, Chs. 8–10.
  • Input + output pinning (\(U = \{v_x, v_{\hat{y}}\}\)): supervised fitting of a single example; this is the clamp setup of Chs. 11–12.
  • Hidden-neuron steering (\(U \supseteq \{v_x\}\) plus a subset of \(v_{z^{(\ell)}}\) or \(v_{a^{(\ell)}}\)): counterfactual editing, missing-feature imputation.

Each case is a Dirichlet problem on the same sheaf, differing only in the boundary set.

Solvability of the two-sided-pinned problem

Let \(U = \{v_x, v_{\hat{y}}\}\) with \(c_{v_x} = x\) and \(c_{v_{\hat{y}}} = y\) (the supervised target). Let \(\Omega = V \setminus U\) be the interior vertices.

Proposition 11.2 (Two-sided Dirichlet problem). For every fixed \(\theta\) and every \(\sigma\), the restricted Laplacian \(L_{\text{free}}(\sigma; U)\) associated to the boundary set \(U = \{v_x, v_{\hat{y}}\}\) is positive semidefinite, and positive definite when the output edge is either the identity or sector-monotone (Def. 10.3). The two-sided Dirichlet problem has a unique minimizer \(c^\star = c^\star(\theta, \sigma; x, y) \in C^0(G; \mathcal{F}_\theta)\).

The minimizer \(c^\star\) interpolates between \(x\) and \(y\) through the current sheaf. The residual Dirichlet energy at the minimizer, \[R(\theta; x, y) := \min_{c : c_{v_x} = x,\, c_{v_{\hat{y}}} = y} E(c; \sigma(c)), \tag{11.1}\] measures how incompatibly the current network connects \(x\) to \(y\). Zero residual means the network maps \(x\) to \(y\) exactly.

Training loss as residual Dirichlet energy

Definition 13.2 Definition 11.3 (Sheaf training loss). For a training set \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{n_{\text{train}}}\), the sheaf training loss is the average residual Dirichlet energy: \[\mathcal{L}(\theta) := \frac{1}{n_{\text{train}}} \sum_{i=1}^{n_{\text{train}}} R(\theta; x_i, y_i).\]

When \(\mathcal{L}(\theta^\star) = 0\), the network fits the training set exactly; in general \(\mathcal{L}\) plays the same role as squared loss + regression residual in ordinary training.

Joint state–parameter dynamics

The joint flow treats the cochain \(c\) and the parameters \(\theta\) as evolving together, both driven by Dirichlet energy minimization. Parameter gradients are computed edgewise: each edge’s restriction map contains some parameter block, and that block’s gradient is a rank-one outer product of its two endpoint stalks.

Definition 13.3 Definition 11.4 (Joint state–parameter flow — [1] §5.2). For a single pinned pair \((x, y)\), the joint flow is the system \[\begin{aligned} \dot{c}(t) &= -\eta_c\, \nabla_c E(c(t); \sigma(c(t)), \theta(t)) & \text{(cochain flow, free coordinates only)} \\ \dot{\theta}(t) &= -\eta_\theta\, \nabla_\theta E(c(t); \sigma(c(t)), \theta(t)), & \text{(parameter flow)} \end{aligned}\] with \(c_{v_x}(t) = x\) and \(c_{v_{\hat{y}}}(t) = y\) pinned for all \(t\), and rate constants \(\eta_c, \eta_\theta > 0\).

For a batch \(\{(x_i, y_i)\}_{i=1}^B\), the joint flow carries \(B\) independent cochains \(c^{(i)}(t)\) (one per example) sharing a single \(\theta(t)\); the parameter update sums (or averages) the per-example gradients: \[\dot{\theta}(t) = -\frac{\eta_\theta}{B} \sum_{i=1}^B \nabla_\theta E(c^{(i)}(t); \sigma(c^{(i)}(t)), \theta(t)).\]

Locality of the parameter gradient

Proposition 11.5 (Parameter gradient is local). For an affine edge \(e_\ell^{\text{aff}}\) with weight \(W^{(\ell)}\), \[\nabla_{W^{(\ell)}} E(c; \sigma, \theta) = (W^{(\ell)} c_{a^{(\ell-1)}} - c_{z^{(\ell)}})\, (c_{a^{(\ell-1)}})^T = -(\delta c)_{e_\ell^{\text{aff}}}\, (c_{a^{(\ell-1)}})^T.\] The gradient at edge \(e_\ell^{\text{aff}}\) depends only on the two endpoint cochain values \(c_{a^{(\ell-1)}}\), \(c_{z^{(\ell)}}\) and the edge’s discrepancy \((\delta c)_{e_\ell^{\text{aff}}}\) — it uses no information about the rest of the graph.

The same locality holds for every other parameterized edge. This is the sense in which sheaf training is backprop-free: each weight’s update uses only its own two endpoints.

Stationary points

Theorem 13.1 Theorem 11.6 (Stationary points characterize trained networks). A pair \((c^\star, \theta^\star)\) is a stationary point of the joint flow (Def. 11.4) on batch \(\mathcal{D}\) if and only if:

(i) For every example \(i\), \(c^{(i)\star}\) is the unique minimizer of \(E(\cdot; \sigma(\cdot), \theta^\star)\) subject to \(c^{(i)}_{v_x} = x_i\) and \(c^{(i)}_{v_{\hat{y}}} = y_i\) (i.e., the harmonic extension of the two-sided boundary data).

(ii) \(\nabla_\theta \mathcal{L}(\theta^\star) = 0\), where \(\mathcal{L}\) is the sheaf training loss of Def. 11.3.

At a stationary point, the fast cochain dynamics has equilibrated and the residual Dirichlet energy is locally minimized over \(\theta\). This is the sheaf-theoretic analogue of the critical-point condition \(\nabla L(\theta^\star) = 0\) in SGD.

Remark 11.7 (Relation to equilibrium propagation). Def. 11.4 is a generalization of [2] equilibrium propagation: their two-phase clamped/unclamped energy-based training is a specialization of the joint flow in which the cochain equilibrates exactly between parameter updates. The sheaf-theoretic packaging makes the graph structure underlying the local update rule explicit (the cellular sheaf replaces the energy function of Hopfield/EP networks), and covers both regression (identity output) and classification (nonlinear output via Ch. 10) uniformly.

Remark 11.8 (Convergence and timescales). Local convergence of the joint flow around a stationary point \((c^\star, \theta^\star)\) is treated in Ch. 12 under a timescale-separation assumption \(\eta_\theta \ll \eta_c\), which reduces the slow \(\theta\)-flow to a gradient flow of \(\mathcal{L}(\theta)\) with the fast cochain always at its equilibrium \(c^\star(\theta)\).

13.5 Theorem demonstrations

Proof of Prop. 11.2. The restricted Laplacian \(L_{\text{free}}(\sigma; U)\) is the submatrix of the full sheaf Laplacian \(L_\mathcal{F}(\sigma) = \delta(\sigma)^T \delta(\sigma)\) obtained by deleting the rows and columns indexed by \(U\). It is symmetric (as a principal submatrix of a symmetric matrix) and positive semidefinite (as a principal submatrix of a PSD matrix). For strict positive definiteness: let \(v \in C^0(\Omega; \mathcal{F})\) with \(L_{\text{free}}(\sigma; U) v = 0\), i.e. \(\|\delta_\Omega(\sigma) v\|^2 = 0\) where \(\delta_\Omega\) is \(\delta\) restricted to interior columns. Extend \(v\) by zero on \(U\) to \(\tilde v \in C^0(G; \mathcal{F})\); then \(\delta \tilde v = \delta_\Omega v = 0\), so \(\tilde v \in \ker \delta\). Under the identity output hypothesis, Lem. 8.2 (taking the larger boundary set \(U = \{v_x, v_{\hat y}\}\) and noting that removing the output vertex keeps \(\delta_\Omega\) block lower-triangular with identity diagonal — the output edge’s downstream identity is simply deleted) gives \(\delta_\Omega\) still injective. Under the sector-monotone output (Def. 10.3), the output-edge contribution adds a positive-definite block, so the sum is positive definite. Uniqueness of the minimizer follows from strict convexity of the restricted quadratic. \(\square\)

Proof of Prop. 11.5. Write Dirichlet energy as \(E(c; \sigma, \theta) = \tfrac{1}{2} \sum_e \|(\delta c)_e\|^2\). Only one edge — namely \(e_\ell^{\text{aff}}\) — carries the parameter \(W^{(\ell)}\) in its restriction map. Differentiating just that edge’s contribution, \[\frac{\partial}{\partial W^{(\ell)}} \cdot \tfrac{1}{2} \|W^{(\ell)} c_{a^{(\ell-1)}} + b^{(\ell)} - c_{z^{(\ell)}}\|^2 = (W^{(\ell)} c_{a^{(\ell-1)}} + b^{(\ell)} - c_{z^{(\ell)}}) \cdot (c_{a^{(\ell-1)}})^T.\] Signed conventions: with \((\delta c)_{e_\ell^{\text{aff}}} = c_{z^{(\ell)}} - W^{(\ell)} c_{a^{(\ell-1)}} - b^{(\ell)}\) this is \(-(\delta c)_{e_\ell^{\text{aff}}} (c_{a^{(\ell-1)}})^T\). Every other edge’s restriction map is independent of \(W^{(\ell)}\), so its contribution to \(\partial E / \partial W^{(\ell)}\) vanishes. Hence the gradient involves only the two endpoint values \(c_{a^{(\ell-1)}}, c_{z^{(\ell)}}\) — the claim. \(\square\)

Proof of Thm. 11.6. Stationarity of the joint flow means \(\dot c^{(i)} = 0\) and \(\dot \theta = 0\) at \((c^\star, \theta^\star)\). From Def. 11.4, \(\dot c^{(i)} = -\eta_c \nabla_{c^{(i)}} E = 0\) iff \(c^{(i)\star}\) minimizes \(E(\cdot; \sigma, \theta^\star)\) over the affine subspace \(\{c : c_{v_x} = x_i, c_{v_{\hat y}} = y_i\}\); under Prop. 11.2 the minimizer is unique. This is (i). For (ii), \(\dot \theta = 0\) reads \(\nabla_\theta (B^{-1} \sum_i E(c^{(i)\star}; \sigma, \theta)) = 0\). Plugging in (i), \(E(c^{(i)\star}; \sigma, \theta) = R(\theta; x_i, y_i)\), so \(\nabla_\theta \mathcal{L}(\theta^\star) = 0\). Conversely, if (i) and (ii) hold at \((c^\star, \theta^\star)\), both flow components vanish, giving stationarity. \(\square\)

13.6 Worked example: two-sided pinning on the [2, 2, 1] sheaf

Continue the [2, 2, 1] sheaf with identity output from Chs. 7–8. Pin both ends: \(c_{v_x} = (1, 2)^T\) and \(c_{v_{\hat{y}}} = y = 3\) (a target that does not match the current forward-pass value of \(2\)).

Interior vertices. \(\Omega = \{v_{z^{(1)}},\ v_{a^{(1)}},\ v_{z^{(2)}}\}\), dimension \(2 + 2 + 1 = 5\). Boundary set \(U = \{v_x,\ v_{\hat{y}}\}\), pinned values \(u = ((1, 2),\ 3)\).

Free Laplacian. Remove the \(v_{\hat{y}}\)-block from the matrix in Ch. 9 (now that vertex is on the boundary, not in the interior). The new \(L_{\text{free}}(\sigma; U)\) at \(\sigma = (1, 1)\) is the \(5 \times 5\) matrix \[L_{\text{free}}(\sigma; U) = \begin{pmatrix} 2 I_2 & -I_2 & 0 \\ -I_2 & I_2 + (W^{(2)})^T W^{(2)} & -(W^{(2)})^T \\ 0 & -W^{(2)} & 2 \end{pmatrix},\] where the \((z^{(2)}, z^{(2)})\)-entry is now \(2\) (two incident edges, both linear: \(e_2^{\text{aff}}\) with \(I_1\) restriction on the \(z^{(2)}\)-side and \(e^{\text{out}}\) with \(I_1\) restriction on both sides). Eigenvalues are (numerically) \(\{0.47,\ 1.00,\ 2.00,\ 3.00,\ 5.53\}\) — all positive, so Prop. 11.2 applies.

Harmonic extension with two-sided boundary. Solve the linear system \(L_{\text{free}}(\sigma; U)\, y_\Omega = -L[\Omega, U]\, u\). The right-hand side combines the pinned input contribution \(W^{(1)} x + b^{(1)} = (1, 1)^T\) and the pinned output contribution \(y = 3\).

Carrying out the algebra (or numerically) gives \[c^\star_{z^{(1)}} \approx (1.17,\ 1.17), \quad c^\star_{a^{(1)}} \approx (1.33,\ 1.33), \quad c^\star_{z^{(2)}} \approx 2.67.\]

Notice what happened: the interior cochain is pulled between the two boundaries. \(c_{z^{(1)}}\) is no longer exactly \(W^{(1)} x + b^{(1)} = (1, 1)\); it has shifted toward values consistent with the output target.

Residual Dirichlet energy. Plugging \(c^\star\) back into \(E(c; \sigma)\): \[R(\theta; x, y) = E(c^\star; \sigma) = \tfrac{1}{2} \sum_e d_e(c^\star) \approx 0.056.\] The residual is nonzero because the current weights cannot map \(x = (1, 2)\) to \(y = 3\) exactly. If we let \(\theta\) flow too (Def. 11.4), \(R\) will decrease.

Per-edge discord at the two-sided equilibrium.

  • \(d_{e_1^{\text{aff}}}(c^\star) = \|W^{(1)} x + b^{(1)} - c^\star_{z^{(1)}}\|^2 = \|(1, 1) - (1.17, 1.17)\|^2 \approx 0.056\).
  • \(d_{e_1^{\text{act}}}(c^\star) = \|R_\sigma c^\star_{z^{(1)}} - c^\star_{a^{(1)}}\|^2 = \|(1.17, 1.17) - (1.33, 1.33)\|^2 \approx 0.050\).
  • \(d_{e_2^{\text{aff}}}(c^\star) = (W^{(2)} c^\star_{a^{(1)}} - c^\star_{z^{(2)}})^2 = (2.67 - 2.67)^2 = 0\).
  • \(d_{e^{\text{out}}}(c^\star) = (c^\star_{z^{(2)}} - y)^2 = (2.67 - 3)^2 \approx 0.109\).

Sum \(\approx 0.22 = 2 R(\theta; x, y)\), matching Prop. 13.2. The output edge carries roughly half the residual — a signal that the output is the dominant point of disagreement between input and target under the current weights. This is exactly the diagnostic signature of Ch. 13.

One step of parameter update. By Prop. 11.5, the local update to \(W^{(2)}\) is \[\Delta W^{(2)} = -\eta_\theta\, (W^{(2)} c^\star_{a^{(1)}} - c^\star_{z^{(2)}})\, (c^\star_{a^{(1)}})^T = -\eta_\theta \cdot 0 \cdot (1.33, 1.33) = 0.\] The affine-\(e_2\) edge already agrees at equilibrium; \(W^{(2)}\) doesn’t move. By contrast \(W^{(1)}\)’s update is \[\Delta W^{(1)} = -\eta_\theta\, (W^{(1)} x + b^{(1)} - c^\star_{z^{(1)}})\, x^T = -\eta_\theta \cdot (-0.17, -0.17)^T \cdot (1, 2) \neq 0,\] pushing \(W^{(1)}\) to increase its pre-activations so that \(c^\star_{z^{(1)}}\) aligns with the interpolated target. This is the backprop-free update rule: each weight uses only its endpoint cochains.

13.7 Coding lab

Lab 11 — Output Pinning — Clamp both endpoints of a small trained MLP: input to a training point, output to its target. Solve the two-sided Dirichlet problem (Prop. 11.2) and plot the interior cochain as a “string draped between two endpoints.” Compute the residual Dirichlet energy \(R(\theta; x, y)\) and the per-edge discords. Then add a few slow gradient steps in \(\theta\) and watch the string go slack as weights adapt. Optional: experiment with hidden-neuron pinning (counterfactual editing) — fix an interior vertex to a target value and observe how it reshapes both upstream and downstream cochains.

13.8 Exercises

  1. (warm-up) For the [2, 2, 1] worked example with pinning \(x = (1,2)\), \(y = 3\), verify the reported equilibrium \(c^\star_{z^{(1)}} \approx (1.17, 1.17)\) by solving the \(5 \times 5\) linear system directly.
  2. (warm-up) Show that if $y = $ forward\((x)\) (the current network already fits the pair), the two-sided-pinned harmonic extension equals the input-only harmonic extension. Hence \(R(\theta; x, y) = 0\).
  3. (intermediate) Derive the parameter-gradient formula in Prop. 11.5 for the ReLU edge’s restriction map (which carries no weights) and confirm \(\nabla_{R_\sigma} E \equiv 0\) — ReLU edges contribute to the cochain gradient but not the parameter gradient.
  4. (intermediate) Consider pinning only a single hidden-layer neuron \(c_{a^{(\ell)}, j} = u\) (with input and output unpinned). What Dirichlet problem does this define? Write down the boundary set \(U\) and the resulting \(L_{\text{free}}(\sigma; U)\).
  5. (project) Implement the joint flow of Def. 11.4 as a pair of coupled ODEs and reproduce Theorem 11.6 empirically: initialize near a stationary point, perturb, and verify the pair returns to \((c^\star, \theta^\star)\).
  6. (advanced) Investigate whether the two-sided Dirichlet minimizer is always in the activation region of the forward-pass pattern at \(x\). Construct an example (or prove the claim) where pinning the output shifts the realized \(\sigma\) at \(c^\star\) relative to \(\sigma(\mathtt{forward}(x))\).

13.9 Further reading

[1] §5.1–5.2 is the primary reference for partial clamping and joint state-parameter dynamics. [2] equilibrium propagation is the closest antecedent in the machine-learning literature; compare the two-phase (clamped/unclamped) energy decomposition with the joint flow above. For hidden-neuron pinning / counterfactual editing, the activation-editing literature on generative networks (e.g., Bau et al.’s work on GAN dissection) is a practice-side companion. [3] presents predictive-coding’s analogous local-update rule in a neuroscience-flavored setting.

13.10 FAQ / common misconceptions

Is the two-sided-pinned equilibrium the same as “fitting” \((x, y)\)? Fitting means \(R(\theta; x, y) = 0\): the network can connect \(x\) to \(y\) with zero sheaf tension. The equilibrium exists at any \(\theta\); it is “fitted” only when the tension is zero.

Why can I pin any vertex? Isn’t the input special? Nothing in the sheaf structure privileges the input vertex. Pinning is Dirichlet data; any subset of vertices can carry it. The specialness of input/output is architectural convention, not mathematical.

How is this different from Lagrangian training (adding constraints to the loss)? Lagrangian training imposes \(c_{v_x} = x\) and \(c_{v_{\hat y}} = y\) via multipliers and descent on an augmented loss. Two-sided pinning eliminates the pinned coordinates entirely: they are fixed boundary data, never variables. The resulting Dirichlet problem has fewer unknowns and no multiplier dynamics.

Does pinning the output mean the network is trained? No. Pinning the output during a single forward cochain relaxation fixes the target; letting \(\theta\) flow simultaneously is what constitutes training (Ch. 12). Pinning alone just asks “what interior state is consistent with both endpoints under the current weights.”