14 Backprop-Free Training, Batches, and Timescale Separation
Purpose. Spells out the training algorithm: joint diffusion with timescale separation, batch processing, and how it relates to (but differs from) SGD.
14.1 Key concepts & results
- Timescale separation: fast cochain dynamics vs slow weight dynamics; adiabatic approximation recovers a local, backprop-flavored update rule.
- Batch processing: per-example pinned diffusions sharing parameters; aggregate weight update.
- Comparison with SGD: scaling laws predicted from the spectral gap of L_free.
- Honest discussion of the performance gap: sheaf-based training is not yet competitive with SGD, but obeys quantitative theory-predicted scaling.
Prerequisites: Ch 11
14.2 Motivating example
Train the [2, 30, 1] network from [1] §6 on the paraboloid task \(f(x_1, x_2) = x_1^2 + x_2^2 - \tfrac{2}{3}\) on \([-2, 2]^2\), using sheaf-based training instead of SGD. For each of \(N\) training pairs \((x_i, y_i)\) in the current batch, run the two-sided-pinned diffusion of Ch. 11 on a copy of the sheaf until the cochain \(x^{(i)}\) has equilibrated. Read off the per-example Dirichlet energy \(E(x^{(i)}; \theta)\) and its gradient with respect to the shared parameters \(\theta\) — a local edge-by-edge computation, no chain rule. Average the gradients across the batch, take one small step in \(\theta\), and repeat.
Run this for a few hundred epochs on \(N = 200\) random training points. The final MSE is comparable to an SGD baseline on the same architecture. The number of steps to reach that MSE is larger — by a factor that scales like \(1/\lambda_{\min}(L_{\text{free}})\), exactly as [1] predict. The scaling \(\beta \sim 1/n_{\text{train}}\) confirmed in their Fig. 10 falls out of the same bound.
14.3 Intuition
The joint flow of Ch. 11 has two natural timescales. The cochain \(x\) is fast: within each pinned diffusion it equilibrates on the timescale \(1/\lambda_{\min}(L_{\text{free}})\), which is the spectral-gap rate of Ch. 9. The parameters \(\theta\) are slow: they only move once per batch, and each move is proportional to the current per-edge discords. Whenever a dynamical system has a fast variable and a slow variable, the standard move is adiabatic approximation: hold \(\theta\) fixed long enough for \(x\) to reach its \(\theta\)-dependent equilibrium \(x^\star(\theta)\), substitute \(x^\star(\theta)\) into the \(\theta\)-update, and ignore any transient \(x\)-dynamics. The result is a reduced flow on \(\theta\) alone whose fixed points coincide with fixed points of the joint flow.
What does this adiabatic \(\theta\)-update look like? Because the gradient of Dirichlet energy \(E(x; \theta) = \tfrac{1}{2} \|\delta_\theta x\|^2\) with respect to \(\theta\) is a sum of per-edge contributions — and each per-edge contribution only touches the two stalks \(x_u\) and \(x_v\) that the edge connects — the \(\theta\)-update is fully local. No backward pass through the graph is needed; every edge updates using only its own endpoints’ equilibrium values. This is the sheaf-side analogue of contrastive Hebbian learning and equilibrium propagation, and it is the sense in which sheaf training is “backprop-free.”
Batches are the natural unit of parallelism. Each training pair \((x_i, y_i)\) produces its own pinned sheaf with its own equilibrium cochain \(x^{(i)\star}(\theta)\); the \(\theta\)-update is then the batch-mean of the per-example edge contributions. No shared buffers, no cross-example dependencies during the fast phase — a cleanly parallel structure that maps naturally onto distributed hardware.
The honest discussion [1] §6.2 asks for, and this chapter repeats: the sheaf training rule is not yet competitive with SGD in wall-clock or final accuracy on hard tasks. What it has going for it is (a) locality — no chain-rule propagation — which is attractive for hardware and biologically-plausible learning models; (b) a theory-predicted scaling, \(\beta \sim 1/n_{\text{train}}\), that SGD does not have; and (c) a framework that extends unchanged to partial clamping and to the parameter sheaf \(\mathcal{H}_W\) of App. B.1 of the paper, both of which are awkward to express in SGD language.
Intuition device (planned): Dual time-axis diagram (fast x, slow θ) + bar chart of operations-per-step vs SGD.
14.4 Formal development
Work in the joint state–parameter framework of Ch. 11. Let \(B\) be the batch size, \(c^{(i)}\) the per-example cochain with \(c^{(i)}_{v_x} = x_i\) and \(c^{(i)}_{v_{\hat{y}}} = y_i\) pinned, \(\theta\) the shared parameters, and \(E^{(i)}(\theta) = \min_{c : c_{v_x} = x_i, c_{v_{\hat{y}}} = y_i} E(c; \sigma(c), \theta) = R(\theta; x_i, y_i)\) the per-example residual from (11.1).
Timescale separation
Definition 14.1 Definition 12.1 (Timescale separation). The joint flow (Def. 11.4) satisfies timescale separation with ratio \(\varepsilon = \eta_\theta / \eta_c \ll 1\) when the cochain dynamics equilibrates on the fast timescale \(t_c \sim 1 / (\eta_c \lambda_{\min}^{\text{free}})\) before any significant parameter motion occurs on the slow timescale \(t_\theta \sim 1 / (\eta_\theta \|\nabla_\theta \mathcal{L}\|)\).
The typical implementation runs each cochain to convergence (fast phase), then takes a single parameter step (slow phase), then repeats. For a pre-trained network near a flat region of \(\mathcal{L}\), \(\varepsilon \to 0\) recovers the adiabatic approximation.
Adiabatic approximation
Proposition 12.2 (Adiabatic reduction — [1] §5.3). Under timescale separation (Def. 12.1), the parameter flow reduces to a gradient flow on \(\mathcal{L}(\theta)\): \[\dot{\theta} = -\eta_\theta\, \nabla_\theta \mathcal{L}(\theta) + O(\varepsilon),\] where the gradient is evaluated at the per-example equilibrium cochains \(c^{(i)\star}(\theta)\). Consequently, fixed points of the adiabatic flow coincide with stationary points of \(\mathcal{L}\) (Theorem 11.6).
The \(O(\varepsilon)\) term is a transient from finite fast-phase convergence; Proposition 12.2 asserts that no “slow drift” or adiabatic invariant introduces a persistent bias.
The local update rule
Definition 14.2 Definition 12.3 (Local parameter update). Given equilibrium cochains \(c^{(i)\star}(\theta)\) for \(i = 1, \ldots, B\), the local update rule for weight \(W^{(\ell)}\) is \[W^{(\ell)} \leftarrow W^{(\ell)} - \eta_\theta\, \frac{1}{B} \sum_{i=1}^B (\delta c^{(i)\star})_{e_\ell^{\text{aff}}}\, (c^{(i)\star}_{a^{(\ell-1)}})^T,\] with analogous formulas for biases and for any other parameterized restriction map.
Each affine weight is updated using a per-example rank-one outer product of its two endpoint stalks at the equilibrium cochain — no chain-rule factors from elsewhere in the graph (Proposition 11.5). The cost per parameter per step is \(O(B \cdot \text{edge fan-in})\), matching SGD’s \(O(B)\) up to constants.
The sheaf training algorithm
Algorithm 12.4 (Sheaf trainer). Inputs: training set \(\mathcal{D}\), batch size \(B\), fast-phase tolerance \(\tau_c\), slow-phase step \(\eta_\theta\).
- Initialize \(\theta\) at random (e.g., [2]).
- Repeat until the validation residual stabilizes:
- Sample a batch \(\{(x_i, y_i)\}_{i=1}^B \subseteq \mathcal{D}\).
- Fast phase. For each example \(i\), solve the two-sided Dirichlet problem (Prop. 11.2): integrate the pinned cochain flow (or run a direct linear solve at the current \(\sigma\)) to tolerance \(\tau_c\).
- Slow phase. Apply the local update rule (Def. 12.3) using the per-example equilibrium cochains.
- Return \(\theta\).
The fast phase is the only computationally substantial step. At fixed activation pattern \(\sigma\), it reduces to a single \(L_{\text{free}}(\sigma)\)-linear solve of dimension \(\dim \Omega\); across pattern switches, it is iterated.
Convergence rate and scaling
Theorem 14.1 Theorem 12.5 (Training-rate scaling — [1] §5.3). Under Def. 12.1 and the hypotheses of Theorem 10.4, the adiabatic parameter flow converges locally around a stationary point \(\theta^\star\) at rate \[\beta_\theta \geq \eta_\theta\, \lambda_{\min}\!\left(\nabla^2 \mathcal{L}(\theta^\star)\right) \cdot \left(1 + O\!\left(\frac{1}{\lambda_{\min}^{\text{free}}}\right)\right),\] where \(\lambda_{\min}^{\text{free}}\) is the uniform free-Laplacian spectral gap (Cor. 8.4). In particular, the wall-clock rate per sample is suppressed by a factor \(\sim 1/\lambda_{\min}^{\text{free}}\) relative to idealized SGD.
The suppression factor is where sheaf training “loses” to SGD on standard benchmarks: shortening the fast phase increases \(\varepsilon\) and re-introduces the \(O(\varepsilon)\) adiabatic error. A careful operating-point choice trades fast-phase tolerance against slow-phase step size.
Batches, parallelism, and empirical scaling
Proposition 12.6 (Batch parallelism). The \(B\) fast-phase sub-problems in Algorithm 12.4 are independent: they share \(\theta\) but no intermediate cochain state. They can be solved in parallel with no synchronization until the slow-phase aggregation.
Empirically, [1] §6.3 observe a scaling \(\beta \sim 1 / n_{\text{train}}\) in the training convergence rate as a function of training-set size — predicted by the companion paper [3] from the common-Lyapunov bound over a per-example ensemble. This is a quantitative scaling SGD does not have.
Parameter sheaf and regularization-as-augmentation
Two constructions from App. B of [1] fit naturally here.
Definition 14.3 Definition 12.7 (Parameter sheaf \(\mathcal{H}_W\)). The parameter sheaf is the cellular sheaf on the dual graph of \(G\) whose stalks store the weight matrices \(W^{(\ell)}\) and whose edges encode weight compatibility (e.g., dimension matching across layers). The slow-phase parameter flow of Def. 11.4 is precisely a sheaf-diffusion on \(\mathcal{H}_W\).
Definition 14.4 Definition 12.8 (Regularization as augmentation). \(L_2\) regularization \(\mathcal{L}(\theta) + \lambda \|\theta\|^2\) equals Dirichlet energy on the augmented graph \(\tilde{G}\) obtained from \(G\) by adding, for each parameterized edge, a “reluctance edge” to a zero-anchor vertex with weight \(\sqrt{\lambda}\).
Both constructions are cosmetic relabelings that make existing training techniques fit inside the sheaf framework — but they are honest relabelings: \(\mathcal{H}_W\)-diffusion recovers standard SGD-like updates on weights, and augmentation-graph Dirichlet energy recovers standard L2-regularized loss. The payoff is that novel regularization schemes (e.g., non-isotropic reluctance) can be designed by modifying \(\tilde{G}\) rather than by modifying the loss function.
Remark 12.9 (Honest performance comparison). On the benchmark tasks of [1] §6 — paraboloid regression, saddle, circular binary classification, four-class anisotropic blobs — Algorithm 12.4 matches SGD in final test loss on small networks ([2, 4, 1], [2, 30, 1]), but requires more wall-clock time by the factor in Theorem 12.5. On larger networks or harder tasks, the gap widens. The current state is: sheaf training is a theoretically well-founded, fully local alternative to SGD, competitive on small scales, not competitive on production-scale tasks. Closing that gap is an active research program.
14.5 Theorem demonstrations
Proof sketch of Prop. 12.2 (adiabatic reduction). Rescale time by \(\tau = \eta_\theta t\) so that the joint flow (Def. 11.4) reads \[\varepsilon^{-1} \frac{dc^{(i)}}{d\tau} = -\nabla_{c^{(i)}} E^{(i)}(c^{(i)}; \theta), \qquad \frac{d\theta}{d\tau} = -\nabla_\theta \Big[ B^{-1} \sum_i E^{(i)}(c^{(i)}; \theta) \Big],\] with \(\varepsilon = \eta_\theta / \eta_c \ll 1\). This is a standard Tikhonov/singular-perturbation problem (cf. [4] for the Filippov version). On the slow timescale \(\tau\), the fast layer equilibrates at \(c^{(i)} = c^{(i)\star}(\theta)\), the unique minimizer of \(E^{(i)}(\cdot; \theta)\) guaranteed by Prop. 11.2. Substituting into the slow equation and applying the envelope theorem (chain rule for implicitly defined minimizers), \[\nabla_\theta E^{(i)}(c^{(i)\star}(\theta); \theta) = \nabla_\theta R(\theta; x_i, y_i),\] since the \(c\)-gradient vanishes at the minimizer. Hence \(\dot\theta = -\nabla_\theta \mathcal{L}(\theta) + O(\varepsilon)\), where the \(O(\varepsilon)\) captures the finite-rate fast-phase transient. Stationary points of the reduced flow satisfy \(\nabla_\theta \mathcal{L}(\theta^\star) = 0\), matching Thm. 11.6(ii). \(\square\)
Proof sketch of Thm. 12.5 (training-rate scaling). Linearize the adiabatic flow \(\dot\theta = -\eta_\theta \nabla_\theta \mathcal{L}(\theta)\) around a stationary point \(\theta^\star\) with Hessian \(H := \nabla^2 \mathcal{L}(\theta^\star) \succeq 0\). The linearized dynamics is \(\dot{\Delta\theta} = -\eta_\theta H \Delta\theta\), giving local exponential rate \(\eta_\theta \lambda_{\min}(H)\). However, the Hessian is evaluated through the envelope: by the implicit-function theorem applied to the first-order condition \(\nabla_c E^{(i)} = 0\), \[H = \sum_i \Big[ \nabla^2_{\theta\theta} E^{(i)} - \nabla_{\theta c} E^{(i)} (\nabla^2_{cc} E^{(i)})^{-1} \nabla_{c\theta} E^{(i)} \Big]_{c = c^{(i)\star}}.\] The second term is the envelope correction, and its norm is controlled by \(\|(\nabla^2_{cc} E^{(i)})^{-1}\| = 1 / \lambda_{\min}(L_{\text{free}}^{\text{aug}})\). So relative to the idealized rate (no fast-slow coupling), the realized rate picks up a multiplicative \(1 + O(1/\lambda_{\min}^{\text{free}})\) factor. \(\square\)
Proof of Prop. 12.6 (batch parallelism). In the fast phase of Algorithm 12.4, example \(i\)’s cochain flow is \(\dot c^{(i)} = -\nabla_{c^{(i)}} E^{(i)}(c^{(i)}; \theta)\), depending only on \(\theta\) (held constant in the fast phase) and \(c^{(i)}\) itself — no other \(c^{(j)}\) appears on the right-hand side. Hence the \(B\) sub-problems are decoupled and admit trivially parallel integration. Synchronization is only required at the slow-phase aggregation step \(\sum_i \nabla_\theta E^{(i)}\), which is a batched reduce. \(\square\)
14.6 Worked example: one epoch of Algorithm 12.4 on the paraboloid
Sketch one epoch of the sheaf trainer on a small instance of the paraboloid regression task from [1] §6.1.
Task. Learn \(f(x_1, x_2) = x_1^2 + x_2^2 - \tfrac{2}{3}\) on \([-2, 2]^2\). Sample \(n_{\text{train}} = 16\) points \(\{(x_i, y_i)\}\) on a \(4 \times 4\) grid.
Architecture. [2, 4, 1] ReLU network with identity output. Initialize weights with [2]: \[W^{(1)} \sim \mathcal{N}(0, 2/n_0) \in \mathbb{R}^{4 \times 2}, \quad W^{(2)} \sim \mathcal{N}(0, 2/n_1) \in \mathbb{R}^{1 \times 4}, \quad b^{(\ell)} = 0.\]
Hyperparameters. Batch size \(B = 16\) (full batch), fast-phase tolerance \(\tau_c = 10^{-4}\), slow-phase step \(\eta_\theta = 0.01\).
Epoch.
Step 1 (fast phase). For each \((x_i, y_i)\):
- Pin \(c^{(i)}_{v_x} = x_i\) and \(c^{(i)}_{v_{\hat{y}}} = y_i\).
- Initialize \(c^{(i)}\) at the current forward pass of \(x_i\) through the network.
- Solve the two-sided-pinned Dirichlet problem: within a single activation pattern, this is a \(5\)-dimensional linear solve (dimension \(\sum_\ell n_\ell = 4 + 1 = 5\), from the 4-wide hidden layer plus pre-output). Across pattern switches, iterate.
- At convergence, record the equilibrium cochain \(c^{(i)\star}\) and its per-edge discords.
Step 2 (slow phase). Apply Def. 12.3. For the affine edge \(e_1^{\text{aff}}\) carrying \(W^{(1)}\), the batch-averaged update is \[\Delta W^{(1)} = -\frac{\eta_\theta}{B} \sum_{i=1}^{16} (W^{(1)} x_i + b^{(1)} - c^{(i)\star}_{z^{(1)}})\, x_i^T.\] The term in parentheses is the \(e_1^{\text{aff}}\) discord vector for example \(i\), and the outer product with \(x_i^T\) is the local edge gradient (Prop. 11.5). Analogous updates for \(W^{(2)}\), \(b^{(1)}\), \(b^{(2)}\).
Step 3 (diagnostics). Record the batch-mean residual \(\bar{R} = (1/B) \sum_i E(c^{(i)\star}; \sigma^{(i)})\) and the per-edge discord vector \((\bar{d}_e)_{e \in E}\). Log them for the dashboard (Ch. 13).
What to expect. On this task the first few epochs show large output-edge discord \(\bar{d}_{e^{\text{out}}}\) (the network cannot yet express the paraboloid) and small affine discords (weights initialized near agreement). Over time the output-edge discord drops, the ReLU-edge discords redistribute as dead neurons activate, and the total \(\bar{R}\) decreases roughly exponentially. The final MSE is comparable to an SGD baseline on the same architecture but reached in \(\sim 1 / \lambda_{\min}^{\text{free}}\) times as many steps — Theorem 12.5’s prediction, visible as a linear factor on a loss-vs-steps plot.
Comparing operations per step.
| Operation (per example per step) | SGD | Sheaf trainer |
|---|---|---|
| Forward pass | 1 | \(O(\kappa)\) (iterative fast phase, \(\kappa \sim 1/\lambda_{\min}^{\text{free}}\)) |
| Backward pass | 1 | 0 (replaced by local edge computation in slow phase) |
| Parameter update | 1 | 1 |
| Per-edge diagnostic | — | free (by-product of Step 3) |
The asymptotic cost advantage of the sheaf trainer — no backward pass, free diagnostics — is real. The extra \(O(\kappa)\) factor in the forward phase is the current bottleneck, and the heart of the \(1/\lambda_{\min}^{\text{free}}\) scaling in Theorem 12.5.
Joint-dynamics alternative. Instead of Algorithm 12.4’s alternation (fully relax cochain, then step \(\theta\)), one can integrate the joint flow of Def. 11.4 with small \(\eta_\theta / \eta_c\). This avoids the tolerance choice \(\tau_c\) at the cost of \(O(\varepsilon)\) adiabatic error (Prop. 12.2). In the [2, 30, 1] experiments of [1] §6.2, both variants give comparable final loss; Algorithm 12.4 is simpler to tune.
14.7 Coding lab
Lab 12 — Sheaf Trainer — Implement Algorithm 12.4 end-to-end on the [2, 30, 1] paraboloid task. Train with sheaf-based updates and an SGD baseline, both with the same initialization, batch size, and learning rate. Plot training loss vs epoch for both, measure wall-clock time, and confirm the \(1/\lambda_{\min}^{\text{free}}\) slowdown predicted by Thm. 12.5. Extend the lab with a batch-parallelism demonstration: run \(B\) fast-phase cochain solves concurrently (e.g., via joblib) and time the speedup.
14.8 Exercises
- (warm-up) Derive the local update rule for the bias \(b^{(\ell)}\) (analogous to Def. 12.3 for \(W^{(\ell)}\)). Verify it reduces to the edge discord vector itself — no rank-one structure since the bias isn’t multiplied by an endpoint.
- (warm-up) For the [2, 2, 1] network at the equilibrium of the Ch. 11 worked example, apply one step of Def. 12.3 with \(\eta_\theta = 0.1\) and compute the updated \(W^{(1)}\) explicitly.
- (intermediate) Show that in the timescale-separated limit \(\varepsilon \to 0\), the slow flow of Prop. 12.2 is exactly gradient descent on \(\mathcal{L}(\theta)\), with no adiabatic correction.
- (intermediate) Prove Prop. 12.6 directly: write out the per-example gradient \(\nabla_\theta E^{(i)}\) and verify no cross-example coupling appears until the \(\sum_i\) averaging step.
- (project) Fit the \(\beta \sim 1/n_{\text{train}}\) scaling empirically (paper Fig. 10) by running Algorithm 12.4 on paraboloid with \(n_{\text{train}} \in \{16, 64, 256, 1024\}\) and plotting the extracted convergence rate against \(1/n_{\text{train}}\).
- (advanced) Replace the fast-phase cochain flow with a direct linear solve \(y = L_{\text{free}}^{-1} b\) inside each activation region, detecting region switches by sign changes. Measure the practical speedup over iterative integration and discuss the tolerance choice \(\tau_c\).
- (advanced) Add \(L_2\) regularization by augmenting the graph as in Def. 12.8 and verify empirically that the resulting update matches standard SGD + weight decay on the same loss.
14.9 Further reading
[1] §5.3 is the primary reference for the sheaf trainer and the rate-scaling theorem; §6.2 has the empirical comparison to SGD. The companion paper [3] derives the \(\beta \sim 1/n_{\text{train}}\) scaling law. [5] is the most direct precedent for the algorithm in its equilibrium-propagation form; [6] reviews predictive-coding alternatives. For a theoretical framing of adiabatic approximations in two-timescale systems, see Borkar’s monograph on stochastic approximation. [2] is the weight-initialization reference used in the lab.
14.10 FAQ / common misconceptions
Is Algorithm 12.4 a replacement for SGD? Not in 2026 — it is a theoretically well-founded alternative whose empirical performance lags on production-scale tasks. Its current appeal is locality, free diagnostics (Ch. 13), and a theory-predicted scaling SGD doesn’t have.
Does “backprop-free” mean no chain rule at all? It means no global chain-rule propagation through the graph. Each edge still uses elementary calculus locally — computing \((\delta c)_e\) and an outer product — but no information crosses an edge in the weight-update step.
Why does the fast phase cost more than SGD’s forward pass? Because the cochain has to converge to the two-sided-pinned equilibrium, not just propagate once. The cost is \(O(\kappa)\) iterations with \(\kappa \sim 1/\lambda_{\min}^{\text{free}}\). Direct linear solve per activation region is a standard accelerator (Exercise 6).
Can I use momentum / Adam with the sheaf trainer? Yes — the slow-phase update in Def. 12.3 is a gradient, and any standard gradient-based optimizer applies. Whether this helps or hurts is empirical, and to our knowledge not yet benchmarked thoroughly.
What happens when \(\varepsilon\) is not small? Prop. 12.2 introduces an \(O(\varepsilon)\) bias; the joint flow may oscillate or be attracted to spurious fixed points. In practice, either run the fast phase to tight tolerance (Algorithm 12.4) or integrate the joint flow with a small step size.