14 Backprop-Free Training, Batches, and Timescale Separation
Purpose. Spells out the training algorithm: joint diffusion with timescale separation, batch processing, and how it relates to (but differs from) SGD.
14.1 Key concepts & results
- Timescale separation: fast cochain dynamics vs slow weight dynamics; adiabatic approximation recovers a local, backprop-flavored update rule.
- Batch processing: per-example pinned diffusions sharing parameters; aggregate weight update.
- Comparison with SGD: scaling laws predicted from the spectral gap of L_free.
- Honest discussion of the performance gap: sheaf-based training is not yet competitive with SGD, but obeys quantitative theory-predicted scaling.
Prerequisites: Ch 11
14.2 Motivating example
Train the [2, 30, 1] network from [1] §6 on the paraboloid task \(f(x_1, x_2) = x_1^2 + x_2^2 - \tfrac{2}{3}\) on \([-2, 2]^2\), using sheaf-based training instead of SGD. For each of \(N\) training pairs \((x_i, y_i)\) in the current batch, run the two-sided-pinned diffusion of Ch. 11 on a copy of the sheaf until the cochain \(x^{(i)}\) has equilibrated. Read off the per-example Dirichlet energy \(E(x^{(i)}; \theta)\) and its gradient with respect to the shared parameters \(\theta\) — a local edge-by-edge computation, no chain rule. Average the gradients across the batch, take one small step in \(\theta\), and repeat.
Run this for a few hundred epochs on \(N = 200\) random training points. The final MSE is comparable to an SGD baseline on the same architecture. The number of steps to reach that MSE is larger — by a factor that scales like \(1/\lambda_{\min}(L_{\text{free}})\), exactly as [1] predict. The scaling \(\beta \sim 1/n_{\text{train}}\) confirmed in their Fig. 10 falls out of the same bound.
14.3 Intuition
The joint flow of Ch. 11 has two natural timescales. The cochain \(x\) is fast: within each pinned diffusion it equilibrates on the timescale \(1/\lambda_{\min}(L_{\text{free}})\), which is the spectral-gap rate of Ch. 9. The parameters \(\theta\) are slow: they only move once per batch, and each move is proportional to the current per-edge discords. Whenever a dynamical system has a fast variable and a slow variable, the standard move is adiabatic approximation: hold \(\theta\) fixed long enough for \(x\) to reach its \(\theta\)-dependent equilibrium \(x^\star(\theta)\), substitute \(x^\star(\theta)\) into the \(\theta\)-update, and ignore any transient \(x\)-dynamics. The result is a reduced flow on \(\theta\) alone whose fixed points coincide with fixed points of the joint flow.
What does this adiabatic \(\theta\)-update look like? Because the gradient of Dirichlet energy \(E(x; \theta) = \tfrac{1}{2} \|\delta_\theta x\|^2\) with respect to \(\theta\) is a sum of per-edge contributions — and each per-edge contribution only touches the two stalks \(x_u\) and \(x_v\) that the edge connects — the \(\theta\)-update is fully local. No backward pass through the graph is needed; every edge updates using only its own endpoints’ equilibrium values. This is the sheaf-side analogue of contrastive Hebbian learning and equilibrium propagation, and it is the sense in which sheaf training is “backprop-free.”
Batches are the natural unit of parallelism. Each training pair \((x_i, y_i)\) produces its own pinned sheaf with its own equilibrium cochain \(x^{(i)\star}(\theta)\); the \(\theta\)-update is then the batch-mean of the per-example edge contributions. No shared buffers, no cross-example dependencies during the fast phase — a cleanly parallel structure that maps naturally onto distributed hardware.
The honest discussion [1] §6.2 asks for, and this chapter repeats: the sheaf training rule is not yet competitive with SGD in wall-clock or final accuracy on hard tasks. What it has going for it is (a) locality — no chain-rule propagation — which is attractive for hardware and biologically-plausible learning models; (b) a theory-predicted scaling, \(\beta \sim 1/n_{\text{train}}\), that SGD does not have; and (c) a framework that extends unchanged to partial clamping and to the parameter sheaf \(\mathcal{H}_W\) of App. B.1 of the paper, both of which are awkward to express in SGD language.
Intuition device (planned): Dual time-axis diagram (fast x, slow θ) + bar chart of operations-per-step vs SGD.
14.4 Formal development
[TO FILL: formal development — definitions, statements, careful notation]
14.5 Theorem demonstrations
[TO FILL: proofs / proof sketches of the key results named above. Proofs should come *after* the intuition section, as agreed.]
14.6 Worked examples
[TO FILL: worked example(s) carried out by hand]
14.7 Coding lab
lab-12-sheaf-trainer —
[TO FILL: one-paragraph description of the lab's goal]
14.8 Exercises
[TO FILL: 3–6 exercises, graded from warm-up to project-level]
14.9 Further reading
[TO FILL: annotated paragraph of 3–6 references]
14.10 FAQ / common misconceptions
[TO FILL: short Q&A for things readers frequently get wrong]