5  Nonsmooth Dynamics: Filippov, Clarke, and CPWA Flows

Purpose. Equips the reader with the minimal nonsmooth-dynamics vocabulary needed to interpret the sheaf heat equation under ReLU switching, and to understand the proof of convergence (Theorem 4.1 of [1]).

5.1 Motivating example

Run the heat equation on a ReLU network. The state \(z^{(\ell)}(t)\) evolves continuously. But whenever a pre-activation coordinate \(z^{(\ell)}_j\) crosses zero, the diagonal entry of the ReLU matrix \(R_{z^{(\ell)}}\) flips between 0 and 1. At that crossing, the velocity \(\dot{z}^{(\ell)}\) jumps discontinuously — even though the state \(z^{(\ell)}\) itself does not. This is a switching system, and ordinary ODE theory does not immediately apply.

The right framework is Filippov’s theory of differential equations with discontinuous right-hand sides. It defines what it means for a trajectory to “pass through” a discontinuity, handles trajectories that slide along the switching surface without jumping, and provides the Lyapunov tools needed to prove that the energy decreases even during sliding. This chapter develops all of that.

5.2 Key concepts and results

  • Why ODEs with discontinuous right-hand side need a generalized solution concept.
  • Filippov set-valued map \(\mathcal{F}[f](x)\) and the definition of a Filippov solution.
  • Clarke subdifferential \(\partial_C f\) for locally Lipschitz \(f\).
  • CPWA dynamical systems: piecewise-affine vector fields, switching surfaces, sliding modes.
  • The fast selection rule: which Filippov selection governs CPWA ReLU dynamics.
  • Nonsmooth Lyapunov theory and LaSalle’s invariance principle for Filippov systems.
  • Bounded fast solutions and convergence to generalized critical points.

5.3 Intuition

Think of a ball rolling on a V-shaped valley. The bottom of the V is a switching surface — the ball’s velocity changes sign there. If you push the ball straight toward the bottom, it crosses. If you aim the ball at a shallow angle, the ball can slide along the bottom for a while before continuing. Filippov’s theory handles both cases: crossing is just a fast passage through the discontinuity; sliding is a differential inclusion — the velocity lies in the convex hull of the velocities on either side of the surface.

The punchline for our setting is this: the sheaf Laplacian’s energy \(\|\delta \omega\|^2\) decreases even during sliding. So even if the heat equation trajectory spends time on a ReLU switching surface, it is still making progress toward the forward pass output. That is the key ingredient in the proof of Theorem 4.1.

5.4 Formal development

Differential equations with discontinuous right-hand side

Consider the ODE \[\dot{x} = f(x), \qquad x \in \mathbb{R}^n, \tag{3.1}\] where \(f : \mathbb{R}^n \to \mathbb{R}^n\) is measurable and locally bounded but not necessarily continuous. The classical notion of a \(C^1\) solution does not apply when \(f\) is discontinuous.

Definition 5.1 Definition 3.1 (Filippov set-valued map). For a measurable, locally bounded \(f\), the Filippov regularization is the set-valued map \[\mathcal{F}[f](x) = \bigcap_{\delta > 0} \overline{\mathrm{co}}\, f(B_\delta(x) \setminus S),\] where the intersection is over \(\delta > 0\), \(B_\delta(x)\) is the open \(\delta\)-ball, \(S\) is any set of Lebesgue measure zero, and \(\overline{\mathrm{co}}\) denotes closed convex hull. Intuitively, \(\mathcal{F}[f](x)\) is the smallest closed convex set that contains all limiting values of \(f\) near \(x\).

Definition 5.2 Definition 3.2 (Filippov solution). An absolutely continuous function \(x : [0,T] \to \mathbb{R}^n\) is a Filippov solution to (3.1) if \[\dot{x}(t) \in \mathcal{F}[f](x(t)) \quad \text{for almost every } t \in [0,T].\]

At points where \(f\) is continuous, \(\mathcal{F}[f](x) = \{f(x)\}\) and the Filippov solution reduces to the classical solution. At a discontinuity, \(\mathcal{F}[f](x)\) is a convex set of possible velocities. Existence of Filippov solutions under measurability and local boundedness is guaranteed by [2]; see also [3] for an accessible treatment.

Switching surfaces and sliding

For a CPWA vector field \(f\), the switching surface \(\mathcal{S}\) is the set where \(f\) is discontinuous: typically a union of hyperplanes \(\{h_i(x) = 0\}\).

At a point \(x \in \mathcal{S}\), the Filippov map \(\mathcal{F}[f](x)\) is the convex hull of the one-sided limits \(f^+(x)\) and \(f^-(x)\) from the two sides of \(\mathcal{S}\). There are three cases:

  1. Crossing. Both one-sided velocities point toward the same side: the trajectory passes through \(\mathcal{S}\) in finite time.
  2. Repulsion. Both velocities point away from \(\mathcal{S}\): the trajectory approaches from one side and departs immediately.
  3. Sliding. The two velocities point toward \(\mathcal{S}\) from opposite sides: the trajectory is trapped on \(\mathcal{S}\) and must follow a velocity in \(\mathcal{F}[f](x) \cap T_x\mathcal{S}\) (the tangential component).

In the neural sheaf setting, the switching surface for neuron \(j\) at layer \(\ell\) is \(\{z^{(\ell)}_j = 0\}\). [1] adopt the fast selection rule of [4] (Definition 7.4.17 of that thesis): among all Filippov selections at a switching surface, the one with maximal norm is chosen. This rule (i) guarantees existence and uniqueness of solutions, and (ii) ensures trajectories exit switching surfaces rather than sliding indefinitely on non-critical ones (Proposition 7.4.14 and Remark 7.4.20 of [4]).

Clarke subdifferential

For Lyapunov arguments on Filippov systems, the appropriate derivative concept is the Clarke subdifferential.

Definition 5.3 Definition 3.3 (Clarke subdifferential). Let \(V : \mathbb{R}^n \to \mathbb{R}\) be locally Lipschitz. The Clarke subdifferential at \(x\) is \[\partial_C V(x) = \overline{\mathrm{co}}\left\{\lim_{x_k \to x} \nabla V(x_k) : \nabla V \text{ exists at each } x_k\right\}.\]

The Clarke subdifferential reduces to the ordinary gradient \(\{\nabla V(x)\}\) at smooth points. At a kink — for example, \(V(x) = |x|\) at \(x = 0\) — it gives the interval \([-1,1]\), capturing all limiting gradients.

Definition 5.4 Definition 3.4 (Generalized critical point). A point \(x^*\) is a generalized critical point of \(V\) (in the Clarke sense) if \(0 \in \partial_C V(x^*)\).

Nonsmooth Lyapunov theory

The standard tool for proving convergence in smooth systems is Lyapunov’s method: find a function \(V \geq 0\) with \(\dot{V} \leq 0\). For Filippov systems, the appropriate generalization is:

Theorem 5.1 Theorem 3.5 (Nonsmooth LaSalle, after [5]). Let \(\dot{x} \in \mathcal{F}[f](x)\) be a Filippov system and let \(V : \mathbb{R}^n \to \mathbb{R}\) be locally Lipschitz. Suppose:

  1. \(V\) is nonincreasing along Filippov solutions: for a.e. \(t\), \(\frac{d}{dt}V(x(t)) \leq 0\).
  2. The sublevel sets \(\{x : V(x) \leq c\}\) are bounded for all \(c\).

Then every trajectory converges to the largest invariant subset of \(\{x : 0 \in \partial_C V(x)\}\).

The key verification requirement is condition (i): showing that \(V\) decreases even during sliding. For the sheaf heat equation, the candidate Lyapunov function is the total discrepancy \(V(\omega) = \tfrac{1}{2}\|\delta_t \omega\|^2\). The heart of the convergence proof (Ch. 9) is checking that \(\frac{d}{dt}V \leq 0\) on switching surfaces \(\{z^{(\ell)}_j = 0\}\), which [1] verify by observing that both one-sided gradients of \(V\) with respect to \(R_{z^{(\ell)}}_{jj}\) agree on the surface (because \(\mathrm{ReLU}(0) = 0\) regardless of convention).

CPWA systems and bounded solutions

Definition 5.5 Definition 3.6 (CPWA dynamical system). A vector field \(f : \mathbb{R}^n \to \mathbb{R}^n\) is CPWA if there exists a finite polyhedral decomposition \(\{\mathcal{R}_i\}\) of \(\mathbb{R}^n\) such that \(f|_{\mathcal{R}_i}\) is affine for each \(i\).

The sheaf heat equation (under fixed weights) is a CPWA system: the polyhedral regions are exactly the activation regions from Ch. 1, and within each region the Laplacian is constant (linear system).

Theorem 5.2 Theorem 3.7 (Bounded fast solutions, [4] Theorem 7.4.21). For a CPWA gradient system \(\dot{x} \in -\partial_C V(x)\) under the fast selection rule, all Filippov solutions are bounded.

Boundedness plus Theorem 3.5 gives convergence to generalized critical points. The remaining step (Ch. 9) is identifying which generalized critical points are present in the sheaf setting — and showing that the only one is \(\omega^*\), the forward pass cochain.

5.5 Worked example: a one-neuron switching system

The simplest CPWA system arising from a ReLU is the scalar equation \[\dot{z} = -\alpha\bigl(z - R_z\, z \cdot v^2 \bigr), \qquad R_z = \mathbf{1}[z \geq 0], \quad \alpha, v > 0.\] This models a single pre-activation \(z\) on a weight edge (upstream contribution \(v^2 z\) from an active post-activation, zero if inactive).

Two regions. For \(z > 0\): \(R_z = 1\), so \(\dot{z} = -\alpha z(1 - v^2)\). If \(v < 1\), this is \(\dot{z} < 0\) (converging). For \(z < 0\): \(R_z = 0\), so \(\dot{z} = -\alpha z\) (converging to 0 from below since \(z < 0 \Rightarrow \dot{z} > 0\)).

Switching surface. At \(z = 0\): the two one-sided velocities are \(\dot{z}^+ = 0\) (from the right) and \(\dot{z}^- = 0\) (from the left). Both are zero, so \(z = 0\) is an equilibrium — the fixed point is exactly the forward pass value (the zero pre-activation, dead neuron case of Remark 3.5 in [1]).

Lyapunov function. Take \(V(z) = \tfrac{1}{2}z^2\). Then \(\frac{d}{dt}V = z\dot{z} = -\alpha z^2 (1 - R_z v^2) \leq 0\) for \(v \leq 1\) in both regions, including at \(z = 0\). Convergence to \(z = 0\) follows from Theorem 3.5.

This one-variable example illustrates every element of the full convergence proof: two affine regions, a switching surface that is the equilibrium, a Lyapunov function that decreases in both regions, and the fast selection rule resolving the discontinuity.

5.6 Exercises

3.1 (Filippov map). Let \(f : \mathbb{R} \to \mathbb{R}\) be \(f(x) = \mathrm{sign}(x)\) (i.e., \(+1\) for \(x > 0\), \(-1\) for \(x < 0\)). Compute \(\mathcal{F}[f](x)\) for \(x = 0\) and \(x \neq 0\). Draw the phase portrait of the resulting Filippov system \(\dot{x} \in \mathcal{F}[f](x)\).

3.2 (Clarke subdifferential). Compute \(\partial_C V\) for (a) \(V(x) = |x|\), (b) \(V(x) = \tfrac{1}{2}x^2\), (c) \(V(x) = \max(0, x)\) (the ReLU itself). In each case, identify all generalized critical points.

3.3 (Sliding mode). Consider the 2D Filippov system with \(f^+(x) = (-1, -1)\) for \(x_2 > 0\) and \(f^-(x) = (1, -1)\) for \(x_2 < 0\). On the switching surface \(\{x_2 = 0\}\), the two velocities both point downward. Show that the Filippov set-valued map at \(x_2 = 0\) consists only of the zero vector (equilibrium), so trajectories converge to the switching surface and slide to the origin.

3.4 (Energy decrease during sliding). In the sheaf heat equation, the switching surface for neuron \(j\) is \(\{z^{(\ell)}_j = 0\}\). The Lyapunov function is \(V(\omega) = \tfrac{1}{2}\|\delta_t \omega\|^2\). Verify (following the proof sketch in [1], Appendix A.2) that \(\frac{d}{dt}V \leq 0\) on this surface by showing that the one-sided values of \(V\) agree there and that the tangential gradient \(\nabla_\tau V\) points away from or along the surface.

3.5 (Project — phase-plane trajectories). Reproduce the phase-plane plots from Figure 13 of [1] for the [2, 4, 1] and [2, 6, 4, 1] networks. Identify: (a) the activation-region boundaries, (b) boundary crossings, (c) any transient sliding episodes. Verify that the Dirichlet energy decreases monotonically throughout, including during any sliding. Use lab-03-switched-gradient-flow.

5.7 Coding lab

lab-03-switched-gradient-flow — Simulate a CPWA gradient flow on a simple 2D state space with two activation regions separated by the line \(z_1 = 0\). Initialize from several starting points and observe: crossing trajectories, sliding-mode trajectories, and convergence to the fixed point. Plot the Lyapunov function \(V(t)\) on a log scale to verify monotone decrease. Then embed a single-hidden-layer neural sheaf and run the restricted heat equation, watching ReLU boundary crossings and energy decrease. This lab builds the intuition for the convergence proof in Ch. 9.

5.8 Further reading

The foundational reference for Filippov solutions is [2]. An accessible entry point is the tutorial [3], which covers the Clarke subdifferential, Filippov maps, and Lyapunov theory for discontinuous systems. The Lyapunov stability theory for nonsmooth systems with the formulation used here is [5]. The specific CPWA sheaf convergence theory used in [1] — including the fast selection rule, bounded fast solutions, and the exclusion of spurious critical points — is developed in [4]. Davis, Drusvyatskiy, Kakade, and Lee’s paper on stochastic subgradient descent through the Clarke subdifferential provides a complementary perspective from optimization.

5.9 FAQ / common misconceptions

Q: If the Filippov solution exists and is unique, why do we need the “fast selection rule”? Existence and uniqueness from standard Filippov theory require only measurability and local boundedness. But that theory allows sliding solutions that sit on a switching surface indefinitely, even if the switching surface is not a critical point of the energy. The fast selection rule ([4], Def. 7.4.17) selects the Filippov velocity with maximal norm, which forces trajectories to leave non-critical switching surfaces quickly. Without this rule, you get existence and uniqueness but not convergence.

Q: Does sliding actually happen in the neural sheaf? Yes, transiently. Figure 13 of [1] shows phase-plane trajectories for the [2, 6, 4, 1] network with 111 ReLU boundary crossings in the first few hundred iterations. Some of these involve brief sliding, visible as segments of the trajectory running parallel to a ReLU boundary. After the transient, the trajectory enters a single activation region and proceeds as a linear system.

Q: What happens at a “dead neuron” (\(z^{(\ell)}_j = 0\) forever)? A dead neuron has \(z^{(\ell)}_j = 0\) at the forward pass solution. This is a codimension-one event in weight space (Remark 3.5 in [1]). At such a point, both conventions \(R_{z^{(\ell)}}_{jj} = 0\) and \(R_{z^{(\ell)}}_{jj} = 1\) give the same activation \(a^{(\ell)}_j = 0\), so the harmonic extension is the same regardless. The convergence theorems assume the forward pass equilibrium lies in the interior of an activation region; dead neurons violate this but do not affect the harmonic extension, only the labeling of the activation pattern.