5 Nonsmooth Dynamics: Filippov, Clarke, and CPWA Flows

Purpose. Equips the reader with the minimal nonsmooth-dynamics vocabulary needed to handle dynamical systems whose vector field jumps when an input coordinate changes sign — the situation that will arise later when we run a continuous-time process on top of a ReLU network. The convergence theorem that motivates this machinery is developed in Part IV (Theorem 4.1 of [1]).

5.1 Motivating example

Imagine a continuous-time process on a state variable whose update rule depends on whether a certain coordinate is positive or negative — for instance, a coordinate passing through a ReLU. The state itself moves smoothly, but the velocity changes discontinuously the moment that coordinate crosses zero: the rule that was governing the trajectory on one side gets replaced by a different rule on the other side. This is a switching system, and ordinary ODE theory does not immediately apply.

Later in the book this is exactly what happens when we run a gradient-style flow on a ReLU network: the state \(z^{(\ell)}(t)\) evolves continuously, but a diagonal entry of the ReLU matrix \(R_{z^{(\ell)}}\) flips between 0 and 1 when the pre-activation \(z^{(\ell)}_j\) crosses zero, and the velocity \(\dot{z}^{(\ell)}\) jumps.

The right framework is Filippov’s theory of differential equations with discontinuous right-hand sides. It defines what it means for a trajectory to “pass through” a discontinuity, handles trajectories that slide along the switching surface without jumping, and provides the Lyapunov tools needed to prove that the energy decreases even during sliding. This chapter develops all of that.

5.2 Key concepts and results

Why ODEs with discontinuous right-hand side need a generalized solution concept.
Filippov set-valued map \(\mathcal{F}[f](x)\) and the definition of a Filippov solution.
Clarke subdifferential \(\partial_C f\) for locally Lipschitz \(f\).
CPWA dynamical systems: piecewise-affine vector fields, switching surfaces, sliding modes.
The fast selection rule: which Filippov selection governs CPWA ReLU dynamics.
Nonsmooth Lyapunov theory and LaSalle’s invariance principle for Filippov systems.
Bounded fast solutions and convergence to generalized critical points.

5.3 Intuition

Think of a ball rolling on a V-shaped valley. The bottom of the V is a switching surface — the ball’s velocity changes sign there. If you push the ball straight toward the bottom, it crosses. If you aim the ball at a shallow angle, the ball can slide along the bottom for a while before continuing. Filippov’s theory handles both cases: crossing is just a fast passage through the discontinuity; sliding is a differential inclusion — the velocity lies in the convex hull of the velocities on either side of the surface.

The punchline for our setting is this: an appropriate energy function decreases even during sliding. So even if the trajectory of interest spends time on a switching surface, it is still making progress toward its intended equilibrium. That is the key ingredient in the main convergence argument we will assemble in Part IV — concretely, the sheaf-Laplacian energy \(\|\delta \omega\|^2\) plays the role of the Lyapunov function, and the equilibrium is the network’s forward pass output (Theorem 4.1 of [1]).

5.4 Formal development

Differential equations with discontinuous right-hand side

Consider the ODE \[\dot{x} = f(x), \qquad x \in \mathbb{R}^n, \tag{3.1}\] where \(f : \mathbb{R}^n \to \mathbb{R}^n\) is measurable and locally bounded but not necessarily continuous. The classical notion of a \(C^1\) solution does not apply when \(f\) is discontinuous.

Definition 5.1 Definition 3.1 (Filippov set-valued map). For a measurable, locally bounded \(f\), the Filippov regularization is the set-valued map \[\mathcal{F}[f](x) = \bigcap_{\delta > 0} \overline{\mathrm{co}}\, f(B_\delta(x) \setminus S),\] where the intersection is over \(\delta > 0\), \(B_\delta(x)\) is the open \(\delta\)-ball, \(S\) is any set of Lebesgue measure zero, and \(\overline{\mathrm{co}}\) denotes closed convex hull. Intuitively, \(\mathcal{F}[f](x)\) is the smallest closed convex set that contains all limiting values of \(f\) near \(x\).

Definition 5.2 Definition 3.2 (Filippov solution). An absolutely continuous function \(x : [0,T] \to \mathbb{R}^n\) is a Filippov solution to (3.1) if \[\dot{x}(t) \in \mathcal{F}[f](x(t)) \quad \text{for almost every } t \in [0,T].\]

At points where \(f\) is continuous, \(\mathcal{F}[f](x) = \{f(x)\}\) and the Filippov solution reduces to the classical solution. At a discontinuity, \(\mathcal{F}[f](x)\) is a convex set of possible velocities. Existence of Filippov solutions under measurability and local boundedness is guaranteed by [2]; see also [3] for an accessible treatment.

Switching surfaces and sliding

For a CPWA vector field \(f\), the switching surface \(\mathcal{S}\) is the set where \(f\) is discontinuous: typically a union of hyperplanes \(\{h_i(x) = 0\}\).

At a point \(x \in \mathcal{S}\), the Filippov map \(\mathcal{F}[f](x)\) is the convex hull of the one-sided limits \(f^+(x)\) and \(f^-(x)\) from the two sides of \(\mathcal{S}\). There are three cases:

Crossing. Both one-sided velocities point toward the same side: the trajectory passes through \(\mathcal{S}\) in finite time.
Repulsion. Both velocities point away from \(\mathcal{S}\): the trajectory approaches from one side and departs immediately.
Sliding. The two velocities point toward \(\mathcal{S}\) from opposite sides: the trajectory is trapped on \(\mathcal{S}\) and must follow a velocity in \(\mathcal{F}[f](x) \cap T_x\mathcal{S}\) (the tangential component).

In the book’s main application, each switching surface will be the zero level set of a single hidden-unit coordinate — the set where a particular ReLU turns on or off. To make the resulting dynamics well-posed, [1] adopt the fast selection rule of [4] (Definition 7.4.17 of that thesis): among all Filippov selections at a switching surface, the one with maximal norm is chosen. This rule (i) guarantees existence and uniqueness of solutions, and (ii) ensures trajectories exit switching surfaces rather than sliding indefinitely on non-critical ones (Proposition 7.4.14 and Remark 7.4.20 of [4]). Concretely, in the neural sheaf setting the switching surface for neuron \(j\) at layer \(\ell\) is \(\{z^{(\ell)}_j = 0\}\).

Clarke subdifferential

For Lyapunov arguments on Filippov systems, the appropriate derivative concept is the Clarke subdifferential.

Definition 5.3 Definition 3.3 (Clarke subdifferential). Let \(V : \mathbb{R}^n \to \mathbb{R}\) be locally Lipschitz. The Clarke subdifferential at \(x\) is \[\partial_C V(x) = \overline{\mathrm{co}}\left\{\lim_{x_k \to x} \nabla V(x_k) : \nabla V \text{ exists at each } x_k\right\}.\]

The Clarke subdifferential reduces to the ordinary gradient \(\{\nabla V(x)\}\) at smooth points. At a kink — for example, \(V(x) = |x|\) at \(x = 0\) — it gives the interval \([-1,1]\), capturing all limiting gradients.

Definition 5.4 Definition 3.4 (Generalized critical point). A point \(x^*\) is a generalized critical point of \(V\) (in the Clarke sense) if \(0 \in \partial_C V(x^*)\).

Nonsmooth Lyapunov theory

The standard tool for proving convergence in smooth systems is Lyapunov’s method: find a function \(V \geq 0\) with \(\dot{V} \leq 0\). For Filippov systems, the appropriate generalization is:

Theorem 5.1 Theorem 3.5 (Nonsmooth LaSalle, after [5]). Let \(\dot{x} \in \mathcal{F}[f](x)\) be a Filippov system and let \(V : \mathbb{R}^n \to \mathbb{R}\) be locally Lipschitz. Suppose:

\(V\) is nonincreasing along Filippov solutions: for a.e. \(t\), \(\frac{d}{dt}V(x(t)) \leq 0\).
The sublevel sets \(\{x : V(x) \leq c\}\) are bounded for all \(c\).

Then every trajectory converges to the largest invariant subset of \(\{x : 0 \in \partial_C V(x)\}\).

The key verification requirement is condition (i): showing that \(V\) decreases even during sliding. The intuition to carry forward is that an energy function built from the mismatch between neighboring states typically has matching one-sided values on a switching surface — because the surface itself corresponds to a coordinate being exactly zero, a value both regimes agree on. That observation is what makes the sliding case tractable.

Concretely (Ch. 9): the candidate Lyapunov function will be the total discrepancy \(V(\omega) = \tfrac{1}{2}\|\delta_t \omega\|^2\), and the heart of the convergence proof is checking \(\frac{d}{dt}V \leq 0\) on switching surfaces \(\{z^{(\ell)}_j = 0\}\), which [1] verify by noting that both one-sided gradients of \(V\) with respect to \(R_{z^{(\ell)}}_{jj}\) agree on the surface (because \(\mathrm{ReLU}(0) = 0\) regardless of convention).

CPWA systems and bounded solutions

Definition 5.5 Definition 3.6 (CPWA dynamical system). A vector field \(f : \mathbb{R}^n \to \mathbb{R}^n\) is CPWA if there exists a finite polyhedral decomposition \(\{\mathcal{R}_i\}\) of \(\mathbb{R}^n\) such that \(f|_{\mathcal{R}_i}\) is affine for each \(i\).

The sheaf heat equation (under fixed weights) is a CPWA system: the polyhedral regions are exactly the activation regions from Ch. 1, and within each region the Laplacian is constant (linear system).

Theorem 5.2 Theorem 3.7 (Bounded fast solutions, [4] Theorem 7.4.21). For a CPWA gradient system \(\dot{x} \in -\partial_C V(x)\) under the fast selection rule, all Filippov solutions are bounded.

Boundedness plus Theorem 3.5 gives convergence to generalized critical points. The remaining step — deferred to Part IV — is to identify which generalized critical points are present in the application of interest and to rule out spurious ones; the punchline we’ll reach is that the only surviving critical point is the network’s forward pass output (denoted \(\omega^*\) and called the forward pass cochain in Ch. 9).

5.5 Worked example: a one-neuron switching system

The simplest CPWA system arising from a ReLU is the scalar equation \[\dot{z} = -\alpha\bigl(z - R_z\, z \cdot v^2 \bigr), \qquad R_z = \mathbf{1}[z \geq 0], \quad \alpha, v > 0.\] This models a single pre-activation \(z\) on a weight edge (upstream contribution \(v^2 z\) from an active post-activation, zero if inactive).

Two regions. For \(z > 0\): \(R_z = 1\), so \(\dot{z} = -\alpha z(1 - v^2)\). If \(v < 1\), this is \(\dot{z} < 0\) (converging). For \(z < 0\): \(R_z = 0\), so \(\dot{z} = -\alpha z\) (converging to 0 from below since \(z < 0 \Rightarrow \dot{z} > 0\)).

Switching surface. At \(z = 0\): the two one-sided velocities are \(\dot{z}^+ = 0\) (from the right) and \(\dot{z}^- = 0\) (from the left). Both are zero, so \(z = 0\) is an equilibrium — the fixed point is exactly the forward pass value (the zero pre-activation, dead neuron case of Remark 3.5 in [1]).

Lyapunov function. Take \(V(z) = \tfrac{1}{2}z^2\). Then \(\frac{d}{dt}V = z\dot{z} = -\alpha z^2 (1 - R_z v^2) \leq 0\) for \(v \leq 1\) in both regions, including at \(z = 0\). Convergence to \(z = 0\) follows from Theorem 3.5.

This one-variable example illustrates every element of the full convergence proof: two affine regions, a switching surface that is the equilibrium, a Lyapunov function that decreases in both regions, and the fast selection rule resolving the discontinuity.

5.6 Exercises

3.1 (Filippov map). Let \(f : \mathbb{R} \to \mathbb{R}\) be \(f(x) = \mathrm{sign}(x)\) (i.e., \(+1\) for \(x > 0\), \(-1\) for \(x < 0\)). Compute \(\mathcal{F}[f](x)\) for \(x = 0\) and \(x \neq 0\). Draw the phase portrait of the resulting Filippov system \(\dot{x} \in \mathcal{F}[f](x)\).

3.2 (Clarke subdifferential). Compute \(\partial_C V\) for (a) \(V(x) = |x|\), (b) \(V(x) = \tfrac{1}{2}x^2\), (c) \(V(x) = \max(0, x)\) (the ReLU itself). In each case, identify all generalized critical points.

3.3 (Sliding mode). Consider the 2D Filippov system with \(f^+(x) = (-1, -1)\) for \(x_2 > 0\) and \(f^-(x) = (1, -1)\) for \(x_2 < 0\). On the switching surface \(\{x_2 = 0\}\), the two velocities both point downward. Show that the Filippov set-valued map at \(x_2 = 0\) consists only of the zero vector (equilibrium), so trajectories converge to the switching surface and slide to the origin.

3.4 (Energy decrease during sliding — preview of Ch. 9). Consider a Lyapunov function built from the mismatch between neighboring states across an edge, in a setting where one state coordinate can cross zero. Argue that the two one-sided values of the energy agree on the zero-crossing surface, so the tangential gradient alone determines the sign of \(\dot V\) there. Then show that this tangential gradient cannot point into the unstable direction. (For the book’s eventual application: the surface is \(\{z^{(\ell)}_j = 0\}\) and \(V(\omega) = \tfrac{1}{2}\|\delta_t \omega\|^2\); a full proof sketch is in [1], Appendix A.2.)

3.5 (Project — phase-plane trajectories). Reproduce the phase-plane plots from Figure 13 of [1] for the [2, 4, 1] and [2, 6, 4, 1] networks. Identify: (a) the activation-region boundaries, (b) boundary crossings, (c) any transient sliding episodes. Verify that a suitable quadratic energy (the “Dirichlet energy” we build in Ch. 5, \(\tfrac{1}{2}\|\delta\omega\|^2\)) decreases monotonically throughout, including during any sliding. Use lab-03-switched-gradient-flow.

5.7 Coding lab

Lab 03 — Switched Gradient Flow — Simulate a CPWA gradient flow on a simple 2D state space with two activation regions separated by the line \(z_1 = 0\). Initialize from several starting points and observe: crossing trajectories, sliding-mode trajectories, and convergence to the fixed point. Plot the Lyapunov function \(V(t)\) on a log scale to verify monotone decrease. As a second step, preview the book’s main application: embed a small one-hidden-layer network into the framework we will build in Ch. 7 and run the analogous flow, watching ReLU boundary crossings and energy decrease. The goal of this lab is to internalize what the main convergence argument will deliver before we formalize it in Ch. 9.

5.8 Further reading

The foundational reference for Filippov solutions is [2]. An accessible entry point is the tutorial [3], which covers the Clarke subdifferential, Filippov maps, and Lyapunov theory for discontinuous systems. The Lyapunov stability theory for nonsmooth systems with the formulation used here is [5]. The specific CPWA sheaf convergence theory used in [1] — including the fast selection rule, bounded fast solutions, and the exclusion of spurious critical points — is developed in [4]. Davis, Drusvyatskiy, Kakade, and Lee’s paper on stochastic subgradient descent through the Clarke subdifferential provides a complementary perspective from optimization.

5.9 FAQ / common misconceptions

Q: If the Filippov solution exists and is unique, why do we need the “fast selection rule”? Existence and uniqueness from standard Filippov theory require only measurability and local boundedness. But that theory allows sliding solutions that sit on a switching surface indefinitely, even if the switching surface is not a critical point of the energy. The fast selection rule ([4], Def. 7.4.17) selects the Filippov velocity with maximal norm, which forces trajectories to leave non-critical switching surfaces quickly. Without this rule, you get existence and uniqueness but not convergence.

Q: Does sliding actually happen in the neural sheaf? Yes, transiently. Figure 13 of [1] shows phase-plane trajectories for the [2, 6, 4, 1] network with 111 ReLU boundary crossings in the first few hundred iterations. Some of these involve brief sliding, visible as segments of the trajectory running parallel to a ReLU boundary. After the transient, the trajectory enters a single activation region and proceeds as a linear system.

Q: What happens at a “dead neuron”? A dead neuron is one whose pre-activation sits exactly on a switching surface at equilibrium — it is neither firmly on nor firmly off. The intuition to carry forward: this looks delicate, but it turns out to be harmless, because the two possible conventions at the switching surface give the same output value (a ReLU of zero is zero either way). So the equilibrium itself is unambiguous; only the label we attach to the activation pattern is ambiguous. Concretely (developed in Chs. 7–9): a dead neuron has \(z^{(\ell)}_j = 0\) at the forward pass solution — a codimension-one event in weight space (Remark 3.5 in [1]) — and both conventions \(R_{z^{(\ell)}}_{jj} = 0\) and \(R_{z^{(\ell)}}_{jj} = 1\) give the same harmonic extension. The convergence theorems assume the forward pass equilibrium lies in the interior of an activation region; dead neurons violate this but do not affect the equilibrium itself.