3 ReLU Networks as Continuous Piecewise-Affine Maps

Purpose. Reframes a feedforward ReLU MLP as a CPWA function; introduces activation patterns, polyhedral cells, and the bent-hyperplane arrangement that will become the ‘state’ of the sheaf later.

3.1 Motivating example

A two-input ReLU network with one hidden layer computes something like \(\hat{y} = \max(w_1 x_1 + w_2 x_2 + b, 0) \cdot v + c\). Plot \(\hat{y}\) as a function of \((x_1, x_2)\): you do not get a smooth surface. You get a piecewise flat surface made of triangular facets, bent along lines that cut through the input plane. Each facet is a region where the hidden neuron is either fully on or fully off. The bending lines are the zero-sets of the pre-activations — the activation boundaries.

This structure is not an accident or a pathology of ReLU. It is the defining feature, and every theorem in this book about ReLU networks exploits it.

3.2 Key concepts and results

Feedforward ReLU MLP: precise definition, layer notation (\(z^{(\ell)}\) pre-activation, \(a^{(\ell)}\) post-activation).
Continuous piecewise-affine (CPWA) functions and why every ReLU network is one.
Activation patterns \(\sigma \in \{0,1\}^N\) and activation regions.
The bent-hyperplane arrangement: how the input space is cut into polyhedral cells.
Why ReLU non-differentiability is structured, not pathological.

3.3 Intuition

Think of a ReLU network as a switching system. At any given input \(x\), each hidden neuron is either active (pre-activation \(\geq 0\), output = input) or inactive (pre-activation \(< 0\), output \(= 0\)). The pattern of active and inactive neurons is the activation pattern \(\sigma(x) \in \{0,1\}^N\). Within the region of input space where \(\sigma(x)\) is constant, the network is simply an affine function of \(x\) — matrix multiply, add bias, done. The nonlinearity comes entirely from the region boundaries, where \(\sigma\) switches.

The key insight: there are finitely many such regions (one per activation pattern), each is a convex polyhedron, and the affine functions on adjacent regions agree on their shared boundary. That is exactly what it means for the global function to be continuous and piecewise-affine.

This is why ReLU non-differentiability does not cause the trouble it appears to. Classically, “non-differentiable” sounds like “unpredictable.” Here it means “piecewise-linear with a finite combinatorial structure.” The breaking points are organized along hyperplanes in input space, not scattered randomly. Ch. 3 will develop the tools (Filippov solutions, Clarke subdifferentials) needed to work with these breaking points in a dynamical context.

3.4 Formal development

Feedforward ReLU network

A feedforward ReLU network with \(k\) hidden layers has:

Input \(x \in \mathbb{R}^{n_0}\).
Layers \(\ell \in \{1, \ldots, k+1\}\) with widths \(n_1, \ldots, n_{k+1}\).
Weight matrices \(W^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\) and bias vectors \(b^{(\ell)} \in \mathbb{R}^{n_\ell}\).

Setting \(a^{(0)} \equiv x\), each hidden layer computes: \[z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}, \qquad a^{(\ell)} = \mathrm{ReLU}(z^{(\ell)}), \qquad 1 \leq \ell \leq k, \tag{1}\] where \(\mathrm{ReLU}(z) = \max(z, 0)\) is applied element-wise. The output layer applies a task-specific activation \(\varphi\): \[\hat{y} = \varphi\!\left(W^{(k+1)} a^{(k)} + b^{(k+1)}\right). \tag{2}\]

For regression, \(\varphi = \mathrm{id}\). For classification, \(\varphi = \mathrm{softmax}\) or \(\sigma\) (sigmoid). The hidden-layer computation (1) is the focus of Chs. 1–9; nonlinear output \(\varphi\) is treated in Ch. 10.

The quantities \(z^{(\ell)}\) are called pre-activations and \(a^{(\ell)}\) post-activations (or activations). The full forward pass is the composition of \(k+1\) layers.

Continuous piecewise-affine functions

Definition 3.1 Definition 1.1 (CPWA function). A function \(f : \mathbb{R}^n \to \mathbb{R}^m\) is continuous piecewise-affine (CPWA) if there exists a finite collection of closed convex polyhedra \(\{P_i\}\) that covers \(\mathbb{R}^n\) and on each \(P_i\) the restriction \(f|_{P_i}\) is affine. Continuity requires that the affine pieces agree on shared boundaries.

A piecewise-linear (PWL) function is CPWA with all affine pieces passing through the origin (no constant term). The terms CPWA and CPWL are often used interchangeably in the literature; we follow [1] in using CPWA.

Theorem 3.1 Theorem 1.2 (Arora et al. [2]). A function \(f : \mathbb{R}^n \to \mathbb{R}^m\) is CPWA if and only if it can be represented by a feedforward ReLU network of sufficient depth and width.

We do not prove Theorem 1.2 here; we cite it to place ReLU networks in the broader class of CPWA functions and to confirm that this class is exactly right — no larger, no smaller. What we do derive, and will use throughout the book, is the activation-pattern decomposition that makes the CPWA structure of a ReLU network explicit.

Activation patterns and regions

For a network with \(k\) hidden layers and layer widths \(n_1, \ldots, n_k\), let \(N = \sum_{\ell=1}^k n_\ell\) be the total number of hidden neurons. The activation pattern of an input \(x\) is the binary vector \[\sigma(x) = \bigl(\sigma^{(1)}(x), \ldots, \sigma^{(k)}(x)\bigr) \in \{0,1\}^N, \quad \sigma^{(\ell)}_j(x) = \mathbf{1}\!\left[z^{(\ell)}_j(x) \geq 0\right].\]

For each pattern \(\sigma \in \{0,1\}^N\), the activation region \(\mathcal{R}_\sigma\) is the set of inputs that produce that pattern: \[\mathcal{R}_\sigma = \{x \in \mathbb{R}^{n_0} : \sigma(x) = \sigma\}.\]

Within \(\mathcal{R}_\sigma\), the ReLU at each neuron acts as a fixed linear map (either the identity or the zero map), so the entire network reduces to a single affine function of \(x\). On the boundary between two regions, continuity of ReLU ensures the two affine functions agree.

Proposition 1.3 (Activation regions are convex polyhedra). Each non-empty activation region \(\mathcal{R}_\sigma\) is an open convex polyhedron. Its boundary faces are contained in the union of the hyperplanes \(\{z^{(\ell)}_j(x) = 0\}\), one per hidden neuron.

Proof sketch. The pre-activation \(z^{(\ell)}_j(x)\) is an affine function of \(x\) within any region where layers \(1, \ldots, \ell-1\) have fixed activation patterns (induction on \(\ell\)). The region \(\{z^{(\ell)}_j(x) \geq 0\}\) is therefore a closed half-space, and \(\mathcal{R}_\sigma\) is the intersection of \(N\) half-spaces and their complements — a convex polyhedron. \(\square\)

The collection \(\{\overline{\mathcal{R}_\sigma}\}\) over all non-empty patterns \(\sigma\) forms a polyhedral decomposition of \(\mathbb{R}^{n_0}\): a finite cover by closed convex polyhedra that overlap only on boundaries. See [3] and [4] for bounds on the number of non-empty regions.

The bent-hyperplane arrangement

The activation boundaries \(\{z^{(\ell)}_j(x) = 0\}\) partition \(\mathbb{R}^{n_0}\) into cells. Because \(z^{(\ell)}_j\) depends on the activation pattern of earlier layers, these boundaries are not globally linear — they are bent along the boundaries of lower-layer regions. This is the bent-hyperplane arrangement of the network.

The arrangement is the geometric object that encodes everything about the network’s expressivity: how many linear regions it has, how they are arranged, and (in the sheaf picture of Chs. 7–9) which activation pattern is “active” at a given point during diffusion.

Note

Remark 1.4 (Preview — the activation pattern becomes a dynamical variable). Later in the book we will run a continuous-time process in which the state itself moves — and while it moves, the network’s activation pattern can change. The piece of geometry we just described (which neurons are on, which are off) is therefore not a fixed backdrop; it evolves together with the state. This is why the convergence argument we build in Part IV cannot be a standard linear-diffusion calculation — it has to accommodate a structure that flips when the state crosses an activation boundary. Concrete language for when we get there: the diagonal projection matrices \(R_{z^{(\ell)}}\) change with time, the resulting object is called a state-dependent sheaf in [1], and the convergence result is Theorem 4.1 of that paper (Ch. 9 of this book).

The diagonal ReLU matrix

For a fixed activation pattern, define the diagonal ReLU matrix \(R_{z^{(\ell)}} \in \mathbb{R}^{n_\ell \times n_\ell}\) by \[(R_{z^{(\ell)}})_{jj} = \begin{cases} 1 & \text{if } z^{(\ell)}_j \geq 0, \\ 0 & \text{if } z^{(\ell)}_j < 0. \end{cases}\]

This is an orthogonal projection: \((R_{z^{(\ell)}})^2 = R_{z^{(\ell)}}\) and \((R_{z^{(\ell)}})^T = R_{z^{(\ell)}}\). With this notation, \(a^{(\ell)} = R_{z^{(\ell)}} z^{(\ell)}\), and the forward pass (1) becomes two matrix multiplications per layer. This representation — the activation as a state-dependent projection — is exactly the form that the sheaf construction in Ch. 7 will encode as a restriction map.

3.5 Worked example: the [2, 4, 1] network

We work through the [2, 4, 1] running network used in [1] (Figure 6 of that paper): two inputs, four hidden neurons, one output, identity final activation.

Setup. Let \(x = (x_1, x_2)^T \in \mathbb{R}^2\). The hidden layer computes \[z^{(1)} = W^{(1)} x + b^{(1)} \in \mathbb{R}^4, \qquad a^{(1)} = R_{z^{(1)}} z^{(1)} \in \mathbb{R}^4,\] and the output layer computes \(z^{(2)} = W^{(2)} a^{(1)} + b^{(2)} \in \mathbb{R}\), \(\hat{y} = z^{(2)}\).

Activation patterns. There are \(2^4 = 16\) possible activation patterns. Not all are realized; the actual number of non-empty regions depends on the weights. For generic weights in \(\mathbb{R}^2\), at most \(\binom{4}{0} + \binom{4}{1} + \binom{4}{2} = 11\) regions are non-empty (one arrangement of four lines through the origin in the plane can create at most 8 regions; with offsets the count can differ).

Within one region. Suppose \(\sigma = (1,1,0,1)\) — neurons 1, 2, 4 active, neuron 3 inactive. Then \(R_{z^{(1)}} = \mathrm{diag}(1,1,0,1)\), and the output is \[\hat{y} = W^{(2)} \mathrm{diag}(1,1,0,1) (W^{(1)} x + b^{(1)}) + b^{(2)},\] which is an affine function of \(x\). The region \(\mathcal{R}_\sigma\) is the intersection of four half-spaces: \(z^{(1)}_1 \geq 0\), \(z^{(1)}_2 \geq 0\), \(z^{(1)}_3 < 0\), \(z^{(1)}_4 \geq 0\). Each constraint is a linear inequality in \((x_1, x_2)\), so \(\mathcal{R}_\sigma\) is a convex polygon in the input plane.

The diagonal ReLU matrix, foreshadowed. Notice that \(R_{z^{(1)}}\) depends on the sign of \(z^{(1)}\), which depends on \(x\). The fact to carry forward is that this matrix is not a passive bookkeeping device — it changes whenever the input crosses an activation boundary, and in later chapters we will run a dynamics in which the state itself crosses such boundaries while it evolves. Tracking those crossings is the hard part of the convergence proof. (Concretely, Ch. 7: \(R_{z^{(1)}}\) becomes the restriction map on the activation edge connecting \(v_{z^{(1)}}\) to \(v_{a^{(1)}}\).)

3.6 Exercises

1.1 (Warm-up — CPWA by hand). Let \(f : \mathbb{R} \to \mathbb{R}\) be the scalar network \(f(x) = \mathrm{ReLU}(w_1 x + b_1) \cdot v_1 + \mathrm{ReLU}(w_2 x + b_2) \cdot v_2 + c\) with \(w_1 = 1, b_1 = 0, v_1 = 1, w_2 = -1, b_2 = 1, v_2 = 2, c = 0\). (a) Identify all activation regions. (b) Write the affine formula for \(f\) on each region. (c) Verify continuity at each boundary.

1.2 (Activation pattern count). Show that a single-hidden-layer ReLU network with \(n\) neurons and input dimension \(d\) has at most \(2\sum_{i=0}^{d-1}\binom{n-1}{i}\) non-empty activation regions. (Hint: this is the problem of counting regions in a hyperplane arrangement in \(\mathbb{R}^d\).)

1.3 (ReLU as projection). Verify directly that \(R_{z^{(\ell)}}\) is an orthogonal projection for any \(z^{(\ell)}\): show \((R_{z^{(\ell)}})^2 = R_{z^{(\ell)}}\) and \((R_{z^{(\ell)}})^T = R_{z^{(\ell)}}\).

1.4 (Piecewise-affine composition). Show that the composition of two CPWA functions is CPWA. Conclude that any feedforward ReLU network defines a CPWA function directly from the definition, without invoking Theorem 1.2.

1.5 (Structured non-differentiability). Let \(f(x) = \mathrm{ReLU}(x)\) and \(g(x) = |x|\). Both are non-differentiable at \(x = 0\). Explain why the two non-differentiabilities are of the same type. Then construct a function \(h : \mathbb{R} \to \mathbb{R}\) that is everywhere continuous but nowhere differentiable, and explain why no finite ReLU network can represent \(h\).

1.6 (Project — visualizing the arrangement). For the [2, 4, 1] network, choose concrete random weights and (a) plot the bent-hyperplane arrangement on the input plane, (b) shade each region by its activation pattern, (c) verify that the output function is constant-slope within each region. Use the lab notebook lab-01-activation-regions.

3.7 Coding lab

Lab 01 — Visualizing Activation Regions — Build the [2, 4, 1] running network in NumPy, sweep the input plane on a grid, and record the activation pattern at each point. Plot the resulting bent-hyperplane arrangement. Then: perturb the weights and watch the arrangement change; identify a set of weights that creates the maximum number of regions for a 2D input with 4 neurons. The geometric picture you build here is the foundation for everything that follows: the polyhedral cells you plot are exactly the regions over which the algebraic object we will construct in Ch. 7 (the state-dependent sheaf \(\mathcal{F}_t\)) stays constant.

3.8 Further reading

The CPWA characterization of ReLU networks is developed in detail by [2], who prove Theorem 1.2 and study depth-width tradeoffs. The geometry of activation regions — specifically the connection to hyperplane arrangements and persistent homology — is explored in [4]. For an earlier study of the number of linear regions, see [3]. Bosca, Rask, Tanweer, Tawfeek, and Stone [5] study topological invariants of these regions and their relation to expressivity.

3.9 FAQ / common misconceptions

Q: If ReLU is non-differentiable, how does backpropagation work? Backpropagation uses the subgradient of ReLU at zero, conventionally taken to be 0 or 1. Because ReLU is non-differentiable only on a set of measure zero (the hyperplane \(z^{(\ell)}_j = 0\)), this ambiguity almost never matters in practice. In the sheaf setting (Ch. 9), the formal tool for handling it is the Clarke subdifferential (Ch. 3); the convention \(R_{z^{(\ell)}}_{jj} = 1\) for \(z^{(\ell)}_j = 0\) is adopted throughout [1].

Q: Does the CPWA structure change during training? Yes. As weights update, the hyperplanes \(\{z^{(\ell)}_j(x) = 0\}\) move, so the polyhedral decomposition changes. In the joint-dynamics training of Ch. 12, both the cochain (the activations) and the restriction maps (the weights) evolve simultaneously, and the activation-region boundaries move with them.

Q: Is a [2, 4, 1] network the simplest interesting example? It is the smallest network with enough hidden neurons to show a nontrivial arrangement in 2D input space. The [2, 3, 1] network (three neurons) can make at most 7 regions, and a [2, 2, 1] network makes at most 4. Four neurons in 2D allows richer arrangements while remaining tractable by hand. The paper [1] uses [2, 4, 1] for convergence experiments and [2, 30, 1] for training experiments.