3 ReLU Networks as Continuous Piecewise-Affine Maps
Purpose. Reframes a feedforward ReLU MLP as a CPWA function; introduces activation patterns, polyhedral cells, and the bent-hyperplane arrangement that will become the ‘state’ of the sheaf later.
3.1 Motivating example
A two-input ReLU network with one hidden layer computes something like \(\hat{y} = \max(w_1 x_1 + w_2 x_2 + b, 0) \cdot v + c\). Plot \(\hat{y}\) as a function of \((x_1, x_2)\): you do not get a smooth surface. You get a piecewise flat surface made of triangular facets, bent along lines that cut through the input plane. Each facet is a region where the hidden neuron is either fully on or fully off. The bending lines are the zero-sets of the pre-activations — the activation boundaries.
This structure is not an accident or a pathology of ReLU. It is the defining feature, and every theorem in this book about ReLU networks exploits it.
3.2 Key concepts and results
- Feedforward ReLU MLP: precise definition, layer notation (\(z^{(\ell)}\) pre-activation, \(a^{(\ell)}\) post-activation).
- Continuous piecewise-affine (CPWA) functions and why every ReLU network is one.
- Activation patterns \(\sigma \in \{0,1\}^N\) and activation regions.
- The bent-hyperplane arrangement: how the input space is cut into polyhedral cells.
- Why ReLU non-differentiability is structured, not pathological.
3.3 Intuition
Think of a ReLU network as a switching system. At any given input \(x\), each hidden neuron is either active (pre-activation \(\geq 0\), output = input) or inactive (pre-activation \(< 0\), output \(= 0\)). The pattern of active and inactive neurons is the activation pattern \(\sigma(x) \in \{0,1\}^N\). Within the region of input space where \(\sigma(x)\) is constant, the network is simply an affine function of \(x\) — matrix multiply, add bias, done. The nonlinearity comes entirely from the region boundaries, where \(\sigma\) switches.
The key insight: there are finitely many such regions (one per activation pattern), each is a convex polyhedron, and the affine functions on adjacent regions agree on their shared boundary. That is exactly what it means for the global function to be continuous and piecewise-affine.
This is why ReLU non-differentiability does not cause the trouble it appears to. Classically, “non-differentiable” sounds like “unpredictable.” Here it means “piecewise-linear with a finite combinatorial structure.” The breaking points are organized along hyperplanes in input space, not scattered randomly. Ch. 3 will develop the tools (Filippov solutions, Clarke subdifferentials) needed to work with these breaking points in a dynamical context.
3.4 Formal development
Feedforward ReLU network
A feedforward ReLU network with \(k\) hidden layers has:
- Input \(x \in \mathbb{R}^{n_0}\).
- Layers \(\ell \in \{1, \ldots, k+1\}\) with widths \(n_1, \ldots, n_{k+1}\).
- Weight matrices \(W^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\) and bias vectors \(b^{(\ell)} \in \mathbb{R}^{n_\ell}\).
Setting \(a^{(0)} \equiv x\), each hidden layer computes: \[z^{(\ell)} = W^{(\ell)} a^{(\ell-1)} + b^{(\ell)}, \qquad a^{(\ell)} = \mathrm{ReLU}(z^{(\ell)}), \qquad 1 \leq \ell \leq k, \tag{1}\] where \(\mathrm{ReLU}(z) = \max(z, 0)\) is applied element-wise. The output layer applies a task-specific activation \(\varphi\): \[\hat{y} = \varphi\!\left(W^{(k+1)} a^{(k)} + b^{(k+1)}\right). \tag{2}\]
For regression, \(\varphi = \mathrm{id}\). For classification, \(\varphi = \mathrm{softmax}\) or \(\sigma\) (sigmoid). The hidden-layer computation (1) is the focus of Chs. 1–9; nonlinear output \(\varphi\) is treated in Ch. 10.
The quantities \(z^{(\ell)}\) are called pre-activations and \(a^{(\ell)}\) post-activations (or activations). The full forward pass is the composition of \(k+1\) layers.
Continuous piecewise-affine functions
Definition 3.1 Definition 1.1 (CPWA function). A function \(f : \mathbb{R}^n \to \mathbb{R}^m\) is continuous piecewise-affine (CPWA) if there exists a finite collection of closed convex polyhedra \(\{P_i\}\) that covers \(\mathbb{R}^n\) and on each \(P_i\) the restriction \(f|_{P_i}\) is affine. Continuity requires that the affine pieces agree on shared boundaries.
A piecewise-linear (PWL) function is CPWA with all affine pieces passing through the origin (no constant term). The terms CPWA and CPWL are often used interchangeably in the literature; we follow [1] in using CPWA.
Theorem 3.1 Theorem 1.2 (Arora et al. [2]). A function \(f : \mathbb{R}^n \to \mathbb{R}^m\) is CPWA if and only if it can be represented by a feedforward ReLU network of sufficient depth and width.
We do not prove Theorem 1.2 here; we cite it to place ReLU networks in the broader class of CPWA functions and to confirm that this class is exactly right — no larger, no smaller. What we do derive, and will use throughout the book, is the activation-pattern decomposition that makes the CPWA structure of a ReLU network explicit.
Activation patterns and regions
For a network with \(k\) hidden layers and layer widths \(n_1, \ldots, n_k\), let \(N = \sum_{\ell=1}^k n_\ell\) be the total number of hidden neurons. The activation pattern of an input \(x\) is the binary vector \[\sigma(x) = \bigl(\sigma^{(1)}(x), \ldots, \sigma^{(k)}(x)\bigr) \in \{0,1\}^N, \quad \sigma^{(\ell)}_j(x) = \mathbf{1}\!\left[z^{(\ell)}_j(x) \geq 0\right].\]
For each pattern \(\sigma \in \{0,1\}^N\), the activation region \(\mathcal{R}_\sigma\) is the set of inputs that produce that pattern: \[\mathcal{R}_\sigma = \{x \in \mathbb{R}^{n_0} : \sigma(x) = \sigma\}.\]
Within \(\mathcal{R}_\sigma\), the ReLU at each neuron acts as a fixed linear map (either the identity or the zero map), so the entire network reduces to a single affine function of \(x\). On the boundary between two regions, continuity of ReLU ensures the two affine functions agree.
Proposition 1.3 (Activation regions are convex polyhedra). Each non-empty activation region \(\mathcal{R}_\sigma\) is an open convex polyhedron. Its boundary faces are contained in the union of the hyperplanes \(\{z^{(\ell)}_j(x) = 0\}\), one per hidden neuron.
Proof sketch. The pre-activation \(z^{(\ell)}_j(x)\) is an affine function of \(x\) within any region where layers \(1, \ldots, \ell-1\) have fixed activation patterns (induction on \(\ell\)). The region \(\{z^{(\ell)}_j(x) \geq 0\}\) is therefore a closed half-space, and \(\mathcal{R}_\sigma\) is the intersection of \(N\) half-spaces and their complements — a convex polyhedron. \(\square\)
The collection \(\{\overline{\mathcal{R}_\sigma}\}\) over all non-empty patterns \(\sigma\) forms a polyhedral decomposition of \(\mathbb{R}^{n_0}\): a finite cover by closed convex polyhedra that overlap only on boundaries. See [3] and [4] for bounds on the number of non-empty regions.
The bent-hyperplane arrangement
The activation boundaries \(\{z^{(\ell)}_j(x) = 0\}\) partition \(\mathbb{R}^{n_0}\) into cells. Because \(z^{(\ell)}_j\) depends on the activation pattern of earlier layers, these boundaries are not globally linear — they are bent along the boundaries of lower-layer regions. This is the bent-hyperplane arrangement of the network.
The arrangement is the geometric object that encodes everything about the network’s expressivity: how many linear regions it has, how they are arranged, and (in the sheaf picture of Chs. 7–9) which activation pattern is “active” at a given point during diffusion.
Remark 1.4 (State-dependent sheaf). When the sheaf heat equation is run on a ReLU network (Ch. 9), the activation pattern \(\sigma\) depends on the current state \(z^{(\ell)}\), which is evolving. The sheaf structure — specifically the diagonal projection matrices \(R_{z^{(\ell)}}\) on the activation edges — therefore changes with time. This is what [1] call a state-dependent sheaf, and it is why the convergence proof (Theorem 4.1 of that paper) requires CPWA dynamical systems theory rather than standard linear diffusion.
The diagonal ReLU matrix
For a fixed activation pattern, define the diagonal ReLU matrix \(R_{z^{(\ell)}} \in \mathbb{R}^{n_\ell \times n_\ell}\) by \[(R_{z^{(\ell)}})_{jj} = \begin{cases} 1 & \text{if } z^{(\ell)}_j \geq 0, \\ 0 & \text{if } z^{(\ell)}_j < 0. \end{cases}\]
This is an orthogonal projection: \((R_{z^{(\ell)}})^2 = R_{z^{(\ell)}}\) and \((R_{z^{(\ell)}})^T = R_{z^{(\ell)}}\). With this notation, \(a^{(\ell)} = R_{z^{(\ell)}} z^{(\ell)}\), and the forward pass (1) becomes two matrix multiplications per layer. This representation — the activation as a state-dependent projection — is exactly the form that the sheaf construction in Ch. 7 will encode as a restriction map.
3.5 Worked example: the [2, 4, 1] network
We work through the [2, 4, 1] running network used in [1] (Figure 6 of that paper): two inputs, four hidden neurons, one output, identity final activation.
Setup. Let \(x = (x_1, x_2)^T \in \mathbb{R}^2\). The hidden layer computes \[z^{(1)} = W^{(1)} x + b^{(1)} \in \mathbb{R}^4, \qquad a^{(1)} = R_{z^{(1)}} z^{(1)} \in \mathbb{R}^4,\] and the output layer computes \(z^{(2)} = W^{(2)} a^{(1)} + b^{(2)} \in \mathbb{R}\), \(\hat{y} = z^{(2)}\).
Activation patterns. There are \(2^4 = 16\) possible activation patterns. Not all are realized; the actual number of non-empty regions depends on the weights. For generic weights in \(\mathbb{R}^2\), at most \(\binom{4}{0} + \binom{4}{1} + \binom{4}{2} = 11\) regions are non-empty (one arrangement of four lines through the origin in the plane can create at most 8 regions; with offsets the count can differ).
Within one region. Suppose \(\sigma = (1,1,0,1)\) — neurons 1, 2, 4 active, neuron 3 inactive. Then \(R_{z^{(1)}} = \mathrm{diag}(1,1,0,1)\), and the output is \[\hat{y} = W^{(2)} \mathrm{diag}(1,1,0,1) (W^{(1)} x + b^{(1)}) + b^{(2)},\] which is an affine function of \(x\). The region \(\mathcal{R}_\sigma\) is the intersection of four half-spaces: \(z^{(1)}_1 \geq 0\), \(z^{(1)}_2 \geq 0\), \(z^{(1)}_3 < 0\), \(z^{(1)}_4 \geq 0\). Each constraint is a linear inequality in \((x_1, x_2)\), so \(\mathcal{R}_\sigma\) is a convex polygon in the input plane.
The diagonal ReLU matrix as a sheaf restriction map. Notice that \(R_{z^{(1)}}\) depends on the sign of \(z^{(1)}\), which depends on \(x\). In the sheaf picture (Ch. 7), this matrix becomes the restriction map on the activation edge connecting \(v_{z^{(1)}}\) to \(v_{a^{(1)}}\). When the state evolves under the heat equation, \(z^{(1)}\) changes, \(R_{z^{(1)}}\) can switch, and the activation-region boundary is crossed. Tracking these crossings is the hard part of the convergence proof.
3.6 Exercises
1.1 (Warm-up — CPWA by hand). Let \(f : \mathbb{R} \to \mathbb{R}\) be the scalar network \(f(x) = \mathrm{ReLU}(w_1 x + b_1) \cdot v_1 + \mathrm{ReLU}(w_2 x + b_2) \cdot v_2 + c\) with \(w_1 = 1, b_1 = 0, v_1 = 1, w_2 = -1, b_2 = 1, v_2 = 2, c = 0\). (a) Identify all activation regions. (b) Write the affine formula for \(f\) on each region. (c) Verify continuity at each boundary.
1.2 (Activation pattern count). Show that a single-hidden-layer ReLU network with \(n\) neurons and input dimension \(d\) has at most \(2\sum_{i=0}^{d-1}\binom{n-1}{i}\) non-empty activation regions. (Hint: this is the problem of counting regions in a hyperplane arrangement in \(\mathbb{R}^d\).)
1.3 (ReLU as projection). Verify directly that \(R_{z^{(\ell)}}\) is an orthogonal projection for any \(z^{(\ell)}\): show \((R_{z^{(\ell)}})^2 = R_{z^{(\ell)}}\) and \((R_{z^{(\ell)}})^T = R_{z^{(\ell)}}\).
1.4 (Piecewise-affine composition). Show that the composition of two CPWA functions is CPWA. Conclude that any feedforward ReLU network defines a CPWA function directly from the definition, without invoking Theorem 1.2.
1.5 (Structured non-differentiability). Let \(f(x) = \mathrm{ReLU}(x)\) and \(g(x) = |x|\). Both are non-differentiable at \(x = 0\). Explain why the two non-differentiabilities are of the same type. Then construct a function \(h : \mathbb{R} \to \mathbb{R}\) that is everywhere continuous but nowhere differentiable, and explain why no finite ReLU network can represent \(h\).
1.6 (Project — visualizing the arrangement). For the [2, 4, 1] network, choose concrete random weights and (a) plot the bent-hyperplane arrangement on the input plane, (b) shade each region by its activation pattern, (c) verify that the output function is constant-slope within each region. Use the lab notebook lab-01-activation-regions.
3.7 Coding lab
lab-01-activation-regions — Build the [2, 4, 1] running network in NumPy, sweep the input plane on a grid, and record the activation pattern at each point. Plot the resulting bent-hyperplane arrangement. Then: perturb the weights and watch the arrangement change; identify a set of weights that creates the maximum number of regions for a 2D input with 4 neurons. The lab is the geometric foundation for the sheaf construction in Ch. 7: the activation regions you plot here are exactly the polyhedral cells over which the state-dependent sheaf \(\mathcal{F}_t\) is constant.
3.8 Further reading
The CPWA characterization of ReLU networks is developed in detail by [2], who prove Theorem 1.2 and study depth-width tradeoffs. The geometry of activation regions — specifically the connection to hyperplane arrangements and persistent homology — is explored in [4]. For an earlier study of the number of linear regions, see [3]. Bosca, Rask, Tanweer, Tawfeek, and Stone [5] study topological invariants of these regions and their relation to expressivity.
3.9 FAQ / common misconceptions
Q: If ReLU is non-differentiable, how does backpropagation work? Backpropagation uses the subgradient of ReLU at zero, conventionally taken to be 0 or 1. Because ReLU is non-differentiable only on a set of measure zero (the hyperplane \(z^{(\ell)}_j = 0\)), this ambiguity almost never matters in practice. In the sheaf setting (Ch. 9), the formal tool for handling it is the Clarke subdifferential (Ch. 3); the convention \(R_{z^{(\ell)}}_{jj} = 1\) for \(z^{(\ell)}_j = 0\) is adopted throughout [1].
Q: Does the CPWA structure change during training? Yes. As weights update, the hyperplanes \(\{z^{(\ell)}_j(x) = 0\}\) move, so the polyhedral decomposition changes. In the joint-dynamics training of Ch. 12, both the cochain (the activations) and the restriction maps (the weights) evolve simultaneously, and the activation-region boundaries move with them.
Q: Is a [2, 4, 1] network the simplest interesting example? It is the smallest network with enough hidden neurons to show a nontrivial arrangement in 2D input space. The [2, 3, 1] network (three neurons) can make at most 7 regions, and a [2, 2, 1] network makes at most 4. Four neurons in 2D allows richer arrangements while remaining tractable by hand. The paper [1] uses [2, 4, 1] for convergence experiments and [2, 30, 1] for training experiments.