ReLU
Deep LearningThe Rectified Linear Unit activation function, defined as max(0, x), which has become the default non-linearity in modern deep networks due to its simple gradient and computational efficiency.
ReLU (Rectified Linear Unit) computes f(x) = max(0, x): it outputs the input directly if positive, and zero otherwise. Despite its simplicity, ReLU was a breakthrough that helped make deep learning practical. Its gradient is either 1 (for positive inputs) or 0 (for negative inputs), eliminating the vanishing gradient problem that plagued sigmoid and tanh activations.
ReLU offers several advantages: its gradient does not saturate for positive values (unlike sigmoid/tanh), enabling effective training of deep networks; it produces sparse activations (roughly half of neurons output zero), which is computationally efficient and creates more interpretable representations; and max(0, x) is trivially fast to compute compared to exponentials in sigmoid/tanh.
The main drawback is the dying ReLU problem: neurons that receive only negative inputs always output zero and stop learning entirely (their gradient is permanently zero). Variants address this: Leaky ReLU allows a small gradient for negative inputs (f(x) = max(0.01x, x)), and GELU (used in transformers) provides a smoother approximation. He initialization was specifically developed to account for ReLU's variance reduction (zeroing half of activations), using Var(w) = 2/n_in instead of Xavier's 2/(n_in + n_out).
Last updated: February 22, 2026