Module I·Article I·~4 min read

Perceptron and Multilayer Neural Networks

Fundamentals of Neural Networks

Turn this article into a podcast

Pick voices, format, length — AI generates the audio

Neural networks are inspired by biology, but have long since become an independent mathematical field. Rosenblatt's perceptron (1957) is the simplest formal neuron that already reveals the main principle: linear transformation + nonlinear activation. Multilayer networks (MLP) made of such neurons turned out to be universal approximators.

Formal neuron and perceptron

Biological motivation: A neuron in the brain receives signals through dendrites, sums them in the cell body, and "fires" (spike) only when the threshold is exceeded. The perceptron models this: weighted sum of inputs + threshold activation function.

Perceptron: $y = \operatorname{sign}(w^\mathrm{T}x + b) = \operatorname{sign}(\sum_i w_ix_i + b)$, where $x \in \mathbb{R}^n$ is the input vector, $w \in \mathbb{R}^n$ are weights, $b$ is bias (threshold), sign is the sign function ($\pm1$). Geometrically: $w$ and $b$ define the hyperplane $w^\mathrm{T}x + b = 0$ in $\mathbb{R}^n$. The perceptron divides the space into two half-spaces.

Perceptron convergence theorem (Rosenblatt, 1957): If classes are linearly separable, the update algorithm $w_{t+1} = w_t + y_ix_i$ (when there is an error on $(x_i, y_i)$) converges in a finite number of steps. The upper bound of iterations: $(R/\gamma)^2$, where $R = \max||x_i||$ is the radius, $\gamma$ is the separation margin.

Limitation: Linear inseparability $\rightarrow$ perceptron does not learn. Classic counterexample: XOR function (Minsky and Papert, 1969) — it is nonlinear, so the perceptron does not solve it. This led to the "AI winter".

Activation functions

Nonlinear activation is the key difference from linear regression. Without it: a stack of linear layers = one linear layer.

Sigmoid (logistic): $\sigma(z) = 1/(1 + e^{-z}) \in (0,1)$. Derivative: $\sigma'(z) = \sigma(z)[1-\sigma(z)]$. Problem: for large $|z|$, derivative $\rightarrow 0$ $\rightarrow$ "saturation," gradient vanishing. Used in early networks.

tanh: $\tanh(z) = (e^z - e^{-z})/(e^z + e^{-z}) \in (-1,1)$. Derivative: $1 - \tanh^2(z)$. Centered around 0 (better than sigmoid), but also saturates.

ReLU (Rectified Linear Unit): $\mathrm{ReLU}(z) = \max(0, z)$. Derivative: 1 for $z>0$, 0 for $z<0$. Advantages: no saturation for $z>0$ $\rightarrow$ efficient gradient propagation. Fast to compute. Problem: "dead neurons" (dying ReLU) when $z<0$ always $\rightarrow$ neuron always outputs 0 and never trains.

Leaky ReLU: $\max(\alpha z, z)$, $\alpha = 0.01$ — small slope for $z<0$, eliminates dying ReLU.

GELU: $G(z) = z \cdot \Phi(z)$ ($\Phi$ — normal CDF function). Used in BERT, GPT. Smoother than ReLU, combines "noisiness" with linearity.

Multilayer network (MLP)

Forward pass:

Layer $l$: $a^l = \sigma(W^l a^{l-1} + b^l)$

$W^l \in \mathbb{R}^{n_l \times n_{l-1}}$ — weight matrix of layer $l$, $b^l$ — bias vector, $\sigma$ — activation function (elementwise), $a^l$ — activations of layer.

Network function: $f(x; \theta) = a^{L} = \sigma(W^{L} \sigma(W^{L-1}\ldots \sigma(W^{1} x + b^{1}) \ldots + b^{L-1}) + b^{L})$. Parameters: $\theta = {W^{1}, b^{1}, \ldots, W^{L}, b^{L}}$. Number of parameters: $\sum_l (n_l \cdot n_{l-1} + n_l)$.

Backpropagation: Efficient algorithm for computing $\partial L/\partial \theta$ through the chain rule.

$\delta^L = \partial L/\partial z^L = \nabla_{a^L} L \odot \sigma'(z^L)$ (last layer error)
$\delta^l = (W^{l+1})^\mathrm{T} \delta^{l+1} \odot \sigma'(z^l)$ (backpropagation of error)
$\partial L/\partial W^l = \delta^l (a^{l-1})^\mathrm{T}$, $\partial L/\partial b^l = \delta^l$ (gradients)

Computational complexity: $O$(number of parameters) — same as forward pass.

Weight initialization

Poor initialization $\rightarrow$ vanishing or exploding gradient. If $W_{ij} \sim \mathcal{N}(0, \sigma^2)$: variance of activations in layer $l \approx (n_{l-1} \sigma^2) \times$ variance of input $\rightarrow$ grows exponentially.

Glorot (Xavier) initialization: $\sigma^2 = 2/(n_l + n_{l-1})$ — keeps the variance of activations the same. For tanh/sigmoid.

He initialization: $\sigma^2 = 2/n_{l-1}$ — for ReLU (accounts for zero half).

Numerical example

MLP: [2 $\rightarrow$ 4 $\rightarrow$ 1], ReLU, MSE loss. Input $x = (1, -1)$. $W^1 = \begin{bmatrix}0.5 & -0.5 \ 1 & 0.5 \ -0.5 & 1 \ 0 & -1\end{bmatrix}$, $b^1 = [0, 0, 0, 0]$. $W^2 = [1, -1, 0.5, -0.5]$, $b^2 = [0.1]$.

Forward: $z^1 = W^1x + b^1 = [0.5+0.5, 1-0.5, -0.5-1, 0+1] = [1, 0.5, -1.5, 1]$. $a^1 = \mathrm{ReLU}(z^1) = [1, 0.5, 0, 1]$. $z^2 = [1-0.5+0+ -0.5] + 0.1 = 0 + 0.1 = 0.1$. Output $f(x) = 0.1$.

For $y_\mathrm{true} = 1$: $L = (0.1-1)^2 = 0.81$. Backprop: $\delta^2 = 2(0.1-1) = -1.8$. $\partial L/\partial W^2 = \delta^2 \cdot (a^1)^\mathrm{T} = [-1.8, -0.9, 0, -1.8]$.

Assignment: Implement an MLP from scratch in NumPy (without PyTorch): [2$\rightarrow$8$\rightarrow$4$\rightarrow$1], ReLU+sigmoid. Train on the XOR problem (4 points). Implement SGD with momentum. Plot the decision boundary. How many epochs to reach 100% accuracy? How does convergence speed change with momentum $\beta=0.9$ vs $\beta=0$?

§ Act · what next