Week 2: Training Linear Classifiers & The XOR Problem

Accompanying Notebooks:

Click File -> Save a copy in Drive

Summary:

In this module, you will:

Implement the perceptron learning algorithm
Visualize how decision boundaries evolve during training
Understand the mathematical foundation of weight updates
Discover why some problems require deeper networks
Understand how multi-layer perceptrons makes it possible to solve non-linearly separable problems

Introduction

In Week 1, you manually adjusted weights using sliders to find decision boundaries. But manual tuning would be impossible if you had thousands or millions of weights. This week, we’ll learn how computers can automatically find optimal weights through the perceptron learning rule.

Section 3.1: The Perceptron Model

Mathematical Foundation

The perceptron makes predictions using the same linear model from Week 1, but now we’ll see how to train it automatically.

Prediction Formula:

\[\hat{y} = \text{sign}(w \cdot x + b)\]

Where:

$x = [x_1, x_2]$ is the input vector (our 2D coordinates)
$w = [w_1, w_2]$ are the weights (learnable parameters)
$b$ is the bias (also learnable)
$\hat{y}$ is the predicted class: $+1$ or $-1$

The sign function returns $+1$ if its input is non-negative, and $-1$ otherwise. This creates a hard threshold at zero.

Alternative Formulation:

You might also see this written as:

\[\hat{y} = \begin{cases} +1 & \text{if } w_1x_1 + w_2x_2 + b \geq 0 \\ -1 & \text{if } w_1x_1 + w_2x_2 + b < 0 \end{cases}\]

Both formulations are equivalent - the perceptron checks which side of the decision boundary a point lies on.

Geometric Interpretation

Remember from Week 1 that $w_1x_1 + w_2x_2 + b = 0$ defines a line. The perceptron uses this line as its decision boundary:

Points above/on the line: classified as $+1$
Points below the line: classified as $-1$

In our examples, we label the $+1$ region as $blue$ and the $-1$ region as $red$.

The weight vector $w = [w_1, w_2]$ is perpendicular to the decision boundary. This geometric fact will be crucial for understanding how the learning algorithm works.

Section 3.2: The Perceptron Learning Rule

How Does Learning Work?

The steps of how a perceptron learns are as follows:

Initialize weights and bias to some random small values
Train a Epoch: one pass through all of the data:
- For each data point (iteration):
  - Make a prediction
  - If the prediction is wrong, update the weights
  - If the prediction is correct, do nothing
Repeat Step 2 for multiple epochs (until there are no errors or a limit is reached).

Vocab:

Iteration: Updating weights for a single example
Epoch: Updating weights after seeing every example in the dataset once

The Update Rule

When the perceptron makes a mistake on input $x_i$ with true label $y_i$:

\[w_{\text{new}} = w_{\text{old}} + \alpha \cdot y_i \cdot x_i\] \[b_{\text{new}} = b_{\text{old}} + \alpha \cdot y_i\]

Where:

$\alpha$ is the learning rate (typically 0.1 or 0.01), which adjusts the magnitude of the update step.
$y_i \in {-1, +1}$ is the true label
$x_i$ is the input that was misclassified

Note: The Learning Rate Trade-off

The learning rate ($\alpha$) controls how drastically we change the weights:

If too large: The model becomes unstable. The decision boundary swings wildly, potentially “overshooting” the solution and un-learning previous correct classifications.
If too small: The model learns too slowly. It requires thousands of epochs to move the boundary to the correct place.

Example:

Suppose we have a blue point (label $y_i = +1$) at position $x_i = [1, 1]$, but our perceptron incorrectly predicts $-1$ (red).

This means $w \cdot x_i + b < 0$ when it should be $\geq 0$. We need to increase the score $w \cdot x_i + b$.

By the update rule: $w_{\text{new}} = w_{\text{old}} + \alpha \cdot (+1) \cdot [1, 1] = w_{\text{old}} + \alpha [1, 1]$

This adds a positive contribution in the direction of $x_i$, which increases $w \cdot x_i$.

Conversely: If we mistakenly predict $+1$ for a red point (true label $-1$), we multiply by $-1$, which subtracts from $w$, decreasing the score.

Why are we changing the labels from {0, 1} to {-1, +1} in our code?

The perceptron algorithm requires labels to be {-1, +1} instead of the more intuitive {0, 1}.

The Mathematical Reason:

Recall the update rule when we make a mistake:

\[w_{\text{new}} = w_{\text{old}} + \alpha \cdot y_i \cdot x_i\]

The label $y_i$ acts as a direction multiplier:

When $y_i = +1$ (blue point): we add $+\alpha x_i$ to the weights
When $y_i = -1$ (red point): we add $-\alpha x_i$ to the weights

What would happen with {0, 1} labels?

If we used $y_i \in {0, 1}$:

When $y_i = 1$: update works → $w \leftarrow w + \alpha \cdot 1 \cdot x_i$
When $y_i = 0$: no update → $w \leftarrow w + \alpha \cdot 0 \cdot x_i = w$

Multiplying by zero produces no weight change for Class 0 misclassifications. The algorithm would only correct mistakes on Class 1 points, making it unable to learn the correct boundary.

Why {-1, +1} works:

With signed labels, both classes contribute to learning:

Class 1 mistakes ($y_i = +1$): push the boundary toward the misclassified point
Class 0 mistakes ($y_i = -1$): push the boundary away from the misclassified point

The negative sign for Class 0 encodes the “opposite direction” we need to move the decision boundary.

In Practice:

Since most datasets use {0, 1} labels naturally (like our OR and XOR gates), we convert them before training:

# Convert 0 labels to -1, leave 1 labels as is
y_signed = np.where(y == 0, -1, 1)

Geometric Interpretation of Updates

When we update $w \leftarrow w + \alpha y_i x_i$, we are actually rotating the decision boundary. Hopefully you’ve noticed this in week 1’s module.

If we misclassify a blue point ($y_i = +1$), we rotate the boundary toward that point
If we misclassify a red point ($y_i = -1$), we rotate the boundary away from that point

The learning rate $\alpha$ controls how much we rotate per mistake. Smaller $\alpha$ means slower, more careful learning.

Section 3.3: Training Dynamics on the OR Gate

Why OR is Linearly Separable

Recall the OR gate truth table:

$x_1$	$x_2$	OR Output
0	0	0
0	1	1
1	0	1
1	1	1

Observing Training

When you run the perceptron on OR in the notebook, you’ll see:

Early epochs: The boundary is random, making many mistakes
Middle epochs: The boundary rotates toward the correct orientation
Later epochs: The boundary stabilizes, correctly separating all points

The color gradient visualization (purple → yellow) shows this progression beautifully. Notice how the boundary “settles” into a valid position.

Typical Convergence: OR gate usually converges in 5-10 epochs with learning rate 0.1, though this varies with initialization.

Sample gif of training:

Section 4.1: Why XOR is Not Linearly Separable

The XOR Problem

XOR is true when inputs are different.

Here’s the truth table again as a refresher:

$x_1$	$x_2$	XOR Output
0	0	0
0	1	1
1	0	1
1	1	0

In our 2D space:

Red points: (0,0) and (1,1)
Blue points: (0,1) and (1,0)

The Impossibility Proof

Although you have already explored XOR manually last week and visually confirmed that a single line cannot create a boundary for these points, we will still do a simple proof.

Claim: No single line can separate XOR’s classes.

▶ Click to see the proof

Proof by Contradiction:

Suppose such a line exists: $w_1x_1 + w_2x_2 + b = 0$

For correct classification, we need:

$(0,0)$ is red: $w_1(0) + w_2(0) + b < 0 \Rightarrow b < 0$
$(0,1)$ is blue: $w_1(0) + w_2(1) + b > 0 \Rightarrow w_2 + b > 0$
$(1,0)$ is blue: $w_1(1) + w_2(0) + b > 0 \Rightarrow w_1 + b > 0$
$(1,1)$ is red: $w_1(1) + w_2(1) + b < 0 \Rightarrow w_1 + w_2 + b < 0$

From conditions 2 and 3: $w_2 > -b$ and $w_1 > -b$

Adding these: $w_1 + w_2 > -2b$

But from condition 1: $b < 0$, so $-2b > 0$

Therefore: $w_1 + w_2 > 0$

But condition 4 requires: $w_1 + w_2 < -b$

Since $b < 0$, we have $-b > 0$, but we also need $w_1 + w_2 < -b$.

This creates a contradiction when we add condition 1’s constraint. No solution exists

What the Perceptron Does on XOR

When you train a perceptron on XOR, it never converges. The decision boundary keeps rotating, trying to satisfy contradictory requirements, resulting in a indefinite oscillation behavior. (It might correctly classify 3 points but always fails on the 4th)

In the provided notebook, we manually set the epoch limit to 25. Use the slider to see the oscilation.

Thus, this fundamental limitation motivated the development of multi-layer networks.

Section 5.1: Two-Layer Solution for XOR

Why Depth Solves Non-Linearity

A single-layer perceptron is limited to linear decision boundaries. But by adding a hidden layer, we can:

Create multiple decision boundaries in the first layer
Combine them non-linearly in the second layer

This is the key insight behind neural networks

Manual Weight Construction

Let’s build a two-layer perceptron that solves XOR. Our architecture:

Input layer: 2 neurons ($x_1, x_2$)
Hidden layer: 2 neurons ($h_1, h_2$)
Output layer: 1 neuron (final prediction)

Hidden Layer Design

Neuron 1: Detects “at least one input is 1” (OR-like)

Weights: $w^{(1)} = [1, 1]$
Bias: $b^{(1)} = -0.5$
Activation: $h_1 = \text{step}(x_1 + x_2 - 0.5)$

This fires ($h_1 = 1$) when $x_1 + x_2 \geq 0.5$, which happens for (0,1), (1,0), and (1,1).

Neuron 2: Detects “NOT both inputs are 1” (NAND-like)

Weights: $w^{(2)} = [1, 1]$
Bias: $b^{(2)} = -1.5$
Activation: $h_2 = \text{step}(x_1 + x_2 - 1.5)$

This only fires ($h_2 = 1$) when $x_1 + x_2 \geq 1.5$, which happens only for (1,1).

Output Layer Design

Output Neuron: Combines $h_1$ and $h_2$ to create XOR

Weights: $w^{(\text{out})} = [1, -1]$
Bias: $b^{(\text{out})} = -0.5$
Output: $\text{step}(h_1 - h_2 - 0.5)$

This computes: “first neuron fires AND second neuron doesn’t fire”

Forward Pass Verification

Let’s verify all four XOR inputs by following the data through each layer:

Input (0,0): (Expected output: 0)

First, we compute the hidden layer activations:

Hidden Neuron 1 (OR-like detector):
- Calculation: $h_1 = \text{step}(w_1^{(1)} \cdot x_1 + w_2^{(1)} \cdot x_2 + b^{(1)}) = \text{step}(1 \cdot 0 + 1 \cdot 0 - 0.5) = \text{step}(-0.5) = 0$
- Interpretation: Neither input is 1, so this neuron doesn’t fire
Hidden Neuron 2 (NAND-like detector):
- Calculation: $h_2 = \text{step}(w_1^{(2)} \cdot x_1 + w_2^{(2)} \cdot x_2 + b^{(2)}) = \text{step}(1 \cdot 0 + 1 \cdot 0 - 1.5) = \text{step}(-1.5) = 0$
- Interpretation: Inputs don’t sum to 2, so this neuron doesn’t fire

Now we compute the output using the hidden layer values:

Output Neuron:
- Calculation: $\text{Output} = \text{step}(w_1^{(\text{out})} \cdot h_1 + w_2^{(\text{out})} \cdot h_2 + b^{(\text{out})}) = \text{step}(1 \cdot 0 + (-1) \cdot 0 - 0.5) = \text{step}(-0.5) = 0$
- Result: 0 ✓ Correct! (matches expected XOR output)

Input (0,1):

$h_1 = \text{step}(0 + 1 - 0.5) = \text{step}(0.5) = 1$
$h_2 = \text{step}(0 + 1 - 1.5) = \text{step}(-0.5) = 0$
Output = $\text{step}(1 - 0 - 0.5) = \text{step}(0.5) = 1$ Correct

Input (1,0):

$h_1 = \text{step}(1 + 0 - 0.5) = \text{step}(0.5) = 1$
$h_2 = \text{step}(1 + 0 - 1.5) = \text{step}(-0.5) = 0$
Output = $\text{step}(1 - 0 - 0.5) = \text{step}(0.5) = 1$ Correct

Input (1,1):

$h_1 = \text{step}(1 + 1 - 0.5) = \text{step}(1.5) = 1$
$h_2 = \text{step}(1 + 1 - 1.5) = \text{step}(0.5) = 1$
Output = $\text{step}(1 - 1 - 0.5) = \text{step}(-0.5) = 0$ Correct

All four cases are able to be correctly classified when a hidden layer is introduced, which solves our linerarity problem.

Building XOR from Simpler Gates

You might have noticed that the two logic gates we have used are OR and NAND (NOT AND).

XOR = OR AND NAND

The whole premise of multi-layer perceptrons and neural networks is combining simple linearly separable logic gates to build more complex ones.

Here’s the truth table showing how OR and NAND combine to create XOR:

$x_1$	$x_2$	OR ($h_1$)	NAND ($h_2$)	OR AND NAND	XOR (Expected)
0	0	0	1	(0 AND 1) → 0	0
0	1	1	1	(1 AND 1) → 1	1
1	0	1	1	(1 AND 1) → 1	1
1	1	1	0	(1 AND 0) → 0	0

Notice how the “OR AND NAND” column perfectly matches the XOR output.

Geometric Interpretation

The hidden layer performs a feature transformation:

Original space: $(x_1, x_2)$ - XOR is not linearly separable
Hidden space: $(h_1, h_2)$ - XOR becomes linearly separable.

In the hidden feature space:

(0,0) maps to [0, 0] - red
(0,1) maps to [1, 0] - blue
(1,0) maps to [1, 0] - blue
(1,1) maps to [1, 1] - red

Now the classes can be separated by the line $h_1 - h_2 = 0.5$.

This is the power of hidden layers: they create new representations where previously impossible problems become solvable.

XOR Hidden Space

Why We Can’t Train This Yet

The simple perceptron learning rule outlined at the top of this module only works for single-layer networks.

For multi-layer networks, we face a problem: when the output is wrong, how do we know how much each hidden neuron is “to blame”? How do we update the hidden layer weights?

This is called the credit assignment problem, and it remained unsolved until the invention of backpropagation in the 1980s.

Next module, we’ll learn backpropagation and gradient descent, which allow us to train multi-layer networks automatically.

Resources

Check out these resources if you’re interested in learning more:

StatQuest: The Perceptron
Interactive TensorFlow Playground
- Try solving XOR with different architectures
- See how hidden layers create feature transformations

Week 2: Perceptron Training & The XOR Problem

Implementing training method for perceptrons and introduction to hidden layer.

Week 2: Training Linear Classifiers & The XOR Problem

Accompanying Notebooks:

Summary:

Introduction

Section 3.1: The Perceptron Model

Mathematical Foundation

Geometric Interpretation

Section 3.2: The Perceptron Learning Rule

How Does Learning Work?

The Update Rule

Note: The Learning Rate Trade-off

Example:

Why are we changing the labels from {0, 1} to {-1, +1} in our code?

Geometric Interpretation of Updates

Section 3.3: Training Dynamics on the OR Gate

Why OR is Linearly Separable

Observing Training

Sample gif of training:

Section 4.1: Why XOR is Not Linearly Separable

The XOR Problem

The Impossibility Proof

What the Perceptron Does on XOR

Section 5.1: Two-Layer Solution for XOR

Why Depth Solves Non-Linearity

Manual Weight Construction

Hidden Layer Design

Output Layer Design

Forward Pass Verification

Building XOR from Simpler Gates

Geometric Interpretation

Why We Can’t Train This Yet

Resources