Skip to the content.

Week 2: Training Linear Classifiers & The XOR Problem

Accompanying Notebooks:

Open the Complementary Notebook In Colab

Click File -> Save a copy in Drive

Summary:

In this module, you will:

Introduction

In Week 1, you manually adjusted weights using sliders to find decision boundaries. But manual tuning would be impossible if you had thousands or millions of weights. This week, we’ll learn how computers can automatically find optimal weights through the perceptron learning rule.


Section 3.1: The Perceptron Model

Mathematical Foundation

The perceptron makes predictions using the same linear model from Week 1, but now we’ll see how to train it automatically.

Prediction Formula:

\[\hat{y} = \text{sign}(w \cdot x + b)\]

Where:

The sign function returns $+1$ if its input is non-negative, and $-1$ otherwise. This creates a hard threshold at zero.

Alternative Formulation:

You might also see this written as:

\[\hat{y} = \begin{cases} +1 & \text{if } w_1x_1 + w_2x_2 + b \geq 0 \\ -1 & \text{if } w_1x_1 + w_2x_2 + b < 0 \end{cases}\]

Both formulations are equivalent - the perceptron checks which side of the decision boundary a point lies on.

Geometric Interpretation

Remember from Week 1 that $w_1x_1 + w_2x_2 + b = 0$ defines a line. The perceptron uses this line as its decision boundary:

In our examples, we label the $+1$ region as $blue$ and the $-1$ region as $red$.

The weight vector $w = [w_1, w_2]$ is perpendicular to the decision boundary. This geometric fact will be crucial for understanding how the learning algorithm works.


Section 3.2: The Perceptron Learning Rule

How Does Learning Work?

The steps of how a perceptron learns are as follows:

  1. Initialize weights and bias to some random small values
  2. Train a Epoch: one pass through all of the data:
    • For each data point (iteration):
      • Make a prediction
      • If the prediction is wrong, update the weights
      • If the prediction is correct, do nothing
  3. Repeat Step 2 for multiple epochs (until there are no errors or a limit is reached).

Vocab:

The Update Rule

When the perceptron makes a mistake on input $x_i$ with true label $y_i$:

\[w_{\text{new}} = w_{\text{old}} + \alpha \cdot y_i \cdot x_i\] \[b_{\text{new}} = b_{\text{old}} + \alpha \cdot y_i\]

Where:

Note: The Learning Rate Trade-off

The learning rate ($\alpha$) controls how drastically we change the weights:

Example:

Suppose we have a blue point (label $y_i = +1$) at position $x_i = [1, 1]$, but our perceptron incorrectly predicts $-1$ (red).

This means $w \cdot x_i + b < 0$ when it should be $\geq 0$. We need to increase the score $w \cdot x_i + b$.

By the update rule: \(w_{\text{new}} = w_{\text{old}} + \alpha \cdot (+1) \cdot [1, 1] = w_{\text{old}} + \alpha [1, 1]\)

This adds a positive contribution in the direction of $x_i$, which increases $w \cdot x_i$.

Conversely: If we mistakenly predict $+1$ for a red point (true label $-1$), we multiply by $-1$, which subtracts from $w$, decreasing the score.

Why are we changing the labels from {0, 1} to {-1, +1} in our code?

The perceptron algorithm requires labels to be {-1, +1} instead of the more intuitive {0, 1}.

The Mathematical Reason:

Recall the update rule when we make a mistake:

\[w_{\text{new}} = w_{\text{old}} + \alpha \cdot y_i \cdot x_i\]

The label $y_i$ acts as a direction multiplier:

What would happen with {0, 1} labels?

If we used $y_i \in {0, 1}$:

Multiplying by zero produces no weight change for Class 0 misclassifications. The algorithm would only correct mistakes on Class 1 points, making it unable to learn the correct boundary.

Why {-1, +1} works:

With signed labels, both classes contribute to learning:

The negative sign for Class 0 encodes the “opposite direction” we need to move the decision boundary.

In Practice:

Since most datasets use {0, 1} labels naturally (like our OR and XOR gates), we convert them before training:

# Convert 0 labels to -1, leave 1 labels as is
y_signed = np.where(y == 0, -1, 1)

Geometric Interpretation of Updates

When we update $w \leftarrow w + \alpha y_i x_i$, we are actually rotating the decision boundary. Hopefully you’ve noticed this in week 1’s module.

The learning rate $\alpha$ controls how much we rotate per mistake. Smaller $\alpha$ means slower, more careful learning.


Section 3.3: Training Dynamics on the OR Gate

Why OR is Linearly Separable

Recall the OR gate truth table:

$x_1$ $x_2$ OR Output
0 0 0
0 1 1
1 0 1
1 1 1

Observing Training

When you run the perceptron on OR in the notebook, you’ll see:

  1. Early epochs: The boundary is random, making many mistakes
  2. Middle epochs: The boundary rotates toward the correct orientation
  3. Later epochs: The boundary stabilizes, correctly separating all points

The color gradient visualization (purple → yellow) shows this progression beautifully. Notice how the boundary “settles” into a valid position.

Typical Convergence: OR gate usually converges in 5-10 epochs with learning rate 0.1, though this varies with initialization.

Sample gif of training:

OR Gate Training

Section 4.1: Why XOR is Not Linearly Separable

The XOR Problem

XOR is true when inputs are different.

Here’s the truth table again as a refresher:

$x_1$ $x_2$ XOR Output
0 0 0
0 1 1
1 0 1
1 1 0

In our 2D space:

The Impossibility Proof

Although you have already explored XOR manually last week and visually confirmed that a single line cannot create a boundary for these points, we will still do a simple proof.

Claim: No single line can separate XOR’s classes.

▶ Click to see the proof

Proof by Contradiction:

Suppose such a line exists: $w_1x_1 + w_2x_2 + b = 0$

For correct classification, we need:

From conditions 2 and 3: $w_2 > -b$ and $w_1 > -b$

Adding these: $w_1 + w_2 > -2b$

But from condition 1: $b < 0$, so $-2b > 0$

Therefore: $w_1 + w_2 > 0$

But condition 4 requires: $w_1 + w_2 < -b$

Since $b < 0$, we have $-b > 0$, but we also need $w_1 + w_2 < -b$.

This creates a contradiction when we add condition 1’s constraint. No solution exists

What the Perceptron Does on XOR

When you train a perceptron on XOR, it never converges. The decision boundary keeps rotating, trying to satisfy contradictory requirements, resulting in a indefinite oscillation behavior. (It might correctly classify 3 points but always fails on the 4th)

In the provided notebook, we manually set the epoch limit to 25. Use the slider to see the oscilation.

Thus, this fundamental limitation motivated the development of multi-layer networks.


Section 5.1: Two-Layer Solution for XOR

Why Depth Solves Non-Linearity

A single-layer perceptron is limited to linear decision boundaries. But by adding a hidden layer, we can:

  1. Create multiple decision boundaries in the first layer
  2. Combine them non-linearly in the second layer

This is the key insight behind neural networks

Manual Weight Construction

Let’s build a two-layer perceptron that solves XOR. Our architecture:

Hidden Layer Design

Neuron 1: Detects “at least one input is 1” (OR-like)

This fires ($h_1 = 1$) when $x_1 + x_2 \geq 0.5$, which happens for (0,1), (1,0), and (1,1).

Neuron 2: Detects “NOT both inputs are 1” (NAND-like)

This only fires ($h_2 = 1$) when $x_1 + x_2 \geq 1.5$, which happens only for (1,1).

Output Layer Design

Output Neuron: Combines $h_1$ and $h_2$ to create XOR

This computes: “first neuron fires AND second neuron doesn’t fire”

Forward Pass Verification

Let’s verify all four XOR inputs by following the data through each layer:

Input (0,0): (Expected output: 0)

First, we compute the hidden layer activations:

Now we compute the output using the hidden layer values:

Input (0,1):

Input (1,0):

Input (1,1):

All four cases are able to be correctly classified when a hidden layer is introduced, which solves our linerarity problem.

Building XOR from Simpler Gates

You might have noticed that the two logic gates we have used are OR and NAND (NOT AND).

XOR = OR AND NAND

The whole premise of multi-layer perceptrons and neural networks is combining simple linearly separable logic gates to build more complex ones.

Here’s the truth table showing how OR and NAND combine to create XOR:

$x_1$ $x_2$ OR ($h_1$) NAND ($h_2$) OR AND NAND XOR (Expected)
0 0 0 1 (0 AND 1) → 0 0
0 1 1 1 (1 AND 1) → 1 1
1 0 1 1 (1 AND 1) → 1 1
1 1 1 0 (1 AND 0) → 0 0

Notice how the “OR AND NAND” column perfectly matches the XOR output.

Geometric Interpretation

The hidden layer performs a feature transformation:

In the hidden feature space:

Now the classes can be separated by the line $h_1 - h_2 = 0.5$.

This is the power of hidden layers: they create new representations where previously impossible problems become solvable.

XOR Hidden Space

Why We Can’t Train This Yet

The simple perceptron learning rule outlined at the top of this module only works for single-layer networks.

For multi-layer networks, we face a problem: when the output is wrong, how do we know how much each hidden neuron is “to blame”? How do we update the hidden layer weights?

This is called the credit assignment problem, and it remained unsolved until the invention of backpropagation in the 1980s.

Next module, we’ll learn backpropagation and gradient descent, which allow us to train multi-layer networks automatically.


Resources

Check out these resources if you’re interested in learning more:

  1. StatQuest: The Perceptron

  2. Interactive TensorFlow Playground

    • Try solving XOR with different architectures
    • See how hidden layers create feature transformations