Teach XOR to a network in 200 lines of Rust
A neural network sounds like a black box until you build the smallest one that actually learns โ then it collapses into a handful of loops over f64 arrays. We'll grow a network from a single neuron up to a two-layer perceptron that learns XOR, the textbook problem a single neuron cannot solve. Along the way Rust's const generics turn dimension bugs into compile errors.
The core idea, in one sentence
A network maps inputs to outputs through layers of weighted sums, each squashed by a nonlinear function; training nudges those weights until the outputs match the targets. Everything below is plumbing around two operations โ a forward pass that computes an answer, and a backward pass that assigns blame for the error.
Activations: where nonlinearity lives
Stack two linear layers and you still have a linear function โ you've gained nothing. The nonlinear activation function applied after each layer is what lets a network bend space and learn curves. We model it as a trait, so layers stay generic over which one they use.
Start with the two methods every activation must provide:
pub trait Activation: Clone + Copy {
fn activate(&self, x: f64) -> f64;
fn derivative(&self, x: f64) -> f64;
}
activate is the forward direction โ the squash applied to a neuron's output. derivative is for backprop: backpropagation is calculus, and the chain rule needs the slope of every function it passes through, so the derivative is not optional decoration.
The supertraits earn their place too. Our activations are zero-sized marker structs carrying no data, so making them Copy means the same one can be handed to two layers for free, with no ownership dance.
Now a concrete one. First the struct and the forward formula:
#[derive(Clone, Copy)]
pub struct Sigmoid;
impl Activation for Sigmoid {
fn activate(&self, x: f64) -> f64 {
1.0 / (1.0 + (-x).exp())
}
// โฆ
}
Sigmoid holds no fields โ it's purely a type-level tag that selects this impl. The formula 1 / (1 + e^-x) is the classic S-curve that maps any real number into (0, 1), which is why its output reads naturally as a probability-like score.
Then the slope, the other method of the same impl:
fn derivative(&self, x: f64) -> f64 {
let s = self.activate(x);
s * (1.0 - s)
}
The sigmoid derivative is ฯ'(x) = ฯ(x)ยท(1 โ ฯ(x)) โ it can be written purely in terms of the function's own output. That tidy form, computed from a single forward call, is exactly why sigmoid was the historical default: backprop gets the slope almost for free.
A layer that knows its own shape
A fully connected layer is a weight matrix plus a bias vector. The clever part is encoding the dimensions in the type using const generics:
pub struct Layer<const I: usize, const O: usize, A>
where
A: Activation,
{
pub weights: [[f64; I]; O],
pub biases: [f64; O],
activation: A,
}
I is the input width and O the output width, baked into the type itself. The weights array is indexed weights[neuron][input] โ O neurons, each with I incoming weights โ and the arrays are fixed-size and stack-allocated, no heap, no Vec.
This is where the type system starts catching bugs for you. Layer<2, 3, Sigmoid> is a genuinely different type from Layer<3, 2, Sigmoid>, so feeding a 3-element input to a layer expecting 2 simply won't compile โ no runtime shape checks, no silent broadcasting surprises.
Now the constructor. Weights start random, and the reason is subtle:
pub fn new(activation: A) -> Self {
let mut rng = rand::rng();
// โฆ fill weights and biases, then:
Self { weights, biases, activation }
}
rng is the random source from the rand crate. Initialization can't be all-zeros: if every weight began at zero, every neuron would compute the identical thing and stay locked in lockstep forever. Random values break the symmetry that would otherwise freeze learning.
Fill the weight matrix:
let weights = array::from_fn(|_| {
array::from_fn(|_| {
rng.random_range(-1.0..1.0)
})
});
array::from_fn builds a fixed-size array by calling a closure per index, and here it's nested โ the outer call builds one row per neuron, the inner builds I weights in [-1, 1] per row. Because the lengths come from the const generics, the compiler knows the exact shape at this call site.
Then the biases, one per neuron:
let biases = array::from_fn(|_| {
rng.random_range(-1.0..1.0)
});
Each neuron gets one bias, also seeded randomly in [-1, 1]. The bias shifts a neuron's threshold independently of its inputs, giving the layer one extra degree of freedom per neuron.
The forward pass
For each neuron, multiply every input by its weight, sum, add the bias, then activate. Begin with the signature and a sanity check:
pub fn forward(
&self,
inputs: &[f64],
) -> ([f64; O], [f64; O]) {
assert_eq!(inputs.len(), I);
// โฆ compute sums, then activations
}
The return type is a pair of O-length arrays, not one. We return both the activated outputs and the raw pre-activation sums, because backprop will need those raw values later. The assert_eq! guards the one place a slice (&[f64]) sneaks past the const-generic checks.
Compute the weighted sums:
let weighted_sums = array::from_fn(|i| {
self.weights[i].iter().zip(inputs)
.map(|(w, x)| w * x)
.sum::<f64>() + self.biases[i]
});
For neuron i, zip pairs each weight with its matching input, map multiplies them, and sum collapses the products into one number โ the dot product. Adding biases[i] gives the raw pre-activation value, conventionally called z.
Then squash each sum and return both arrays:
let activations = array::from_fn(|i| {
self.activation.activate(weighted_sums[i])
});
(activations, weighted_sums)
This second pass runs every z through the activation to get the neuron's actual output. This single method is a perceptron โ train it alone and it learns AND or OR perfectly.
Why one layer isn't enough
XOR's truth table โ [0,1]โ1, [1,0]โ1, but [1,1]โ0 โ cannot be separated by a single straight line. No setting of weights on one neuron will ever solve it. The fix is a hidden layer that re-represents the inputs into a space where the output layer can draw its line.
Encode that two-layer pipeline as one struct:
pub struct Mlp<
const I: usize,
const N: usize,
const O: usize,
A: Activation,
> {
hidden_layer: Layer<I, N, A>,
output_layer: Layer<N, O, A>,
learning_rate: f64,
}
Three const parameters describe the whole shape: I inputs, N hidden neurons, O outputs. The A: Activation bound rides inline in the generic list instead of a separate where clause โ fewer lines, and equally valid. Watch how the const generics chain: the hidden layer is I โ N, the output layer is N โ O, and that shared N is the wiring between them. A mismatched pipeline simply won't type-check. Inference is then just two forward calls in series, and learning_rate is the knob we'll use during training.
Backpropagation: assigning blame
Training is a loop: forward pass, measure error, push the error backward into gradients, update weights. We measure error with mean squared error, then compute a delta for each neuron โ how much it should change. The output deltas are direct; the hidden deltas come from propagating output deltas backward through the connecting weights.
Take the output neuron first:
fn output_delta_at(
&self,
i: usize,
expected: &[f64],
f: &ForwardPassResult<N, O>,
) -> f64 {
// โฆ error times slope
}
f carries the saved arrays from the forward pass โ both the activations and the raw z sums for both layers. We pass an index i and compute the delta for one output neuron at a time, which keeps each function small and easy to test.
Its body is just error times slope:
let z = f.output_weighted_sums[i];
let err = expected[i] - f.output_activations[i];
err * self.output_layer.derivative(z)
err is how far this neuron's output sits from the target, and derivative(z) is the activation's slope at that point. Multiplying them gives the delta โ large when the neuron is both wrong and responsive, near zero when it's already saturated.
The hidden neuron is the interesting case. Same shape of signature:
fn hidden_delta_at(
&self,
i: usize,
out_deltas: &[f64; O],
f: &ForwardPassResult<N, O>,
) -> f64 {
// โฆ sum the blame, then scale by slope
}
A hidden neuron has no target of its own โ it never directly touches the answer. Instead it receives out_deltas, the deltas already computed for the output layer, and must reconstruct its share of the blame from them.
Sum the blame flowing back through its connections:
let err: f64 = out_deltas.iter().enumerate()
.map(|(j, &d)| {
d * self.output_layer.weights[j][i]
})
.sum();
Each output neuron j "blames" hidden neuron i in proportion to the weight weights[j][i] between them โ a strong connection carries more of the error back. Summing over every j gives the total error charged to this hidden neuron. This weighted sum is the chain rule made concrete.
Then scale by its own slope, exactly like the output case:
let z = f.hidden_weighted_sums[i];
err * self.hidden_layer.derivative(z)
The same error ร slope pattern closes the loop: the propagated error is multiplied by the hidden activation's derivative at its own z. One ergonomic note โ #[derive(Deref)] from derive_more lets us call layer.derivative(z) directly instead of layer.activation.derivative(z).
The update step that makes it learn
With deltas in hand, every weight moves a small step proportional to its delta and its input. Update an output weight and its bias:
let lr = self.learning_rate;
self.output_layer.weights[i][j] +=
lr * out_delta * hidden_activation[j];
self.output_layer.biases[i] += lr * out_delta;
A weight's nudge scales with three things: the learning rate lr, the neuron's delta, and the input that fed it (hidden_activation[j]) โ an input that contributed nothing gets adjusted by nothing. The bias has no input, so it just moves by lr * out_delta.
Note the += rather than -=. Our delta already bakes in (expected โ predicted), the direction of improvement, so we add the step instead of subtracting a gradient. The learning_rate controls step size โ too large overshoots the target, too small crawls toward it.
Now wire it together and train:
let mut mlp =
Mlp::<2, 4, 1, _>::new(0.5, Sigmoid);
mlp.train(&data, 10_000);
This builds a 2 โ 4 โ 1 network: two inputs, four hidden neurons, one output, with a learning rate of 0.5. We run 10,000 passes over the four XOR examples โ enough repetitions for the deltas to accumulate into weights that actually separate the classes.
Check that it learned:
assert!(mlp.predict(&[1.0, 0.0])[0] > 0.8);
After training, feeding [1, 0] should yield an output well above 0.8 โ confidently 1, exactly as XOR demands. A single neuron could never make this assertion pass; the hidden layer is what got us here.
Where to go next / in production
This network is honest but tiny. Real workloads want:
- Better init โ fixed
[-1,1]randomness stalls on deep nets; Xavier (sigmoid/tanh) or He (ReLU) initialization scale to layer width. - More activations โ
ReLU(x.max(0.0)) avoids sigmoid's vanishing gradients in hidden layers;Tanhis zero-centered for faster convergence. Both drop in via the same trait, zero changes elsewhere. - Batching and matrices โ per-example loops are clear but slow; production code multiplies whole matrices via
ndarrayor BLAS. - Arbitrary depth โ two fixed layers become a
Vecof boxed layers once you outgrow const generics.
Why build it
Implement this once and "neuron", "gradient", and "backprop" stop being incantations โ they become arithmetic you can step through in a debugger. The whole network is f64 arrays and the chain rule, and Rust's type system catches your dimension mistakes before the program ever runs. That's the entire foundation of deep learning, held in one afternoon and one file.