Multilayer feedforward networks — also called multilayer perceptrons (MLPs) — extend the single-layer perceptron by inserting one or more layers of hidden units between the input and output layers. Each unit computes a weighted sum of its inputs, adds a bias, and passes the result through a nonlinear activation function. This architecture overcomes the perceptron's inability to solve linearly inseparable problems such as XOR, and it became the workhorse of connectionist cognitive modeling after Rumelhart, Hinton, and Williams (1986) popularized an efficient learning algorithm for adjusting the weights.
Architecture and Computation
Output layer: oₖ = g(Σⱼ wₖⱼ⁽²⁾ hⱼ + bₖ⁽²⁾)
Common activation functions:
Sigmoid: f(z) = 1 / (1 + e⁻ᶻ)
Hyperbolic tangent: f(z) = tanh(z)
ReLU: f(z) = max(0, z)
Information flows strictly forward from input to output — there are no feedback connections. Each hidden unit learns to detect a particular feature or combination of features in the input, and the output layer combines these hidden representations to produce the network's response. The choice of activation function at the output depends on the task: a sigmoid for binary classification, softmax for multi-class categorization, and a linear function for continuous regression.
Universal Approximation Theorem
Cybenko (1989) and Hornik, Stinchcombe, and White (1989) proved that a feedforward network with a single hidden layer containing a sufficient number of units can approximate any continuous function on a compact subset of ℝⁿ to any desired degree of accuracy. This universal approximation theorem guarantees that the representational capacity of multilayer networks is, in principle, unlimited. However, the theorem says nothing about how many hidden units are needed or whether learning will find the right weights — these remain practical challenges.
In mathematical psychology, multilayer networks have been used to model categorization (as in the ALCOVE model), reading aloud (the triangle model of Plaut, McClelland, Seidenberg, & Patterson, 1996), and past-tense learning (Rumelhart & McClelland, 1986). These applications emphasize that hidden layers allow networks to discover internal representations — distributed patterns of activation that capture the statistical structure of the task domain — which serve as models of how the mind represents categories, words, and rules.
The depth of a network — the number of hidden layers — affects both its representational efficiency and the difficulty of training. While a single hidden layer suffices in theory, deeper networks can represent certain functions exponentially more efficiently. Modern deep learning exploits this, using dozens or hundreds of layers, but the cognitive plausibility of such deep architectures remains debated. In mathematical psychology, most connectionist models use one or two hidden layers, prioritizing interpretability over raw performance.