Different Types of Neural Networks

Part 4

So far, we have only covered one type of neural network: the feedforward neural network (FNN) (otherwise known as a multilayer perceptron (MLP)). But there exist hundreds more "varieties" of networks that we haven't discussed, each of which work in a slightly different way. Over the years, we have developed many different “architectures”, each optimised for solving specific kinds of problems. Here are some of the major types:

▬▬▬▬▬▬

Feedforward Neural Networks (FNNs)

We probably don't need to spend that much time on this section. The last 3 tutorials were dedicated to this type of neural network, but here's a summary:

Feedforward neural networks are the simplest type of neural network. They consist of layers of nodes (neurons) with strictly forward flow: from the input layer, through one or more hidden layers, and to the output layer. Each node takes a weighted sum of its inputs, applies an activation function, and forwards the result to the next layer. Since the connections never disconnect, information flows in just one direction. FNNs are universal function approximators A universal function approximator is a computational model, such as a feedforward neural network, that can learn to compute any continuous function to an arbitrary degree of accuracy, given enough complexity and resources. and are used occasionally for tasks where inputs might map directly to the outputs, such as simple regressionA regression problem involves a machine learning algorithm learning the continuous relationship between independent variables (features) and a dependent variable (output) to predict a numerical value or classification.

Neural Network Diagram (H.Y. using Canva)

▬▬▬▬▬▬

Convolutional Neural Networks (CNNs)

Convolutional neural networks were designed to handle data with a grid-like structure, most commonly images. Instead of connecting every neuron in one layer to every neuron in the next layer, CNNs use convolutional filters (small moving windows) to apply them to the input. These filters locate local features such as edges, textures, or shapes. With every group of several convolutional layers, progressively higher-level features are learned by the network, from simple edges to complex objects. However, CNNs come with additional complexities and more features. For example, we implement pooling layersIn the context of CNNs, pooling is a step where the network reduces the size of its data by summarizing small regions. It usually does this by taking the max or the average, which helps keep the key information while cutting out extra detail. to reduce the spatial size of data by keep things efficient and pull out only the most crucial details. CNNs work well because they can exploit spatial localitySpatial locality means nearby pixels in an image are usually related, so patterns often appear in local regions. and weight sharingWeight sharing is when the same filter (set of weights) is applied across different parts of the input, so the network can detect the same feature no matter where it shows up. This makes the model more efficient and less prone to overfitting. , which makes them excellent at image recognition and similar applications.

Convolutional Neural Network Diagrams (H.Y. using Canva)

▬▬▬▬▬▬

Recurrent Neural Networks (RNNs)

Recurrent neural networks specialise in processing sequential data, like text, speech, or time-series signals. As opposed to FNNs, RNNs have feedback connections so that they can keep a kind of "memory" of what they have seen in the past. At every step, the network computes the new hidden state based on the current input and previous hidden state. This allows the network to process sequences of any size and learn about dependencies between earlier and later items. But RNNs are troubled by long-term dependencies because the memory fades over many steps, leading to issues like vanishing gradients
(BotPenguin, n.d.)
The vanishing gradient problem is when the weights and biases update very slowly when we use gradient descent. This means the early layers hardly change, so the network has trouble learning deep patterns.
.

Recurrent Neural Network Diagram (H.Y. using Canva)

▬▬▬▬▬▬

Long Short-Term Memory Networks (LSTMs)

LSTMs are a more advanced form of RNNs that solve the problem of long-term memory. They possess a more complex cell structure with gates which control the information flow. The input gate calculates how much of the current input to remember, the forget gate calculates what to erase, and the output gate calculates what to send forward. This architecture allows LSTMs to remember for a lot longer than regular RNNs, which makes them extremely powerful at things like machine translation, speech recognition, and music composition.

In LSTMs, there are essentially 2 types of memory: the hidden state (\(h_t\)) and the cell state (\(C_t\)). The hidden state acts as "short-term" memory, and only hidden layers very close to it can access this information. The cell state on the other hand is more like "long-term" memory, and it solves the vanishing gradient problem
(BotPenguin, n.d.)
The vanishing gradient problem is when the weights and biases update very slowly when we use gradient descent. This means the early layers hardly change, so the network has trouble learning deep patterns.
that normal RNNs have.

▬▬▬▬▬▬

At each step of the process, the LSTM has access to:

With this, the LSTM first decides what to "forget". This is done by the forget gate which looks at the input as well as the previous hidden state and decides what parts of the old cell state to throw away.
For example, “Forget the subject of the sentence if a new one has started.”

Then, the input gate decides what new info to add to the cell state (the long-term memory). So, in our example, it might be, “Add the new subject into memory”

Finally, the output gate decides what part of the updated cell state should influence the hidden state (the short-term output used right now).

LSTM Diagram (H.Y. using Canva)

▬▬▬▬▬▬

Transformers

Transformers are a newer architecture
(Sora, 2025)
A neural network architecture is the specific arrangement of layers, neurons, and connections that defines how a neural network processes and transforms input data to produce an output.
that has now replaced RNNs and LSTMs on sequence modelling tasksA sequence modeling task is one where the goal is to predict, generate, or analyse data that comes in a specific order, like words in a sentence or frames in a video. These tasks require the model to capture patterns and dependencies across the sequence to make accurate predictions.. Unlike RNNs that process one step in a sequence at a time, transformers process the entire sequence at once and use self-attention layersA self-attention layer is a component of a transformer that lets each element in a sequence look at and weigh all the other elements to decide which ones are most relevant. It computes attention scores that determine how much influence each element has on the others, allowing the network to capture relationships across the entire sequence. This mechanism is key for understanding long-range dependencies efficiently. to figure out which parts of the input are most relevant to each other. This allows them to capture long-range dependencies efficiently with residual connections
(H.Y., 2025)
A residual connection is a shortcut in a neural network that adds the input of a layer directly to its output, helping the network retain important information. This makes training deeper networks more stable and efficient by preventing the signal from vanishing.
and feedforward layers helping to blend and stabilise the information.

The key strength of transformers lies in something called “self-attention”, which enables the model to weight relationships between all elements in a sequence simultaneously. For instance, in a sentence, it could know that "bank" is related to "river" in one context and to "money" in another, but positional encodings keep track of the order of words since the model sees everything at once. By combining attention and parallel processing
(Sora, 2025)
Parallel processing is when a model processes all elements of a sequence at the same time instead of one by one, allowing faster computation and better handling of long-range dependencies.
, transformers excel particularly at translation, summarisation, and text generation tasks. Although this seems vastly different to normal neural networks and despite all their complex architecture, transformers still remain neural networks at their core.

▬▬▬▬▬▬

Roughly, this is how a transformer works:

Transformer Diagram (H.Y. using Canva)

▬▬▬▬▬▬

Generative Adversarial Networks (GANs)

Generative adversarial networks (GANs) work quite differently from the other types of neural networks mentioned above. It consists of 2 competing models “fighting” one another: a generator and a discriminator. The generator tries to produce data (like artificial images) that are similar to real data, while the discriminator tries to figure out if data is real or fake. The more the process goes on, the more the generator gets better and better until it can produce highly realistic outputs. This means that GANs are usually used for generating things like images, videos, and music pieces.

GAN Diagram (H.Y. using Canva)

▬▬▬▬▬▬