How CNNs Learn Spatial Hierarchies of Features
Basic Architecture of CNNs
- Input Layer: Receives the raw pixel data of an image.
- Convolutional Layers: Apply filters (kernels) to detect features like edges and textures.
- Activation Functions: Introduce non-linearity, often using ReLU (Rectified Linear Unit).
- Pooling Layers: Reduce the spatial dimensions, preserving essential features.
- Fully Connected Layers: Integrate features for classification.
- Output Layer: Produces the final prediction, often using a softmax function for classification.

This hierarchical structure allows CNNs to learn from simple to complex features, mimicking how humans recognize patterns in images.
Convolutional Layers: The Core of Feature Extraction
- Convolution Operation: A filter slides over the input image, computing a dot product between the filter and the image patch.
- Feature Maps: The output of the convolution operation, highlighting specific features detected by the filter.
- Consider a filter designed to detect vertical edges.
- As it moves across the image, it produces high values where vertical edges are present, creating a feature map that highlights these edges.
Activation Functions: Introducing Non-Linearity
- ReLU (Rectified Linear Unit): Sets negative values to zero, preserving positive values.
- Purpose: Allows the network to learn complex, non-linear patterns.
ReLU is preferred in CNNs because it reduces the risk of the vanishing gradient problem, enabling faster and more effective training.
Pooling Layers: Reducing Dimensionality
- Max Pooling: Takes the maximum value from a defined window (e.g., 2x2) in the feature map.
- Purpose:
- Reduces computational complexity
- Makes the network more robust to spatial variations
- Think of pooling as summarizing a paragraph into a sentence.
- It captures the most important information while discarding less relevant details.
Fully Connected Layers: Integrating Features
- Structure: Each neuron is connected to every neuron in the previous layer.
- Function: Combines features to make high-level decisions, such as classifying an image as a cat or dog.
Fully connected layers act as the "decision-making" part of the network, using the features extracted by earlier layers to produce a final prediction.