How CNNs Learn Spatial Hierarchies of Features
Basic Architecture of CNNs
- Input Layer: Receives the raw pixel data of an image.
- Convolutional Layers: Apply filters (kernels) to detect features like edges and textures.
- Activation Functions: Introduce non-linearity, often using ReLU (Rectified Linear Unit).
- Pooling Layers: Reduce the spatial dimensions, preserving essential features.
- Fully Connected Layers: Integrate features for classification.
- Output Layer: Produces the final prediction, often using a softmax function for classification.
This hierarchical structure allows CNNs to learn from simple to complex features, mimicking how humans recognize patterns in images.
Convolutional Layers: The Core of Feature Extraction
- Convolution Operation: A filter slides over the input image, computing a dot product between the filter and the image patch.
- Feature Maps: The output of the convolution operation, highlighting specific features detected by the filter.
- Consider a filter designed to detect vertical edges.
- As it moves across the image, it produces high values where vertical edges are present, creating a feature map that highlights these edges.
Activation Functions: Introducing Non-Linearity
- ReLU (Rectified Linear Unit): Sets negative values to zero, preserving positive values.
- Purpose: Allows the network to learn complex, non-linear patterns.
ReLU is preferred in CNNs because it reduces the risk of the vanishing gradient problem, enabling faster and more effective training.
Pooling Layers: Reducing Dimensionality
- Max Pooling: Takes the maximum value from a defined window (e.g., 2x2) in the feature map.
- Purpose:
- Reduces computational complexity
- Makes the network more robust to spatial variations
- Think of pooling as summarizing a paragraph into a sentence.
- It captures the most important information while discarding less relevant details.
Fully Connected Layers: Integrating Features
- Structure: Each neuron is connected to every neuron in the previous layer.
- Function: Combines features to make high-level decisions, such as classifying an image as a cat or dog.
Fully connected layers act as the "decision-making" part of the network, using the features extracted by earlier layers to produce a final prediction.
Output Layer: Producing Predictions
Softmax Function: Converts the network's output into probabilities for each class.
In a cat vs. dog classifier, the softmax output might be 0.8 for "cat" and 0.2 for "dog," indicating an 80% confidence that the image is a cat.
- The number of layers affects the network's ability to learn complex features.
- Deeper networks can capture more abstract patterns but may require more data and computational power.
Kernel Size and Stride
- Kernel Size: Determines the area of the image each filter covers.
- Larger Kernels: Capture more global features.
- Smaller Kernels: Focus on local details.
- Stride: Controls how much the filter moves with each step.
- Larger Strides: Reduce the size of the feature map, increasing computational efficiency.
- Smaller Strides: Provide more detailed feature maps.
A 3x3 kernel with a stride of 1 captures fine details, while a 5x5 kernel with a stride of 2 captures broader patterns with less overlap.
Activation Function Selection
- ReLU: Commonly used for its simplicity and effectiveness.
- Other Options: Leaky ReLU, tanh, or sigmoid, depending on the specific requirements of the network.
Avoid using sigmoid or tanh in hidden layers of CNNs, as they can cause the vanishing gradient problem, slowing down training.
Loss Function
- Cross-Entropy Loss: Measures the difference between predicted probabilities and actual labels.
- Role: Guides the network's learning by penalizing incorrect predictions.
- The choice of loss function is critical.
- For classification tasks, cross-entropy loss is preferred because it effectively handles probability distributions.
Why Spatial Hierarchies Matter
- Local to Global Features: CNNs start by detecting simple features (e.g., edges) and progressively learn complex patterns (e.g., shapes, textures).
- Adaptive Learning: The network adjusts its filters during training to optimize feature detection for the specific task.
Real-World Applications
- Image Classification: Identifying objects in photos.
- Object Detection: Locating and classifying multiple objects within an image.
- Medical Imaging: Analyzing X-rays or MRIs to detect anomalies.