Image classification
Image classification is a fundamental task in computer vision and it is the base for Object detection. In image classification, the label of an image is classified using an image classification model. The goal is to classify the image by assigning it to a specific label. Typically, Image Classification refers to images which has only one object. But, object detection involves both classification and localization, and is used to analyze more real life use cases in which multiple objects may exist in an image.
Image classification is a supervised learning task. That means we have to prepare the dataset in which images of same label are collected and compiled together for all the labels.
Object Detection: Image classification with localization
Object detection is a computer vision technique that allows us to identify and locate objects in an image or video. The output of the model will be bounding boxes.
CNN for Object detection
CNN Convolutional Neural Network is a kind of Neural Networks that has filter which convolves with the image pixels and widely used in visual data interpretation task in computer vision.
There are two major kind of CNN architecture for object detection,
- Single Step Object detection
- Two Step Object detection
Single-Step Object Detection
These kind of model uses just one type of architecture (CNN) for feature extraction, localization and classification end to end and can be easily trainable.
Below are some of the models that uses this approach,
- YOLO,
- SSD,
- RetinaNet
Two-Step Object Detection
These kind of model takes two different steps in order to get the detection outputs. Typically, these kind has convolutional layers for feature extraction and RPN (region proposal network) and fully connected layers for localization and classification.
Below are some of the models that uses this approach,
- RCNN
- Fast-RCNN
- Faster-RCNN
- Mask-RCNN
Transformers for Object detection
CNN only architecture have been the best architecture so far until transformers comes in and leaded the benchmarks of object detection. Transformers were introduced to the field of Natural Language processing and soon it finds its way to computer vision. And by the way, transformers is not replacing CNN, it is only enhancing the architecture by using CNN as feature extractor.
Example of Vision Transformer is DETR(Detection Transformer) which is an end to end object detection model that does object classification and localization i.e boundary box detection.
Applications for Object detection
- Face detection
- Counting and Analysis
- Visual search engine
- Visual tagging
- Aerial view analysis
- Video analytics
- And much more.