YOLOv4 VS YOLOv4-tiny

Training custom YOLO detectors for Object Detection

What is YOLO?

YOLO stands for You Only Look Once. YOLO is a state-of-the-art, real-time object detection system. It was developed by Joseph Redmon. It is a real-time object recognition system that can recognize multiple objects in a single frame. YOLOv2, YOLOv3 and YOLOv4 are the evolved versions of YOLO.

YOLO uses a totally different approach than other previous detection systems. It applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

The basic idea of YOLO is exhibited in the figure below. YOLO divides the input image into an S × S grid and each grid cell is responsible for predicting the object centered
in that grid cell.

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object(Probability) and also how accurate it thinks the box is that it predicts(IOU).

YOLO model has several advantages over classifier-based systems. It can recognize multiple objects in a single frame. It looks at the whole image at test time so its predictions are informed by the global context in the image. Also, it makes predictions with a single network evaluation unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN. The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.

See the following papers on this for more details on the full system.

About YOLOv4

YOLOv4 is an object detection algorithm that is an evolution of the YOLOv3 model. The YOLOv4 method was created by Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. It is twice as fast as EfficientDet with comparable performance. In addition, AP (Average Precision) and FPS (Frames Per Second) in YOLOv4 have increased by 10% and 12% respectively compared to YOLOv3. Its architecture is composed of CSPDarknet53 as a backbone, spatial pyramid pooling additional module, PANet path-aggregation neck, and YOLOv3 head.

YOLOv4 uses many new features and combines some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a real-time speed of ~65 FPS on Tesla V100. Following are the new features used by it:

Weighted-Residual-Connections (WRC)
Cross-Stage-Partial-connections (CSP)
Cross mini-Batch Normalization (CmBN)
Self-adversarial-training (SAT)
Mish activation
Mosaic data augmentation
DropBlock regularization
Complete Intersection over Union loss (CIoU loss)

What is YOLOv4-tiny?

YOLOv4-tiny is the compressed version of YOLOv4. It isproposed based on YOLOv4 to make the network structure simpler and reduce parameters so that it becomes feasible for developing on mobile and embedded devices.

We can use YOLOv4-tiny for much faster training and much faster detection. It has only two YOLO heads as opposed to three in YOLOv4 and it hasbeentrained from 29 pre-trained convolutional layers as opposed to YOLOv4 which has been trained from 137 pre-trained convolutional layers.

The FPS (Frames Per Second) in YOLOv4-tiny is approximately eight times that of YOLOv4. However, the accuracy for YOLOv4-tiny is 2/3rds that of YOLOv4 when tested on the MS COCO dataset.

The YOLOv4-tiny model achieves 22.0% AP (42.0% AP50) at a speed of 443 FPS on RTX 2080Ti, while by using TensorRT, batch size = 4 and FP16-precision, it achieves 1774 FPS.

For real-time object detection, YOLOv4-tiny is the better option when compared with YOLOv4 as faster inference time is more important than precision or accuracy when working with a real-time object detection environment.

YOLOv4 Custom Detector vs YOLOv4-tiny Custom Detector

Face Mask Detection

Image for post — Original Video by Max Fischer from Pexels

I trained both YOLOv4 and YOLOv4-tiny detectors on the same 1500 image mask dataset where their average losses reached around 0.68 after 6000 iterations and 0.15 after 6000 iterations respectively. The average loss should be between .05 and .3 for a good detector model.

**YOLOv4** …………..…………………………………………….. **YOLOv4-tiny**

TESTING THE TRAINED CUSTOM DETECTORS

When tested for real-time object detection using a webcam, YOLOv4-tiny is better because of its much faster inference time. However, when tested on images and videos YOLOv4 is way more efficient.

Testing Detectors on Images

I ran both the trained detectors on the same images. See their output side by side below with YOLOv4-tiny predicted images on the left and YOLOv4 predicted images on the right.

YOLOv4-tiny~~~~~~~~~~~~~~~~~~~~~~YOLOv4

Original Photo by Life Matters from Pexels

Original Photo by Brett Sayles from Pexels

Testing Detectors on Videos

I also ran both the detectors on the same videos. You can watch the side-by-side comparison of their video compilations below.

You can watch the entire video comparison for both the trained detectors on YouTube here.

My Custom Mask Dataset

I have shared my labeled mask dataset on the link below. It is a relatively small dataset but it will give you a good start on how to train your own custom detector model using YOLO. You can find larger datasets with better quality images and label them yourself later.

https://www.kaggle.com/techzizou/labeled-mask-dataset-yolo-darknet

The obj.zip file contains 1510 images along with their YOLO format labeled text files. I have labeled around 1350 of these and downloaded 149 labeled images from Roboflow.

NOTE: This dataset has mostly close-up images (around 1300) and very few long-shot images (around 200). If you want to download more long-shot images, you can search for datasets online. There are many sites where you can find more datasets. I have given a few links at the bottom under Dataset Sources. You can also add your own images and their YOLO format labeled text files to the dataset.

Since my dataset mostly had close-up images, the detection for close-ups in images and videos is really good. On the other hand, having only 200 long-shot images gives us average performance for long-shot detections.

This goes to show how important the process of collecting datasets and labeling them correctly is. Always remember this rule:- Garbage In = Garbage Out therefore choosing and labeling images is the most important part. Try to find good-quality images. The quality of the data goes a long way towards determining the quality of the result.

Check out the following blogs on how to train your custom object detectors using YOLOv4 & YOLOv4-tiny.

CREDITS

References

Dataset Sources

You can download datasets for many objects from the sites mentioned below. In addition, these sites also contain images of many classes of objects along with their annotations/labels in multiple formats such as the YOLO_DARKNET text files and the PASCAL_VOC xml files.

Mask Dataset Sources

I have used these 3 datasets for my labeled dataset:

More Mask Datasets

Prasoonkottarathil Kaggle (20000 images)
Ashishjangra27 Kaggle (12000 images )
Andrewmvd Kaggle

Video Sources

https://www.pexels.com/