Implementing YOLOV2 from using Tensorflow 2.0

In this notebook I am going to re-implement YOLOV2 as described in the paper YOLO9000: Better, Faster, Stronger. The goal is to replicate the model as described in the paper and train it on the VOC 2012 dataset.

Introduction

Most of the code, in this notbook comes from a series of blog posts by Yumi. I just followed his posts to get things working. The original blog post uses Tensorflow 1.x so I had to change a few things to make it work but most of the code remains the same. I am linking all his blog posts here, and I highly recommend taking a look at it as it explains everything in much more detail.

Yumi’s Blog Posts with explanation

Google colab with end to end training and evaluation on VOC 2012

I followed Yumi’s blogs to replicate YOLOV2 for VOC 2012 dataset. If you are looking for a consolidated python notebook with everything working, you can clone this Google Colab notebook.

https://colab.research.google.com/drive/14mPj3NYg_lJwWCRclzgPzdpKXoQutxUb?usp=sharing

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).

Data Preprocessing

I would be using VOC 2012 dataset as its size is manageable so it would be easy to run it using Google Colab.

First, I download and extract the dataset.

--2020-07-06 20:57:53--  http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
Resolving host.robots.ox.ac.uk (host.robots.ox.ac.uk)... 129.67.94.152
Connecting to host.robots.ox.ac.uk (host.robots.ox.ac.uk)|129.67.94.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1999639040 (1.9G) [application/x-tar]
Saving to: ‘VOCtrainval_11-May-2012.tar.1’

VOCtrainval_11-May- 100%[===================>]   1.86G  9.38MB/s    in 3m 35s  

2020-07-06 21:01:28 (8.88 MB/s) - ‘VOCtrainval_11-May-2012.tar.1’ saved [1999639040/1999639040]

Next, we define a function that parses the annotations from the XML files and stores it in an array.

We prepare the arrays with training_image and seen_train_labels for the whole dataset.

As opposed to YOLOV1, YOLOV2 uses K-means clustering to find the best anchor box sizes for the given dataset.

The ANCHORS defined below are taken from the following blog:

Part 1 Object Detection using YOLOv2 on Pascal VOC2012 - anchor box clustering.

Instead of rerunning the K-means algorithm again, we use the ANCHORS obtained by Yumi as it is.

N train = 17125

Next, we define a ImageReader class to process an image. It takes in an image and returns the resized image and all the objects in the image.

Here’s a sample usage of the ImageReader class.

******************************
Input
  object: [{'name': 'person', 'xmin': 174, 'ymin': 101, 'xmax': 349, 'ymax': 351}]
  filename: VOCdevkit/VOC2012/JPEGImages/2007_000027.jpg
  width: 486
  height: 500
******************************
Output
          [{'name': 'person', 'xmin': 148, 'ymin': 84, 'xmax': 298, 'ymax': 292}]

png

Next, we define BestAnchorBoxFinder which finds the best anchor box for a particular object. This is done by finding the anchor box with the highest IOU(Intersection over Union) with the bounding box of the object.

Here’s a sample usage of the BestAnchorBoxFinder class.

................................................................................
The three example anchor boxes:
anchor box index=0, w=0.08285376, h=0.13705531
anchor box index=1, w=0.20850361, h=0.39420716
anchor box index=2, w=0.80552421, h=0.77665105
anchor box index=3, w=0.42194719, h=0.62385487
................................................................................
Allocate bounding box of various width and height into the three anchor boxes:
bounding box (w = 0.1, h = 0.1) --> best anchor box index = 0, iou = 0.63
bounding box (w = 0.1, h = 0.3) --> best anchor box index = 0, iou = 0.38
bounding box (w = 0.1, h = 0.5) --> best anchor box index = 1, iou = 0.42
bounding box (w = 0.1, h = 0.7) --> best anchor box index = 1, iou = 0.35
bounding box (w = 0.3, h = 0.1) --> best anchor box index = 0, iou = 0.25
bounding box (w = 0.3, h = 0.3) --> best anchor box index = 1, iou = 0.57
bounding box (w = 0.3, h = 0.5) --> best anchor box index = 3, iou = 0.57
bounding box (w = 0.3, h = 0.7) --> best anchor box index = 3, iou = 0.65
bounding box (w = 0.5, h = 0.1) --> best anchor box index = 1, iou = 0.19
bounding box (w = 0.5, h = 0.3) --> best anchor box index = 3, iou = 0.44
bounding box (w = 0.5, h = 0.5) --> best anchor box index = 3, iou = 0.70
bounding box (w = 0.5, h = 0.7) --> best anchor box index = 3, iou = 0.75
bounding box (w = 0.7, h = 0.1) --> best anchor box index = 1, iou = 0.16
bounding box (w = 0.7, h = 0.3) --> best anchor box index = 3, iou = 0.37
bounding box (w = 0.7, h = 0.5) --> best anchor box index = 2, iou = 0.56
bounding box (w = 0.7, h = 0.7) --> best anchor box index = 2, iou = 0.78
cebter_x abd cebter_w should range between 0 and 13
cebter_y abd cebter_h should range between 0 and 13
center_x = 07.031 range between 0 and 13
center_y = 05.906 range between 0 and 13
center_w = 04.688 range between 0 and 13
center_h = 06.562 range between 0 and 13

Next, we define a custom Batch generator to get a batch of 16 images and its corresponding bounding boxes.

array([ 1.07709888,  1.78171903,  2.71054693,  5.12469308, 10.47181473,
       10.09646365,  5.48531347,  8.11011331])
x_batch: (BATCH_SIZE, IMAGE_H, IMAGE_W, N channels)           = (16, 416, 416, 3)
y_batch: (BATCH_SIZE, GRID_H, GRID_W, BOX, 4 + 1 + N classes) = (16, 13, 13, 4, 25)
b_batch: (BATCH_SIZE, 1, 1, 1, TRUE_BOX_BUFFER, 4)            = (16, 1, 1, 1, 50, 4)
igrid_h=11,igrid_w=06,iAnchor=00, person

png

------------------------------
igrid_h=07,igrid_w=05,iAnchor=03, person
igrid_h=08,igrid_w=05,iAnchor=03, person
igrid_h=09,igrid_w=05,iAnchor=02, sofa

png

------------------------------
igrid_h=08,igrid_w=06,iAnchor=02, bird

png

------------------------------
igrid_h=09,igrid_w=08,iAnchor=02, sofa

png

------------------------------
igrid_h=05,igrid_w=06,iAnchor=02, dog

png

------------------------------
igrid_h=06,igrid_w=06,iAnchor=02, car

png

Next, I am adding a function to prepare the input and the output. The input is a (448, 448, 3) image and the output is a (7, 7, 30) tensor. The output is based on S x S x (B * 5 +C).

S X S is the number of grids B is the number of bounding boxes per grid C is the number of predictions per grid

Training the model

Next, I am defining a custom generator that returns a batch of input and outputs.

Next, we create instances of the generator for our training and validation sets.

Define a custom output layer

We need to reshape the output from the model so we define a custom Keras layer for it.

Defining the YOLO model.

Next, we define the model as described in the original paper.

YOLO V2

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_3 (InputLayer)            [(None, 416, 416, 3) 0                                            
__________________________________________________________________________________________________
conv_1 (Conv2D)                 (None, 416, 416, 32) 864         input_3[0][0]                    
__________________________________________________________________________________________________
norm_1 (BatchNormalization)     (None, 416, 416, 32) 128         conv_1[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_22 (LeakyReLU)      (None, 416, 416, 32) 0           norm_1[0][0]                     
__________________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D)  (None, 208, 208, 32) 0           leaky_re_lu_22[0][0]             
__________________________________________________________________________________________________
conv_2 (Conv2D)                 (None, 208, 208, 64) 18432       max_pooling2d_5[0][0]            
__________________________________________________________________________________________________
norm_2 (BatchNormalization)     (None, 208, 208, 64) 256         conv_2[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_23 (LeakyReLU)      (None, 208, 208, 64) 0           norm_2[0][0]                     
__________________________________________________________________________________________________
max_pooling2d_6 (MaxPooling2D)  (None, 104, 104, 64) 0           leaky_re_lu_23[0][0]             
__________________________________________________________________________________________________
conv_3 (Conv2D)                 (None, 104, 104, 128 73728       max_pooling2d_6[0][0]            
__________________________________________________________________________________________________
norm_3 (BatchNormalization)     (None, 104, 104, 128 512         conv_3[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_24 (LeakyReLU)      (None, 104, 104, 128 0           norm_3[0][0]                     
__________________________________________________________________________________________________
conv_4 (Conv2D)                 (None, 104, 104, 64) 8192        leaky_re_lu_24[0][0]             
__________________________________________________________________________________________________
norm_4 (BatchNormalization)     (None, 104, 104, 64) 256         conv_4[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_25 (LeakyReLU)      (None, 104, 104, 64) 0           norm_4[0][0]                     
__________________________________________________________________________________________________
conv_5 (Conv2D)                 (None, 104, 104, 128 73728       leaky_re_lu_25[0][0]             
__________________________________________________________________________________________________
norm_5 (BatchNormalization)     (None, 104, 104, 128 512         conv_5[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_26 (LeakyReLU)      (None, 104, 104, 128 0           norm_5[0][0]                     
__________________________________________________________________________________________________
max_pooling2d_7 (MaxPooling2D)  (None, 52, 52, 128)  0           leaky_re_lu_26[0][0]             
__________________________________________________________________________________________________
conv_6 (Conv2D)                 (None, 52, 52, 256)  294912      max_pooling2d_7[0][0]            
__________________________________________________________________________________________________
norm_6 (BatchNormalization)     (None, 52, 52, 256)  1024        conv_6[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_27 (LeakyReLU)      (None, 52, 52, 256)  0           norm_6[0][0]                     
__________________________________________________________________________________________________
conv_7 (Conv2D)                 (None, 52, 52, 128)  32768       leaky_re_lu_27[0][0]             
__________________________________________________________________________________________________
norm_7 (BatchNormalization)     (None, 52, 52, 128)  512         conv_7[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_28 (LeakyReLU)      (None, 52, 52, 128)  0           norm_7[0][0]                     
__________________________________________________________________________________________________
conv_8 (Conv2D)                 (None, 52, 52, 256)  294912      leaky_re_lu_28[0][0]             
__________________________________________________________________________________________________
norm_8 (BatchNormalization)     (None, 52, 52, 256)  1024        conv_8[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_29 (LeakyReLU)      (None, 52, 52, 256)  0           norm_8[0][0]                     
__________________________________________________________________________________________________
max_pooling2d_8 (MaxPooling2D)  (None, 26, 26, 256)  0           leaky_re_lu_29[0][0]             
__________________________________________________________________________________________________
conv_9 (Conv2D)                 (None, 26, 26, 512)  1179648     max_pooling2d_8[0][0]            
__________________________________________________________________________________________________
norm_9 (BatchNormalization)     (None, 26, 26, 512)  2048        conv_9[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_30 (LeakyReLU)      (None, 26, 26, 512)  0           norm_9[0][0]                     
__________________________________________________________________________________________________
conv_10 (Conv2D)                (None, 26, 26, 256)  131072      leaky_re_lu_30[0][0]             
__________________________________________________________________________________________________
norm_10 (BatchNormalization)    (None, 26, 26, 256)  1024        conv_10[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_31 (LeakyReLU)      (None, 26, 26, 256)  0           norm_10[0][0]                    
__________________________________________________________________________________________________
conv_11 (Conv2D)                (None, 26, 26, 512)  1179648     leaky_re_lu_31[0][0]             
__________________________________________________________________________________________________
norm_11 (BatchNormalization)    (None, 26, 26, 512)  2048        conv_11[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_32 (LeakyReLU)      (None, 26, 26, 512)  0           norm_11[0][0]                    
__________________________________________________________________________________________________
conv_12 (Conv2D)                (None, 26, 26, 256)  131072      leaky_re_lu_32[0][0]             
__________________________________________________________________________________________________
norm_12 (BatchNormalization)    (None, 26, 26, 256)  1024        conv_12[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_33 (LeakyReLU)      (None, 26, 26, 256)  0           norm_12[0][0]                    
__________________________________________________________________________________________________
conv_13 (Conv2D)                (None, 26, 26, 512)  1179648     leaky_re_lu_33[0][0]             
__________________________________________________________________________________________________
norm_13 (BatchNormalization)    (None, 26, 26, 512)  2048        conv_13[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_34 (LeakyReLU)      (None, 26, 26, 512)  0           norm_13[0][0]                    
__________________________________________________________________________________________________
max_pooling2d_9 (MaxPooling2D)  (None, 13, 13, 512)  0           leaky_re_lu_34[0][0]             
__________________________________________________________________________________________________
conv_14 (Conv2D)                (None, 13, 13, 1024) 4718592     max_pooling2d_9[0][0]            
__________________________________________________________________________________________________
norm_14 (BatchNormalization)    (None, 13, 13, 1024) 4096        conv_14[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_35 (LeakyReLU)      (None, 13, 13, 1024) 0           norm_14[0][0]                    
__________________________________________________________________________________________________
conv_15 (Conv2D)                (None, 13, 13, 512)  524288      leaky_re_lu_35[0][0]             
__________________________________________________________________________________________________
norm_15 (BatchNormalization)    (None, 13, 13, 512)  2048        conv_15[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_36 (LeakyReLU)      (None, 13, 13, 512)  0           norm_15[0][0]                    
__________________________________________________________________________________________________
conv_16 (Conv2D)                (None, 13, 13, 1024) 4718592     leaky_re_lu_36[0][0]             
__________________________________________________________________________________________________
norm_16 (BatchNormalization)    (None, 13, 13, 1024) 4096        conv_16[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_37 (LeakyReLU)      (None, 13, 13, 1024) 0           norm_16[0][0]                    
__________________________________________________________________________________________________
conv_17 (Conv2D)                (None, 13, 13, 512)  524288      leaky_re_lu_37[0][0]             
__________________________________________________________________________________________________
norm_17 (BatchNormalization)    (None, 13, 13, 512)  2048        conv_17[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_38 (LeakyReLU)      (None, 13, 13, 512)  0           norm_17[0][0]                    
__________________________________________________________________________________________________
conv_18 (Conv2D)                (None, 13, 13, 1024) 4718592     leaky_re_lu_38[0][0]             
__________________________________________________________________________________________________
norm_18 (BatchNormalization)    (None, 13, 13, 1024) 4096        conv_18[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_39 (LeakyReLU)      (None, 13, 13, 1024) 0           norm_18[0][0]                    
__________________________________________________________________________________________________
conv_19 (Conv2D)                (None, 13, 13, 1024) 9437184     leaky_re_lu_39[0][0]             
__________________________________________________________________________________________________
norm_19 (BatchNormalization)    (None, 13, 13, 1024) 4096        conv_19[0][0]                    
__________________________________________________________________________________________________
conv_21 (Conv2D)                (None, 26, 26, 64)   32768       leaky_re_lu_34[0][0]             
__________________________________________________________________________________________________
leaky_re_lu_40 (LeakyReLU)      (None, 13, 13, 1024) 0           norm_19[0][0]                    
__________________________________________________________________________________________________
norm_21 (BatchNormalization)    (None, 26, 26, 64)   256         conv_21[0][0]                    
__________________________________________________________________________________________________
conv_20 (Conv2D)                (None, 13, 13, 1024) 9437184     leaky_re_lu_40[0][0]             
__________________________________________________________________________________________________
leaky_re_lu_42 (LeakyReLU)      (None, 26, 26, 64)   0           norm_21[0][0]                    
__________________________________________________________________________________________________
norm_20 (BatchNormalization)    (None, 13, 13, 1024) 4096        conv_20[0][0]                    
__________________________________________________________________________________________________
lambda_2 (Lambda)               (None, 13, 13, 256)  0           leaky_re_lu_42[0][0]             
__________________________________________________________________________________________________
leaky_re_lu_41 (LeakyReLU)      (None, 13, 13, 1024) 0           norm_20[0][0]                    
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 13, 13, 1280) 0           lambda_2[0][0]                   
                                                                 leaky_re_lu_41[0][0]             
__________________________________________________________________________________________________
conv_22 (Conv2D)                (None, 13, 13, 1024) 11796480    concatenate_1[0][0]              
__________________________________________________________________________________________________
norm_22 (BatchNormalization)    (None, 13, 13, 1024) 4096        conv_22[0][0]                    
__________________________________________________________________________________________________
leaky_re_lu_43 (LeakyReLU)      (None, 13, 13, 1024) 0           norm_22[0][0]                    
__________________________________________________________________________________________________
conv_23 (Conv2D)                (None, 13, 13, 100)  102500      leaky_re_lu_43[0][0]             
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 13, 13, 4, 25 0           conv_23[0][0]                    
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, 1, 1, 1, 50, 0                                            
__________________________________________________________________________________________________
lambda_3 (Lambda)               (None, 13, 13, 4, 25 0           reshape_1[0][0]                  
                                                                 input_4[0][0]                    
==================================================================================================
Total params: 50,650,436
Trainable params: 50,629,764
Non-trainable params: 20,672
__________________________________________________________________________________________________

Next, we download the pre-trained weights for YOLO V2.

--2020-07-06 21:02:41--  https://pjreddie.com/media/files/yolov2.weights
Resolving pjreddie.com (pjreddie.com)... 128.208.4.108
Connecting to pjreddie.com (pjreddie.com)|128.208.4.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203934260 (194M) [application/octet-stream]
Saving to: ‘yolov2.weights.1’

yolov2.weights.1    100%[===================>] 194.49M   867KB/s    in 3m 6s   

2020-07-06 21:05:47 (1.05 MB/s) - ‘yolov2.weights.1’ saved [203934260/203934260]
all_weights.shape = (50983565,)

Define a custom learning rate scheduler

The paper uses different learning rates for different epochs. So we define a custom Callback function for the learning rate.

Define the loss function

Next, we would be defining a custom loss function to be used in the model. Take a look at this blog post to understand more about the loss function used in YOLO.

I understood the loss function but didn’t implement it on my own. I took the implementation as it is from the above blog post.

The original blog post was using Tensorflow 1.x so I had to update some of the code to make it run it on Tensorflow 2.x.

******************************
prepare inputs
y_pred before scaling = (16, 13, 13, 4, 25)
******************************
define tensor graph
******************************
ouput
******************************

pred_box_xy (16, 13, 13, 4, 2)
  bounding box x at iGRID_W=00 MIN= 0.45, MAX= 0.55
  bounding box x at iGRID_W=01 MIN= 1.45, MAX= 1.54
  bounding box x at iGRID_W=02 MIN= 2.45, MAX= 2.55
  bounding box x at iGRID_W=03 MIN= 3.45, MAX= 3.55
  bounding box x at iGRID_W=04 MIN= 4.45, MAX= 4.55
  bounding box x at iGRID_W=05 MIN= 5.45, MAX= 5.55
  bounding box x at iGRID_W=06 MIN= 6.46, MAX= 6.55
  bounding box x at iGRID_W=07 MIN= 7.45, MAX= 7.55
  bounding box x at iGRID_W=08 MIN= 8.46, MAX= 8.55
  bounding box x at iGRID_W=09 MIN= 9.44, MAX= 9.55
  bounding box x at iGRID_W=10 MIN=10.46, MAX=10.55
  bounding box x at iGRID_W=11 MIN=11.46, MAX=11.55
  bounding box x at iGRID_W=12 MIN=12.45, MAX=12.55
  bounding box y at iGRID_H=00 MIN= 0.45, MAX= 0.55
  bounding box y at iGRID_H=01 MIN= 1.45, MAX= 1.54
  bounding box y at iGRID_H=02 MIN= 2.46, MAX= 2.54
  bounding box y at iGRID_H=03 MIN= 3.45, MAX= 3.55
  bounding box y at iGRID_H=04 MIN= 4.45, MAX= 4.54
  bounding box y at iGRID_H=05 MIN= 5.45, MAX= 5.54
  bounding box y at iGRID_H=06 MIN= 6.45, MAX= 6.55
  bounding box y at iGRID_H=07 MIN= 7.45, MAX= 7.55
  bounding box y at iGRID_H=08 MIN= 8.46, MAX= 8.54
  bounding box y at iGRID_H=09 MIN= 9.46, MAX= 9.55
  bounding box y at iGRID_H=10 MIN=10.45, MAX=10.54
  bounding box y at iGRID_H=11 MIN=11.46, MAX=11.54
  bounding box y at iGRID_H=12 MIN=12.45, MAX=12.54

pred_box_wh (16, 13, 13, 4, 2)
  bounding box width  MIN= 0.88, MAX=12.49
  bounding box height MIN= 1.46, MAX=12.64

pred_box_conf (16, 13, 13, 4)
  confidence  MIN= 0.45, MAX= 0.56

pred_box_class (16, 13, 13, 4, 20)
  class probability MIN=-0.26, MAX= 0.28

We extract the ground truth.

Input y_batch = (16, 13, 13, 4, 25)
******************************
ouput
******************************

true_box_xy (16, 13, 13, 4, 2)
  bounding box x at iGRID_W=01 MIN= 1.56, MAX= 1.56
  bounding box x at iGRID_W=02 MIN= 2.36, MAX= 2.36
  bounding box x at iGRID_W=03 MIN= 3.09, MAX= 3.41
  bounding box x at iGRID_W=05 MIN= 5.00, MAX= 5.94
  bounding box x at iGRID_W=06 MIN= 6.22, MAX= 6.67
  bounding box x at iGRID_W=07 MIN= 7.66, MAX= 7.66
  bounding box x at iGRID_W=08 MIN= 8.56, MAX= 8.86
  bounding box x at iGRID_W=09 MIN= 9.09, MAX= 9.39
  bounding box y at iGRID_H=01 MIN= 1.58, MAX= 1.58
  bounding box y at iGRID_H=05 MIN= 5.34, MAX= 5.42
  bounding box y at iGRID_H=06 MIN= 6.50, MAX= 6.91
  bounding box y at iGRID_H=07 MIN= 7.02, MAX= 7.38
  bounding box y at iGRID_H=08 MIN= 8.08, MAX= 8.64
  bounding box y at iGRID_H=09 MIN= 9.20, MAX= 9.88
  bounding box y at iGRID_H=10 MIN=10.14, MAX=10.36
  bounding box y at iGRID_H=11 MIN=11.11, MAX=11.42

true_box_wh (16, 13, 13, 4, 2)
  bounding box width  MIN= 0.00, MAX=12.97
  bounding box height MIN= 0.00, MAX=13.00

true_box_conf (16, 13, 13, 4)
  confidence, unique value = [0. 1.]

true_box_class (16, 13, 13, 4)
  class index, unique value = [ 0  2  6  7  8 11 14 15 17 19]
******************************
ouput
******************************
loss_xywh = 4.148
******************************
ouput
******************************
loss_class = 3.018

png

float64 float64
loss tf.Tensor(7.290645, shape=(), dtype=float32)

Add a callback for saving the weights

Next, I define a callback to keep saving the best weights.

Compile the model

Finally, I compile the model using the custom loss function that was defined above.

WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.

Train the model

Now that we have everything setup, we will call model.fit to train the model for 135 epochs.

Epoch 1/50
1071/1071 [==============================] - ETA: 0s - loss: 0.0836
.
.
.
Epoch 00011: loss did not improve from 0.04997
1071/1071 [==============================] - 287s 268ms/step - loss: 0.0612
Epoch 00011: early stopping





<tensorflow.python.keras.callbacks.History at 0x7f0494d50940>

Evaluation

Now, that we have trained our model, lets use it to predict the class labels and bounding boxes for a few images.

Lets pass an image and see the prediction for the image.

(416, 416, 3)
(1, 416, 416, 3)
(1, 13, 13, 4, 25)

Note, that the y_pred needs to be scaled up. So we define a class called OutputRescaler for it.

Let’s try out the OutputRescaler class.

Also, lets define a method to find bounding boxes with high confidence probability.

Let’s try out the above function and see if it works.

obj_threshold=0.015
In total, YOLO can produce GRID_H * GRID_W * BOX = 676 bounding boxes 
I found 20 bounding boxes with top class probability > 0.015

obj_threshold=0.03
In total, YOLO can produce GRID_H * GRID_W * BOX = 676 bounding boxes 
I found 12 bounding boxes with top class probability > 0.03

Also, next we define a function to draw bounding boxes on the image.

Plot with low object threshold
person     0.082 xmin= 183,ymin=  39,xmax= 292,ymax= 276
person     0.090 xmin= 181,ymin=  25,xmax= 291,ymax= 293
person     0.033 xmin= 180,ymin=  43,xmax= 251,ymax= 299
person     0.518 xmin= 186,ymin=  26,xmax= 286,ymax= 314
person     0.865 xmin= 178,ymin=  31,xmax= 291,ymax= 312
person     0.027 xmin= 179,ymin=  27,xmax= 344,ymax= 304
bicycle    0.017 xmin=  85,ymin= 180,xmax= 141,ymax= 237
bicycle    0.045 xmin=  62,ymin= 144,xmax= 174,ymax= 260

png

Plot with high object threshold
person     0.082 xmin= 183,ymin=  39,xmax= 292,ymax= 276
person     0.090 xmin= 181,ymin=  25,xmax= 291,ymax= 293
person     0.033 xmin= 180,ymin=  43,xmax= 251,ymax= 299
person     0.518 xmin= 186,ymin=  26,xmax= 286,ymax= 314
person     0.865 xmin= 178,ymin=  31,xmax= 291,ymax= 312
bicycle    0.045 xmin=  62,ymin= 144,xmax= 174,ymax= 260
bicycle    0.368 xmin=  76,ymin= 135,xmax= 219,ymax= 283
bicycle    0.315 xmin=  68,ymin= 132,xmax= 235,ymax= 289

png

Notice, that each object has multiple bounding boxes around it. So we define a function to apply non max suppression that chooes the bounding box with the highest IOU.

Lets use the above function to see if it reduces the number of bounding boxes.

2 final number of boxes
bicycle    0.368 xmin=  76,ymin= 135,xmax= 219,ymax= 283
person     0.865 xmin= 178,ymin=  31,xmax= 291,ymax= 312

png

Next, lets have some fun by evaluating more images and see the results.

bird       0.070 xmin= 249,ymin=  47,xmax= 416,ymax= 406
bird       0.070 xmin= 249,ymin=  47,xmax= 416,ymax= 406
bird       0.070 xmin= 249,ymin=  47,xmax= 416,ymax= 406
bottle     0.274 xmin= 250,ymin=  23,xmax= 416,ymax= 403
bird       0.070 xmin= 249,ymin=  47,xmax= 416,ymax= 406
chair      0.030 xmin= 265,ymin= 373,xmax= 394,ymax= 410
bottle     0.274 xmin= 250,ymin=  23,xmax= 416,ymax= 403
chair      0.030 xmin= 265,ymin= 373,xmax= 394,ymax= 410

png

chair      0.851 xmin= 272,ymin= 179,xmax= 416,ymax= 405
chair      0.312 xmin= 340,ymin=   3,xmax= 408,ymax=  49
chair      0.312 xmin= 340,ymin=   3,xmax= 408,ymax=  49
chair      0.851 xmin= 272,ymin= 179,xmax= 416,ymax= 405
chair      0.312 xmin= 340,ymin=   3,xmax= 408,ymax=  49
chair      0.312 xmin= 340,ymin=   3,xmax= 408,ymax=  49
chair      0.312 xmin= 340,ymin=   3,xmax= 408,ymax=  49
chair      0.312 xmin= 340,ymin=   3,xmax= 408,ymax=  49

png

Conclusion

It was a good exercise to implement YOLO V2 from scratch and understand various nuances of writing a model from scratch. This implementation won’t achieve the same accuracy as what was described in the paper since we have skipped the pretraining step.

Vivek Maskara
Vivek Maskara
Grad Student at ASU | Researcher at The Luminosity Lab | Ex Senior Software Engineer, Zeta | Volunteer, Wikimedia Foundation

My research interests include distributed robotics, mobile computing and programmable matter.

Related