Since the proposal of a fast and efficient learning algorithm for deep networks, the deep neural networks and learning techniques have drawn increasing interests because of their inherent capability of overcoming the drawback of traditional machine algorithms dependent on hand-designed features. Deep learning approaches have also been found to be suitable for big data analysis with successful applications to computer vision, pattern recognition, speech recognition, natural language processing, recommendation systems and autonomous driving systems. The need for efficient, accurate and fast low power object detection with localization are increased, thanks to the rise of autonomous vehicles, video surveillance, face authentication systems. These object detection systems not only recognize and classify objects in an image but also localize objects by drawing proper boundary, aka bounding box/anchor boxes around it. As you have probably guessed by now why object detection is a harder task than image classification. Computer vision and perception application can be broadly categorized as below.

- Classification, assigns a label to an entire image
- Localization, assigns a bounding box to a particular label object
- Object Detection, draws multiple bounding boxes in an image
- Image segmentation, creates precise segments of where objects lie

Object detection for autonomous driving requires high accuracy and real time inference. It is good to have small model size and energy efficiency to be deployed on an embedded system. SqueezeDet is a fully connected convolutional neural network for detection that aims to simultaneously satisfy high accuracy, real time inference and low memory and low power requirement and thus suited for embedded systems.

In SqueezeDet, convolutional layers not only extract feature maps, but also used as the output layer to compute bounding boxes and class probabilities. The detection pipeline of this model only contains a single forward pass of a neural network, thus it is extremely fast. This network is fully convolutional, which leads to small model size and better energy efﬁciency. It is inspired by YOLO object detection network and uses stacked convolution ﬁlters to extract a high dimensional, low resolution feature map for the input image. It uses a convolutional layer to take the feature map as input and compute a large amount of object bounding boxes and predict their categories. Finally, it ﬁlter these bounding boxes to obtain ﬁnal detections. The “backbone” convolutional neural net (CNN) architecture of our network is SqueezeNet, which achieves AlexNet level imageNet accuracy with a model size of < 5MB that can be further compressed to 0.5MB. After strengthening the SqueezeNet model with additional layers followed by ConvDet, the total model size is still less than 8MB. Shown below detection pipeline for SqueezeDet.

SqueezeDet has a single-stage detection pipeline. Region proposition and classiﬁcation is performed by one single network simultaneously. As shown in figure below, a convolutional neural network ﬁrst takes an image as input and extract a low-resolution, high dimensional feature map from the image. Then, the feature map is fed it into the ConvDet layer to compute bounding boxes centered around W × H uniformly distributed spatial grids. Here, W and H are number of grid centers along horizontal and vertical axes. Each bounding box is associated with C + 1 values, where C is the number of classes to distinguish, and the extra 1 is for the conﬁdence score, which indicates how likely does the bounding box actually contain an object. Similar to YOLO, it deﬁnes the conﬁdence score. A high conﬁdence score implies a high probability that an object of interest does exist and that the overlap between the predicted bounding box and the ground truth is high. The other outputs represent the conditional class probability distribution given that the object exists within the bounding box. It assigns the label with the highest conditional probability to this bounding box. Finally, it keeps the top N bounding boxes with the highest conﬁdence and use Non-Maximum Suppression (NMS) to ﬁlter redundant bounding boxes to obtain the ﬁnal detections.

During inference, the entire detection pipeline consists of only one forward pass of one neural network with minimal post-processing. The design of the ConvDet layer enables SqueezeDet to generate tens-of-thousands of region proposals with much fewer model parameters compared to YOLO. ConvDet is essentially a convolutional layer that is trained to output bounding box coordinates and class probabilities. It works as a sliding window that moves through each spatial position on the feature map. At each position, it computes K×(4+1+C) values that encode the bounding box predictions. Here, K is the number of reference bounding boxes with pre-selected shapes. These reference bounding boxes are called as anchor.

As said above, SqueezeDet architecture is composed of a convolutional layer, ten sequential fire modules, and a final convolutional layer which outputs predictions of object locations and class labels for a given input image. The main contribution to this is its feature extractor (SqueezeNet) which is composed of sequential fire modules. As shown below, This module consists of a1 × 1 convolutional layer that compresses the feature volume without decreasing the spatial resolution.

This first operation is known as the squeeze layer. To the output of this layer, two parallel convolution operations are applied, one layer with 1 × 1 convolutions and another with 3 × 3 convolutions. This parallel procedure is called the expand layer, as the depth of the feature volume is increased at this stage. The reasoning behind the fire module is that the expand layer incur a high computational cost which can be relieved by first compressing the feature representations. Along with all other Single Shot Detectors, the benefit of architectures such as SqueezeDet over R-CNN based approaches is the complete sharing of computations between all ROI, which has repercussions for both the inference speed and the integration of global contextual information into each prediction

#### Deep learning networks on EDGE devices

Though model size of network like SqueezeDet is small compared to its counterparts, the high computational complexity and memory requirements of such deep learning-based models has still been a problem for an efficient deployment of those object detection networks to mobile, autonomous and other embedded devices. To successfully deploy an inference engine on these devices, there is a need for an efficient and accurate low precision on-device inference method. Deep learning training and inferencing are highly compute intensive operations, however using full precision (Float32) computations on conventional hardware is inefficient. Using 8-bit representations/calculations help your models run faster and use less power. Even though low precision can have a lot of systems benefits, low precision conversion generates noise in representation, arithmetic and thus limiting the classification and object detection accuracy and other performance metric. So we need to carefully handle the process of floating point to integer conversion and arithmetic for such a deep object detection network without blowing up the arithmetic & representation error and still keeping the performance intact … How 🙂

#### Low precision Quantization scheme for Image Classification

Deep networks can be trained with ﬂoating point precision, but an effective quantization scheme for the parameters and activations make the model small. This in general speed up the inference time in hardware due to faster integer only arithmetic and less memory accesses . Once the parameters and the activation are in integer only(fixed point representations), then fixed point arithmetic unit are used for a faster inference. These fixed-point units are faster and consume far less hardware resources and power than ﬂoating-point engines. Thus, lower precision integer only arithmetic reduces the memory footprint, speeds up the inference thus enabling larger models to ﬁt within the given memory capacity and speed requirements. Integer only arithmetic with lower precision operations and data for classification applications are now deployed in mobile devices. These efficient number conversions, scaling and quantization is carried out with and without retraining of the parameters and also by evaluating some form of performance measures like KL divergence between the reference float output and low precision outputs. Lower precision inference was thus possible in classification problems.

#### Low precision Quantization scheme for object Detection

As discussed above, Object detection involve not only recognizing and classifying every object in an image, but localizing each one by drawing the appropriate bounding box around it and makes it a significantly harder task than its image classification. Object Detection, draws multiple bounding boxes in an image. An object detection for autonomous driving system broadly does the following

- Is an object (car/pedestrian/cyclist) present in the given image/video
- Where exactly is the object lying in the image? Is the pedestrian on the footpath or is he/she crossing the road?
- How many objects are there in an image? Are the same or different ?
- What is the size of the objects? Where is the object boundary?

An ideal object detection network comes up with a list of bounding boxes, or the (x, y)-coordinates for each object in an image, the class label associated with each bounding box and the probability/confidence score associated with each bounding box and class label

As you noticed, object detection is a regression and a classification task. The localization of object is a regression problem. Regression is about returning a numbers instead of a class like (x0,y0,width,height) that are related to a bounding box. You train this system with an image and a ground truth bounding box, and use L2 distance to calculate the loss between the predicted bounding box and the ground truth. This means that quantization noise play a very important role in determine the box coordinates and thus affect the accuracy of the detection. Also to calculate the precision we need to remove the boxes with low confidence so that the confidence score prediction should be accurate. Then, we use the **Intersection over Union (IoU)** area, a value between 0 and 1. It corresponds to the overlapping area between the predicted box and the ground-truth box. The higher the IoU, the better the predicted location of the box for a given object. Usually, we keep all bounding box candidates with an IoU greater than some threshold. In binary classification, the Average Precision (AP) metric is a summary of the precision-recall curve, details are provided here. The commonly used metric used for object detection challenges is called the **mean Average Precision (mAP)**. It is simply the mean of the Average Precisions computed over all the classes of the challenge. The mAP metric avoids to have extreme specialization in few classes and thus weak performances in others. The mAP score is usually computed for a fixed IoU but a high number of bounding boxes can increase the number of candidate boxes.

#### Object Detection Pipeline in SqueezeDet

The feature extractor network like squeezenet outputs the features required for the object detection. This feature map is fed it into the ConvDet layer to compute

- Bounding boxes centered around W × H uniformly distributed spatial grids. Here, W and H are number of grid centers along horizontal and vertical axes. Each bounding box is associated with C + 1 values, where C is the number of classes to distinguish, and the extra 1 is for the conﬁdence score, which indicates how likely does the bounding box actually contain an object.
- The conditional class probability distribution given that the object exists within the bounding box. It assigns the label with the highest conditional probability to this bounding box.
- It also keeps the top N bounding boxes with the highest conﬁdence and use Non-Maximum Suppression (NMS) to ﬁlter redundant bounding boxes to obtain the ﬁnal detections.

During inference, the entire detection pipeline consists of only one forward pass of one neural network with minimal post-processing.

#### Solving the Regression problem in lower precision

Though conventional quantization schemes are successfully applied for classification problem, the same scheme can not be applied for a object detection network as it has to deal with this multiple regression and classification outputs simulataneosly. The dynamic range of all three subset of outputs of the SqueezeDet are different and thus making the problem sensitive to quantization. AlphaICs lower precision deep learning library for inference -AlphaRT uses an efficient quantization and scaling scheme that allows SqueezeDet object detection inference to be carried out using 8 bit integer-only (INT8) arithmetic still retaining the algorithmic performance. AlphaRT library is used to successfully implement SqueezeDet on AlphaICs Real AI processor(RAP) for Egde hardware while matching floating point inference.

#### INT8 SQUEEZEDET

We have trained the SqueezeDet standard reference code and generated the neural network model in floating point format. As the floating point model not best suited in low-power embedded and Edge devices, this is converted to 8 bit integer operations with all the multiplication using 8 bit integer formats. This means that weights, biases and activations are to be in integer 8 bit ranges. As the parameters are static and does not change during inference, they are scaled and quantized to 8 bits integers using the quantization modules – all offline. The activations are quantized to 8 bits integer during run time. There is no floating point computations anywhere in the network. In this attempt, we primarily focussed the performance and accuracy of the object detection inference task.

Btw, this task was not straightforward. A scheme for identifying the effect of scaling and quantization for simultaneous classification and regression problem was built in. A set of performance metric and estimation scheme are used to mitigate the effect of noise introduced while moving to lower precision. The offline library takes the heavy load of coming up with an effective system for scaling and quantizing the parameters. Readers, please note that, we achieved very good high accuracy using 8-bit integer precision on SqueezeDet without retraining. We neither pruned nor clubbed layers and thus keeping the model architecture intact… “We did not use quantization while training, unlike many other system”…… It showed that by properly compressing the dynamic range of the input, it is possible to minimize the quantization loss and achieve very good accuracy. Since convolutions and other operations are run entirely on 8-bit inputs and outputs, our quantization and reduces the computational resources needed for inference calculations.

#### Parameter scaling and quantization

For the INT8 inference, and the data type is set to INT8 for the data propagating in each layer. In this approach, we used a method of mapping parameters to the range of -128, 127. The input and output are represented as 8-bit integers. The convolution involves 8-bit integer operands and a 32-bit intermediate integer accumulator. The ReLU activation uses 8-bit integer arithmetic. The convolution and matrix multiplication uses 8 bit operands. A novel method for scaling and quantizing weights are employed based on an ensemble of similarity and performance measures. Various similarity measures like cosine similarity, root mean square error, spectral distances and noise modeling are used while designing quantization scheme. This could help us to quantize weights in 8 bits. Bias is also quantized to 8 bits with proper arithmetic shifts based on the accumulated sum. Though many researchers inferred that the accuracy is sensitive to bias quantization and needs proper 32 bits to get comparable accuracies, our scheme could manage to keep the bias in 8 bits. All the weights and biases are quantized without retraining , pruning or compressing the network.

#### Activation scaling and quantization

Activation Quantization needs a different approach from weight quantization. While weights are fixed after the training, activations vary based on the inputs. The activation values are also rescaled and limited to 8 bit at each layer to make sure that the precision loss is minimal. The different parts of SqueezeDet layers have a significant dynamic range. In a certain layer, the activations are caused by thousands of accumulations. Therefore, the activations are usually much larger than the layer parameters. The parameters and activations in different layers also have different distribution. So a dynamic scaling parameter that too specific for each layer is used for an efficient quantization of the same. We do not use any calibration dataset for tuning the activation, thus making the library invariant to activation distribution. The final layer output is rescaled and converted to float32 for bounding box prediction. Multiple similarity metric in weighted combination are used to tune the quantization for predicted class probability, confidence scores and bounding box delta thus minimizing the joint noise. The quantization and fixed point performance analysis was carried out layer by layer to get optimal performance.

#### Performance Benchmarking

We evaluated our INT8 model on the KITTI object detection dataset, which is designed with autonomous driving in mind by implementing the algorithm on AlphaICs Real AI Processor(RAP). We analyzed our model’s accuracy measured by average precision(AP). In our experiments, we scaled all the input images to 1248×384. To filter the bounding boxes, first the anchors with a confidence score below a threshold of 0.4 were excluded from consideration. Of the remaining anchors the bounding boxes were calculated using the predicted deltas, the final bounding boxes were then sorted by confidence. Starting with the bounding box that had the highest confidence score, an IOU with every other bounding box was calculated and any that had an IOU above 0.4 were dropped. This process of filtering the detections is also known as non maximum suppression. Our average precision (AP) results are on the validation set. We used the trained model that detects 3 categories of object, car, cyclist, pedestrian and used 9 anchors for each grid in our model. At the inference stage, we only kept the top 64 detections with highest conﬁdence, and use NMS to ﬁlter the bounding boxes.

The trained weights were taken from https://www.dropbox.com/s/a6t3er8f03gdl4z/model_checkpoints.tgz and quantized using the techniques mentioned above. Shown below are the glimpses of the effect of quantization and scaling captured while successfully implementing multiple layers.

##### Reference Float 32 output

Can we say, INT8 object output shown below is as cute as the reference float32 output.

Seeing is believing, but still …. let us look at the precision recall curves for the three classes.

Precision is the Mantra… Shown below are the mean average precision for INT8 and float version after validating on a set of 500 validation images…

Average precision in percentage from the **INT8** version

Average precision from the **FLOAT32** reference version

More details… the better…

The AlphaRT- INT8 inference library could manage to get comparable inference performance with that of float reference without retraining the parameters and without calibration. The Key features include:

- Parameters including biases are in 8 bit integer format
- Efficient scaling and quantization for parameters.
- Efficient run time scaling for activation signals.
- Efficient INT8 arithmetic support for matrix multiplication.
- Efficient offline performance tuning metrics using signal processing and modeling techniques
- Efficient activation quantization without calibration.
- Efficient quantization without retraining.

#### Conclusion

An efficient weight scaling and quantization scheme for SqueezeDet successfully mitigated the adverse effect of quantization in a regression problem like object detection. Performance against a subset of the validation set reveal that the INT8 inference for SqueezeDet retains most of the classification and object detection performance. The current scheme neither retrained quantized weight nor used any subset of the validation set for activation calibration. The current scheme of scaling and quantization can be further improved by tuning it for a subset of validation set(calibration set) and retraining the quantized weights. The performance result shows that a scaling and quantization scheme using 8 bit integer arithmetic for weights and activation in the AlphaRT library accelerated by AlphaICs Real AI Processor can be used for efficient inference in most of the classification and object detection networks.

#### References

- SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving Bichen Wu1 , Alvin Wan1 , Forrest Iandola1,2 , Peter H. Jin1 , Kurt Keutzer1,2 1UC Berkeley, 2DeepScale {bichen, alvinwan, phj, keutzer}@berkeley.edu, forrest@deepscale.ai
- https://github.com/BichenWuUCB/squeezeDet
- SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size Iandola, Forrest N. and Moskewicz, Matthew W. and Ashraf, Khalid and Han, Song and Dally, William J. and Keutzer, Kurt