
IEEE/CAA Journal of Automatica Sinica
Citation: | I. Ahmed, S. D. D. n, G. Jeon, F. Piccialli, and G. Fortino, "Towards Collaborative Robotics in Top View Surveillance: A Framework for Multiple Object Tracking by Detection Using Deep Learning," IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1253-1270, Jul. 2021. doi: 10.1109/JAS.2020.1003453 |
COLLABORATIVE robotics has been gaining the attention of researchers and emerging as a key technology in different areas such as industry, manufacturing, transport, domestic task handling, entertainment, healthcare services, navigation, localization, and most prominently in intelligent video surveillance. The need for intelligent surveillance systems has been increasing day by day. Several optical devices (cameras and sensors) have been installed in public places for security and monitoring purposes. The majority of surveillance systems consist of a centralized monitoring structure (a single room where multiple camera recording videos have been observed/monitored by human operators). Monitoring multiple video streams, however, may be a tedious task for the security operators/officials. Therefore, it is desirable to use collaborative robotics and provide such an intelligent automated surveillance system that monitors and analyzes multiple video streams and helps human operators as much as possible. Collaborative robotics helps to expand the surveillance system’s potential by utilizing smart camera devices and visual processing technology. The primary objectives of such collaborative robotics-based surveillance systems are to provide useful information about different activities in a specific environment or scene. It provides information that may help in behavior analysis, events and activity analysis, managing and protecting people in crowded environments, movement pattern detection, and tracking objects. These are considered as important with wide range of real-life applications, e.g., security analysis, [1], [2] autonomous or self-driving vehicles, [3], face recognition [4], human-computer interaction (HCI) robotics [5]-[7], and [8] location and navigation. These applications may suffer from several factors, including variations in object appearances, (sizes, body orientations, poses), different backgrounds, illumination conditions, cluttered scenes and camera viewpoints, abrupt variations in motion, and close interaction of objects and most importantly occlusion.
A number of deep learning, machine learning and computer vision based methods have been presented to cope with these challenges by providing robust and efficient solutions [9], [10]. Majority of the developed approaches are primarily based on traditional handcrafted features [11]-[16] along with different machine learning classifiers [17]. Recent advancement in deep learning models makes object detection [18]-[24], and object tracking [25]-[28] and [29] methods more robust, efficient in terms of computation speed and accuracy. The key advantages of these models are automatic selection of most salient features of objects and having stronger classification probability as compared to laborious handcrafted features which requires extra training of images [30]. Furthermore, these models usually have more discriminative power in terms of multi class object classification regardless of their scale, size, pose, location, appearance with respect to the camera position, illumination, background condition and occlusion.
Majority of techniques either feature-based or deep learning-based, generally developed for normal, horizontal, or frontal view object detection and tracking, without utilizing collaborative robotics setup. Mostly researchers developed different object tracking methods, e.g., [25], [26], [28], [29], commonly based on frontal or asymmetric camera perspective as shown in Fig. 1. These methods have their robustness, detection, and tracking reliability for different applications, but still suffered from occlusion challenges as highlighted in Fig. 1.
In contrast, with the frontal view, if a similar object is captured and viewed from a top perspective, occlusion problems may be reduced. Likewise, using the top view perspective provide more visibility of the scene, and the object is more visible to the camera Fig. 2. In order to minimize occlusion problem, some researchers [36]–[38] suggested and utilized top view images or video sequences for object detection and tracking. Ahmad et al. [44] suggested that replacing multiple frontal view cameras with a single top view camera may overcome power consumption, human resource, and installation expenses. The perspective change in camera position also causes significant variation in the object's appearance in terms of posture, size, body articulations, and visibility, as highlighted in Fig. 2.
This work introduces a collaborative robotics-based framework for top view intelligent surveillance system, capable of detecting and tracking multiple objects. A smart robotic camera is used with a processing unit that can facilitate human operators during surveillance and suitable for real-world applications. For object detection existing deep learning models, you only look once (YOLO) [22] and single shot detector (SSD) [23], trained on frontal view data sets have been practiced. The top view data set, containing video sequences of multiple objects against various backgrounds, has been used for testing purposes. Object detection models are further combined with six different algorithms, including GOTURN, MEDIANFLOW, TLD, KCF, MIL, and BOOSTING, to track the objects in top view scenes. This paper mainly focused on the following:
1) A framework is introduced for top view collaborative surveillance that can assist human operators in multiple object tracking and detection tasks. The structure consists of a smart robotic camera with a visual processing unit that used deep learning-based pre-trained object detection models.
2) Generalization performance of frontal view pre-trained object detection models has been investigated by testing on completely different data set, i.e., top view data set. It contains video sequences of multiple objects recorded from top view with various sizes, poses, body orientations, shapes, and camera resolutions in different background and illumination conditions.
3) Top view object tracking is performed by combining deep learning object detection models with different tracking algorithms. A comparison of six tracking algorithms, along with object detection models, have also been made.
4) The importance of top view multi-class multi-object detection and tracking over a traditional or frontal view, specifically in video surveillance, is explored with possible future direction.
The work presented in this paper is ordered in the subsequent sections. The summary of the present work used for object tracking and detection tasks is provided in Section II. The recorded top view data set is explained in Section III. Deep learning-based object detection models employed for top view data set and different tracking algorithms have been elaborated in Section IV. Section V provides a detailed explanation of output results, performance evaluation of object detection models, and comparison results of different tracking techniques. The conclusion of the paper with possible future directions is presented in Section VI.
A summary of different object tracking and detection approaches has been presented in this section. The developed methods are mainly categorized in traditional generic, features, deep, and machine learning-based models. A comprehensive study of various tracking and detection techniques is also found in [9], [10], [45], and [46].
Early object detection techniques are based upon handcrafted features. Researchers designed sophisticated feature based methods such as color histograms [47], attributes [48]–[50], traditional handcrafted features including histogram of oriented gradient (HOG) [14], [51], local binary patterns (LBP) [13], Haar-like features [12], [52], scale invariant feature transforms (SFIT) [15], and shape based features. These methods mainly extract most prominent object features, which are further utilized for training and testing of machine learning algorithms, e.g., support vector machine (SVM) [53], boosting [52], random forest [54], Hough forest [55], and structural learning [52], [56]. Besides with object detection, traditional object tracking methods are categorized into frame differencing, optical flow, and feature-based methods. The traditional tracking algorithms focused on the target position in video sequences and prediction using filter-based methods, e.g., Particle and Kalman filter.
Some researchers used appearance features like shape, color, and texture to track different objects across the frame [57]. Other researchers used a template-based [58] and sparse representation [58] based methods for object tracking by concentrating on searching region, similar to the tracked target. To differentiate the background and foreground, some researchers developed discriminative feature learning [59] based algorithms. Many of them used object features similar to object detection methods. Reference [60] presented a method for object tracking, combining multiple detection features with the probabilistic segmentation technique. Most of the developed detection and tracking techniques are based on data sets recorded from frontal view, that might suffers from the occlusion problem, as discussed in the Section I. To overcome this problem many researchers [36], [61]-[65] used and suggested top view camera perspective. In [66], authors provide a top view feature-based method for person detection in an industrial and indoor environment. In another work [67], developed a feature-based method for tracking people in the top view industrial environment. Reference [68] used blob based method and provided a rotation-invariant person tracking solution for top view surveillance. The authors in [69] used an efficient rotated HOG based method along with SVM classifier for overhead view person detection.
Recent growth in deep learning-based models, capable of learning high-level image/video frame features, object detection, and tracking methods, start evolving with unprecedented speed. Deep learning models are characterized by two-stage and one-stage detection frameworks. The first detection framework is developed by [70] by proposing Regions with CNN features. In [70], each region proposal is scaled into the fixed-sized image and fed into the training CNN model. Reference [71] proposed a two-stage object detection model using a particular pyramid pooling network for object identification. The model consists of CNN layers, which enable the generation of fixed-length representation features regardless of the image’s rescaling. Another two-stage object detection method is proposed by [20], which enables to train bounding box regressor and detector simultaneously at the same network using the Pascal VOC data set. Reference [72] developed first real-time end to end model for target detection. In Faster-RCNN, the authors move individual blocks of region proposal detection, bounding box regression, and feature extraction at the end to end integrated network, making it fast compared to former models. The COCO [73] has been employed for training and testing of the developed model. Another two-stage object detection model has been developed by [74] based on [72] and gives the advantage to classify objects with a wide range of scale variations. Redmon et al. [22] presented a one-stage object detection model named YOLO, a single neural network is employed on the whole image that extracts regions and predicts bounding box and probabilities for each region proposal. The improvement has been made by [75] and [76], which further enhanced the detection accuracy of the previous model. Another one stage detection model is proposed by Liu et al. [23] named as single shot multibox detector (SSD). It introduces a multi-resolution and multi-reference detection framework. [18], [21], and [24] also used deep learning models which show their robustness, efficiency and accuracy in different object detection applications.
The researchers also developed deep learning-based object tracking methods [25], [26], [28] and [29]. The deep learning tracking frameworks detect the object and store it in the form of features information, which is further used for tracking purposes. Numerous method [60], [77]-[82] have been proposed that used the neural network architecture for object tracking. Reference [83] developed a feature-based visual object tracking method that mainly used a pre-trained detection hierarchical network. Likewise, taking advantage of the region proposal network [84] used a recurrent convolution network model for object tracking. Reference [85] developed a neural network-based online object tracking technique using the frontal view data set. Reference [77] used a pre-trained network of deep layers for human tracking via frontal view images. Furthermore, [82] developed a generic feature-based object tracking method robust against several object appearance variations. Other CNN based object tracking methods were developed by [31], [86], similar as [21], using proposal information to track pets. Reference [81] adopted the CNN model and produced discriminative saliency maps, which were further combined with SVM to track the object from the frontal view. For object feature extraction they adopted DLT [82] and CNNSVM [81] classifier. Cui et al. [87] proposed RNN based object tracking method using the correlation filter. References [86] and [88] also developed CNN based trackers.
Majority of the developed deep learning models used frontal view images. Some of the researcher performed object detection and tracking tasks using aerial and satellite images [42], [89]–[92]. Some researchers used deep learning for top view object detection and tracking, but their work was mainly for a single class object, person [42], [93], and [94]. Ahmed et al. [43] applied two-stage detection models for top view multi-class object detection. In [43] used Mask-RCNN and Faster-RCNN for top view object segmentation.
Video sequences used in this work are recorded in the real world, uncontrolled environment, having variation in illumination conditions and backgrounds in different indoor and outdoor environments. The movement of the objects is also not restricted within the scene. Two top view scenarios are considered, in the first scenario (symmetric top view) the object is observed under the camera from a specific camera height. The newly recorded data set provides better visibility and wide coverage of the scene. The appearance of a different object in the symmetric top view is different from the frontal view, as shown in the sample frames of the experimental results section. In the second scenario (asymmetric top view), the object is observed at a different location from the centrally mounted camera. The appearance of the object is significantly varied with respect to the camera. Multiple objects are considered during the recording, such as motorbikes, cars, trucks, buses, and most prominently persons. The distribution of different objects in indoor and outdoor environments are variable. The number of object classes varies but mainly focuses on a person as it is counted as one of the important objects in video surveillance [43] and [69]; therefore, the large portion of our data set comprises person video sequences in outdoor and indoor environments. The sample frames of the experimental results section highlighted the change in the object appearance cause by changing the camera perspective. The data set provides variation in object appearance (size, scale, poses, and orientation) with respect to the camera position. Mainly, it is recorded in surveillance environments in different day timings covering different views. Different camera devices have been practiced. The detail of the data set is elaborated in Table I.
# | Detail description | |
1 | Color space | RGB |
2 | Video sequence duration | 5 to 30 minutes |
3 | Video format | .mp4 |
4 | Frame rate | 20 frame per sec |
5 | Frame resolution | 640 × 480 |
6 | Height of camera | 3 to 4 meters from ground |
7 | Recording location | Indoor and outdoor |
8 | Illumination changes reflections/Shadows | Yes |
9 | Frame format | .jpg |
10 | None object | Varying |
In this work, we have introduced a top view collaborative video surveillance framework, as described in Fig. 3. The overall framework is comprised of a top view smart robotic camera source and a visual processing unit. The top view robotic camera is used for recording video sequences in different indoor and outdoor environments. The visual processing unit is embedded with a smart camera responsible for observing the scene and performing multiple object detection and tracking tasks. The visual processing unit results are sent to the surveillance monitoring room/maintenance room, where it assists/helps human operators to control and monitor surveillance systems. The framework helps human operators with multiple surveillance applications highlighted in Fig. 3.
The detail of the visual processing unit is presented in Fig. 4. The visual processing unit is the combination of two modules: one for object detection and other for tracking. The detection module aims to localize and detect the object in top view video sequences. While the tracking module helps in tracking the object in the top view scene. The video sequences after conversion into subsequent frames are fed into the object detection module. For top view object detection, existing deep learning models YOLO [22] and SSD [23] have been adopted. The frontal view data set [73] was utilized for the pre-training of both models. The models produce a detected bounding box, class label, and identified object confidence score at outputs. This information is stored in the list, which further helps in object tracking. A tracker is initialized, which assigned tracking ID to each detected object. Six different pre-built trackers are used for tracking purposes, including GOTURN, MEDIANFLOW, TLD, KCF, MIL, and BOOSTING. All of these tracking algorithms are developed for the frontal view data set. In this work, trackers used the detected output bounding box information to track the object from a top view. The following subsections provide a detailed explanation of the different tracking algorithms and deep learning-based object detection models.
1) YOLO Based Object Detection: This section briefly discusses the first deep learning model YOLO [22] applied for top view multiple object detection. It is faster and more generalized than other regional proposal architectures (two-stage detection models) discussed in Section II, which involves various object detection steps. It localizes the object along with its predicted class using a single network structure. As it used the same integrated network, therefore forcing the classifier to extract specific region proposals that exactly contain the object. It also helps in reducing false positives present in background areas. The model is quite useful in terms of accuracy and computation. Because of these characteristics, we adopted this as the first model for top view object detection—for top view object tracking, the detection model is further embedded with tracking algorithms. The general structure of the model is described in Fig. 5. It simply takes the input image, passes it over convolutional layers, and gives a vector of bounding boxes with object confidence value and class label prediction. The YOLO model [22] was trained using the frontal view COCO data set [73]. It has an additional batch normalization layer, which makes network convergence faster. The model can randomly adjust the input image size during training, improving the detection results during testing of multi-scale images. The YOLO model’s general structure for top view multiple object detection, demonstrated in Fig. 5 is explained in the following steps:
i) The model is composed of Darknet architecture and convolutional layers. The pre-trained model composes of 2 phases. Firstly, the classifier network was trained like VGG-16. The fully connected (FC) layers for the object detection using the PASCAL VOC data set convolution neural network were evaluated. Initial convolutional layers are used to extract features, and connected layers are used to predict output probabilities along with bounding box’s coordinates. The original model used twenty-four convolutions layers.
ii) Unlike two-stage detectors or region based approaches, i.e., (Faster RCNN, Fast-RCNN, and RCNN), which uses the local area for detection, YOLO model used entire image features. The model firstly divides the image into associated (
– The model predicts boundary boxes B with a confidence score.
– Only one object is detected despite the number of detected boxes B.
–
The center of each object present in the image falls within the grid cell; then, it predicts random bounding boxes for objects. As depicted in Fig. 6(c), instead of predicted different bounding boxes for one object in the whole image, the model predicts five different bounding boxes.
iii) The predicted bounding box has five components: width, height,
iv) Fig. 6 depicts that the model predicts individual objects only if it falls within the grid cell; else, the grid cell value becomes zero. The coordinate dimensions
Lloc=λcoord1objijS2∑i=0B∑j=0[(xi−x∗i)2+(yi−y∗i)2+(√wi−√w∗i)2+(√hi−√h∗i)2]. |
(1) |
In the above equations
– If the target object is present in the
Now, moving to the second part of loss function related to the object class probability and confidence value given as [22].
Lcls=S2∑i=0B∑j=0[1objij+λnoobj(1−1objij)(Cij−C∗ij)2+S2∑i=0∑cϵclass1obji(pic−p∗ic)2]. |
(2) |
In the above (2), where
–
–
–
–
–
–
For pre-trained model, overall objective function is given as [22].
L=Lloc+Lcls. |
(3) |
2) SSD Based Object Detection: The second model used for top multiple object detection is SSD [23]. This section briefly explains the general architecture of SSD, which shows excellent results for generic object detection. It uses convolutional neural networks as pyramidal feature hierarchy for the detection of various sized objects. The general architecture of the SSD model used for object detection using a top view is represented in Fig. 7 is explained in the following steps:
i) The model used the VGG-16 pre-trained network for useful feature extraction. Further, it adds several additional convolution layers as features extraction layers with decreasing sizes can be seen in Fig. 7. These layers are called pyramid representations of different scales of images. On each pyramid layer, object detection happens for several size objects.
ii) Unlike the above discussed model (YOLO), the image is not split into arbitrary size grid cells [95]. It usually predicts the predefined bounding box for all spatial locations of the feature map, which is further responsible for one particular scale of the object, as shown in Fig. 7. In Fig. 8, a higher-level feature map is used for the detection of a large object while the lower-level feature map is used for small-sized object detection. It typically estimates four default bounding boxes or corners, particularly for every location, with different scales and aspect ratios, as seen in Fig. 8. Thus, confidence scores and shape offsets are predicted at each location for each object category, Fig. 8.
iii) Each bounding box is either preserved as positives or negative boxes. The loss function is calculated using ground truth and the default bounding box information; it only predicts positive values and ignores the negative ones. Similar to the previous model, the loss function is calculated as a combination of two functions: localization and classification loss. The loss function is given as [23].
L=1N(Lconf+αLloc) |
(4) |
where
Lconf=−∑iϵposXkijlogcki−∑iϵneglogc0i. |
(5) |
In (5):
–
– The total positive matched boxes in total
–
The second loss shown in (4) is the localization loss [23] calculated as
Lloc=∑ij∑mϵ{x,y,w,h}XmatchijLsmooth1(dim−tjm)2. |
(6) |
In the above equation
In Fig. 4, it can be visualized that for tracking purposes, the output of each detection model is stored in the form of the list. A tracker is initialized using different tracking methods. After the initialization of the tracker, we manually select the tracking algorithm. The tracking algorithm checks for object information in the list and starts tracking it from the top view. In the object tracking module, if the detected object’s value in the list is greater than 0, the tracker is continuously updated, and the object is tracked. In this work for top view multiple object tracking, we used six different tracking algorithms: GOTURN, MEDIANFLOW, TLD, MIL, KCF, and BOOSTING implemented in OpenCV. These algorithms are originally proposed for frontal view object tracking. The detected bounding box information extracted from object detection models is utilized to create a tracker list. The tracker list stores this information and tracks the multiple objects in the top view data set. The visualization results of tracking algorithms are elaborated in the experimental results section. Tracking algorithms used in this work are discussed as follows:
1) BOOSTING Tracker: A robust and real-time tracking technique is developed by [96] using 20 fps (frame per second). The limitation of this technique is that it considered tracking an object as a binary classification problem (background and object). For tracking, it considers the most discriminating features. The algorithm is similar to Haar cascades (e.g., Haar-like wavelets, local binary patterns, orientation histograms). It is slow and does not work well as compared to others.
2) MIL Tracker: Babenko et al. [97] developed a method for object tracking known as multiple instance learning (MIL), that solves the problem of adaptive appearance model learning. It employs a discriminative classifier to classify the object from the background. It permits to update the appearance model utilizing image patches without knowing which image patch is responsible for capturing the object of interest. The accuracy is better in comparison with the BOOSTING tracking algorithm but suffers during reporting failure.
3) KCF Tracker: To overcome the computation burden of the discriminative classifier, Henriques et al. [98] developed a fast learning and detection model using a fast Fourier transform. Reference [98] used a space kernel machine with linear classifiers. The model analyzed consequences using dense sampling in tracking. It is faster than the above two but does not handle occlusion and suffers from failure when there is variation in the size and position of an object.
4) TLD Tracker: In [99], authors examined long-term tracking algorithm for different objects in video sequences. The developed framework for tracking, learning, and detection (TLD) decomposes the long-term tracking task. The tracker localizes all observed appearance of the object in video sequences and each frame. It immensely suffers from false-positives.
5) MEDIANFLOW Tracker: Kalal et al. in [100] developed a method based on Forward-Backward error. In [100], the authors measured forward and backward differences between two trajectories of the target. The proposed algorithm detects the faults and helps in the discovery of tracking failures and the selection of tracking paths. It is suitable for reporting failures but runs out whenever there is a massive jump or variation in motion, a sudden change in appearance, and fast motion.
6) GOTURN Tracker: The generic object tracking using regression networks (GOTURN) [101] based on CNN layers. The architecture is primarily trained (mainly frontal view video sequences) on thousands of crop frames. The algorithm provides excellent results and handles several variations, e.g., (viewpoint, lighting changes, and deformation) during tracking, but it does not manage occlusion very well.
The overall algorithm for the top view, multiple object detection, and tracking framework shown in Fig. 4 is explained in Algorithm 1. The pre-trained deep learning models are applied on top view video sequences. The pre-trained models’ output has learned features, which are used for detecting objects along with the bounding box and confidence score value. The detected bounding box information is stored in the form of a list. The list contains coordinate information of each detected object (
The experimental results of deep learning models used for object detection in top view video surveillance have been elaborated in this section. Deep learning-based models, along with tracking algorithms, are implemented in OpenCV. The section is split up into subsections. The first section mainly discussed the testing results for multiple object detection and tracking using pre-trained deep learning models, in different scenes with a variety of backgrounds, heights, poses, and illumination conditions angle, cameras resolution and aspect ratio. In the second section, the results of tracking algorithms have been elaborated. The evaluation performance of object detection models and tracking algorithms are carried out in the last
Algorithm 1 Top View Multiple Object Detection and Tracking
1: for each frame in input video sequence
2: convert it into RGB frames
3: if framework = YOLO then
4: test the model shown in Figs. 4 and 5:
5: divide input
6: for each grid cell do
7: detect the object as
8: if confidence
9: detect bounding box
10: end if
11: end for
12: else
13: framework = SSD
14: use the MobilenetSSD-deploy.caffemodel.
15: test model shown in Figs. 4 and 7:
16: for each detection value in feature map. do
17: if confidence
18: detect the object with bounding box
19: assign class ID to each detected
20: assign bounding box
21: end if
22: end for
23: for each detection
24: assign class number to each detected object.
25: if object is present in detected list: then
26: i = i + 1
27:
28: else
29: object is not detected list.
30: end if
31: end for
32: for each object bounding box
33: initialize tracker
34:
35:
36:
37: start tracking
38: for i = 10 in
39: track the object
40: if
41:
42:
43: Continue tracking the
44: Update the tracker list.
45: else
46: Stop tracking.
47: end if
48: end for
49: end for
50: end if
51: end for
Figs. 9 and 10 show the learning models’ output results for top view person detection. In both scenes, video sequences are recorded from a completely top view (symmetric view). The first scene is captured in an outdoor environment, while the second is captured in indoor environments. From Figs. 9 and 10, it can be viewed that the appearance of the person is varying, but still, the deep learning models give good detection results. In Fig. 9, the person’s appearance changes throughout the video sequence, depending upon its radial distance from the camera position. The radial distance causes a change in the person’s scale and size, but the deep learning model still detects the accurate bounding box for each size and scale variation.
Similarly, in Fig. 10, the detected bounding box is automatically adjusted for different sizes and scales. The person in the sample frames moves exactly below the camera, which causes variations in appearance in the person’s body. The person’s size looks increasing with the movement of the person away from the camera. The detected bounding box, class label (person) along with the confidence score, can be seen in sample frames. We also tested the discussed deep learning models for multiple persons, as seen in Fig. 11. It can be determined from the Fig. 11, that the models detect multiple people and assign an ID to each person. It can be seen from the sample frames that the person in the red shirt is assigned by ID 0, which is the same throughout the video sequences. The sample frames are captured in an outdoor environment where the person is going down from the stairs, which causes variation in the size of the person from camera height. The detection results show that the model effectively detects and classifies the person without any false detection.
The deep learning object detection models are also tested using asymmetric top view in outdoor and indoor environments. In video surveillance, the main focus is detection and tracking of the person/multiple persons; therefore, in almost all scenes, they are considered as the main region of interest. The other scene covered multiple objects from the symmetric top view in outdoor environments. The last scene is comprised of sample frames of multiple objects from an asymmetric overhead view. All of the scenes are captured in different backgrounds with different illumination conditions. The YOLO model results for top view multiple object detection, using different backgrounds and scenes are examined in the Fig. 12. It can be visualized from the sample frames that from the top view, there is significant variation in size, orientation, and scale and appearance of objects, but still, the pre-trained deep learning model efficiently detects the object from an overhead view. In Fig. 12(a), the persons from the top perspective are efficiently identified even they are too close to each other.
Similarly, in Figs. 12(b), 12(c), and 12(d), the person in an indoor environment is detected using a pre-trained deep learning model. In sample frames, Figs. 12(e) and 12(f), the person and other objects, including car, bus, and truck, are accurately detected from the top view. The confidence score for different objects is also depicted in Fig. 12. The detected bounding box shows three values for each object, i.e., the object class label, the object class ID, and its confidence score. For example, in sample image Fig. 12(d), the two objects with different class labels are detected, one is the bench, and one is a person. Some false detections are also highlighted in Fig. 12; for example, the person’s feet are detected as a skateboard. Likewise, in Fig. 12(b), the two people from the overhead view are not identified by the model, which are marked with red crosses, while one in Fig. 12(e). Furthermore, in Fig. 12(c), two people are detected in the same bounding box, which is referred to as miss-detection. We concluded from the results that the YOLO model efficiently detected the object from an overhead view without any additional training. Although there are also some false detections, it can be reduced if the model is trained using the same top view data set.
Likewise, for the second deep learning model SSD, the visual results of testing using top view frames are depicted in Fig. 13. Same as the YOLO model, the SSD also shows beneficial results for top view object detection. The SSD model also gives miss-detection as shown in Fig. 13(a), where multiple persons are detected in a single bounding box. Similarly, in Fig. 13(b), like YOLO, the SSD model also produced not detected results highlighted with red crosses. Same in Fig. 12(d), the only person is detected while the bench is not detected. However, overall, the results are good compared to the traditional feature-based method, which requires bulky training of background samples. Like the first deep learning object detection model, the second model also detects the object in a bounding box with three values for each object, i.e., the object class label, the object class ID, and its confidence score. For example, in the sample frame Fig. 13(b), each detected person has a class label, class ID, and a confidence score. For the SSD model, the confidence values of objects are different; for example, in sample image Figs. 12(a) and 13(a), the person in red color cloth is detected with different confidence score 0.96 and 0.83, respectively. Similarly, in Fig. 13(d), the identified person has a different confidence score of 0.78 and 0.72, respectively. The detected bounding box information from the above models are further applied for tracking purposes. In the following section, the tracking results for top view multiple object detection has been discussed.
In Fig. 14, the tracking results of the top view data set for different scenes can be visualized. It can also be analyzed from sample frames of different video sequences, that the object appearance, size, scale, and orientation changes with respect to the camera location in comparison with a frontal view. However, the tracking algorithms developed for frontal view data sets, still give good results. The YOLO and SSD models, pre-trained on frontal view images, accurately classify the object in top view video frames. The detected object (person) with the confidence score can be seen in Fig. 14 with a green color bounding box.
The results of different tracking algorithms for multiple top view object detection are good, but, in some instances, it suffers from failure. The BOOSTING and MEDIANFLOW trackers are good but suffer when there is an abrupt change in object size; it shows failure. Likewise, the MIL fails sometimes during the sequence due to the variation in the person’s body size and change in motion direction. The results of the KCF algorithm are quite well as compared to other trackers. The TLD algorithm also gives miss rate and suffers from failure, but that failure is not consistent. The median flow tracker gives false detection results and experiences failure when there is a sudden variation in the person’s direction and gesture.
To avoid unnecessary repetition, we explain the tracking results of one best tracker, i.e., GOTURN. The first column of Fig. 14 the tracking results using the YOLO model, while in the second column, the tracking results using SSD are depicted. In Fig. 14, to avoid confusion in tracking trajectories, we show the outcomes for simple video sequences. As discussed earlier, in video surveillance, where the primary interest is to track persons’ trajectories, in this section, we also focus on tracking individuals’ trajectories. In Figs. 14(a) and 14(b), the tracking trajectories for an individual with an asymmetric top view can be seen. It can be seen that person in the scene is moving across the scene. The GOTURN tracker shows good results along with the deep learning models.
Likewise, in Figs. 14(c) and 14(d), the tracking results for symmetric top view person video sequences are depicted. The individual in the scene moves freely, as seen from trajectories results of the Figs. 14(c) and 14(d). Similarly, in the next video sequence where the two people are moving nearer to each other in the same direction, the algorithm efficiently tracks each person with some miss rate. Figs. 14(g) and 14(h) show result for outdoor video sequences that are captured from the asymmetric top view, in which the car is moving in straight direction also two persons are moving freely in the scene. The tracking trajectories of all objects are depicted in Figs. 14(g) and 14(h). The trajectories for the different objects are indicated with different color dotted lines.
Similarly, results for another scene in which there are two objects; a bus and a person are depicted Figs. 14(i) and 14(l). One object is stationary, while other object is freely moving in the scene. In Figs. 14(i) and 14(l) the person trajectories are presented with a yellow color dotted lines. Same in the case of Figs. 14(m) and 14(n) where the person is continuously moving in the scene. That video sequence contains person and bus from the asymmetric top view. The person is efficiently tracked by tracking algorithm using deep learning-based detection models.
In this section, different parameters utilized for the evaluation of different tracking algorithms, and both object detection models have been discussed. The evaluation method is adopted from [67]. The parameters are further categorized into two categories listed as follows:
i) False detection rate (FDR) and true detection rate (TDR): These two parameters are applied to evaluate both deep learning-based multiple top view object, detection models.
ii) Tracking accuracy (TA): It is utilized for the evaluation of different tracking algorithms.
1) Detection Results: The detection results for top view multiple objects using deep learning-based models are demonstrated in Figs. 15 and 16 and elaborated in Tabel. II. It can be determined that both models produced good results. For top view, multiple object detection both models perform well without any extra background training. The performance of SSD is slightly better in comparison to the YOLO model. In case if a preference is given to speed over accuracy, then YOLO performance is better. For sample video frames containing person, the YOLO model achieves TDR of 92%, while for other objects, TDR ranges from 90% to 92%. Likewise, for the SSD model, the TDR for multiple top view objects ranging from 90% to 93% while for persons, it presents good results by achieving TDR of 92%.
# | Object | YOLO | SSD | |||
TDR | FDR | TDR | FDR | |||
1 | Person | 92% | 0.5% | 93% | 0.4% | |
2 | Car | 92% | 0.5% | 92% | 0.4% | |
3 | Motorbike | 90% | 0.5% | 90% | 0.5% | |
4 | Truck | 90% | 0.6% | 90% | 0.6% | |
5 | Bus | 90% | 0.6% | 90% | 0.6% |
The TDR for both models in different symmetric and asymmetric top view scenarios for different objects are plotted in Figs. 15 and 16. From both figures, the robustness of the deep learning models can be analyzed. Similarly, FDR for both models is depicted in Fig. 16 for symmetric and asymmetric top views. From FDR, we concluded that instead, the method shows high FDR because they were not trained on top view data set; it turns over minimal values of almost less than 0.7%. It indicates that the deep learning models are better than traditional handcrafted features, which requires additional background training to improve the TDR. From the results of FDR, we conclude that the results may be further improved if the same models are trained for top view images.
2) Tracking Results: In Fig. 17, the performance measure of six different tracking algorithms, for indoor and outdoor top view scenarios in symmetric and asymmetric view against each object are depicted. It is clearly examined that accuracy of tracking algorithms is changing for different target objects. It shows that changing the camera perspective significantly influences the shape, appearance of the object. The TA of different tracking algorithms is also presented in Tables III and IV. With YOLO model the tracking algorithm achieves tracking accuracy ranges from 90% to 93% for the person shown in Fig. 17(a), While the SSD model, the tracking accuracy for a person is ranging from 94% to 90%. Fig. 17 reflects that using multiple top views scenes, the tracking accuracy of algorithms for multiple objects may be effectively improved. From Fig. 17, all tracking models, along with both detection modules, achieve good accuracy results. The GOTURN performance is excellent in comparison to other tracking algorithms by making tracking accuracy of 94%. The tracking accuracy of different tracking algorithms with each detection model is also elaborated in Tables III and IV.
# | Object | Tracking algorithms | |||||
BOOSTING | MIL | KCF | TLD | MEDIANFLOW | GOTURN | ||
TA | TA | TA | TA | TA | TA | ||
1 | Person | 90% | 90% | 91% | 91% | 92% | 94% |
2 | Car | 90% | 90% | 91% | 91% | 92% | 94% |
3 | Motorbike | 90% | 90% | 90% | 91% | 91% | 92% |
4 | Truck | 90% | 90% | 90% | 90% | 90% | 91% |
5 | Bus | 90% | 90% | 90% | 90% | 90% | 90% |
# | Object | Tracking algorithms | |||||
BOOSTING | MIL | KCF | TLD | MEDIANFLOW | GOTURN | ||
TA | TA | TA | TA | TA | TA | ||
1 | Person | 91% | 91% | 92% | 91% | 93% | 94% |
2 | Car | 90% | 91% | 92% | 91% | 93% | 94% |
3 | Motorbike | 90% | 91% | 91% | 91% | 92% | 93% |
4 | Truck | 90% | 90% | 90% | 91% | 92% | 92% |
5 | Bus | 90% | 90% | 90% | 91% | 91% | 91% |
Collaborative robotics plays a crucial function in video surveillance by providing an autonomous solution for many real-life applications. In this work, a robotic framework is presented for top view surveillance system. The overall framework consists of a visual processing unit embedded with a smart robotic camera. The visual processing unit performs object detection and tracking in top view surveillance and assists human operators to manage and control different surveillance applications. For object detection existing pre-trained deep learning models, i.e., SSD and YOLO, already trained on frontal data sets are practiced against a top view data set. Different tracking algorithms are investigated for multiple objects, along with detection, for tracking objects from the top view. However, due to the perspective change of the camera, there is a significant variation in the object’s appearance, even though deep learning-based models show promising results. Overall, TDR of 93% with FDR of 0.6% is achieved for person images, and for other objects, the TDR is ranging from 92% to 90% in the case of YOLO. For SSD model top view person sample frames, TPR is 93% with FDR of 0.5% and for other objects TPR up to 90% with 0.6% FDR.
Furthermore, while using different tracking algorithms for multiple object tracking, all tracking algorithms almost achieved tracking accuracy ranges from 94% to 90% with both detection models. To the best of our knowledge, this work is considered the first attempt in top view collaborative surveillance systems and generic models in relation to object detection and tracking. In the future, this work is intended to be explored for other deep learning models for a variety of top view objects. The models trained and tested on top view data set may boost detection and tracking performance. The collaborative robotics in top view surveillance may further be enhanced by incorporating different modules, e.g., behavior analysis, activity recognition, and event detection.
[1] |
L. G. Clift, J. Lepley, H. Hagras, and A. F. Clark, “Autonomous computational intelligence-based behaviour recognition in security and surveillance,” in Proc. SPIE 10802, Counterterrorism, Crime Fighting, Forensics, and Surveillance Technologies II, Berlin, Germany, 2018, pp. 108020L.
|
[2] |
H. M. Hodgetts, F. Vachon, C. Chamberland, and S. Tremblay, “See no evil: Cognitive challenges of security surveillance and monitoring,” J. Appl. Res. Mem. Cognit., vol. 6, no. 3, pp. 230–243, Sept. 2017. doi: 10.1016/j.jarmac.2017.05.001
|
[3] |
P. Bansal and K. M. Kockelman, “Are we ready to embrace connected and self-driving vehicles? A case study of Texans” Transportation, vol. 45, no. 2, pp. 641–675, Mar. 2018. doi: 10.1007/s11116-016-9745-z
|
[4] |
M. Haghighat and M. Abdel-Mottaleb, “Low resolution face recognition in surveillance systems using discriminant correlation analysis,” in Proc. 12th IEEE Int. Conf. Automatic Face & Gesture Recognition, Washington, USA, 2017, pp. 912–917.
|
[5] |
Y. Jeong, S. Son, E. Jeong, and B. Lee, “An integrated self-diagnosis system for an autonomous vehicle based on an IoT gateway and deep learning,” Appl. Sci., vol. 8, no. 7, Article No. 1164, Jul. 2018. doi: 10.3390/app8071164
|
[6] |
M. Chen, J. Zhou, G. M. Tao, J. Yang, and L. Hu, “Wearable affective robot,” IEEE Access, vol. 6, pp. 64766–64776, Oct. 2018. doi: 10.1109/ACCESS.2018.2877919
|
[7] |
M. Chen and Y. X. Hao, “Label-less learning for emotion cognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2430–2440, Jul. 2020.
|
[8] |
M. Chen, Y. Cao, R. Wang, Y. Li, D. Wu, and Z. C. Liu, “Deepfocus: Deep encoding brainwaves and emotions with multi-scenario behavior analytics for human attention enhancement,” IEEE Netw., vol. 33, no. 6, pp. 70–77, Nov.–Dec. 2019. doi: 10.1109/MNET.001.1900054
|
[9] |
Z. X. Zou, Z. W. Shi, Y. H. Guo, and J. P. Ye, “Object detection in 20 years: A survey,” arXiv preprint arXiv: 1905.05055, 2019.
|
[10] |
R. Yao, G. S. Lin, S. X. Xia, J. Q. Zhao, and Y. Zhou, “Video object segmentation and tracking: A survey,” ACM Trans. Intell. Syst. Technol., vol. 11, no. 4, pp. 1–47, May 2020.
|
[11] |
K. A. Joshi and D. G. Thakore, “A survey on moving object detection and tracking in video surveillance system,” Int. J. Soft Comput. Eng., vol. 2, no. 3, pp. 44–48, Jul. 2012.
|
[12] |
S. Hare, S. Golodetz, A. Saffari, V. Vineet, M. M. Cheng, S. L. Hicks, and P. H. S. Torr, “Struck: Structured output tracking with kernels,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2096–2109, Oct. 2016. doi: 10.1109/TPAMI.2015.2509974
|
[13] |
F. Yang, H. Lu, W. Zhang, and G. Yang, “Visual tracking via bag of features,” IET Image Process., vol. 6, no. 2, pp. 115–128, Mar. 2012. doi: 10.1049/iet-ipr.2010.0127
|
[14] |
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition, San Diego, USA, 2005.
|
[15] |
J. L. Fan, X. H. Shen, and Y. Wu, “Scribble tracker: A matting-based approach for robust tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 8, pp. 1633–1644, Aug. 2012. doi: 10.1109/TPAMI.2011.257
|
[16] |
X. Li, A. Dick, C. H. Shen, A. Van den Hengel, and H. Z. Wang, “Incremental learning of 3D-DCT compact representations for robust visual tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 4, pp. 863–881, Apr. 2013. doi: 10.1109/TPAMI.2012.166
|
[17] |
H. S. Parekh, D. G. Thakore, and U. K. Jaliya, “A survey on object detection and tracking methods,” Int. J. Innovat. Res. Comput. Commun. Eng., vol. 2, no. 2, pp. 2970–2978, Feb. 2014.
|
[18] |
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv: 1312.6229, 2013.
|
[19] |
G. Fortino, W. Russo, C. Savaglio, W. M. Shen, and M. C. Zhou, “Agent-oriented cooperative smart objects: From IoT system design to implementation,” IEEE Trans. Syst. Man Cybern. Syst., vol. 48, no. 11, pp. 1939–1956, Nov. 2018. doi: 10.1109/TSMC.2017.2780618
|
[20] |
R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1440–1448.
|
[21] |
S. Gidaris and N. Komodakis, “Object detection via a multi-region and semantic segmentation-aware CNN model,” in Proc. IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1134–1142.
|
[22] |
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 779–788.
|
[23] |
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Proc. 14th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 21–37.
|
[24] |
J. F. Dai, Y. Li, K. M. He, and J. Sun, “R-FCN: Object detection via region-based fully convolutional networks,” in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 379–387.
|
[25] |
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional siamese networks for object tracking,” in Proc. European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 850–865.
|
[26] |
A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1442–1468, Jul. 2014. doi: 10.1109/TPAMI.2013.230
|
[27] |
G. Smart, N. Deligiannis, R. Surace, V. Loscri, G. Fortino, and Y. Andreopoulos, “Decentralized time-synchronized channel swapping for ad hoc wireless networks,” IEEE Trans. Vehicular Technol., vol. 65, no. 10, pp. 8538–8553, Oct. 2016. doi: 10.1109/TVT.2015.2509861
|
[28] |
Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, Sept. 2015. doi: 10.1109/TPAMI.2014.2388226
|
[29] |
G. Ciaparrone, F. L. Sánchez, S. Tabik, L. Troiano, R. Tagliaferri, and F. Herrera, “Deep learning in video multi-object tracking: A survey,” Neurocomputing, vol. 381, pp. 61–88, Mar. 2020. doi: 10.1016/j.neucom.2019.11.023
|
[30] |
T. Kong, A. B. Yao, Y. R. Chen, and F. C. Sun, “Hypernet: Towards accurate region proposal generation and joint object detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 845–853.
|
[31] |
G. Zhu, F. Porikli, and H. D. Li, “Robust visual tracking with deep convolutional neural network based object proposals on pets,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops, Las Vegas, USA, 2016, pp. 1265–1272.
|
[32] |
M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. van Gool, “Online multiperson tracking-by-detection from a single, uncalibrated camera,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 9, pp. 1820–1833, Sept. 2011. doi: 10.1109/TPAMI.2010.232
|
[33] |
K. Potdar, C. D. Pai, and S. Akolkar, “A convolutional neural network based live object recognition system as blind aid,” arXiv preprint arXiv: 1811.10399, 2018.
|
[34] |
A. Vavilin and K. H. Jo, “Motion analysis for scenes with multiple moving objects,” IEEJ Trans. Electron. Inf. Syst., vol. 133, no. 1, pp. 40–46, Jan. 2013.
|
[35] |
G. Khan, Z. Tariq, and M. U. G. Khan, “Multi-person tracking based on faster R-CNN and deep appearance features,” in Visual Object Tracking with Deep Neural Networks, P. L. Mazzeo, S. Ramakrishnan, and P. Spagnolo, Eds. IntechOpen, 2019.
|
[36] |
I. Ahmed and J. N. Carter, “A robust person detector for overhead views,” in Proc. 21st Int. Conf. Pattern Recognition, Tsukuba, Japan, 2012, pp. 1483–1486.
|
[37] |
I. Ahmed and A. Adnan, “A robust algorithm for detecting people in overhead views,” Cluster Comput., vol. 21, no. 1, pp. 633–654, Mar. 2018. doi: 10.1007/s10586-017-0968-3
|
[38] |
M. Ahmad, I. Ahmed, K. Ullah, I. Khan, and A. Adnan, “Robust background subtraction based person’s counting from overhead view,” in Proc. 9th IEEE Annu. Ubiquitous Computing, Electronics & Mobile Communication Conf., New York City, USA, 2018, pp. 746–752.
|
[39] |
H. Tayara and K. T. Chong, “Object detection in very high-resolution aerial images using one-stage densely connected feature pyramid network,” Sensors, vol. 18, no. 10, Article No. 3341, Oct. 2018. doi: 10.3390/s18103341
|
[40] |
A. van Etten, “You only look twice: Rapid multi-scale object detection in satellite imagery,” arXiv preprint arXiv: 1805.09512, 2018.
|
[41] |
M. Sigalas, M. Pateraki, and P. Trahanias, “Full-body pose tracking—the top view reprojection approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1569–1582, Aug. 2016. doi: 10.1109/TPAMI.2015.2502582
|
[42] |
C. Migniot and F. Ababsa, “Hybrid 3D–2D human tracking in a top view,” J. Real-Time Image Process., vol. 11, no. 4, pp. 769–784, Dec. 2016. doi: 10.1007/s11554-014-0429-7
|
[43] |
I. Ahmed, S. Din, G. Jeon, and F. Piccialli, “Exploring deep learning models for overhead view multiple object detection,” IEEE Internet Things J., vol. 7, no. 7, pp. 5737–5744, Jul. 2020. doi: 10.1109/JIOT.2019.2951365
|
[44] |
M. Ahmad, I. Ahmed, K. Ullah, I. Khan, A. Khattak, and A. Adnan, “Energy efficient camera solution for video surveillance,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 3, pp. 522–529, 2019.
|
[45] |
S. R. Zhou, M. L. Ke, J. Qiu, and J. Wang, “A survey of multi-object video tracking algorithms,” in Int. Conf. Applications and Techniques in Cyber Security and Intelligence, J. Abawajy, K. K. R. Choo, R. Islam, Z. Xu, and M. Atiquzzaman, Eds. Cham, Germany: Springer, 2018, pp. 351–369.
|
[46] |
P. X. Li, D. Wang, L. J. Wang, and H. C. Lu, “Deep visual tracking: Review and experimental comparison,” Pattern Recognit., vol. 76, pp. 323–338, Apr. 2018. doi: 10.1016/j.patcog.2017.11.007
|
[47] |
D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564–577, May 2003. doi: 10.1109/TPAMI.2003.1195991
|
[48] |
M. Danelljan, F. S. Khan, M. Felsberg, and J. van de Weijer, “Adaptive color attributes for real-time visual tracking,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 1090–1097.
|
[49] |
D. A. Ross, J. Lim, R. S. Lin, and M. H. Yang, “Incremental learning for robust visual tracking,” Int. J. Comput. Vis., vol. 77, no. 1-3, pp. 125–141, May 2008. doi: 10.1007/s11263-007-0075-7
|
[50] |
Q. Wang, F. Chen, W. L. Xu, and M. H. Yang, “Object tracking via partial least squares analysis,” IEEE Trans. Image Process., vol. 21, no. 10, pp. 4454–4465, Oct. 2012. doi: 10.1109/TIP.2012.2205700
|
[51] |
Y. Lu, T. F. Wu, and S. C. Zhu, “Online object tracking, learning, and parsing with and-or graphs,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2014, pp. 3462–3469.
|
[52] |
R. Yao, Q. F. Shi, C. H. Shen, Y. N. Zhang, and A. van den Hengel, “Part-based visual tracking with online latent structural learning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Portland, USA, 2013, pp. 2363–2370.
|
[53] |
Y. C. Bai and M. Tang, “Robust tracking via weakly supervised ranking SVM,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Providence, USA, 2012, pp. 1854–1861.
|
[54] |
J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof, “Prost: Parallel robust online simple tracking,” in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition, San Francisco, USA, 2010, pp. 723–730.
|
[55] |
J. Gall, A. Yao, N. Razavi, L. van Gool, and V. Lempitsky, “Hough forests for object detection, tracking, and action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 11, pp. 2188–2202, Nov. 2011. doi: 10.1109/TPAMI.2011.70
|
[56] |
L. Zhang and L. van der Maaten, “Preserving structure in model-free tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp. 756–769, Apr. 2013.
|
[57] |
J. Kwon and K. M. Lee, “Tracking by sampling and integratingmultiple trackers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1428–1441, Jul. 2014. doi: 10.1109/TPAMI.2013.213
|
[58] |
D. Wang, H. C. Lu, and M. H. Yang, “Online object tracking with sparse prototypes,” IEEE Trans. Image Process., vol. 22, no. 1, pp. 314–325, Jan. 2012.
|
[59] |
R. T. Collins, Y. X. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 1631–1643, Oct. 2005. doi: 10.1109/TPAMI.2005.205
|
[60] |
S. Duffner and C. Garcia, “Pixeltrack: A fast adaptive algorithm for tracking non-rigid objects,” in Proc. IEEE Int. Conf. Computer Vision, Sydney, Australia, 2013, pp. 2480–2487.
|
[61] |
C. G. Ertler, H. Possegger, M. Opitz, and H. Bischof, “Pedestrian detection in RGB-D images from an elevated viewpoint,” in Proc. 22nd Computer Vision Winter Workshop, Wien, Austria, 2017.
|
[62] |
J. W. Perng, T. Y. Wang, Y. W. Hsu, and B. F. Wu, “The design and implementation of a vision-based people counting system in buses,” in Proc. Int. Conf. System Science and Engineering, Puli, China, 2016, pp. 1–3.
|
[63] |
P. Vera, S. Monjaraz, and J. Salas, “Counting pedestrians with a zenithal arrangement of depth cameras,” Machine Vision and Applications, vol. 27, no. 2, pp. 303–315, Feb. 2016. doi: 10.1007/s00138-015-0739-1
|
[64] |
Y. W. Pang, Y. Yuan, X. L. Li, and J. Pan, “Efficient hog human detection,” Signal Process., vol. 91, no. 4, pp. 773–781, Apr. 2011. doi: 10.1016/j.sigpro.2010.08.010
|
[65] |
T. W. Choi, D. H. Kim, and K. H. Kim, “Human detection in top-view depth image,” Contemporary Engineering Sciences, vol. 9, no. 11, pp. 547–552, 2016.
|
[66] |
I. Ahmed, M. Ahmad, A. Adnan, A. Ahmad, and M. Khan, “Person detector for different overhead views using machine learning,” Int. J. Mach. Learn. Cyber., vol. 10, no. 10, pp. 2657–2668, Nov. 2019. doi: 10.1007/s13042-019-00950-5
|
[67] |
I. Ahmed, A. Ahmad, F. Piccialli, A. K. Sangaiah, and G. Jeon, “A robust features-based person tracker for overhead views in industrial environment,” IEEE Internet Things J., vol. 5, no. 3, pp. 1598–1605, Jun. 2018. doi: 10.1109/JIOT.2017.2787779
|
[68] |
K. Ullah, I. Ahmed, M. Ahmad, A. U. Rahman, M. Nawaz, and A. Adnan, “Rotation invariant person tracker using top view,” J. Ambient Intell. Humaniz. Comput., 2019. DOI: 10.1007/s12652-019-01526-5.
|
[69] |
I. Ahmed, M. Ahmad, M. Nawaz, K. Haseeb, S. Khan, and G. Jeon, “Efficient topview person detector using point based transformation and lookup table,” Comput. Commun., vol. 147, pp. 188–197, Nov. 2019. doi: 10.1016/j.comcom.2019.08.015
|
[70] |
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 580–587.
|
[71] |
K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sept. 2015. doi: 10.1109/TPAMI.2015.2389824
|
[72] |
S. Q. Ren, K. M. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 91–99.
|
[73] |
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
|
[74] |
T. Y. Lin, P. Dollár, R. Girshick, K. M. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, USA, 2017, pp. 936–944.
|
[75] |
J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6517–6525.
|
[76] |
J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint arXiv: 1804.02767, 2018.
|
[77] |
J. L. Fan, W. Xu, Y. Wu, and Y. H. Gong, “Human tracking using convolutional neural networks,” IEEE Trans. Neural Netw., vol. 21, no. 10, pp. 1610–1623, Oct. 2010. doi: 10.1109/TNN.2010.2066286
|
[78] |
H. M. Lu, T. Uemura, D. Wang, J. H. Zhu, Z. Huang, and H. Kim, “Deep-sea organisms tracking using dehazing and deep learning,” Mobile Netw. Appl., vol. 25, no. 3, pp. 1008–1015, Jun. 2020. doi: 10.1007/s11036-018-1117-9
|
[79] |
J. Zhang, S. Yang, C. Bo, and H. Lu, “Single stage vehicle logo detector based on multi-scale prediction,” Trans. Information and Systems, vol. E103, no. 10, 2020.
|
[80] |
B. N. Zhong, H. X. Yao, S. Chen, R. R. Ji, T. J. Chin, and H. Z. Wang, “Visual tracking via weakly supervised learning from multiple imperfect oracles,” Pattern Recognit., vol. 47, no. 3, pp. 1395–1410, Mar. 2014. doi: 10.1016/j.patcog.2013.10.002
|
[81] |
S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning discriminative saliency map with convolutional neural network,” in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 597–606.
|
[82] |
N. Y. Wang and D. Y. Yeung, “Learning a deep compact image representation for visual tracking,” in Proc. 26th Int. Conf. Neural Information Processing Systems, Lake Tahoe, USA, 2013, pp. 809–817.
|
[83] |
N. Y. Wang, S. Y. Li, A. Gupta, and D. Y. Yeung, “Transferring rich feature hierarchies for robust visual tracking,” arXiv preprint arXiv: 1501.04587, 2015.
|
[84] |
G. H. Ning, Z. Zhang, C. Huang, X. B. Ren, H. H. Wang, C. H. Cai, and Z. H. He, “Spatially supervised recurrent convolutional neural networks for visual object tracking,” in Proc. IEEE Int. Symp. Circuits and Systems, Baltimore, USA, 2017, pp. 1–4.
|
[85] |
N. Y. Wang and D. Y. Yeung, “Ensemble-based tracking: Aggregating crowdsourced structured time series data,” in Proc. 31st Int. Conf. Machine Learning, Beijing, China, 2014, pp. 1107–1115.
|
[86] |
J. Kuen, K. M. Lim, and C. P. Lee, “Self-taught learning of a deep invariant representation for visual tracking via temporal slowness principle,” Pattern Recognit., vol. 48, no. 10, pp. 2964–2982, Oct. 2015. doi: 10.1016/j.patcog.2015.02.012
|
[87] |
Z. Cui, S. T. Xiao, J. S. Feng, and S. C. Yan, “Recurrently target-attending tracking,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 1449–1458.
|
[88] |
J. Y. Gao, T. Z. Zhang, X. S. Yang, and C. S. Xu, “Deep relative tracking,” IEEE Trans. Image Process., vol. 26, no. 4, pp. 1845–1858, Apr. 2017. doi: 10.1109/TIP.2017.2656628
|
[89] |
D. W. Du, Y. K. Qi, H. Y. Yu, Y. F. Yang, K. W. Duan, G. R. Li, W. G. Zhang, Q. M. Huang, and Q. Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 375–391.
|
[90] |
P. F. Zhu, L. Y. Wen, D. W. Du, et al., “Visdrone-vdt2018: The vision meets drone video detection and tracking challenge results,” in Proc. European Conf. Computer Vision, Munich, Germany, 2018, pp. 437–468.
|
[91] |
Y. K. Qi, S. P. Zhang, W. G. Zhang, L. Su, Q. M. Huang, and M. H. Yang, “Learning attribute-specific representations for visual tracking,” in Proc. AAAI Conf. Artificial Intelligence, vol. 33, 2019, pp. 8835–8842.
|
[92] |
M. Z. Uddin, M. M. Hassan, A. Almogren, A. Alamri, M. Alrubaian, and G. Fortino, “Facial expression recognition utilizing local direction-based robust features and deep belief network,” IEEE Access, vol. 5, pp. 4525–4536, Mar. 2017. doi: 10.1109/ACCESS.2017.2676238
|
[93] |
M. Ahmad, I. Ahmed, and A. Adnan, “Overhead view person detection using YOLO,” in Proc. IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conf., New York City, USA, 2019, pp. 627–633.
|
[94] |
M. Ahmad, I. Ahmed, K. Ullah, and M. Ahmad, “A deep neural network approach for top view people detection and counting,” in Proc. IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conf., New York City, USA, 2019, pp. 1082–1088.
|
[95] |
D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deep neural networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, USA, 2014, pp. 2155–2162.
|
[96] |
H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via on-line boosting,” in Proc. British Machine Vision Conf., Edinburgh, UK, 2006, pp. 6.
|
[97] |
B. Babenko, M. H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Miami, USA, 2009, pp. 983–990.
|
[98] |
J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Proc. 12th European Conf. Computer Vision, Florence, Italy, 2012, pp. 702–715.
|
[99] |
Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422, Jul. 2012. doi: 10.1109/TPAMI.2011.239
|
[100] |
Z. Kalal, K. Mikolajczyk, and J. Matas, “Forward-backward error: Automatic detection of tracking failures,” in Proc. 20th Int. Conf. Pattern Recognition, Istanbul, Turkey, 2010, 2756–2759.
|
[101] |
D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps with deep regression networks,” in Proc. 14th European Conf. Computer Vision, Amsterdam, The Netherlands, 2016, pp. 749–765.
|
[1] | Wenqi Ren, Yang Tang, Qiyu Sun, Chaoqiang Zhao, Qing-Long Han. Visual Semantic Segmentation Based on Few/Zero-Shot Learning: An Overview[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(5): 1106-1126. doi: 10.1109/JAS.2023.123207 |
[2] | Jiawen Kang, Junlong Chen, Minrui Xu, Zehui Xiong, Yutao Jiao, Luchao Han, Dusit Niyato, Yongju Tong, Shengli Xie. UAV-Assisted Dynamic Avatar Task Migration for Vehicular Metaverse Services: A Multi-Agent Deep Reinforcement Learning Approach[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(2): 430-445. doi: 10.1109/JAS.2023.123993 |
[3] | Yang Li, Xiao Wang, Zhifan He, Ze Wang, Ke Cheng, Sanchuan Ding, Yijing Fan, Xiaotao Li, Yawen Niu, Shanpeng Xiao, Zhenqi Hao, Bin Gao, Huaqiang Wu. Industry-Oriented Detection Method of PCBA Defects Using Semantic Segmentation Models[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(6): 1438-1446. doi: 10.1109/JAS.2024.124422 |
[4] | Min Yang, Guanjun Liu, Ziyuan Zhou, Jiacun Wang. Probabilistic Automata-Based Method for Enhancing Performance of Deep Reinforcement Learning Systems[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(11): 2327-2339. doi: 10.1109/JAS.2024.124818 |
[5] | Jiaxin Ren, Jingcheng Wen, Zhibin Zhao, Ruqiang Yan, Xuefeng Chen, Asoke K. Nandi. Uncertainty-Aware Deep Learning: A Promising Tool for Trustworthy Fault Diagnosis[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(6): 1317-1330. doi: 10.1109/JAS.2024.124290 |
[6] | Fei Ming, Wenyin Gong, Ling Wang, Yaochu Jin. Constrained Multi-Objective Optimization With Deep Reinforcement Learning Assisted Operator Selection[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(4): 919-931. doi: 10.1109/JAS.2023.123687 |
[7] | Zizhang Qiu, Shouguang Wang, Dan You, MengChu Zhou. Bridge Bidding via Deep Reinforcement Learning and Belief Monte Carlo Search[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(10): 2111-2122. doi: 10.1109/JAS.2024.124488 |
[8] | Haotian Liu, Yuchuang Tong, Zhengtao Zhang. Human Observation-Inspired Universal Image Acquisition Paradigm Integrating Multi-Objective Motion Planning and Control for Robotics[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(12): 2463-2475. doi: 10.1109/JAS.2024.124512 |
[9] | Meng Zhou, Zihao Wang, Jing Wang, Zhengcai Cao. Multi-Robot Collaborative Hunting in Cluttered Environments With Obstacle-Avoiding Voronoi Cells[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(7): 1643-1655. doi: 10.1109/JAS.2023.124041 |
[10] | Cong Pan, Junran Peng, Zhaoxiang Zhang. Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(3): 673-689. doi: 10.1109/JAS.2023.123660 |
[11] | Tao Wang, Qiming Chen, Xun Lang, Lei Xie, Peng Li, Hongye Su. Detection of Oscillations in Process Control Loops From Visual Image Space Using Deep Convolutional Networks[J]. IEEE/CAA Journal of Automatica Sinica, 2024, 11(4): 982-995. doi: 10.1109/JAS.2023.124170 |
[12] | Qing Xu, Min Wu, Edwin Khoo, Zhenghua Chen, Xiaoli Li. A Hybrid Ensemble Deep Learning Approach for Early Prediction of Battery Remaining Useful Life[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(1): 177-187. doi: 10.1109/JAS.2023.123024 |
[13] | Sibo Cheng, César Quilodrán-Casas, Said Ouala, Alban Farchi, Che Liu, Pierre Tandeo, Ronan Fablet, Didier Lucor, Bertrand Iooss, Julien Brajard, Dunhui Xiao, Tijana Janjic, Weiping Ding, Yike Guo, Alberto Carrassi, Marc Bocquet, Rossella Arcucci. Machine Learning With Data Assimilation and Uncertainty Quantification for Dynamical Systems: A Review[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(6): 1361-1387. doi: 10.1109/JAS.2023.123537 |
[14] | Xinya Wang, Qian Hu, Yingsong Cheng, Jiayi Ma. Hyperspectral Image Super-Resolution Meets Deep Learning: A Survey and Perspective[J]. IEEE/CAA Journal of Automatica Sinica, 2023, 10(8): 1668-1691. doi: 10.1109/JAS.2023.123681 |
[15] | Jun Zhang, Lei Pan, Qing-Long Han, Chao Chen, Sheng Wen, Yang Xiang. Deep Learning Based Attack Detection for Cyber-Physical System Cybersecurity: A Survey[J]. IEEE/CAA Journal of Automatica Sinica, 2022, 9(3): 377-391. doi: 10.1109/JAS.2021.1004261 |
[16] | Dezhen Xiong, Daohui Zhang, Xingang Zhao, Yiwen Zhao. Deep Learning for EMG-based Human-Machine Interaction: A Review[J]. IEEE/CAA Journal of Automatica Sinica, 2021, 8(3): 512-533. doi: 10.1109/JAS.2021.1003865 |
[17] | Mohammad Al-Sharman, David Murdoch, Dongpu Cao, Chen Lv, Yahya Zweiri, Derek Rayside, William Melek. A Sensorless State Estimation for A Safety-Oriented Cyber-Physical System in Urban Driving: Deep Learning Approach[J]. IEEE/CAA Journal of Automatica Sinica, 2021, 8(1): 169-178. doi: 10.1109/JAS.2020.1003474 |
[18] | Parham M. Kebria, Abbas Khosravi, Syed Moshfeq Salaken, Saeid Nahavandi. Deep Imitation Learning for Autonomous Vehicles Based on Convolutional Neural Networks[J]. IEEE/CAA Journal of Automatica Sinica, 2020, 7(1): 82-95. doi: 10.1109/JAS.2019.1911825 |
[19] | Li Li, Yisheng Lv, Fei-Yue Wang. Traffic Signal Timing via Deep Reinforcement Learning[J]. IEEE/CAA Journal of Automatica Sinica, 2016, 3(3): 247-254. |
[20] | Lijiao Wang, Bin Meng. Distributed Force/Position Consensus Tracking of Networked Robotic Manipulators[J]. IEEE/CAA Journal of Automatica Sinica, 2014, 1(2): 180-186. |
# | Detail description | |
1 | Color space | RGB |
2 | Video sequence duration | 5 to 30 minutes |
3 | Video format | .mp4 |
4 | Frame rate | 20 frame per sec |
5 | Frame resolution | 640 × 480 |
6 | Height of camera | 3 to 4 meters from ground |
7 | Recording location | Indoor and outdoor |
8 | Illumination changes reflections/Shadows | Yes |
9 | Frame format | .jpg |
10 | None object | Varying |
# | Object | YOLO | SSD | |||
TDR | FDR | TDR | FDR | |||
1 | Person | 92% | 0.5% | 93% | 0.4% | |
2 | Car | 92% | 0.5% | 92% | 0.4% | |
3 | Motorbike | 90% | 0.5% | 90% | 0.5% | |
4 | Truck | 90% | 0.6% | 90% | 0.6% | |
5 | Bus | 90% | 0.6% | 90% | 0.6% |
# | Object | Tracking algorithms | |||||
BOOSTING | MIL | KCF | TLD | MEDIANFLOW | GOTURN | ||
TA | TA | TA | TA | TA | TA | ||
1 | Person | 90% | 90% | 91% | 91% | 92% | 94% |
2 | Car | 90% | 90% | 91% | 91% | 92% | 94% |
3 | Motorbike | 90% | 90% | 90% | 91% | 91% | 92% |
4 | Truck | 90% | 90% | 90% | 90% | 90% | 91% |
5 | Bus | 90% | 90% | 90% | 90% | 90% | 90% |
# | Object | Tracking algorithms | |||||
BOOSTING | MIL | KCF | TLD | MEDIANFLOW | GOTURN | ||
TA | TA | TA | TA | TA | TA | ||
1 | Person | 91% | 91% | 92% | 91% | 93% | 94% |
2 | Car | 90% | 91% | 92% | 91% | 93% | 94% |
3 | Motorbike | 90% | 91% | 91% | 91% | 92% | 93% |
4 | Truck | 90% | 90% | 90% | 91% | 92% | 92% |
5 | Bus | 90% | 90% | 90% | 91% | 91% | 91% |
# | Detail description | |
1 | Color space | RGB |
2 | Video sequence duration | 5 to 30 minutes |
3 | Video format | .mp4 |
4 | Frame rate | 20 frame per sec |
5 | Frame resolution | 640 × 480 |
6 | Height of camera | 3 to 4 meters from ground |
7 | Recording location | Indoor and outdoor |
8 | Illumination changes reflections/Shadows | Yes |
9 | Frame format | .jpg |
10 | None object | Varying |
# | Object | YOLO | SSD | |||
TDR | FDR | TDR | FDR | |||
1 | Person | 92% | 0.5% | 93% | 0.4% | |
2 | Car | 92% | 0.5% | 92% | 0.4% | |
3 | Motorbike | 90% | 0.5% | 90% | 0.5% | |
4 | Truck | 90% | 0.6% | 90% | 0.6% | |
5 | Bus | 90% | 0.6% | 90% | 0.6% |
# | Object | Tracking algorithms | |||||
BOOSTING | MIL | KCF | TLD | MEDIANFLOW | GOTURN | ||
TA | TA | TA | TA | TA | TA | ||
1 | Person | 90% | 90% | 91% | 91% | 92% | 94% |
2 | Car | 90% | 90% | 91% | 91% | 92% | 94% |
3 | Motorbike | 90% | 90% | 90% | 91% | 91% | 92% |
4 | Truck | 90% | 90% | 90% | 90% | 90% | 91% |
5 | Bus | 90% | 90% | 90% | 90% | 90% | 90% |
# | Object | Tracking algorithms | |||||
BOOSTING | MIL | KCF | TLD | MEDIANFLOW | GOTURN | ||
TA | TA | TA | TA | TA | TA | ||
1 | Person | 91% | 91% | 92% | 91% | 93% | 94% |
2 | Car | 90% | 91% | 92% | 91% | 93% | 94% |
3 | Motorbike | 90% | 91% | 91% | 91% | 92% | 93% |
4 | Truck | 90% | 90% | 90% | 91% | 92% | 92% |
5 | Bus | 90% | 90% | 90% | 91% | 91% | 91% |