Abstract

We present a fast and accurate visual tracking algorithm based on the multi-domain convolutional neural network (MDNet). The proposed approach accelerates feature extraction procedure and learns more discriminative models for instance classification; it enhances representation quality of target and background by maintaining a high resolution feature map with a large receptive field per activation. We also introduce a novel loss term to differentiate foreground instances across multiple domains and learn a more discriminative embedding of target objects with similar semantics. The proposed techniques are integrated into the pipeline of a well known CNN-based visual tracking algorithm, MDNet. We accomplish approximately 25 times speed-up with almost identical accuracy compared to MDNet. Our algorithm is evaluated in multiple popular tracking benchmark datasets including OTB2015, UAV123, and TempleColor, and outperforms the state-of-the-art real-time tracking methods consistently even without dataset-specific parameter tuning.

proposed Network Architecture

Figure 1. Network architecture of the proposed tracking algorithm. The network is composed of three convolutional layers for extracting a shared feature map, adaptive RoIAlign layer for extracting a specific feature using regions of interest (RoIs), and three fully connected layers for binary classification. The number of channels and the size of each feature map are shown with the name of each layer.

Improved RoIAlign

Figure 2. Network architecture of the part of our fully convolutional network for extracting a shared feature map. Max pooling layer is removed after conv2 layer in original VGG-M network, and dilated convolution with rate r = 3 is applied for extracting a dense feature map with a higher spatial resolution.

Pretraining for Discriminative Instance Embedding

Figure 3. Multi-task learning for binary classification of the target object and instance embedding across multiple domains. The binary classification loss is designed to distinguish the target and background, while the instance embedding loss separates target instances. Note that a minibatch in each iteration for training is constructed by sampling from a single domain.

Results

1. Results on multiple benchmarks (OTB2015, TempleColor, UAV123).

Figure 4. OTB2015 results

Figure 5. TempleColor results

Figure 6. UAV123 results

2. Ablation study

Figure 7. Impacts of different feature extraction methods on accuracy of RT-MDNet. Each network is pretrained on VOT-OTB dataset and tested on OTB2015.

Figure 8. Internal comparision results pretrained on IMAGENET-Vid dataset.

3. Qualitative Results

Qualitative Results. Qualitative results of the RT-MDNet.

Paper

Real-Time MDNet
Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han
ECCV, 2018
[arXiv] [Bibtex]

Code

--Check our GitHub repository: [github]--