Invited column-deep learning and its application
2018, 47(2): 203001. doi: 10.3788/IRLA201847.0203001
[Abstract](497) [PDF 998KB](234) [Cited by] ()
In the task of close range pedestrian detection, the balance of the precision and speed were of great significance to the practical application of the detection algorithm. In order to detect the close range target quickly and accurately, a pedestrian detection algorithm based on fused fully convolutional network was proposed. Firstly, a fully convolutional detection network was used to detect the target in the image, and a series of candidate bounding boxes were obtained. Secondly, pixel level classification results of the image were obtained by using a semantic segmentation network with weakly supervised training. Finally, the candidate bounding boxes and the pixel level classification results were fused to complete the detection. The experimental results show that the algorithm has good performance in both the speed and the precision of detection.
2018, 47(2): 203002. doi: 10.3788/IRLA201847.0203002
[Abstract](497) [PDF 1062KB](232) [Cited by] ()
Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN) had developed rapidly in the fields of image classification, computer vision, natural language process, speech recognition, machine translation and semantic analysis, which caused researchers' close attention to computers' automatic generation of image interpretation. At present, the main problems in image description were sparse input text data, over-fitting of the model, difficult convergence of the model loss function, and so on. In this paper, NIC was used as a baseline model. For data sparseness, one-hot text in the baseline model was changed and word2vec was used to map the text. To prevent over-fitting, regular items were added to the model and Dropout technology was used. In order to make innovations in word order memory, the associative memory unit GRU for text generation was used. In experiment, the AdamOptimizer optimizer was used to update parameters iteratively. The experimental results show that the improved model parameters are reduced and the convergence speed is significantly faster, the loss function curves are smoother, the maximum loss is reduced to 2.91, and the model accuracy rate increases by nearly 15% compared with the NIC. Experiments validate that the use of word2vec to map text in the model obviously alleviates the data sparseness problem. Adding regular items and using Dropout technology could effectively prevent over-fitting of the model. The introduction of associative memory unit GRU could greatly reduce the model trained parameters and speed up the algorithm of convergence rate, improve the accuracy of the entire model.
2018, 47(2): 203003. doi: 10.3788/IRLA201847.0203003
[Abstract](640) [PDF 943KB](238) [Cited by] ()
Generative adversarial networks had shown promising potential in conditional image generation. It seemed that the GANs were particularly suitable for use in image super-resolution reconstruction. However, there was a shortcoming of excessive smoothness and lack of high frequency detail information for the reconstructed SR images by using GANs. Aiming at resolving the problem that the method of single image super-resolution reconstruction ignored the spatio-temporal relationship between image frames, a method of multiframe infrared image super-resolution reconstruction based on generative adversarial networks (M-GANs) was proposed in this paper. Firstly, motion compensation was proposed for registration low resolution image frames; Secondly, a weight representation convolutional layer was performed to calculate the weight transfer; Finally, the generative adversarial network was used to reconstruct the high resolution image. Experimental results demonstrate that the proposed method surpass current state-of-the-art performance of both subjective and objective evaluation.
2018, 47(2): 203004. doi: 10.3788/IRLA201847.0203004
[Abstract](473) [PDF 1158KB](192) [Cited by] ()
Perceptual aliasing and perceptual variability caused by drastically appearance changing in the scene bring great challenge to visual place recognition. Many existing visual place recognition methods using CNN directly adopted the distance of the CNN features and set thresholds to measure the similarity between the two images, which had shown a poor performance when drastically appearance changing in the scene. A novel multi-level feature difference map based visual place recognition method was proposed. Firstly, a CNN pretrained on scene-centric dataset was adopted to extract features for perceptually different images of same place and aliased images of different places. Then, according to the different properties of different CNN layers, multi-level feature difference map was constructed on the multi-level CNN features to represent the difference between the two images. Finally, visual place recognition was regarded as a binary classification task. The feature difference maps were used to train a new CNN classification model for determining whether the two images are from the same place. Experimental results demonstrated that the feature difference map constructed by multi-level CNN features can well represent the difference between two images, and the proposed method can effectively overcome perceptual aliasing and perceptual variability, and achieve a better recognition performance when drastically appearance changing in the scene.
2018, 47(2): 203005. doi: 10.3788/IRLA201847.0203005
[Abstract](478) [PDF 686KB](111) [Cited by] ()
Texture synthesis is a hot research topic in the fields of computer graphics, vision, and image processing. Traditional texture synthesis methods are generally achieved by extracting effective feature patterns or statistics and generating random images under the constraint of the feature information. Generative adversarial networks (GANs) is a new type of deep network. It can randomly generate new data of the same distribution as the observed data by training generator and discriminator in an adversarial learning mechanism. Inspired by this point, a texture synthesis method based on GANs was proposed. The advantage of the algorithm was that it could generate more realistic texture images without iteration; the generated images were visually consistent with the observed texture image and also had randomness. A series of experiments for random texture and structured texture synthesis verify the effectiveness of the proposed algorithm.
2018, 47(2): 203006. doi: 10.3788/IRLA201847.0203006
[Abstract](488) [PDF 670KB](114) [Cited by] ()
In view of the complicated background of the fire image, the complicated process of extracting the artificial feature, the poor generalization ability of the fire image, the low accuracy, false alarm rate, missing rate, the novel method for detecting fire images of multilinear principal component analysis (MPCA) was presented in the paper. The fire image recognition model was established by using MPCANet. Through the MPCA algorithm, the learning filter was used as the convolution kernel of deep learning network convolution layer, and the feature extraction of high dimensional images of tensor objects was taken, and candle images and fireworks images were taken as interference. Compared with other fire image recognition methods, the recognition accuracy of the proposed image recognition method reaches 97.5%, false alarm rate of 1.5%, missing rate of 1%. Experiments results show that this method could effectively solve the problems of fire image recognition.
2018, 47(2): 203007. doi: 10.3788/IRLA201847.0203007
[Abstract](424) [PDF 658KB](262) [Cited by] ()
Action recognition from natural scene was affected by complex illumination conditions and cluttered backgrounds. There was a growing interest in solving these problems by using 3D skeleton data. Firstly, considering the spatio-temporal features of human actions, a spatio-temporal fusion deep learning network for action recognition was proposed; Secondly, view angle invariant character was constructed based on geometric features of the skeletons. Local spatial character was extracted by short-time CNN networks. A spatio-LSTM network was used to learn the relation between joints of a skeleton frame. Temporal LSTM was used to learn spatio-temporal relation between skeleton sequences. Lastly, NTU RGB+D datasets were used to evaluate this network, the network proposed achieved the state-of-the-art performance for 3D human action analysis. Experimental results show that this network has strong robustness for view invariant sequences.
2018, 47(2): 203008. doi: 10.3788/IRLA201847.0203008
[Abstract](431) [PDF 956KB](86) [Cited by] ()
Learning rich representations efficiently plays an important role in RGB-D object recognition task, which is crucial to achieve high generalization performance. For the long training time of convolutional neural networks, a Hybrid Convolutional Auto-Encoder Extreme Learning Machine Structure (HCAE-ELM) was put forward which included Convolutional Neural Network (CNN) and Auto-Encoder Extreme Learning Machine (AE-ELM), which combined the power of CNN and fast training of AE-ELM. It used convolution layers and pooling layers to effectively abstract lower level features from RGB and depth images separately. And then, the shared layer was developed by combining these features from each modality and fed to an AE-ELM for higher level features. The final abstracted features were fed to an ELM classifier, which led to better generalization performance with faster learning speed. The performance of HCAE-ELM was evaluated on RGB-D object dataset. Experimental results show that the proposed method achieves better testing accuracy with significantly shorter training time in comparison with deep learning methods and other ELM methods.
2018, 47(2): 203009. doi: 10.3788/IRLA201847.0203009
[Abstract](745) [PDF 945KB](158) [Cited by] ()
Fatigue driving is the main cause or reason for traffic accidents, which has a huge influence on social safety. Considering the fact that light change and glasses could significantly increase the difficulty to monitor human eyes, fatigue detection was still an unsolved problem. A new driver fatigue method based on morphology infrared features and deep learning were proposed. Using 850 nm infrared light source, the facial image was obtained. Human faces and landmarks which indicated the area of eyes were located by Convolution Neural Network (CNN) with morphology features in infrared image. In the next step, a filter module which measured head displacement was added, aiming at reducing the impact of posture change. In the following, the collected facial states were transformed into sequential data. Finally, the sequential data was passed to the Long Short Term Memory (LSTM) network to detect fatigue state by analyzing the sequential correlations. Experimental results show that the accuracy of the fatigue detection algorithm can reach 94.48% with an average detection time of 65.64 ms.