<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2020.811003</article-id><article-id pub-id-type="publisher-id">JCC-104036</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  Deep Convolutional Feature Fusion Model for Multispectral Maritime Imagery Ship Recognition
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Xiaohua</surname><given-names>Qiu</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Min</surname><given-names>Li</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Liqiong</surname><given-names>Zhang</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Rui</surname><given-names>Zhao</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib></contrib-group><aff id="aff1"><addr-line>Xi’an Research Institute of Hi-Tech, Xi’an, China</addr-line></aff><aff id="aff2"><addr-line>School of Information Engineering, Engineering University of PAP, Xi’an, China</addr-line></aff><pub-date pub-type="epub"><day>05</day><month>11</month><year>2020</year></pub-date><volume>08</volume><issue>11</issue><fpage>23</fpage><lpage>43</lpage><history><date date-type="received"><day>9,</day>	<month>October</month>	<year>2020</year></date><date date-type="rev-recd"><day>9,</day>	<month>November</month>	<year>2020</year>	</date><date date-type="accepted"><day>12,</day>	<month>November</month>	<year>2020</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  Combining both visible and infrared object information, multispectral data is a promising source data for automatic maritime ship recognition. In this paper, in order to take advantage of deep convolutional neural network and multispectral data, we model multispectral ship recognition task into a convolutional feature fusion problem, and propose a feature fusion architecture called Hybrid Fusion. We fine-tune the VGG-16 model pre-trained on ImageNet through three channels single spectral image and four channels multispectral images, and use existing regularization techniques to avoid over-fitting problem. Hybrid Fusion as well as the other three feature fusion architectures is investigated. Each fusion architecture consists of visible image and infrared image feature extraction branches, in which the pre-trained and fine-tuned VGG-16 models are taken as feature extractor. In each fusion architecture, image features of two branches are firstly extracted from the same layer or different layers of VGG-16 model. Subsequently, the features extracted from the two branches are flattened and concatenated to produce a multispectral feature vector, which is finally fed into a classifier to achieve ship recognition task. Furthermore, based on these fusion architectures, we also evaluate recognition performance of a feature vector normalization method and three combinations of feature extractors. Experimental results on the visible and infrared ship (VAIS) dataset show that the best Hybrid Fusion achieves 89.6% mean per-class recognition accuracy on daytime paired images and 64.9% on nighttime infrared images, and outperforms the state-of-the-art method by 1.4% and 3.9%, respectively.
 
</p></abstract><kwd-group><kwd>Deep Convolutional Neural Network</kwd><kwd> Feature Fusion</kwd><kwd> Multispectral Data</kwd><kwd> Ob-ject Recognition</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>By integrating complementary information from visible (VIS) and infrared (IR) images, multispectral data has recently received much attention in machine learning and computer vision [<xref ref-type="bibr" rid="scirp.104036-ref1">1</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref2">2</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref3">3</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref4">4</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref5">5</xref>]. VIS images are sensitive to variation illumination and unfavourable weather conditions, which degrade the performance of computer vision systems built on these images. Thermal camera can ameliorate the problem, but it cannot provide image with the same high-resolution as visible camera, and often exhibit a decrease in image quality during daytime due to a high background temperature. Therefore, multispectral images have been successfully used to face recognition [<xref ref-type="bibr" rid="scirp.104036-ref6">6</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref7">7</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref8">8</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref9">9</xref>], and are also widely applied to object recognition [<xref ref-type="bibr" rid="scirp.104036-ref10">10</xref>], person re-identification [<xref ref-type="bibr" rid="scirp.104036-ref11">11</xref>], pedestrian detection [<xref ref-type="bibr" rid="scirp.104036-ref12">12</xref>], and object tracking [<xref ref-type="bibr" rid="scirp.104036-ref13">13</xref>] by exploiting deep learning in recent years.</p><p>As known to all, after the breakthrough research by Krizhevsky et al. [<xref ref-type="bibr" rid="scirp.104036-ref14">14</xref>], deep convolutional neural networks (CNN) have achieved remarkable success for a large variety of tasks, and quickly became the dominant tool in computer vision. Meanwhile, some well-known deep CNN models have been reported, such as Oxford VGG Model [<xref ref-type="bibr" rid="scirp.104036-ref15">15</xref>], Google Inception Model [<xref ref-type="bibr" rid="scirp.104036-ref16">16</xref>] and Microsoft ResNet Model [<xref ref-type="bibr" rid="scirp.104036-ref17">17</xref>]. One factor for the dramatic improvement in performance of deep CNN is that many challenging datasets for training with millions of labeled examples are harvested from the web, such as ImageNet [<xref ref-type="bibr" rid="scirp.104036-ref18">18</xref>]. However, a large-scale training set is expensive or difficult to collect in the real world, and training a large neural network on a small dataset would lead to poor performance due to the problem of overfitting. The lack of a large-scale training set forces the computer vision community to find practical workarounds. Much recent effort [<xref ref-type="bibr" rid="scirp.104036-ref19">19</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref20">20</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref21">21</xref>] has been dedicated to developing methods that fine-tune the well-known pre-trained deep CNN models or directly take these models as feature extractors. Research in vision tasks based on multispectral data follows the same trend, e.g., action recognition [<xref ref-type="bibr" rid="scirp.104036-ref22">22</xref>], pedestrian detection [<xref ref-type="bibr" rid="scirp.104036-ref23">23</xref>], object recognition [<xref ref-type="bibr" rid="scirp.104036-ref10">10</xref>]. In the previous works on multispectral data, whether fine-tuning after feature fusion or directly extracting feature without fine-tuning, features are produced at the same layer of the pre-trained deep CNN model for VIS and IR images. However, due to the aforementioned difference between VIS and IR images, features extracted from the same layer may not both be the best, so feature fusion cannot fully take advantage of multispectral data. Therefore, how features of VIS and IR images can be properly fused in pre-trained or fine-tuned deep CNN model to achieve the best performance in vision task remains to be solved.</p><p>In this paper, we focus on using the pre-trained or fine-tuned deep CNN model to extract features of VIS and IR image, and propose a novel feature fusion architecture called Hybrid Fusion for multispectral maritime ship recognition. We firstly model the multispectral maritime ship recognition task to a convolutional feature fusion problem, and then evaluate the feature representation ability of the pre-trained or fine-tuned deep CNN model for multispectral data. Thirdly, Hybrid Fusion and the other three feature fusion architectures are investigated. Finally, we compare Hybrid Fusion with the other reported methods. Our idea is that combining high-level feature of VIS image and middle-level feature of IR image can provide rich multispectral information to the classifier for the final prediction. Due to the large gap of feature values at different layers, a features normalization method is exploited. Meanwhile, based on different feature extractors used by VIS and IR images, three combinations are also investigated.</p><p>Our major contribution is fourfold: First, we propose a feature fusion architecture named Hybrid Fusion, which combines high-level feature of VIS image and middle-level feature of IR image. Second, we investigate four distinct feature fusion architectures, namely Early Fusion, Halfway Fusion, Late Fusion and Hybrid Fusion, and evaluate these fusion architectures on the public multispectral maritime ship images, the VAIS dataset [<xref ref-type="bibr" rid="scirp.104036-ref10">10</xref>]. Third, we fine-tune the pre-trained VGG-16 model on both single spectral image and multispectral images, and also exploit three existing regularization techniques to avoid over-fitting problem. Fourth, the best Hybrid Fusion performs 89.6% mean per-class recognition accuracy on the daytime paired images of VAIS dataset, outperforms the state-of-the-art method by 1.4%, and also achieves 64.9% on nighttime and 68.6% on all time IR images.</p></sec><sec id="s2"><title>2. Related Work</title><p>Object recognition with deep convolutional feature fusion. Initializing with transferred features whether features are transferred from the low-level, middle-level or high-level of the pre-trained deep CNN, can improve generalization performance even after substantial fine-tuning on a new task [<xref ref-type="bibr" rid="scirp.104036-ref24">24</xref>]. Schwarz et al. [<xref ref-type="bibr" rid="scirp.104036-ref25">25</xref>] presented feature fusion model for multi-modal object recognition, a pre-trained AlexNet model [<xref ref-type="bibr" rid="scirp.104036-ref14">14</xref>] is exploited to extract features from the last two fully connected layers. An extension of the fusion model further improves object recognition accuracy by fine-tuning the pre-trained AlexNet with multi-modal training data [<xref ref-type="bibr" rid="scirp.104036-ref19">19</xref>]. Furthermore, Zia et al. [<xref ref-type="bibr" rid="scirp.104036-ref26">26</xref>] proposed a hybrid 2D/3D convolutional neural network initialized by the pre-trained VGG-16 model [<xref ref-type="bibr" rid="scirp.104036-ref15">15</xref>], and fused the features separately extracted from the fully connected layers of three network architectures. Another interesting work [<xref ref-type="bibr" rid="scirp.104036-ref27">27</xref>] presented an unsupervised feature learning framework. In this framework, the pre-trained VGG-f model [<xref ref-type="bibr" rid="scirp.104036-ref28">28</xref>] is taken as a feature extractor, and then recursive neural network [<xref ref-type="bibr" rid="scirp.104036-ref29">29</xref>] is used to reduce dimension of the extracted features and learn high-level features. The aforementioned methods focus not only on convolutional feature fusion but also on the processing of modal data. The goal of our work is how to leverage convolutional feature fusion and limited multispectral data to maximize ship recognition accuracy.</p><p>Ship recognition on single spectral image. Kanjir et al. [<xref ref-type="bibr" rid="scirp.104036-ref30">30</xref>] provided an overview of existing literature on ship detection and classification from optical satellite imagery. However, most of the reviewed methods are performed on optical remote sensing images, and our work focuses on ship recognition from VIS and IR images. Therefore, we mainly review the works of vessel/ship recognition on VIS image due to little of ones on IR image. Khellal et al. [<xref ref-type="bibr" rid="scirp.104036-ref31">31</xref>] used extreme learning machine (ELM) to learn discriminative CNN feature for IR image maritime ship recognition. Fouad et al. [<xref ref-type="bibr" rid="scirp.104036-ref32">32</xref>] presented an experimental study to investigate the ability of deep CNN features to catch details of VIS image maritime ships for fine-grained classification. Cuong et al. [<xref ref-type="bibr" rid="scirp.104036-ref33">33</xref>] trained and tested AlexNet with a dataset of 130,000 VIS images of maritime ships, which are collected from website ShipSpotting<sup>1</sup>. Gundogdu et al. [<xref ref-type="bibr" rid="scirp.104036-ref34">34</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref35">35</xref>] introduced a large-scale VIS image maritime vessels dataset, namely MARVEL, for the fine-grained visual categorization, recognition, retrieval and verification tasks. To achieve the baseline results, both extracting feature from pre-trained VGG-f model and training AlexNet model have been used to perform the aforementioned tasks. To improve the performance of the tasks on MARVEL dataset, Solmaz et al. [<xref ref-type="bibr" rid="scirp.104036-ref36">36</xref>] exploited a multi-task learning framework based on deep CNN models to accompany deep metric learning with a proposed loss function. Milicevic et al. [<xref ref-type="bibr" rid="scirp.104036-ref37">37</xref>] used the training dataset of MARVEL to fine-tune VGG-19 model [<xref ref-type="bibr" rid="scirp.104036-ref15">15</xref>] pre-trained on ImageNet, then boosted the recognition accuracy by 3%. Huang et al. [<xref ref-type="bibr" rid="scirp.104036-ref38">38</xref>] exploited low-level and high-level features to classify ship categories on VIS images. The proposed method learns the high-level features via fine-tuning pre-trained deep CNN model, and incorporates the multi-scales rotation invariant features obtained by Gabor filter and multi-scale completed local binary patterns (MS-CLBP), then these features are fed into support vector machine (SVM) classifier. This method was extended to improve recognition performance by replacing SVM with ELM classifier in [<xref ref-type="bibr" rid="scirp.104036-ref39">39</xref>]. Shi et al. [<xref ref-type="bibr" rid="scirp.104036-ref40">40</xref>] proposed a classification framework consists of a multi-feature ensemble based on convolutional neural network (ME-CNN).</p><p>Ship recognition on multispectral images. Currently, there are few literature about multispectral maritime ship recognition due to the lack of corresponding multispectral data. VAIS dataset including VIS and IR images is the only public multispectral maritime ship dataset for image classification or object recognition. Zhang et al. [<xref ref-type="bibr" rid="scirp.104036-ref10">10</xref>] reported the VAIS dataset in detail, and combined the results of gnostic fields and deep CNN to provide the baseline recognition accuracy on this dataset, 87.4% mean per-class recognition accuracy during the daytime and 61% at nighttime. They also tried to fine-tune the pre-trained VGG-16 model, but failed in improving recognition performance. Aziz et al. [<xref ref-type="bibr" rid="scirp.104036-ref41">41</xref>] used a large-scale visible ship dataset to train a deep CNN, and then fine-tuned their pre-trained CNN model with the training images of VAIS dataset. Santos et al. [<xref ref-type="bibr" rid="scirp.104036-ref42">42</xref>] proposed a decision level fusion of convolutional neural networks using a probabilistic model, in which features are extracted from the last convolutional activate map of the pre-trained VGG-19 model. Zhang et al. [<xref ref-type="bibr" rid="scirp.104036-ref43">43</xref>] presented a multi-feature structure fusion based on spectral regression discriminant analysis (SF-SRDA) by combining structural fusion with linear discriminant analysis, and used the pre-trained models VGG-19 and ResNet-152 [<xref ref-type="bibr" rid="scirp.104036-ref17">17</xref>] to achieve a promising result. The above work has achieved good ship recognition performance. However, they did not consider the difference between each convolutional layer of the pre-training or fine-tuning models for different spectrum image ship recognition, and our work considers this difference and proposes a Hybrid fusion model based on this difference.</p></sec><sec id="s3"><title>3. Proposed Feature Fusion Method</title><p>Intuitively, VIS and IR images provide auxiliary visual information to each other in depicting ship objects. Encouraged by the recent tremendous advances in deep learning techniques, as well as inspired by the work of multispectral pedestrian detection [<xref ref-type="bibr" rid="scirp.104036-ref44">44</xref>], we explore the effectiveness of using the VGG-16 model pre-trained on ImageNet dataset and fine-tuned on VAIS dataset to perform multispectral ship recognition. The structure of our method is shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>.</p><p>The proposed fusion framework mainly includes four stages:</p><p>1) Image preprocessing: as the pre-trained VGG-16 model expects 224&#215;224 pixels and three channels images as input, we simply clone the single IR channel three times. Meanwhile, both VIS and IR images are resized to 224&#215;224 using nearest interpolation.</p><p>2) Feature extraction: there are two feature extraction branches, visible image branch (shorted as VGG-16-VIS) and infrared image branch (shorted as VGG-16-IR), as shown in <xref ref-type="fig" rid="fig1">Figure 1</xref>. Each branch takes the pre-trained or fine-tuned VGG-16 model as feature extractor. Besides, image features of both branches are extracted from the same layer or different layers of VGG-16 model according to feature fusion architectures.</p><p>3) Feature fusion: the features extracted from the two branches are flattened to feature vectors, and then are concatenated to produce a multispectral feature vector representing the maritime ship.</p><p>4) Classification: before the fused feature vector is fed into a linear SVM classifier for the final prediction, feature vector is normalized by l2-norm (shorted as L2) normalization method. According to Hybrid Fusion, feature vector should be normalized before feature fusion because of the large gap of feature values at different layers.</p><p>Additionally, the training samples of VIS and IR images in the VAIS dataset are used to fine-tune the pre-trained VGG-16 model in an end-to-end way, respectively. Then, the two fine-tuned VGG-16 models are also taken as feature extractors.</p><sec id="s3_1"><title>3.1. Feature Fusion Architecture</title><p>Due to features at different levels of VGG-16 correspond to various levels of semantic information and fine visual details [<xref ref-type="bibr" rid="scirp.104036-ref45">45</xref>], feature fusion at different layers would lead to different recognition results. Therefore, the multispectral ship recognition task is modelled into a convolutional feature fusion problem, i.e., which feature fusion architecture could get best recognition performance. Then, we propose a feature fusion architecture called Hybrid Fusion, which combines high-level feature of VIS image and middle-level feature of IR image. We investigate Hybrid Fusion as well as Early Fusion, Halfway Fusion and Late Fusion. These fusion architectures integrate two-branch convolutional features at different layers of VGG-16 model, as shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>. Each branch represents a single spectral image.</p><p>Early Fusion combines the feature maps from VIS and IR images immediately after the first and second convolutional layers (C1 and C2 layers) followed by a Max Pool layer (this fusion architecture is ignored in <xref ref-type="fig" rid="fig2">Figure 2</xref>). Since C1 and C2 layers capture low-level visual features, such as color, corners and line segments. This fusion architecture fuses features at low-level.</p><p>Halfway Fusion also implements feature fusion at convolutional layers. Different from Early Fusion, it fuses the features after the third, fourth and fifth convolutional layers (C3 - C5 layers) followed by a Max Pool layer, as shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>(a). Features from C3 - C5 layers contain more semantic information</p><p>than C1 and C2 layers features, while retain some fine visual details. The fusion architecture fuses features at middle-level.</p><p>Late Fusion combines features extracted from the first and second fully connected layers (F6 and F7 layers) followed by an activation layer named ReLU, which performs feature fusion at fully connected stage, as shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>(b). Conventionally, F6 and F7 layers features are used as new representations of ship objects. This fusion architecture executes high-level feature fusion.</p><p>Hybrid Fusion combines high-level feature of VIS image and middle-level feature of IR images, that is F6 and F7 layers features of VIS images and C3-C5 layers features of IR images, as shown in <xref ref-type="fig" rid="fig2">Figure 2</xref>(c), due to different feature representation of the VGG-16 model at different levels for each spectral image. Hybrid Fusion leverages the feature representation of different levels for multispectral images.</p></sec><sec id="s3_2"><title>3.2. Feature Fusion Method</title><p>After extracting two-branch convolutional features from different levels of the pre-trained or fine-tuned VGG-16 model, each branch features are flattened to a feature vector. Following the work based on multispectral data in [<xref ref-type="bibr" rid="scirp.104036-ref46">46</xref>], concatenation fusion method is used to fuse two feature vectors. The fusion goal is to integrate two feature vectors F V I S and F I R to a fused feature vector F F , where F V I S , F I R denote feature vector of VIS and IR images, respectively. The concatenation fusion method is to directly concatenate two feature vectors, which can be defined as:</p><p>F F = f c o n c a t ( F V I S , F I R ) , (1)</p><p>F d 1 F = F d 1 V I S (2)</p><p>F D 1 + d 2 F = F d 2 I R (3)</p><p>where F d 1 V I S denotes the d 1 t h value of F V I S , F d 2 I R denotes the d 2 t h value of F I R , F d 1 F is the d 1 t h value of F F , F D 1 + d 2 F is the ( D 1 + d 2 ) t h value of F F . 1 ≤ d 1 ≤ D 1 , 1 ≤ d 2 ≤ D 2 and F V I S ∈ ℝ D 1 , F I R ∈ ℝ D 2 , F F ∈ ℝ D 1 + D 2 . This fusion method concatenates the dimensions of the two input feature vectors.</p></sec><sec id="s3_3"><title>3.3. Normalization and Classification</title><p>In order to evaluate the multispectral ship recognition performance of four feature fusion architectures, we exploit a linear SVM as classifier. It is crucial to normalize the feature vector before putting it into the linear SVM classifier. The reason is threefold: First, it avoids the feature characteristic of small value range to be over-branched by the feature characteristic of large value range, so as to improve the performance of linear SVM classifier. Second, it adjusts values measured on different scales to a same scale, and then facilitates data comparison and common processing. Third, it reduces numerical value complexity in calculation. To normalize the features of train data and test data, we use L2 normalization method. L2 normalization is a normalization method commonly used in machine learning. The main idea is to divide each element in a vector by the L2 norm of the vector, that is defined as formulas Equation (4).</p><p>x d L 2 = x d ∑ d = 1 D | x d | 2 (4)</p><p>where x d L 2 represents the d t h value of the D dimension feature vector after L2 normalization, x d is the d t h value of the D dimension feature vector, |   | denotes an absolute operator and 1 ≤ d ≤ D .</p></sec></sec><sec id="s4"><title>4. Experiments</title><sec id="s4_1"><title>4.1. Dataset</title><p>To investigate our four feature fusion architectures for ship category recognition, we use the publicly available VAIS dataset [<xref ref-type="bibr" rid="scirp.104036-ref10">10</xref>]. For now as we know, it is the only existing public database of paired VIS and IR ship imagery. The dataset contains 2865 images (1623 VIS images and 1242 IR images), in which 1088 “VIS-IR” unregistered images pairs and 154 nighttime IR images, and includes 6 categories: cargo ships, medium-other ships, passenger ships, sailing ships, small boats and tug boats. However, the images are captured at different distance and various times of one day, including dusk and dawn. Therefore, some images are high-resolution while a part of images may appear dim and hard to recognize even with manual inspection. In the dataset, the paired VIS-IR image set is partitioned into 539 image pairs for training and 549 image pairs for testing. A sample pairs from VAIS is illustrated in <xref ref-type="fig" rid="fig1">Figure 1</xref> and <xref ref-type="table" rid="table1">Table 1</xref> shows the number of train and test samples for each class. As followed the baseline method [<xref ref-type="bibr" rid="scirp.104036-ref10">10</xref>], the</p><table-wrap id="table1" ><label><xref ref-type="table" rid="table1">Table 1</xref></label><caption><title> The number of train and test images for each class in the paired images, nighttime and all time IR images of VAIS dataset</title></caption><table><tbody><thead><tr><th align="center" valign="middle" >VAIS</th><th align="center" valign="middle" >Time</th><th align="center" valign="middle" >Cargo</th><th align="center" valign="middle" >Medium-other</th><th align="center" valign="middle" >Passenger</th><th align="center" valign="middle" >Sailing</th><th align="center" valign="middle" >Small</th><th align="center" valign="middle" >Tug</th><th align="center" valign="middle" >Total</th></tr></thead><tr><td align="center" valign="middle" >Train set</td><td align="center" valign="middle" >Daytime</td><td align="center" valign="middle" >83</td><td align="center" valign="middle" >62</td><td align="center" valign="middle" >58</td><td align="center" valign="middle" >148</td><td align="center" valign="middle" >158</td><td align="center" valign="middle" >30</td><td align="center" valign="middle" >539</td></tr><tr><td align="center" valign="middle" >Test set</td><td align="center" valign="middle" >Daytime</td><td align="center" valign="middle" >63</td><td align="center" valign="middle" >76</td><td align="center" valign="middle" >59</td><td align="center" valign="middle" >136</td><td align="center" valign="middle" >195</td><td align="center" valign="middle" >20</td><td align="center" valign="middle" >549</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >Nighttime</td><td align="center" valign="middle" >34</td><td align="center" valign="middle" >14</td><td align="center" valign="middle" >12</td><td align="center" valign="middle" >15</td><td align="center" valign="middle" >30</td><td align="center" valign="middle" >49</td><td align="center" valign="middle" >154</td></tr><tr><td align="center" valign="middle" ></td><td align="center" valign="middle" >All time</td><td align="center" valign="middle" >97</td><td align="center" valign="middle" >90</td><td align="center" valign="middle" >71</td><td align="center" valign="middle" >151</td><td align="center" valign="middle" >225</td><td align="center" valign="middle" >69</td><td align="center" valign="middle" >703</td></tr></tbody></table></table-wrap><p>same train data and test data are used and the mean per-class recognition accuracy is taken as the evaluation measurement in the experiments.</p></sec><sec id="s4_2"><title>4.2. Implementation Platform and Details</title><p>Our processing platform is a personal computer with Ubuntu 16.04, with a single CPU (4.20 GHz) of an Intel Core i7-7770K with 16 GB random access memory (RAM). An NVIDIA GTX1080Ti Graphics PU is used for deep CNN computations. The computation environment is a Keras environment with TensorFlow backend, which is a high-level neural network application programming interface written in Python. Our experiment is divided into two stages: features are first extracted and stored subsequently, then are fed into linear SVM classifier. We use LibSVM toolbox [<xref ref-type="bibr" rid="scirp.104036-ref47">47</xref>], which has been packaged as a module of scikit-learn<sup>2</sup>, as classifier to implement ship classification, the relaxation coefficient C is set to 10, kernel function is set to linear. Due to limited RAM, we did not perform experiments on feature fusion at the first convolutional layer, but experiments at the second convolutional layer can reflect the performance of Early Fusion.</p><p>It is not easy to fine-tune the pre-trained VGG-16 model end to end on small-scale dataset like VAIS, especially on IR images. The main problem is how to avoid over-fitting and take into account model convergence during fine-tuning model. Some existing regularization techniques [<xref ref-type="bibr" rid="scirp.104036-ref48">48</xref>], such as data argumentation, dropout and L<sup>2</sup> parameter regularization known as weight decay, are used to fine-tune the pre-trained model on VIS and IR images. Additionally, in order to investigate whether the VGG-16 model learn fusing inputs implicitly, 4 channels multispectral image consisting of VIS and IR images (shorted as 4C VIS-IR) are also taken as inputs to fine-tune the model. In fine-tuning experiment, the initial learning rate is set as 0.001 for VIS images and 0.0001 for IR images and 4C VIS-IR images. Stochastic gradient descent optimizer is utilized for optimization, the momentum is set to 0.9, and the decay is set as 0.00001. The train step is set to 50 epochs, the batch size is set to 32. Random horizontal flip, random vertical flip are used for online data argumentation. Dropout is applied after the second fully connected layer and its rate is set to 0.5. L<sup>2</sup> weight decay is applied on the last fully connected layer and its value is set to 0.1.</p></sec><sec id="s4_3"><title>4.3. Experimental Results</title><sec id="s4_3_1"><title>4.3.1. Evaluation of the Pre-Trained and Fine-Tuned Models</title><p>Firstly, we evaluate the effects of existing regularization techniques during fine-tuning VGG-16 model. <xref ref-type="table" rid="table2">Table 2</xref> shows the comparison of recognition performance with and without regularization techniques. Accuracy evaluation uses the average value together with standard deviation in 10 groups of fine-tuning experiments. <xref ref-type="fig" rid="fig3">Figure 3</xref> and <xref ref-type="fig" rid="fig4">Figure 4</xref> give the accuracy and loss curves of fine-tuning VGG-16 model on VIS and IR images in one group of experiments, respectively. As shown in <xref ref-type="table" rid="table2">Table 2</xref>, using data argumentation greatly improves the recognition accuracy of 4C VIS-IR images, and combining three regularization techniques achieves the best results. However, compared to using data argumentation for VIS and IR images, fine-tuning with dropout or L<sup>2</sup> weight decay has slightly higher average value and smaller standard deviation. A combination of two or more regularization techniques cannot significantly improve the performance of fine-tuned model. Combining dropout and data argumentation even leads to model degradation when fine-tuning on VIS images. Furthermore, it can be observed from <xref ref-type="fig" rid="fig3">Figure 3</xref> and <xref ref-type="fig" rid="fig4">Figure 4</xref> that over-fitting problem is worse on IR images than VIS images. Over-fitting on VIS images is easy to overcome by using any of the three regularization techniques, as shown in <xref ref-type="fig" rid="fig3">Figure 3</xref>. However, data argumentation and dropout make accuracy and loss fluctuate too much for IR images, and the being fine-tuned model is difficult to converge, as shown in <xref ref-type="fig" rid="fig4">Figure 4</xref>. L<sup>2</sup> weight decay can restrain the loss ascension of IR images after about 20 epochs, and model starts to converge. In summary, considering over-fitting problem and model convergence, we exploit dropout regularization technique for fine-tuning model on VIS images, and L<sup>2</sup> weight decay for fine-tuning model on IR images. Meanwhile, the fine-tuned models on VIS and IR images, in which the accuracy is close to the corresponding average value of 10 groups experiments, are chosen as feature extractor in our fusion method.</p><p>Secondly, we analyze the feature representation ability of different layers on the pre-trained VGG-16 model for VIS and IR images. As the horizontal axis shown in <xref ref-type="fig" rid="fig5">Figure 5</xref>(a), C2 is low-level layer, C3 - C5 are middle-level layers, and</p><table-wrap id="table2" ><label><xref ref-type="table" rid="table2">Table 2</xref></label><caption><title> The comparison of recognition performance (%) with and without regularization techniques during fine-tuning VGG-16 model</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Type</th><th align="center" valign="middle"  rowspan="2"  >DR</th><th align="center" valign="middle"  rowspan="2"  >L<sup>2</sup></th><th align="center" valign="middle"  colspan="3"  >Without data argumentation</th><th align="center" valign="middle"  colspan="3"  >With data argumentation</th></tr></thead><tr><td align="center" valign="middle" >4C VIS-IR</td><td align="center" valign="middle" >VIS</td><td align="center" valign="middle" >IR</td><td align="center" valign="middle" >4C VIS-IR</td><td align="center" valign="middle" >VIS</td><td align="center" valign="middle" >IR</td></tr><tr><td align="center" valign="middle" >Type 1</td><td align="center" valign="middle" >0.0</td><td align="center" valign="middle" >0.0</td><td align="center" valign="middle" >81.2 &#177; 2.2</td><td align="center" valign="middle" >85.6 &#177; 2.7</td><td align="center" valign="middle" >67.0 &#177; 2.0</td><td align="center" valign="middle" >82.5 &#177; 1.5</td><td align="center" valign="middle" >85.7 &#177; 2.0</td><td align="center" valign="middle" >66.2 &#177; 2.3</td></tr><tr><td align="center" valign="middle" >Type 2</td><td align="center" valign="middle" >0.5</td><td align="center" valign="middle" >0.0</td><td align="center" valign="middle" >81.7 &#177; 1.0</td><td align="center" valign="middle" >86.2 &#177; 1.3</td><td align="center" valign="middle" >67.5 &#177; 1.2</td><td align="center" valign="middle" >83.4 &#177; 1.5</td><td align="center" valign="middle" >83.9 &#177; 3.5</td><td align="center" valign="middle" >65.9 &#177; 2.0</td></tr><tr><td align="center" valign="middle" >Type 3</td><td align="center" valign="middle" >0.0</td><td align="center" valign="middle" >0.1</td><td align="center" valign="middle" >81.6 &#177; 1.4</td><td align="center" valign="middle" >85.7 &#177; 1.7</td><td align="center" valign="middle" >67.5 &#177; 1.6</td><td align="center" valign="middle" >83.1 &#177; 1.4</td><td align="center" valign="middle" >85.5 &#177; 2.3</td><td align="center" valign="middle" >67.5 &#177; 2.4</td></tr><tr><td align="center" valign="middle" >Type 4</td><td align="center" valign="middle" >0.5</td><td align="center" valign="middle" >0.1</td><td align="center" valign="middle" >82.9 &#177; 1.7</td><td align="center" valign="middle" >85.4 &#177; 1.7</td><td align="center" valign="middle" >68.0 &#177; 2.2</td><td align="center" valign="middle" >83.6 &#177; 1.2</td><td align="center" valign="middle" >83.6 &#177; 3.4</td><td align="center" valign="middle" >66.5 &#177; 1.3</td></tr></tbody></table></table-wrap><p>Notes: Accuracy evaluation uses the average value together with standard deviation in 10 times. Setting dropout rate (DR) or L<sup>2</sup> weight decay (L<sup>2</sup>) to 0.0 means that it does not use dropout or L<sup>2</sup> weight decay regularization technique.</p><p>F6 - F7 are high-level layers. VIS image obtains the more feature representation at high-level layers (see block line with squares in <xref ref-type="fig" rid="fig5">Figure 5</xref>(a)) due to the pre-trained VGG-16 model is trained by a later-scale dataset of VIS image. However, IR image obtains more rich features at middle-level layers (see block dotted line with squares in <xref ref-type="fig" rid="fig5">Figure 5</xref>(a)) than high-level layers for ship recognition. The main reason is that IR images are different from VIS images, such as high contrast, low resolution and insufficient details. Meanwhile, we evaluate the effect on recognition accuracy of the two feature vector normalization methods. <xref ref-type="fig" rid="fig5">Figure 5</xref>(a) shows that L2 normalization improves the recognition performance of IR image at almost all layers (see blue dotted line with diamonds in <xref ref-type="fig" rid="fig5">Figure 5</xref>(a)). The main reason may be that IR images have more noise and are more blurry than VIS images, and L2 normalization eliminates the influence of these small values.</p><p>Thirdly, we evaluate the recognition performance of the fine-tuned models. For convenience, the pre-trained VGG-16 model without fine-tuning is shorted</p><p>as NOFT, the pre-trained VGG-16 model with fine-tuning on VIS images is shorted as FTVIS, and the pre-trained VGG-16 model with fine-tuning on IR images is shorted as FTIR. <xref ref-type="fig" rid="fig5">Figure 5</xref>(b) and <xref ref-type="fig" rid="fig5">Figure 5</xref>(c) show the comparison of fine-tuned and pre-trained models without or with normalizations. Fine-tuning model on VIS images doesn’t obviously improve the performance of layers on VGG-16 model (see red line with rounds in <xref ref-type="fig" rid="fig5">Figure 5</xref>(b) &amp; <xref ref-type="fig" rid="fig5">Figure 5</xref>(c)). Fine-tuning model on IR images also doesn’t obviously improve the performance of C2 - C4 layers on VGG-16 model, therefore it indicates that the low-level and middle-level layers of pre-trained VGG-16 model has strong generalization performance. However, it can be found that the recognition accuracy of C5, F6 and F7 layers on FTVIS and FTIR fine-tuned models are better than those of the NOFT model (see blue dotted line with diamonds and red dotted line with rounds in <xref ref-type="fig" rid="fig5">Figure 5</xref>(b) and <xref ref-type="fig" rid="fig5">Figure 5</xref>(c)). Thus, NOFT is taken as the feature extractor of VIS images, but the feature extractors of IR images are NOFT, FTIR and FTVIS. Therefore, the three combinations of feature extractors for VIS and IR images are investigated, as shown in <xref ref-type="table" rid="table3">Table 3</xref>.</p></sec><sec id="s4_3_2"><title>4.3.2. Evaluation of Four Fusion Architectures</title><p>Firstly, we investigate the recognition performance of Early Fusion, Halfway Fusion and Late Fusion by using L2 normalization method along with three combinations. Due to feature extraction and feature fusion at the same layer for these three fusion architectures, feature is normalized for SVM classifier after features are fused. <xref ref-type="fig" rid="fig6">Figure 6</xref> shows the recognition accuracy of three fusion architectures by using L2 normalization method. For an intuitive comparison, that of feature</p><table-wrap id="table3" ><label><xref ref-type="table" rid="table3">Table 3</xref></label><caption><title> The three combinations of the different VGG-16 models taken as feature extractors for VIS and IR image</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Combination</th><th align="center" valign="middle"  colspan="2"  >Feature extractors of VIS and IR images</th></tr></thead><tr><td align="center" valign="middle" >VIS</td><td align="center" valign="middle" >IR</td></tr><tr><td align="center" valign="middle" >Combination 1</td><td align="center" valign="middle" >NOFT</td><td align="center" valign="middle" >NOFT</td></tr><tr><td align="center" valign="middle" >Combination 2</td><td align="center" valign="middle" >NOFT</td><td align="center" valign="middle" >FTIR</td></tr><tr><td align="center" valign="middle" >Combination 3</td><td align="center" valign="middle" >NOFT</td><td align="center" valign="middle" >FTVIS</td></tr></tbody></table></table-wrap><p>Notes: NOFT denotes the pre-trained VGG-16 model without fine-tuning, FTVIS denotes the pre-trained VGG-16 model with fine-tuning on VIS images, and FTIR denotes the pre-trained VGG-16 model with fine-tuning on IR images.</p><p>fusion without normalization are also shown in <xref ref-type="fig" rid="fig6">Figure 6</xref>. As shown in the figure, using L2 normalization method greatly degenerate the recognition accuracy of feature fusion at C4 layer, but significantly improve at C5, F6 and F7 layers. It indicates that L2 normalization method facilitates feature representation of semantic information. Moreover, Late Fusion at F6 layer using L2 normalization almost achieves the best recognition accuracy among the three fusion architectures.</p><p>Secondly, Hybrid Fusion is compared to Late Fusion, which is the best of the above three fusion architecture. Hybrid Fusion integrates high-level feature of VIS image and middle-level feature of IR image, and there is the large gap of values at different layers, thus the extracted features are normalized before being fused. <xref ref-type="table" rid="table4">Table 4</xref> shows the recognition accuracy of Late Fusion and Hybrid Fusion with L2 normalization method in three combinations. For each combination, Hybrid Fusion (F6C3) and Hybrid Fusion (F6C4) are better than Late Fusion (F6F6), but Hybrid Fusion (F6C5) is worse than Late Fusion (F6F6). Besides, Hybrid Fusion at F7 layer are also better than Late Fusion (F7F7) in all combinations.</p></sec><sec id="s4_3_3"><title>4.3.3. Comparison with Other Reported Methods</title><p>We compare the proposed Hybrid Fusion with four methods for paired images: 1) the baseline method (CNN + Gnostic Fields) [<xref ref-type="bibr" rid="scirp.104036-ref10">10</xref>], 2) Multimodal CNN [<xref ref-type="bibr" rid="scirp.104036-ref41">41</xref>], 3) DyFusion [<xref ref-type="bibr" rid="scirp.104036-ref42">42</xref>], 4) SF-SRDA [<xref ref-type="bibr" rid="scirp.104036-ref43">43</xref>], and with three methods for VIS images in the paired images: 5) MFL (feature-level) + ELM [<xref ref-type="bibr" rid="scirp.104036-ref38">38</xref>], 6) CNN + Gabor + MS-CLBP [<xref ref-type="bibr" rid="scirp.104036-ref36">36</xref>], 7) ME-CNN [<xref ref-type="bibr" rid="scirp.104036-ref40">40</xref>], and with one method for all time IR images: 8) ELM-CNN [<xref ref-type="bibr" rid="scirp.104036-ref31">31</xref>]. <xref ref-type="table" rid="table5">Table 5</xref> shows the comparison results using the mean pre-class recognition accuracy as evaluation measure. As shown in <xref ref-type="table" rid="table5">Table 5</xref>, Hybrid Fusion (F6C3) is 2.2% higher than the baseline method, outperforms the state-of-the-art (DyFusion) by 1.4% in daytime, and boosts the baseline method by 3.9% on</p><table-wrap id="table4" ><label><xref ref-type="table" rid="table4">Table 4</xref></label><caption><title> The recognition accuracy (%) of Late Fusion and Hybrid Fusion with L2 normalization method in three combinations</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Fusion Architecture</th><th align="center" valign="middle"  colspan="3"  >Combination 1</th><th align="center" valign="middle"  colspan="3"  >Combination 2</th><th align="center" valign="middle"  colspan="3"  >Combination 3</th></tr></thead><tr><td align="center" valign="middle" >VIS</td><td align="center" valign="middle" >IR</td><td align="center" valign="middle" >VIS + IR</td><td align="center" valign="middle" >VIS</td><td align="center" valign="middle" >IR</td><td align="center" valign="middle" >VIS + IR</td><td align="center" valign="middle" >VIS</td><td align="center" valign="middle" >IR</td><td align="center" valign="middle" >VIS + IR</td></tr><tr><td align="center" valign="middle" >LF(F6F6)</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >65.9</td><td align="center" valign="middle" >88.2</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >66.9</td><td align="center" valign="middle" >85.8</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >65.4</td><td align="center" valign="middle" >88.3</td></tr><tr><td align="center" valign="middle" >HF(F6C3)</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >69.6</td><td align="center" valign="middle" >88.7</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >69.0</td><td align="center" valign="middle" >89.6</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >67.6</td><td align="center" valign="middle" >88.5</td></tr><tr><td align="center" valign="middle" >HF(F6C4)</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >72.9</td><td align="center" valign="middle" >89.1</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >71.4</td><td align="center" valign="middle" >88.3</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >72.5</td><td align="center" valign="middle" >88.9</td></tr><tr><td align="center" valign="middle" >HF(F6C5)</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >68.7</td><td align="center" valign="middle" >85.8</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >70.1</td><td align="center" valign="middle" >85.3</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >73.8</td><td align="center" valign="middle" >87.4</td></tr><tr><td align="center" valign="middle" >LF(F7F7)</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >64.3</td><td align="center" valign="middle" >83.1</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >67.9</td><td align="center" valign="middle" >82.0</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >66.8</td><td align="center" valign="middle" >84.5</td></tr><tr><td align="center" valign="middle" >HF(F7C3)</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >69.6</td><td align="center" valign="middle" >85.6</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >69.0</td><td align="center" valign="middle" >86.7</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >67.6</td><td align="center" valign="middle" >85.4</td></tr><tr><td align="center" valign="middle" >HF(F7C4)</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >72.9</td><td align="center" valign="middle" >88.0</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >71.4</td><td align="center" valign="middle" >86.5</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >72.5</td><td align="center" valign="middle" >87.2</td></tr><tr><td align="center" valign="middle" >HF(F7C5)</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >68.7</td><td align="center" valign="middle" >84.3</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >70.1</td><td align="center" valign="middle" >84.3</td><td align="center" valign="middle" >81.6</td><td align="center" valign="middle" >73.8</td><td align="center" valign="middle" >86.3</td></tr></tbody></table></table-wrap><p>Notes: Abbreviated symbol LF (F6F6) represents Late Fusion combining F6 layer features of VIS and IR images, the same as to LF (F7F7). Abbreviated symbol HF (F6C3) represents Hybrid Fusion combining F6 layer feature of VIS and C3 layer feature of IR, the same as to others. Bold denotes the recognition accuracy is the best one in the same combination.</p><table-wrap id="table5" ><label><xref ref-type="table" rid="table5">Table 5</xref></label><caption><title> Comparison of recognition accuracy (%) with other results reported on the VAIS dataset</title></caption><table><tbody><thead><tr><th align="center" valign="middle"  rowspan="2"  >Method</th><th align="center" valign="middle"  colspan="3"  >Daytime</th><th align="center" valign="middle"  rowspan="2"  >Nighttime IR</th><th align="center" valign="middle"  rowspan="2"  >All time IR</th></tr></thead><tr><td align="center" valign="middle" >VIS</td><td align="center" valign="middle" >IR</td><td align="center" valign="middle" >VIS + IR</td></tr><tr><td align="center" valign="middle" >CNN + Gnostic Fields [<xref ref-type="bibr" rid="scirp.104036-ref10">10</xref>]</td><td align="center" valign="middle" >81.0</td><td align="center" valign="middle" >56.8</td><td align="center" valign="middle" >87.4</td><td align="center" valign="middle" >61.0</td><td align="center" valign="middle" >-</td></tr><tr><td align="center" valign="middle" >MFL (feature-level) + ELM [<xref ref-type="bibr" rid="scirp.104036-ref38">38</xref>]</td><td align="center" valign="middle" >87.6</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td></tr><tr><td align="center" valign="middle" >CNN + Gabor + MS-CLBP [<xref ref-type="bibr" rid="scirp.104036-ref39">39</xref>]</td><td align="center" valign="middle" >88.0</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td></tr><tr><td align="center" valign="middle" >ME-CNN [<xref ref-type="bibr" rid="scirp.104036-ref40">40</xref>]</td><td align="center" valign="middle" >87.3</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td></tr><tr><td align="center" valign="middle" >ELM-CNN [<xref ref-type="bibr" rid="scirp.104036-ref41">41</xref>]</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >61.2</td></tr><tr><td align="center" valign="middle" >Multimodal CNN [<xref ref-type="bibr" rid="scirp.104036-ref41">41</xref>]</td><td align="center" valign="middle" >80.2</td><td align="center" valign="middle" >63.5</td><td align="center" valign="middle" >86.7</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td></tr><tr><td align="center" valign="middle" >DyFusion [<xref ref-type="bibr" rid="scirp.104036-ref42">42</xref>]</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >88.2</td><td align="center" valign="middle" >-</td><td align="center" valign="middle" >-</td></tr><tr><td align="center" valign="middle" >SF-SRDA [<xref ref-type="bibr" rid="scirp.104036-ref43">43</xref>]</td><td align="center" valign="middle" >87.6</td><td align="center" valign="middle" >74.7</td><td align="center" valign="middle" >88.0</td><td align="center" valign="middle" >57.8</td><td align="center" valign="middle" >71.0</td></tr><tr><td align="center" valign="middle" >Combination 1 Late Fusion (F6F6)</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >65.9</td><td align="center" valign="middle" >88.2</td><td align="center" valign="middle" >46.8</td><td align="center" valign="middle" >51.7</td></tr><tr><td align="center" valign="middle" >Combination 1 Hybrid Fusion (F6C4)</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >72.9</td><td align="center" valign="middle" >89.1</td><td align="center" valign="middle" >57.1</td><td align="center" valign="middle" >68.4</td></tr><tr><td align="center" valign="middle" >Combination 2 Hybrid Fusion (F6C3)</td><td align="center" valign="middle" >86.9</td><td align="center" valign="middle" >69.0</td><td align="center" valign="middle" >89.6</td><td align="center" valign="middle" >64.9</td><td align="center" valign="middle" >68.6</td></tr></tbody></table></table-wrap><p>Notes: For ship recognition on nighttime and all time IR images, Hybrid Fusion (F6C3) extracts the features of IR images from the F6 and C3 layers of the per-trained or fine-tuned VGG-16 model, the same as F6C4, but Late Fusion (F6) extracts the features of IR images only from the F6 layer. Bold denotes the best one.</p><p>nighttime IR images. Furthermore, Hybrid Fusion performs better than Late Fusion (F6F6) on daytime, nighttime and all time IR images. Although SF-SRDA method achieves higher accuracy than our proposed fusion models on daytime and all time IR images, Hybrid Fusion (F6C3) outperforms it by 1.6% on multispectral image and by 7.1% on nighttime IR images. Therefore, the proposed fusion models are more suitable for ship recognition on multispectral image than other methods. Note that combination 1 requires no training at feature extraction stage and is efficient, but combination 2 and combination 3 need a long time to fine-tune VGG-16 model, and some training tricks should be well used during fine-tuning on small-scale dataset.</p><p>In addition, normalized confusion matrices for Hybrid Fusion (F6C3) of combination 2 are shown in <xref ref-type="fig" rid="fig7">Figure 7</xref>. As shown in <xref ref-type="fig" rid="fig7">Figure 7</xref>(a), all categories except for medium-other and tug are above 92% accuracy. Medium-other achieves only 64% because it is often confused with passenger and small ships. Besides, tug achieves only 60% in Hybrid Fusion (F6C3) due to it has less train samples than other classes (see <xref ref-type="table" rid="table1">Table 1</xref>) and being also confused with passenger and small ships. Nighttime IR images provide contour and few details of ship due to blur, low resolution and large pixels range, it is difficult to classify ship category on them. For normalized confusion matrix on nighttime IR images shown in <xref ref-type="fig" rid="fig7">Figure 7</xref>(b), Hybrid Fusion (F6C3) performs worst on medium-other, which is misclassified as cargo by 50% and as passenger by 43% because they have similar contours. This also affects the recognition accuracy on all time IR images, as shown in <xref ref-type="fig" rid="fig7">Figure 7</xref>(c). <xref ref-type="fig" rid="fig8">Figure 8</xref> gives some visual examples, which are</p><p>misclassified by using either of VIS and IR images, but correctly classified by using multispectral images.</p></sec></sec></sec><sec id="s5"><title>5. Discussion</title><p>It is not an easy work to use small-scale dataset to fine-tune the VGG-16 model trained on large-scale dataset like ImageNet, specifically for IR images because of the differences in imaging principle and technique between them. Some existing regularization techniques are used to avoid over-fitting problem in our experiment, however, data argumentation using image transformation causes strong correlation between samples, and don’t improve the performance of model for small-scale dataset. Besides, early stopping is also taken as regularization technique to avoid over-fitting in our experiment, but it often stops before the model converges when we fine-tune the pre-trained model on IR images. L<sup>2</sup> weight decay is sensitive to the manual value, if the value is set too small, it cannot restrain the loss raise. In a word, if small-scale dataset like VAIS is used to fit the large-scale parameters of deep CNN like VGG-16, data argumentation is necessary. Generative adversarial networks [<xref ref-type="bibr" rid="scirp.104036-ref49">49</xref>] [<xref ref-type="bibr" rid="scirp.104036-ref50">50</xref>] may be an effective tool to produce the paired VIS-IR images and increase training samples.</p><p>We investigate the feature fusion performance of the pre-trained and fine-tuned VGG-16 models for multispectral images, and no techniques are used to project high-dimension feature into fewer dimension space. Our work is meaningful to future work for multispectral maritime ship recognition. For example, our work based on the pre-trained VGG-16 model, can be easily extended on the other pre-trained deep CNN models, such as VGG-19 model, Inception model [<xref ref-type="bibr" rid="scirp.104036-ref16">16</xref>] and ResNet model [<xref ref-type="bibr" rid="scirp.104036-ref17">17</xref>]. Besides, researchers can leverage unsupervised feature learning methods to reduce feature dimension, such as principal components analysis, and also embed Network-in-Network (NIN) [<xref ref-type="bibr" rid="scirp.104036-ref51">51</xref>] for fine-tuning the well-known pre-trained deep CNN models. Furthermore, the baseline method and the state-of-the-art method [<xref ref-type="bibr" rid="scirp.104036-ref42">42</xref>] adapt the decision level fusion for ship recognition, and extract features from the last fully connected layer of the pre-trained VGG-16 model and the last convolutional layer of the pre-trained VGG-19 model, respectively. Based on our experimental results, features extracted from the same layer of the pre-trained deep CNN model are not the best for both VIS and IR images. We believe that our work can be further investigated in the decision level fusion.</p></sec><sec id="s6"><title>6. Conclusion</title><p>In this paper, we take advantage of the deep CNN model and multispectral data, and model multispectral ship recognition task into a convolutional feature fusion problem. We propose a feature fusion architecture, namely Hybrid Fusion, and investigate it as well as other three feature fusion architectures by exploiting L2 normalization method. Meanwhile, we use existing regularization techniques to fine-tune the pre-trained VGG-16 model on VIS and IR images in VAIS dataset, and investigate the ship recognition performance of three combinations. Experimental results demonstrate that feature representation ability is strong at high level of the pre-trained VGG-16 model for VIS image, and middle level for IR image. In the four feature fusion architectures, Hybrid Fusion performs better recognition accuracy than the other three feature fusion architectures. Besides, fine-tuning the pre-trained VGG-16 model can learn semantic information of ship, and slightly improve the recognition performance of Hybrid Fusion. The best Hybrid Fusion achieves 89.6% mean per-class recognition accuracy, and outperforms the state-of-the-art method. Our future work focuses on unsupervised feature learning and decision level fusion.</p></sec><sec id="s7"><title>Funding Statement</title><p>This work is partly supported by the National Natural Science Foundation of China (Grant No. 61102170 and No. 62006240) and the National Social Science Foundation of China (Grant No. 15GJ003-243).</p></sec><sec id="s8"><title>Data Availability Statement</title><p>The Excel data used to support the findings of this study are included within the supplementary information file.</p></sec><sec id="s9"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s10"><title>Cite this paper</title><p>Qiu, X.H., Li, M., Zhang, L.Q. and Zhao, R. (2020) Deep Convolutional Feature Fusion Model for Multispectral Maritime Imagery Ship Recognition. Journal of Computer and Communications, 8, 23-43. https://doi.org/10.4236/jcc.2020.811003</p></sec><sec id="s11"><title>NOTES</title></sec></body><back><ref-list><title>References</title><ref id="scirp.104036-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Guo, K., Wu, S. and Xu, Y. (2017) Face Recognition Using Both Visible Light Image and Near-Infrared Image and a Deep Network. CAAI Transactions on Intelligence Technology, 2, 39-47. https://doi.org/10.1016/j.trit.2017.03.001</mixed-citation></ref><ref id="scirp.104036-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Dai, P., Ji, R., Wang, H., Wu, Q. and Huang, Y. (2018) Cross-Modality Person Re-Identification with Generative Adversarial Training. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, 13-19 July 2018, 677-683. https://doi.org/10.24963/ijcai.2018/94</mixed-citation></ref><ref id="scirp.104036-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Li, C., Song, D., Tong, R. and Tang, M. (2019) Illumination-Aware Faster R-CNN for Robust Multispectral Pedestrian Detection. Pattern Recognition, 85, 161-171. https://doi.org/10.1016/j.patcog.2018.08.005</mixed-citation></ref><ref id="scirp.104036-ref4"><label>4</label><mixed-citation publication-type="other" xlink:type="simple">Cho, Y.R., Shin, S., Yim, S.H., Kong, K., Cho, H.W. and Song, W.J. (2019) Multistage Fusion with Dissimilarity Regularization for SAR/IR Target Recognition. IEEE Access, 7, 728-740. https://doi.org/10.1109/ACCESS.2018.2885736</mixed-citation></ref><ref id="scirp.104036-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Li, C., Wu, X., Zhao, N., Cao, X. and Tang, J. (2018) Fusing Two-Stream Convolutional Neural Networks for RGB-T Object Tracking. Neurocomputing, 281, 78-85. https://doi.org/10.1016/j.neucom.2017.11.068</mixed-citation></ref><ref id="scirp.104036-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Hong, C., Koschan, A., Abidi, M., Kong, S.G. and Won, C.-H. (2008) Multispectral Visible and Infrared Imaging for Face Recognition. 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, 24-26 June 2008, 1-6. https://doi.org/10.1109/CVPRW.2008.4563054</mixed-citation></ref><ref id="scirp.104036-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Shoja Ghiass, R., Arandjelovic, O., Bendada, A. and Maldague, X. (2014) Infrared Face Recognition: A Comprehensive Review of Methodologies and Databases. Pattern Recognition, 47, 2807-2824. https://doi.org/10.1016/j.patcog.2014.03.015</mixed-citation></ref><ref id="scirp.104036-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Hermosilla, G., Rojas, M., Mendoza, J., Farias, G., Pizarro, F.T., San Martin, C. and Vera, E. (2018) Particle Swarm Optimization for the Fusion of Thermal and Visible Descriptors in Face Recognition Systems. IEEE Access, 6, 42800-42811. https://doi.org/10.1109/ACCESS.2018.2850281</mixed-citation></ref><ref id="scirp.104036-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Peng, C., Wang, N., Li, J. and Gao, X. (2019) DLFace: Deep Local Descriptor for Cross-Modality Face Recognition. Pattern Recognition, 90, 161-171. https://doi.org/10.1016/j.patcog.2019.01.041</mixed-citation></ref><ref id="scirp.104036-ref10"><label>10</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, M.M., Choi, J., Daniilidis, K., Wolf, M.T. and Kanan, C. (2015) Vais: A Dataset for Recognizing Maritime Imagery in the Visible and Infrared Spectrums. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, 7-12 June 2015, 10-16. https://doi.org/10.1109/CVPRW.2015.7301291</mixed-citation></ref><ref id="scirp.104036-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Kniaz, V.V., Knyaz, V.A., Hladuvka, J., Kropatsch, W.G. and Mizginov, V. (2019) Thermal-GAN: Multimodal Color-to-Thermal Image Translation for Person Re-Identification in Multispectral Dataset. Computer Vision ECCV 2018 Workshops, Volume 11134, 606-624. https://doi.org/10.1007/978-3-030-11024-6_46</mixed-citation></ref><ref id="scirp.104036-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, L., Liu, Z., Zhang, S., Yang, X., Qiao, H., Huang, K. and Hussain, A. (2019) Cross-Modality Interactive Attention Network for Multispectral Pedestrian Detection. Information Fusion, 50, 20-29. https://doi.org/10.1016/j.inffus.2018.09.015</mixed-citation></ref><ref id="scirp.104036-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, L., Gonzalez-Garcia, A., van de Weijer, J., Danelljan, M. and Khan, F.S. (2019) Synthetic Data Generation for End-to-End Thermal Infrared Tracking. IEEE Transactions on Image Processing, 28, 1837-1850. https://doi.org/10.1109/TIP.2018.2879249</mixed-citation></ref><ref id="scirp.104036-ref14"><label>14</label><mixed-citation publication-type="other" xlink:type="simple">Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) ImageNet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems, Curran Associates Inc., Red Hook, 1097-1105.</mixed-citation></ref><ref id="scirp.104036-ref15"><label>15</label><mixed-citation publication-type="other" xlink:type="simple">Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. http://arxiv.org/abs/1409.1556</mixed-citation></ref><ref id="scirp.104036-ref16"><label>16</label><mixed-citation publication-type="other" xlink:type="simple">Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z. (2016) Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 2818-2826. https://doi.org/10.1109/CVPR.2016.308</mixed-citation></ref><ref id="scirp.104036-ref17"><label>17</label><mixed-citation publication-type="other" xlink:type="simple">He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. https://doi.org/10.1109/CVPR.2016.90</mixed-citation></ref><ref id="scirp.104036-ref18"><label>18</label><mixed-citation publication-type="other" xlink:type="simple">Deng, J., Dong, W., Socher, R., Li, L.J., Li, K. and Fei-Fei, L. (2009) ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, 20-25 June 2009, 8. https://doi.org/10.1109/CVPR.2009.5206848</mixed-citation></ref><ref id="scirp.104036-ref19"><label>19</label><mixed-citation publication-type="other" xlink:type="simple">Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M. and Burgard, W. (2015) Multimodal Deep Learning for Robust RGB-D Object Recognition. 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, 28 September-3 October 2015, 681-687. https://doi.org/10.1109/IROS.2015.7353446</mixed-citation></ref><ref id="scirp.104036-ref20"><label>20</label><mixed-citation publication-type="other" xlink:type="simple">Bui, H.M., Lech, M., Cheng, E., Neville, K. and Burnett, I.S. (2016) Object Recognition Using Deep Convolutional Features Transformed by a Recursive Network Structure. IEEE Access, 4, 10059-10066. https://doi.org/10.1109/ACCESS.2016.2639543</mixed-citation></ref><ref id="scirp.104036-ref21"><label>21</label><mixed-citation publication-type="other" xlink:type="simple">Ren, S., He, K., Girshick, R., Zhang, X. and Sun, J. (2017) Object Detection Networks on Convolutional Feature Maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1476-1481. https://doi.org/10.1109/TPAMI.2016.2601099</mixed-citation></ref><ref id="scirp.104036-ref22"><label>22</label><mixed-citation publication-type="other" xlink:type="simple">Wang, T., Chen, Y., Zhang, M., Chen, J. and Snoussi, H. (2017) Internal Transfer Learning for Improving Performance in Human Action Recognition for Small Datasets. IEEE Access, 5, 17627-17633. https://doi.org/10.1109/ACCESS.2017.2746095</mixed-citation></ref><ref id="scirp.104036-ref23"><label>23</label><mixed-citation publication-type="other" xlink:type="simple">Park, K., Kim, S. and Sohn, K. (2018) Unified Multi-Spectral Pedestrian Detection Based on Probabilistic Fusion Networks. Pattern Recognition, 80, 143-155. https://doi.org/10.1016/j.patcog.2018.03.007</mixed-citation></ref><ref id="scirp.104036-ref24"><label>24</label><mixed-citation publication-type="other" xlink:type="simple">Yosinski, J., Clune, J., Bengio, Y. and Lipson, H. (2014) How Transferable Are Features in Deep Neural Networks? Proceedings of the 27th International Conference on Neural Information Processing Systems, Volume 2, 3320-3328.</mixed-citation></ref><ref id="scirp.104036-ref25"><label>25</label><mixed-citation publication-type="other" xlink:type="simple">Schwarz, M., Schulz, H. and Behnke, S. (2015) RGB-D Object Recognition and Pose Estimation Based on Pre-Trained Convolutional Neural Network Features. 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, 26-30 May 2015, 1329-1335. https://doi.org/10.1109/ICRA.2015.7139363</mixed-citation></ref><ref id="scirp.104036-ref26"><label>26</label><mixed-citation publication-type="other" xlink:type="simple">Zia, S., Yuksel, B., Yuret, D. and Yemez, Y. (2017) RGB-D Object Recognition Using Deep Convolutional Neural Networks. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, 22-29 October 2017, 887-894. https://doi.org/10.1109/ICCVW.2017.109</mixed-citation></ref><ref id="scirp.104036-ref27"><label>27</label><mixed-citation publication-type="other" xlink:type="simple">Caglayan, A. and Can, A.B. (2019) Exploiting Multi-Layer Features Using a CNN-RNN Approach for RGB-D Object Recognition. In: Computer Vision ECCV 2018 Workshops, Springer International Publishing, Cham, 675-688. https://doi.org/10.1007/978-3-030-11015-4_51</mixed-citation></ref><ref id="scirp.104036-ref28"><label>28</label><mixed-citation publication-type="other" xlink:type="simple">Chatfield, K., Simonyan, K., Vedaldi, A. and Zisserman, A. (2014) Return of the Devil in the Details: Delving Deep into Convolutional Nets. Proceedings of the British Machine Vision Conference 2014, Nottingham, 1-5 September 2014, 6.1-6.12. https://doi.org/10.5244/C.28.6</mixed-citation></ref><ref id="scirp.104036-ref29"><label>29</label><mixed-citation publication-type="other" xlink:type="simple">Socher, R., Huval, B., Bhat, B., Manning, C.D. and Ng, A.Y. (2012) Convolutional-Recursive Deep Learning for 3D Object Classification. Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1, 656-664. http://dl.acm.org/citation.cfm?id=2999134.2999208</mixed-citation></ref><ref id="scirp.104036-ref30"><label>30</label><mixed-citation publication-type="other" xlink:type="simple">Kanjir, U., Greidanus, H. and Ostir, K. (2018) Vessel Detection and Classification from Spaceborne Optical Images: A Literature Survey. Remote Sensing of Environment, 207, 1-26. https://doi.org/10.1016/j.rse.2017.12.033</mixed-citation></ref><ref id="scirp.104036-ref31"><label>31</label><mixed-citation publication-type="other" xlink:type="simple">Khellal, A., Ma, H. and Fei, Q. (2018) Convolutional Neural Network Based on Extreme Learning Machine for Maritime Ships Recognition in Infrared Images. Sensors, 18, 1490. https://doi.org/10.3390/s18051490</mixed-citation></ref><ref id="scirp.104036-ref32"><label>32</label><mixed-citation publication-type="other" xlink:type="simple">Bousetouane, F. and Morris, B. (2015) Off-the-Shelf CNN Features for Fine-Grained Classification of Vessels in a Maritime Environment. In: Advances in Visual Computing, Volume 9475, Springer International Publishing, Cham, 379-388. https://doi.org/10.1007/978-3-319-27863-6_35</mixed-citation></ref><ref id="scirp.104036-ref33"><label>33</label><mixed-citation publication-type="other" xlink:type="simple">Dao, C.D., Hua, X.H. and Morere, O. (2015) Maritime Vessel Images Classification Using Deep Convolutional Neural Networks. Proceedings of the Sixth International Symposium on Information and Communication Technology, Hue City, 3-4 December 2015, 1-6. https://doi.org/10.1145/2833258.2833266</mixed-citation></ref><ref id="scirp.104036-ref34"><label>34</label><mixed-citation publication-type="other" xlink:type="simple">Solmaz, B., Gundogdu, E., Yucesoy, V. and Koc, A. (2017) Generic and Attribute-Specific Deep Representations for Maritime Vessels. IPSJ Transactions on Computer Vision and Applications, 9, 22. https://doi.org/10.1186/s41074-017-0033-4</mixed-citation></ref><ref id="scirp.104036-ref35"><label>35</label><mixed-citation publication-type="other" xlink:type="simple">Gundogdu, E., Solmaz, B., Yucesoy, V. and Koc, A. (2017) MARVEL: A Large-Scale Image Dataset for Maritime Vessels. Computer Vision ACCV 2016, Volume 10115, 165-180. https://doi.org/10.1007/978-3-319-54193-8_11</mixed-citation></ref><ref id="scirp.104036-ref36"><label>36</label><mixed-citation publication-type="other" xlink:type="simple">Solmaz, B., Gundogdu, E., Yucesoy, V., Koc, A. and Alatan, A.A. (2018) Fine-Grained Recognition of Maritime Vessels and Land Vehicles by Deep Feature Embedding. IET Computer Vision, 12, 1121-1132. https://doi.org/10.1049/iet-cvi.2018.5187</mixed-citation></ref><ref id="scirp.104036-ref37"><label>37</label><mixed-citation publication-type="other" xlink:type="simple">Milicevic, M., Zubrinic, K., Obradovic, I. and Sjekavica, T. (2019) Application of Transfer Learning for Fine-Grained Vessel Classification Using a Limited Dataset. In: Applied Physics, System Science and Computers III, Volume 574, Springer International Publishing, Cham, 125-131. https://doi.org/10.1007/978-3-030-21507-1_19</mixed-citation></ref><ref id="scirp.104036-ref38"><label>38</label><mixed-citation publication-type="other" xlink:type="simple">Huang, L., Li, W., Chen, C., Zhang, F. and Lang, H. (2018) Multiple Features Learning for Ship Classification in Optical Imagery. Multimedia Tools and Applications, 77, 13363-13389. https://doi.org/10.1007/s11042-017-4952-y</mixed-citation></ref><ref id="scirp.104036-ref39"><label>39</label><mixed-citation publication-type="other" xlink:type="simple">Shi, Q., Li, W., Zhang, F., Hu, W., Sun, X. and Gao, L. (2018) Deep CNN with Multi-Scale Rotation Invariance Features for Ship Classification. IEEE Access, 6, 38656-38668. https://doi.org/10.1109/ACCESS.2018.2853620</mixed-citation></ref><ref id="scirp.104036-ref40"><label>40</label><mixed-citation publication-type="other" xlink:type="simple">Shi, Q., Li, W., Tao, R., Sun, X. and Gao, L. (2019) Ship Classification Based on Multifeature Ensemble with Convolutional Neural Network. Remote Sensing, 11, 419. https://doi.org/10.3390/rs11040419</mixed-citation></ref><ref id="scirp.104036-ref41"><label>41</label><mixed-citation publication-type="book" xlink:type="simple">Aziz, K. and Bouchara, F. (2018) Multimodal Deep Learning for Robust Recognizing Maritime Imagery in the Visible and Infrared Spectrums. In: Campilho, A., Karray, F. and ter Haar Romeny, B., Eds., Image Analysis and Recognition, Springer International Publishing, Berlin, 235-244. https://doi.org/10.1007/978-3-319-93000-8_27</mixed-citation></ref><ref id="scirp.104036-ref42"><label>42</label><mixed-citation publication-type="other" xlink:type="simple">Santos, C.E. and Bhanu, B. (2018) Dyfusion: Dynamic IR/RGB Fusion for Maritime Vessel Recognition. 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, 7-10 October 2018, 1328-1332. https://doi.org/10.1109/ICIP.2018.8451745</mixed-citation></ref><ref id="scirp.104036-ref43"><label>43</label><mixed-citation publication-type="other" xlink:type="simple">Zhang, E., Wang, K. and Lin, G. (2019) Classification of Marine Vessels with Multi-Feature Structure Fusion. Applied Sciences, 9, 2153. https://doi.org/10.3390/app9102153</mixed-citation></ref><ref id="scirp.104036-ref44"><label>44</label><mixed-citation publication-type="other" xlink:type="simple">Liu, J., Zhang, S., Wang, S. and Metaxas, D. (2016) Multispectral Deep Neural Networks for Pedestrian Detection. Procedings of the British Machine Vision Conference 2016, York, 19-22 September 2016, 73.1-73.13. https://doi.org/10.5244/C.30.73</mixed-citation></ref><ref id="scirp.104036-ref45"><label>45</label><mixed-citation publication-type="other" xlink:type="simple">Zeiler, M.D. and Fergus, R. (2014) Visualizing and Understanding Convolutional Networks. Computer Vision ECCV 2014, Volume 8689, 818-833. https://doi.org/10.1007/978-3-319-10590-1_53</mixed-citation></ref><ref id="scirp.104036-ref46"><label>46</label><mixed-citation publication-type="other" xlink:type="simple">Chen, Y., Xie, H. and Shin, H. (2018) Multi-Layer Fusion Techniques Using a CNN for Multispectral Pedestrian Detection. IET Computer Vision, 12, 1179-1187. https://doi.org/10.1049/iet-cvi.2018.5315</mixed-citation></ref><ref id="scirp.104036-ref47"><label>47</label><mixed-citation publication-type="other" xlink:type="simple">Chang, C.C. and Lin, C.J. (2011) LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1-27:27. https://doi.org/10.1145/1961189.1961199</mixed-citation></ref><ref id="scirp.104036-ref48"><label>48</label><mixed-citation publication-type="other" xlink:type="simple">Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. MIT Press, Cambridge. http://www.deeplearningbook.org</mixed-citation></ref><ref id="scirp.104036-ref49"><label>49</label><mixed-citation publication-type="other" xlink:type="simple">Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014) Generative Adversarial Networks.</mixed-citation></ref><ref id="scirp.104036-ref50"><label>50</label><mixed-citation publication-type="other" xlink:type="simple">Isola, P., Zhu, J.Y., Zhou, T. and Efros, A.A. (2017) Image-to-Image Translation with Conditional Adversarial Networks. CVPR 2017, Hawaii, 22-25 July 2017, 1642-1650. https://doi.org/10.1109/CVPR.2017.632</mixed-citation></ref><ref id="scirp.104036-ref51"><label>51</label><mixed-citation publication-type="other" xlink:type="simple">Lin, M., Chen, Q. and Yan, S. (2013) Network in Network.</mixed-citation></ref></ref-list></back></article>