Very deep convolutional networks for large-scale image recognition (2014), K. Simonyan and A. Zisserman.pdf

资源描述

1、arXiv:1409.1556v6 cs.CV 10 Apr 2015Published as a conference paper at ICLR 2015VERY DEEP CONVOLUTIONAL NETWORKSFOR LARGE-SCALE IMAGE RECOGNITIONKaren Simonyan Zeiler Sermanet et al., 2014;Simonyan Sermanet et al., 2014) utilised smaller receptive window size andsmaller stride of the first convolutio

2、nal layer. Another line of improvements dealt with trainingand testing the networks densely over the whole image and over multiple scales (Sermanet et al.,2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecturedesign its depth. To this end, we fix other parame

3、ters of the architecture, and steadily increase thedepth of the network by adding more convolutional layers, which is feasible due to the use of verysmall (3 3) convolution filters in all layers.As a result, we come up with significantly more accurate ConvNet architectures, which not onlyachieve the

4、 state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are alsoapplicable to other image recognition datasets, where they achieve excellent performance even whenused as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM withoutfine-tuning)

5、. We have released our two best-performing models1 to facilitate further research.The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations.The details of the image classification training and evaluation are then presented in Sect. 3, and thecurrent affiliatio

6、n: Google DeepMind +current affiliation: University of Oxford and Google DeepMind1http:/www.robots.ox.ac.uk/vgg/research/very_deep/1Published as a conference paper at ICLR 2015configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes thepaper. For completeness, we a

7、lso describe and assess our ILSVRC-2014 object localisation systemin Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B.Finally, Appendix C contains the list of major paper revisions.2 CONVNET CONFIGURATIONSTo measure the improvement brought by the incre

8、ased ConvNet depth in a fair setting, all ourConvNet layer configurations are designed using the same principles, inspired by Ciresan et al.(2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNetconfigurations (Sect. 2.1) and then detail the specific conf

9、igurations used in the evaluation (Sect. 2.2).Our design choices are then discussed and compared to the prior art in Sect. 2.3.2.1 ARCHITECTUREDuring training, the input to our ConvNets is a fixed-size 224 224 RGB image. The only pre-processing we do is subtracting the mean RGB value, computed on th

10、e training set, from each pixel.The image is passed through a stack of convolutional (conv.) layers, where we use filters with a verysmall receptive field: 3 3 (which is the smallest size to capture the notion of left/right, up/down,center). In one of the configurations we also utilise 1 1 convoluti

11、on filters, which can be seen asa linear transformation of the input channels (followed by non-linearity). The convolution stride isfixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preservedafter convolution, i.e. the padding is 1 pixel for 3 3 conv.

12、layers. Spatial pooling is carried out byfive max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followedby max-pooling). Max-pooling is performed over a 2 2 pixel window, with stride 2.A stack of convolutional layers (which has a different depth in different arc

13、hitectures) is followed bythree Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer isthe soft-max layer. The configuration of the fully connected layers is the sam

14、e in all networks.All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012) non-linearity.We note that none of our networks (except for one) contain Local Response Normalisation(LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisationd

15、oes not improve the performance on the ILSVRC dataset, but leads to increased memory con-sumption and computation time. Where applicable, the parameters for the LRN layer are thoseof (Krizhevsky et al., 2012).2.2 CONFIGURATIONSThe ConvNet configurations, evaluated in this paper, are outlined in Tabl

16、e 1, one per column. Inthe following we will refer to the nets by their names (AE). All configurations follow the genericdesign presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A(8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 F

17、C layers). The widthof conv. layers (the number of channels) is rather small, starting from 64 in the first layer and thenincreasing by a factor of 2 after each max-pooling layer, until it reaches 512.In Table 2 we report the number of parameters for each configuration. In spite of a large depth, th

18、enumber of weights in our nets is not greater than the number of weights in a more shallow net withlarger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014).2.3 DISCUSSIONOur ConvNet configurations are quite different from the ones used in the top-performing entriesof t

19、he ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler Sermanet et al., 2014). Rather than using relatively large receptive fields in the first conv. lay-ers (e.g. 1111 with stride 4 in (Krizhevsky et al., 2012), or 77 with stride 2 in (Zeiler Sermanet et al., 2014), we use ve

20、ry small 3 3 receptive fields throughout the whole net,which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two33 conv. layers (without spatial pooling in between) has an effective receptive field of 55; three2Published as a conference paper at ICLR 20

21、15Table 1: ConvNet configurations (shown in columns). The depth of the configurations increasesfrom the left (A) to the right (E), as more layers are added (the added layers are shown in bold). Theconvolutional layer parameters are denoted as “convreceptive field size-number of channels”.The ReLU ac

22、tivation function is not shown for brevity.ConvNet ConfigurationA A-LRN B C D E11 weight 11 weight 13 weight 16 weight 16 weight 19 weightlayers layers layers layers layers layersinput (224224 RGB image)conv3-64 conv3-64 conv3-64 conv3-64 conv3-64 conv3-64LRN conv3-64 conv3-64 conv3-64 conv3-64maxpo

23、olconv3-128 conv3-128 conv3-128 conv3-128 conv3-128 conv3-128conv3-128 conv3-128 conv3-128 conv3-128maxpoolconv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256conv1-256 conv3-256 conv3-256conv3-256maxpoolconv3-512 conv3-512 conv3-512

24、conv3-512 conv3-512 conv3-512conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512conv1-512 conv3-512 conv3-512conv3-512maxpoolconv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512conv1-512 conv3-512 conv3-512conv3-512maxpoolFC-4

25、096FC-4096FC-1000soft-maxTable 2: Number of parameters (in millions).Network A,A-LRN B C D ENumber of parameters 133 133 134 138 144such layers have a 7 7 effective receptive field. So what have we gained by using, for instance, astack of three 33 conv. layers instead of a single 77 layer? First, we

26、 incorporate three non-linearrectification layers instead of a single one, which makes the decision function more discriminative.Second, we decrease the number of parameters: assuming that both the input and the output of athree-layer 3 3 convolution stack has C channels, the stack is parametrised b

27、y 3parenleftbig32C2parenrightbig = 27C2weights; at the same time, a single 7 7 conv. layer would require 72C2 = 49C2 parameters, i.e.81% more. This can be seen as imposing a regularisation on the 7 7 conv. filters, forcing them tohave a decomposition through the 3 3 filters (with non-linearity injec

28、ted in between).The incorporation of 1 1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv. layers. Eventhough in our case the 11 convolution is essentially a linear projection onto the space of

29、the samedimensionality (the number of input and output channels is the same), an additional non-linearity isintroduced by the rectification function. It should be noted that 11 conv. layers have recently beenutilised in the “Network in Network” architecture of Lin et al. (2014).Small-size convolutio

30、n filters have been previously used by Ciresan et al. (2011), but their netsare significantly less deep than ours, and they did not evaluate on the large-scale ILSVRCdataset. Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task ofstreet number recognition, and showed that th

31、e increased depth led to better performance.GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task,was developed independently of our work, but is similar in that it is based on very deep ConvNets3Published as a conference paper at ICLR 2015(22 weight layers)

32、 and small convolution filters (apart from 3 3, they also use 1 1 and 5 5convolutions). Their network topology is, however, more complex than ours, and the spatial reso-lution of the feature maps is reduced more aggressively in the first layers to decrease the amountof computation. As will be shown

33、in Sect. 4.5, our model is outperforming that of Szegedy et al.(2014) in terms of the single-network classification accuracy.3 CLASSIFICATION FRAMEWORKIn the previous section we presented the details of our network configurations. In this section, wedescribe the details of classification ConvNet tra

34、ining and evaluation.3.1 TRAININGThe ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for samplingthe input crops from multi-scale training images, as explained later). Namely, the training is carriedout by optimising the multinomial logistic regression objective using m

35、ini-batch gradient descent(based on back-propagation (LeCun et al., 1989) with momentum. The batch size was set to 256,momentum to 0.9. The training was regularised by weight decay (the L2 penalty multiplier set to5104) and dropout regularisation for the first two fully-connected layers (dropout rat

36、io set to 0.5).The learning rate was initially set to 102, and then decreased by a factor of 10 when the validationset accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learningwas stopped after 370K iterations (74 epochs). We conjecture that in spite of the larg

37、er number ofparameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets requiredless epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv.filter sizes; (b) pre-initialisation of certain layers.The initialisation of the netwo

38、rk weights is important, since bad initialisation can stall learning dueto the instability of gradient in deep nets. To circumvent this problem, we began with trainingthe configuration A (Table 1), shallow enough to be trained with random initialisation. Then, whentraining deeper architectures, we i

39、nitialised the first four convolutional layers and the last three fully-connected layers with the layers of net A (the intermediate layers were initialised randomly). We didnot decrease the learning rate for the pre-initialised layers, allowing them to change during learning.For random initialisatio

40、n (where applicable), we sampled the weights from a normal distributionwith the zero mean and 102 variance. The biases were initialised with zero. It is worth noting thatafter the paper submission we found that it is possible to initialise the weights without pre-trainingby using the random initiali

41、sation procedure of Glorot forS 224 the crop will correspond to a small part of the image, containing a small object or an objectpart.We consider two approaches for setting the training scale S. The first is to fix S, which correspondsto single-scale training (note that image content within the samp

42、led crops can still represent multi-scale image statistics). In our experiments, we evaluated models trained at two fixed scales: S =256 (which has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler Sermanet et al., 2014) and S = 384. Given a ConvNet configuration, we first trained t

43、he networkusing S = 256. To speed-up training of the S = 384 network, it was initialised with the weightspre-trained with S = 256, and we used a smaller initial learning rate of 103.The second approach to setting S is multi-scale training, where each training image is individuallyrescaled by randoml

44、y sampling S from a certain range Smin,Smax (we used Smin = 256 andSmax = 512). Since objects in images can be of different size, it is beneficial to take this into accountduring training. This can also be seen as training set augmentation by scale jittering, where a single4Published as a conference

45、 paper at ICLR 2015model is trained to recognise objects over a wide range of scales. For speed reasons, we trainedmulti-scale models by fine-tuning all layers of a single-scale model with the same configuration,pre-trained with fixed S = 384.3.2 TESTINGAt test time, given a trained ConvNet and an i

46、nput image, it is classified in the following way. First,it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to itas the test scale). We note that Q is not necessarily equal to the training scale S (as we will showin Sect. 4, using several values of Q for e

47、ach S leads to improved performance). Then, the networkis applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely,the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 7conv. layer, the last two FC layers to 1 1 con

48、v. layers). The resulting fully-convolutional net isthen applied to the whole (uncropped) image. The result is a class score map with the number ofchannels equal to the number of classes, and a variable spatial resolution, dependent on the inputimage size. Finally, to obtain a fixed-size vector of c

49、lass scores for the image, the class score map isspatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images;the soft-max class posteriors of the original and flipped images are averaged to obtain the final scoresfor the image.Since the fully-convolutional network is applied over the whole image, there is no need to samplemultiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires networkre-computation for each crop. At the same time, using a la

展开阅读全文