ImageVerifierCode 换一换
格式:PDF , 页数:14 ,大小:195.32KB ,
资源ID:3873    下载:注册后免费下载
快捷下载
登录下载
邮箱/手机:
温馨提示:
快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。 如填写123,账号就是123,密码也是123。
特别说明:
请自助下载,系统不会自动发送文件的哦; 如果您已付费,想二次下载,请登录后访问:我的下载记录
验证码:   换一换

加入VIP,免费下载
 

温馨提示:由于个人手机设置不同,如果发现不能下载,请复制以下地址【https://www.wenkunet.com/d-3873.html】到电脑端继续下载(重复下载不扣费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录   QQ登录   微博登录 

下载须知

1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
2: 试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。
3: 文件的所有权益归上传用户所有。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

本文(Very deep convolutional networks for large-scale image recognition (2014), K. Simonyan and A. Zisserman.pdf)为本站会员(刘岱文)主动上传,文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知文库网(发送邮件至13560552955@163.com或直接QQ联系客服),我们立即给予删除!

Very deep convolutional networks for large-scale image recognition (2014), K. Simonyan and A. Zisserman.pdf

1、arXiv:1409.1556v6 cs.CV 10 Apr 2015Published as a conference paper at ICLR 2015VERY DEEP CONVOLUTIONAL NETWORKSFOR LARGE-SCALE IMAGE RECOGNITIONKaren Simonyan Zeiler Sermanet et al., 2014;Simonyan Sermanet et al., 2014) utilised smaller receptive window size andsmaller stride of the first convolutio

2、nal layer. Another line of improvements dealt with trainingand testing the networks densely over the whole image and over multiple scales (Sermanet et al.,2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecturedesign its depth. To this end, we fix other parame

3、ters of the architecture, and steadily increase thedepth of the network by adding more convolutional layers, which is feasible due to the use of verysmall (3 3) convolution filters in all layers.As a result, we come up with significantly more accurate ConvNet architectures, which not onlyachieve the

4、 state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are alsoapplicable to other image recognition datasets, where they achieve excellent performance even whenused as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM withoutfine-tuning)

5、. We have released our two best-performing models1 to facilitate further research.The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations.The details of the image classification training and evaluation are then presented in Sect. 3, and thecurrent affiliatio

6、n: Google DeepMind +current affiliation: University of Oxford and Google DeepMind1http:/www.robots.ox.ac.uk/vgg/research/very_deep/1Published as a conference paper at ICLR 2015configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes thepaper. For completeness, we a

7、lso describe and assess our ILSVRC-2014 object localisation systemin Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B.Finally, Appendix C contains the list of major paper revisions.2 CONVNET CONFIGURATIONSTo measure the improvement brought by the incre

8、ased ConvNet depth in a fair setting, all ourConvNet layer configurations are designed using the same principles, inspired by Ciresan et al.(2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNetconfigurations (Sect. 2.1) and then detail the specific conf

9、igurations used in the evaluation (Sect. 2.2).Our design choices are then discussed and compared to the prior art in Sect. 2.3.2.1 ARCHITECTUREDuring training, the input to our ConvNets is a fixed-size 224 224 RGB image. The only pre-processing we do is subtracting the mean RGB value, computed on th

10、e training set, from each pixel.The image is passed through a stack of convolutional (conv.) layers, where we use filters with a verysmall receptive field: 3 3 (which is the smallest size to capture the notion of left/right, up/down,center). In one of the configurations we also utilise 1 1 convoluti

11、on filters, which can be seen asa linear transformation of the input channels (followed by non-linearity). The convolution stride isfixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preservedafter convolution, i.e. the padding is 1 pixel for 3 3 conv.

12、layers. Spatial pooling is carried out byfive max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followedby max-pooling). Max-pooling is performed over a 2 2 pixel window, with stride 2.A stack of convolutional layers (which has a different depth in different arc

13、hitectures) is followed bythree Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer isthe soft-max layer. The configuration of the fully connected layers is the sam

14、e in all networks.All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012) non-linearity.We note that none of our networks (except for one) contain Local Response Normalisation(LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisationd

15、oes not improve the performance on the ILSVRC dataset, but leads to increased memory con-sumption and computation time. Where applicable, the parameters for the LRN layer are thoseof (Krizhevsky et al., 2012).2.2 CONFIGURATIONSThe ConvNet configurations, evaluated in this paper, are outlined in Tabl

16、e 1, one per column. Inthe following we will refer to the nets by their names (AE). All configurations follow the genericdesign presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A(8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 F

17、C layers). The widthof conv. layers (the number of channels) is rather small, starting from 64 in the first layer and thenincreasing by a factor of 2 after each max-pooling layer, until it reaches 512.In Table 2 we report the number of parameters for each configuration. In spite of a large depth, th

18、enumber of weights in our nets is not greater than the number of weights in a more shallow net withlarger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014).2.3 DISCUSSIONOur ConvNet configurations are quite different from the ones used in the top-performing entriesof t

19、he ILSVRC-2012 (Krizhevsky et al., 2012) and ILSVRC-2013 competitions (Zeiler Sermanet et al., 2014). Rather than using relatively large receptive fields in the first conv. lay-ers (e.g. 1111 with stride 4 in (Krizhevsky et al., 2012), or 77 with stride 2 in (Zeiler Sermanet et al., 2014), we use ve

20、ry small 3 3 receptive fields throughout the whole net,which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two33 conv. layers (without spatial pooling in between) has an effective receptive field of 55; three2Published as a conference paper at ICLR 20

21、15Table 1: ConvNet configurations (shown in columns). The depth of the configurations increasesfrom the left (A) to the right (E), as more layers are added (the added layers are shown in bold). Theconvolutional layer parameters are denoted as “convreceptive field size-number of channels”.The ReLU ac

22、tivation function is not shown for brevity.ConvNet ConfigurationA A-LRN B C D E11 weight 11 weight 13 weight 16 weight 16 weight 19 weightlayers layers layers layers layers layersinput (224224 RGB image)conv3-64 conv3-64 conv3-64 conv3-64 conv3-64 conv3-64LRN conv3-64 conv3-64 conv3-64 conv3-64maxpo

23、olconv3-128 conv3-128 conv3-128 conv3-128 conv3-128 conv3-128conv3-128 conv3-128 conv3-128 conv3-128maxpoolconv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256conv3-256 conv3-256 conv3-256 conv3-256 conv3-256 conv3-256conv1-256 conv3-256 conv3-256conv3-256maxpoolconv3-512 conv3-512 conv3-512

24、conv3-512 conv3-512 conv3-512conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512conv1-512 conv3-512 conv3-512conv3-512maxpoolconv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512conv3-512 conv3-512 conv3-512 conv3-512 conv3-512 conv3-512conv1-512 conv3-512 conv3-512conv3-512maxpoolFC-4

25、096FC-4096FC-1000soft-maxTable 2: Number of parameters (in millions).Network A,A-LRN B C D ENumber of parameters 133 133 134 138 144such layers have a 7 7 effective receptive field. So what have we gained by using, for instance, astack of three 33 conv. layers instead of a single 77 layer? First, we

26、 incorporate three non-linearrectification layers instead of a single one, which makes the decision function more discriminative.Second, we decrease the number of parameters: assuming that both the input and the output of athree-layer 3 3 convolution stack has C channels, the stack is parametrised b

27、y 3parenleftbig32C2parenrightbig = 27C2weights; at the same time, a single 7 7 conv. layer would require 72C2 = 49C2 parameters, i.e.81% more. This can be seen as imposing a regularisation on the 7 7 conv. filters, forcing them tohave a decomposition through the 3 3 filters (with non-linearity injec

28、ted in between).The incorporation of 1 1 conv. layers (configuration C, Table 1) is a way to increase the non-linearity of the decision function without affecting the receptive fields of the conv. layers. Eventhough in our case the 11 convolution is essentially a linear projection onto the space of

29、the samedimensionality (the number of input and output channels is the same), an additional non-linearity isintroduced by the rectification function. It should be noted that 11 conv. layers have recently beenutilised in the “Network in Network” architecture of Lin et al. (2014).Small-size convolutio

30、n filters have been previously used by Ciresan et al. (2011), but their netsare significantly less deep than ours, and they did not evaluate on the large-scale ILSVRCdataset. Goodfellow et al. (2014) applied deep ConvNets (11 weight layers) to the task ofstreet number recognition, and showed that th

31、e increased depth led to better performance.GoogLeNet (Szegedy et al., 2014), a top-performing entry of the ILSVRC-2014 classification task,was developed independently of our work, but is similar in that it is based on very deep ConvNets3Published as a conference paper at ICLR 2015(22 weight layers)

32、 and small convolution filters (apart from 3 3, they also use 1 1 and 5 5convolutions). Their network topology is, however, more complex than ours, and the spatial reso-lution of the feature maps is reduced more aggressively in the first layers to decrease the amountof computation. As will be shown

33、in Sect. 4.5, our model is outperforming that of Szegedy et al.(2014) in terms of the single-network classification accuracy.3 CLASSIFICATION FRAMEWORKIn the previous section we presented the details of our network configurations. In this section, wedescribe the details of classification ConvNet tra

34、ining and evaluation.3.1 TRAININGThe ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for samplingthe input crops from multi-scale training images, as explained later). Namely, the training is carriedout by optimising the multinomial logistic regression objective using m

35、ini-batch gradient descent(based on back-propagation (LeCun et al., 1989) with momentum. The batch size was set to 256,momentum to 0.9. The training was regularised by weight decay (the L2 penalty multiplier set to5104) and dropout regularisation for the first two fully-connected layers (dropout rat

36、io set to 0.5).The learning rate was initially set to 102, and then decreased by a factor of 10 when the validationset accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learningwas stopped after 370K iterations (74 epochs). We conjecture that in spite of the larg

37、er number ofparameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets requiredless epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv.filter sizes; (b) pre-initialisation of certain layers.The initialisation of the netwo

38、rk weights is important, since bad initialisation can stall learning dueto the instability of gradient in deep nets. To circumvent this problem, we began with trainingthe configuration A (Table 1), shallow enough to be trained with random initialisation. Then, whentraining deeper architectures, we i

39、nitialised the first four convolutional layers and the last three fully-connected layers with the layers of net A (the intermediate layers were initialised randomly). We didnot decrease the learning rate for the pre-initialised layers, allowing them to change during learning.For random initialisatio

40、n (where applicable), we sampled the weights from a normal distributionwith the zero mean and 102 variance. The biases were initialised with zero. It is worth noting thatafter the paper submission we found that it is possible to initialise the weights without pre-trainingby using the random initiali

41、sation procedure of Glorot forS 224 the crop will correspond to a small part of the image, containing a small object or an objectpart.We consider two approaches for setting the training scale S. The first is to fix S, which correspondsto single-scale training (note that image content within the samp

42、led crops can still represent multi-scale image statistics). In our experiments, we evaluated models trained at two fixed scales: S =256 (which has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler Sermanet et al., 2014) and S = 384. Given a ConvNet configuration, we first trained t

43、he networkusing S = 256. To speed-up training of the S = 384 network, it was initialised with the weightspre-trained with S = 256, and we used a smaller initial learning rate of 103.The second approach to setting S is multi-scale training, where each training image is individuallyrescaled by randoml

44、y sampling S from a certain range Smin,Smax (we used Smin = 256 andSmax = 512). Since objects in images can be of different size, it is beneficial to take this into accountduring training. This can also be seen as training set augmentation by scale jittering, where a single4Published as a conference

45、 paper at ICLR 2015model is trained to recognise objects over a wide range of scales. For speed reasons, we trainedmulti-scale models by fine-tuning all layers of a single-scale model with the same configuration,pre-trained with fixed S = 384.3.2 TESTINGAt test time, given a trained ConvNet and an i

46、nput image, it is classified in the following way. First,it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to itas the test scale). We note that Q is not necessarily equal to the training scale S (as we will showin Sect. 4, using several values of Q for e

47、ach S leads to improved performance). Then, the networkis applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely,the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 7conv. layer, the last two FC layers to 1 1 con

48、v. layers). The resulting fully-convolutional net isthen applied to the whole (uncropped) image. The result is a class score map with the number ofchannels equal to the number of classes, and a variable spatial resolution, dependent on the inputimage size. Finally, to obtain a fixed-size vector of c

49、lass scores for the image, the class score map isspatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images;the soft-max class posteriors of the original and flipped images are averaged to obtain the final scoresfor the image.Since the fully-convolutional network is applied over the whole image, there is no need to samplemultiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires networkre-computation for each crop. At the same time, using a la

本站链接:文库   一言   我酷   合作


客服QQ:2549714901微博号:文库网官方知乎号:文库网

经营许可证编号: 粤ICP备2021046453号世界地图

文库网官网©版权所有2025营业执照举报