《facebook Deep Learning for Vision: Tricks of the Trade.pdf》由会员分享,可在线阅读,更多相关《facebook Deep Learning for Vision: Tricks of the Trade.pdf(80页珍藏版)》请在三一文库上搜索。
1、Deep Learning for Vision:Tricks of the TradeMarcAurelio Ranzato Facebook, AI Group BAVM Friday, 4 October 2013www.cs.toronto.edu/ranzato2Ideal FeaturesIdeal Feature Extractor- window, right- chair, left- monitor, top of shelf- carpet, bottom- drums, corner- - pillows on couchQ.: What objects are in
2、the image? Where is the lamp? What is on the couch? .RanzatoRanzato3The Manifold of Natural Images4Ideal Feature ExtractionPixel 1Pixel 2Pixel nExpressionPoseIdeal Feature ExtractorRanzatoRanzato5Learning Non-Linear FeaturesProposal #1: linear combinationProposal #2: compositionGiven a dictionary of
3、 simple non-linear functions: g1,gnf xjgjf xg1g2gnx+RanzatoRanzato6Learning Non-Linear FeaturesProposal #1: linear combinationProposal #2: compositionGiven a dictionary of simple non-linear functions: g1,gnf xg1g2gnx Kernel learning Boosting . Deep learning Scattering networks (wavelet cascade) S.C.
4、 Zhou & D. Mumford “grammar”S h a l l o wD e e pf xjgjRanzatoRanzato7Linear Combination+.Input imagetemplete matchersprediction of classBAD: it may require an exponential nr. of templates!RanzatoRanzato8CompositionInput imagelow level partsprediction of classGOOD: (exponentially) more efficientm
5、id-level partshigh-level parts reuse intermediate parts distributed representationsLee et al. “Convolutional DBNs .” ICML 2009, Zeiler & Fergus RanzatoRanzato.9A Potential Problem with Deep LearningOptimization is difficult: non-convex, non-linear system1234RanzatoRanzato10A Potential Problem wi
6、th Deep LearningOptimization is difficult: non-convex, non-linear system4Solution #1: freeze first N-1 layer (engineer the features) It makes it shallow! - How to design features of features? - How to design features for new imagery?SIFTk-MeansPoolingClassifierRanzatoRanzato11A Potential Problem wit
7、h Deep LearningOptimization is difficult: non-convex, non-linear system4Solution #2: live with it! It will converge to a local minimum. It is much more powerful! 123Given lots of data, engineer less and learn more!Just need to know a few tricks of the trade.RanzatoRanzato12Deep Learning in PracticeO
8、ptimization is easy, need to know a few tricks of the trade.4Q: Whats the feature extractor? And whats the classifier?123A: No distinction, end-to-end learning!RanzatoRanzato13Deep Learning in PracticeIt works very well in practice:RanzatoRanzato14KEY IDEAS: WHY DEEP LEARNING We need non-linear syst
9、em We need to learn it from data Build feature hierarchiesDistributed representationsCompositionality End-to-end learningRanzatoRanzato15What Is Deep Learning?RanzatoRanzato16Buzz WordsIts a Contrastive DivergenceIts a Convolutional NetIts just old Neural NetsIts a Feature LearningIts a Deep Belief
10、NetIts a Unsupervised LearningRanzatoRanzato17(My) DefinitionA Deep Learning method is: a method which makes predictions by using a sequence of non-linear processing stages. The resulting intermediate representations can be interpreted as feature hierarchies and the whole system is jointly learned f
11、rom data. Some deep learning methods are probabilistic, others are loss-based, some are supervised, other unsupervised. Its a large family!RanzatoRanzato18Perceptron1957RosenblattTHE SPACE OF MACHINE LEARNING METHODS19PerceptronNeural NetAutoencoderNeural Net80s back-propagation & compute power
12、20PerceptronNeural NetAutoencoderNeural Net90s LeCuns CNNsConvolutional Neural NetRecurrent Neural NetSparse CodingGMM21PerceptronAutoencoderNeural NetConvolutional Neural NetRecurrent Neural NetSparse CodingSVMBoostingGMMRestricted BMNeural Net00s SVMs22PerceptronBoostingSVMGMMBayesNPRecurrent Neur
13、al NetAutoencoderNeural NetSparse CodingRestricted BMNeural NetConvolutional Neural NetDeep Belief Net2006 Hintons DBN23GMM2009BayesNPSparse CodingRestricted BMNeural NetDeep Belief NetRecurrent Neural NetBoostingPerceptronAutoencoderNeural NetConvolutional Neural NetSVMDeep (sparse/denoising) Autoe
14、ncoder2009 ASR (data + GPU)24GMMBayesNPSparse CodingRestricted BMNeural NetDeep Belief NetRecurrent Neural NetBoostingPerceptronAutoencoderNeural NetConvolutional Neural NetSVMDeep (sparse/denoising) Autoencoder2012 CNNs (data + GPU)25PerceptronNeural NetBoostingSVMGMMBayesNPConvolutional Neural Net
15、Recurrent Neural NetAutoencoderNeural NetSparse CodingRestricted BMDeep Belief NetDeep (sparse/denoising) Autoencoder26TIMEConvolutional Neural Net 2012Convolutional Neural Net 1998Convolutional Neural Net 1988Q.: Did we make any prgress since then?A.: The main reason for the breakthrough is: data a
16、nd GPU, but we have also made networks deeper and more non-linear. 27- Fukushima 1980: designed network with same basic structure but did not train by backpropagation.- LeCun from late 80s: figured out backpropagation for CNN, popularized and deployed CNN for OCR applications and others.- Poggio fro
17、m 1999: same basic structure but learning is restricted to top layer (k-means at second stage) - LeCun from 2006: unsupervised feature learning- DiCarlo from 2008: large scale experiments, normalization layer- LeCun from 2009: harsher non-linearities, normalization layer, learning unsupervised and s
18、upervised. - Mallat from 2011: provides a theory behind the architecture- Hinton 2012: use bigger nets, GPUs, more dataLeCun et al. “Gradient-based learning applied to document recognition” IEEE 1998ConvNets: History28ConvNets: till 2012LossparameterCommon wisdom: training does not work because we “
19、get stuck in local minima”29ConvNets: todayLossparameterLocal minima are all similar, there are long plateaus, it can take long time to break symmetries.wwinput/output invariant to permutationsbreaking ties between parametersWTX1Saturating units 30Like walking on a ridge between valleys31ConvNets: t
20、odayLossparameterLocal minima are all similar, there are long plateaus, it can take long to break symmetries.Optimization is not the real problem when: dataset is large unit do not saturate too much normalization layer32ConvNets: todayLossparameterTodays belief is that the challenge is about: genera
21、lization How many training samples to fit 1B parameters? How many parameters/samples to model spaces with 1M dim.? scalability33PerceptronNeural NetBoostingSVMGMMBayesNPConvolutional Neural NetRecurrent Neural NetAutoencoderNeural NetSparse CodingRestricted BMDeep Belief NetDeep (sparse/denoising) A
22、utoencoderSHALLOWDEEP34PerceptronNeural NetBoostingSVMGMMBayesNPConvolutional Neural NetRecurrent Neural NetAutoencoderNeural NetSparse CodingRestricted BMDeep Belief NetDeep (sparse/denoising) AutoencoderUNSUPERVISEDSUPERVISEDDEEPSHALLOW35Deep Learning is a very rich family!I am going to focus on a
23、 few methods.RanzatoRanzato36PerceptronNeural NetBoostingSVMGMMBayesNPCNNRecurrent Neural NetAutoencoderNeural NetSparse CodingRestricted BMDBNDeep (sparse/denoising) AutoencoderUNSUPERVISEDSUPERVISEDDEEPSHALLOW37Deep Gated MRFLayer 1:Ex,hc,hm=12x 1xpair-wise MRFxpxqRanzato et al. “Modeling natural
24、images with gated MRFs” PAMI 2013px,hc,hmeE x,hc,hm38Deep Gated MRFLayer 1:Ex,hc,hm=12xC C xpair-wise MRFxpxqFRanzato et al. “Modeling natural images with gated MRFs” PAMI 201339Deep Gated MRFLayer 1:Ex,hc,hm=12xC diaghcC xgated MRFxpxqhkcCCFFRanzato et al. “Modeling natural images with gated MRFs”
25、PAMI 201340Deep Gated MRFRanzato et al. “Modeling natural images with gated MRFs” PAMI 2013Layer 1:Ex,hc,hm=12xC diaghcC x12x xx W hmxpxqhjmWCCFMhkcNgated MRFpxhchmeE x ,hc,hm41Deep Gated MRFRanzato et al. “Modeling natural images with gated MRFs” PAMI 2013Layer 1:Ex,hc,hm=12xC diaghcC x12x xx W hmI
26、nference of latent variables: just a forward passTraining:requires approximations(here we used MCMC methods)pxhchmeE x ,hc,hm42Deep Gated MRFRanzato et al. “Modeling natural images with gated MRFs” PAMI 2013Layer 1Layer 2inputh243Deep Gated MRFRanzato et al. “Modeling natural images with gated MRFs”
27、 PAMI 2013Layer 1h2Layer 2inputh3Layer 344Gaussian modelmarginal waveletfrom Simoncelli 2005Pair-wise MRFFoEfrom Schmidt, Gao, Roth CVPR 2010Sampling High-Resolution Images45Gaussian modelmarginal waveletfrom Simoncelli 2005Pair-wise MRFFoEfrom Schmidt, Gao, Roth CVPR 2010Sampling High-Resolution Im
28、agesgMRF: 1 layerRanzato et al. PAMI 201346Gaussian modelmarginal waveletfrom Simoncelli 2005Pair-wise MRFFoEfrom Schmidt, Gao, Roth CVPR 2010Sampling High-Resolution ImagesgMRF: 1 layerRanzato et al. PAMI 201347Gaussian modelmarginal waveletfrom Simoncelli 2005Pair-wise MRFFoEfrom Schmidt, Gao, Rot
29、h CVPR 2010Sampling High-Resolution ImagesgMRF: 1 layerRanzato et al. PAMI 201348Gaussian modelmarginal waveletfrom Simoncelli 2005Pair-wise MRFFoEfrom Schmidt, Gao, Roth CVPR 2010Sampling High-Resolution ImagesgMRF: 3 layersRanzato et al. PAMI 201349Gaussian modelmarginal waveletfrom Simoncelli 200
30、5Pair-wise MRFFoEfrom Schmidt, Gao, Roth CVPR 2010Sampling High-Resolution ImagesgMRF: 3 layersRanzato et al. PAMI 201350Gaussian modelmarginal waveletfrom Simoncelli 2005Pair-wise MRFFoEfrom Schmidt, Gao, Roth CVPR 2010Sampling High-Resolution ImagesgMRF: 3 layersRanzato et al. PAMI 201351Gaussian
31、modelmarginal waveletfrom Simoncelli 2005Pair-wise MRFFoEfrom Schmidt, Gao, Roth CVPR 2010Sampling High-Resolution ImagesgMRF: 3 layersRanzato et al. PAMI 201352Sampling After Training on Face ImagesOriginalInput1st layer2nd layer3rd layer4th layer 10 timesunconstrained samplesconditional (on the le
32、ft part of the face) samplesRanzato et al. PAMI 201353Expression Recognition Under OcclusionRanzato et al. PAMI 201354ProsCons Feature extraction is fast Unprecedented generation quality Advances models of natural images Trains without labeled data Training is inefficientSlowTricky Sampling scales b
33、adly with dimensionality Whats the use case of generative models?Conclusion If generation is not required, other feature learning methods are more efficient (e.g., sparse auto-encoders). Whats the use case of generative models?55PerceptronNeural NetBoostingSVMGMMBayesNPCNNRecurrent Neural NetAutoenc
34、oderNeural NetSPARSE CODINGRestricted BMDBNDeep (sparse/denoising) AutoencoderUNSUPERVISEDSUPERVISEDDEEPSHALLOW56CONV NETS: TYPICAL ARCHITECTUREConvol.LCNPoolingOne stage (zoom)Fully Conn. LayersWhole system1st stage2nd stage3rd stageInput ImageClassLabelsRanzatoRanzato57CONV NETS: TYPICAL ARCHITECT
35、UREConvol.LCNPoolingOne stage (zoom)SIFT K-Means Pyramid Pooling SVMSIFT Fisher Vect. Pooling SVMLazebnik et al. “.Spatial Pyramid Matching.” CVPR 2006Sanchez et al. “Image classifcation with F.V.: Theory and practice” IJCV 2012Conceptually similar to:RanzatoRanzato58CONV NETS: EXAMPLES- OCR / House
36、 number & Traffic sign classification Ciresan et al. “MCDNN for image classification” CVPR 2012Wan et al. “Regularization of neural networks using dropconnect” ICML 201359CONV NETS: EXAMPLES- Texture classification Sifre et al. “Rotation, scaling and deformation invariant scattering.” CVPR 20136
37、0CONV NETS: EXAMPLES- Pedestrian detection Sermanet et al. “Pedestrian detection with unsupervised multi-stage.” CVPR 201361CONV NETS: EXAMPLES- Scene Parsing Farabet et al. “Learning hierarchical features for scene labeling” PAMI 2013RanzatoRanzato62CONV NETS: EXAMPLES- Segmentation 3D volumetric i
38、mages Ciresan et al. “DNN segment neuronal membranes.” NIPS 2012Turaga et al. “Maximin learning of image segmentation” NIPS 2009RanzatoRanzato63CONV NETS: EXAMPLES- Action recognition from videos Taylor et al. “Convolutional learning of spatio-temporal features” ECCV 201064CONV NETS: EXAMPLES- Robot
39、ics Sermanet et al. “Mapping and planning .with long range perception” IROS 200865CONV NETS: EXAMPLES- Denoising Burger et al. “Can plain NNs compete with BM3D?” CVPR 2012originalnoiseddenoisedRanzatoRanzato66CONV NETS: EXAMPLES- Dimensionality reduction / learning embeddings Hadsell et al. “Dimensi
40、onality reduction by learning an invariant mapping” CVPR 200667CONV NETS: EXAMPLES- Image classification Krizhevsky et al. “ImageNet Classification with deep CNNs” NIPS 2012ObjectRecognizerrailcarRanzatoRanzato68CONV NETS: EXAMPLES- Deployed in commercial systems (Google & Baidu, spring 2013) Ra
41、nzatoRanzato69How To Use ConvNets.(properly)70CHOOSING THE ARCHITECTURE Task dependent Cross-validation Convolution LCN pooling* + fully connected layer The more data: the more layers and the more kernelsLook at the number of parameters at each layerLook at the number of flops at each layer Computat
42、ional cost Be creative :)RanzatoRanzato71HOW TO OPTIMIZE SGD (with momentum) usually works very well Pick learning rate by running on a subset of the dataBottou “Stochastic Gradient Tricks” Neural Networks 2012Start with large learning rate and divide by 2 until loss does not divergeDecay learning r
43、ate by a factor of 100 or more by the end of training Use non-linearity Initialize parameters so that each feature across layers has similar variance. Avoid units in saturation.RanzatoRanzato72HOW TO IMPROVE GENERALIZATION Weight sharing (greatly reduce the number of parameters) Data augmentation (e
44、.g., jittering, noise injection, etc.) Dropout Hinton et al. “Improving Nns by preventing co-adaptation of feature detectors” arxiv 2012 Weight decay (L2, L1) Sparsity in the hidden units Multi-task (unsupervised learning) RanzatoRanzato73OTHER THINGS GOOD TO KNOW Check gradients numerically by fini
45、te differences Visualize features (feature maps need to be uncorrelated) and have high variance.sampleshidden unitGood training: hidden units are sparse across samples and across features. RanzatoRanzato74OTHER THINGS GOOD TO KNOW Check gradients numerically by finite differences Visualize features
46、(feature maps need to be uncorrelated) and have high variance.sampleshidden unitBad training: many hidden units ignore the input and/or exhibit strong correlations.RanzatoRanzato75OTHER THINGS GOOD TO KNOW Check gradients numerically by finite differences Visualize features (feature maps need to be
47、uncorrelated) and have high variance. Visualize parametersGood training: learned filters exhibit structure and are uncorrelated. GOODBADBADBADtoo noisytoo correlatedlack structureRanzatoRanzato76OTHER THINGS GOOD TO KNOW Check gradients numerically by finite differences Visualize features (feature m
48、aps need to be uncorrelated) and have high variance. Visualize parameters Measure error on both training and validation set. Test on a small subset of the data and check the error 0.RanzatoRanzato77WHAT IF IT DOES NOT WORK? Training diverges:Learning rate may be too large decrease learning rateBPROP
49、 is buggy numerical gradient checking Parameters collapse / loss is minimized but accuracy is low Check loss function:Is it appropriate for the task you want to solve?Does it have degenerate solutions? Network is underperformingCompute flops and nr. params. if too small, make net largerVisualize hid
50、den units/params fix optmization Network is too slowCompute flops and nr. params. GPU,distrib. framework, make net smaller RanzatoRanzato78SUMMARY Deep Learning = Learning Hierarchical representations. Leverage compositionality to gain efficiency. Unsupervised learning: active research topic. Supervised learning: most