首页 A fast learning algorithm for deep belief nets

A fast learning algorithm for deep belief nets

举报
开通vip

A fast learning algorithm for deep belief nets A fast learning algorithm for deep belief nets ∗ Geoffrey E. Hinton and Simon Osindero Department of Computer Science University of Toronto 10 Kings College Road Toronto, Canada M5S 3G4 {hinton, osindero}@cs.toronto.edu Yee-Whye Teh Department of Computer...

A fast learning algorithm for deep belief nets
A fast learning algorithm for deep belief nets ∗ Geoffrey E. Hinton and Simon Osindero Department of Computer Science University of Toronto 10 Kings College Road Toronto, Canada M5S 3G4 {hinton, osindero}@cs.toronto.edu Yee-Whye Teh Department of Computer Science National University of Singapore 3 Science Drive 3, Singapore, 117543 tehyw@comp.nus.edu.sg Abstract We show how to use “complementary priors” to eliminate the explaining away effects that make inference difficult in densely-connected belief nets that have many hidden layers. Using com- plementary priors, we derive a fast, greedy algo- rithm that can learn deep, directed belief networks one layer at a time, provided the top two lay- ers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights us- ing a contrastive version of the wake-sleep algo- rithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit im- ages and their labels. This generative model gives better digit classification than the best discrimi- native learning algorithms. The low-dimensional manifolds on which the digits lie are modelled by long ravines in the free-energy landscape of the top-level associative memory and it is easy to ex- plore these ravines by using the directed connec- tions to display what the associative memory has in mind. 1 Introduction Learning is difficult in densely-connected, directed belief nets that have many hidden layers because it is difficult to infer the conditional distribution of the hidden activities when given a data vector. Variational methods use simple approximations to the true conditional distribution, but the approximations may be poor, especially at the deepest hidden layer where the prior assumes independence. Also, variational learning still requires all of the parameters to be learned together and makes the learning time scale poorly as the number of param- eters increases. We describe a model in which the top two hidden layers form an undirected associative memory (see figure 1) and the ∗To appear in Neural Computation 2006 remaining hidden layers form a directed acyclic graph that converts the representations in the associative memory into observable variables such as the pixels of an image. This hy- brid model has some attractive features: 1. There is a fast, greedy learning algorithm that can find a fairly good set of parameters quickly, even in deep networks with millions of parameters and many hidden layers. 2. The learning algorithm is unsupervised but can be ap- plied to labeled data by learning a model that generates both the label and the data. 3. There is a fine-tuning algorithm that learns an excel- lent generative model which outperforms discrimina- tive methods on the MNIST database of hand-written digits. 4. The generative model makes it easy to interpret the dis- tributed representations in the deep hidden layers. 5. The inference required for forming a percept is both fast and accurate. 6. The learning algorithm is local: adjustments to a synapse strength depend only on the states of the pre- synaptic and post-synaptic neuron. 7. The communication is simple: neurons only need to communicate their stochastic binary states. Section 2 introduces the idea of a “complementary” prior which exactly cancels the “explaining away” phenomenon that makes inference difficult in directed models. An exam- ple of a directed belief network with complementary priors is presented. Section 3 shows the equivalence between re- stricted Boltzmann machines and infinite directed networks with tied weights. Section 4 introduces a fast, greedy learning algorithm for constructing multi-layer directed networks one layer at a time. Using a variational bound it shows that as each new layer is added, the overall generative model improves. The greedy algorithm bears some resemblance to boosting in its repeated use of the same “weak” learner, but instead of re- weighting each data-vector to ensure that the next step learns something new, it re-represents it. The “weak” learner that 2000 top-level units 500 units 500 units 28 x 28 pixel image 10 label units This could be the top level of another sensory pathway Figure 1: The network used to model the joint distribution of digit images and digit labels. In this paper, each training case consists of an image and an explicit class label, but work in progress has shown that the same learning algorithm can be used if the “labels” are replaced by a multilayer pathway whose inputs are spectrograms from multiple different speak- ers saying isolated digits. The network then learns to generate pairs that consist of an image and a spectrogram of the same digit class. is used to construct deep directed nets is itself an undirected graphical model. Section 5 shows how the weights produced by the fast greedy algorithm can be fine-tuned using the “up-down” al- gorithm. This is a contrastive version of the wake-sleep al- gorithm Hinton et al. (1995) that does not suffer from the “mode-averaging” problems that can cause the wake-sleep al- gorithm to learn poor recognition weights. Section 6 shows the pattern recognition performance of a network with three hidden layers and about 1.7 million weights on the MNIST set of handwritten digits. When no knowledge of geometry is provided and there is no special preprocessing, the generalization performance of the network is 1.25% errors on the 10, 000 digit official test set. This beats the 1.5% achieved by the best back-propagation nets when they are not hand-crafted for this particular application. It is also slightly better than the 1.4% errors reported by Decoste and Schoelkopf (2002) for support vector machines on the same task. Finally, section 7 shows what happens in the mind of the network when it is running without being constrained by vi- sual input. The network has a full generative model, so it is easy to look into its mind – we simply generate an image from its high-level representations. Throughout the paper, we will consider nets composed of Figure 2: A simple logistic belief net containing two inde- pendent, rare causes that become highly anti-correlated when we observe the house jumping. The bias of −10 on the earth- quake node means that, in the absence of any observation, this node is e10 times more likely to be off than on. If the earth- quake node is on and the truck node is off, the jump node has a total input of 0 which means that it has an even chance of being on. This is a much better explanation of the observation that the house jumped than the odds of e−20 which apply if neither of the hidden causes is active. But it is wasteful to turn on both hidden causes to explain the observation because the probability of them both happening is e−10 × e−10 = e−20. When the earthquake node is turned on it “explains away” the evidence for the truck node. stochastic binary variables but the ideas can be generalized to other models in which the log probability of a variable is an additive function of the states of its directly-connected neigh- bours (see Appendix A for details). 2 Complementary priors The phenomenon of explaining away (illustrated in figure 2) makes inference difficult in directed belief nets. In densely connected networks, the posterior distribution over the hid- den variables is intractable except in a few special cases such as mixture models or linear models with additive Gaussian noise. Markov Chain Monte Carlo methods (Neal, 1992) can be used to sample from the posterior, but they are typically very time consuming. Variational methods (Neal and Hinton, 1998) approximate the true posterior with a more tractable distribution and they can be used to improve a lower bound on the log probability of the training data. It is comforting that learning is guaranteed to improve a variational bound even when the inference of the hidden states is done incorrectly, but it would be much better to find a way of eliminating ex- plaining away altogether, even in models whose hidden vari- ables have highly correlated effects on the visible variables. It is widely assumed that this is impossible. A logistic belief net (Neal, 1992) is composed of stochas- tic binary units. When the net is used to generate data, the probability of turning on unit i is a logistic function of the states of its immediate ancestors, j, and of the weights, wij , on the directed connections from the ancestors: p(si = 1) = 1 1 + exp(−bi − ∑ j sjwij) (1) where bi is the bias of unit i. If a logistic belief net only has one hidden layer, the prior distribution over the hidden variables is factorial because their binary states are chosen independently when the model is used to generate data. The non-independence in the posterior distribution is created by the likelihood term coming from the data. Perhaps we could eliminate explaining away in the first hidden layer by using extra hidden layers to create a “complementary” prior that has exactly the opposite correlations to those in the likeli- hood term. Then, when the likelihood term is multiplied by the prior, we will get a posterior that is exactly factorial. It is not at all obvious that complementary priors exist, but figure 3 shows a simple example of an infinite logistic belief net with tied weights in which the priors are complementary at every hidden layer (see Appendix A for a more general treatment of the conditions under which complementary priors exist). The use of tied weights to construct complementary priors may seem like a mere trick for making directed models equiva- lent to undirected ones. As we shall see, however, it leads to a novel and very efficient learning algorithm that works by progressively untying the weights in each layer from the weights in higher layers. 2.1 An infinite directed model with tied weights We can generate data from the infinite directed net in fig- ure 3 by starting with a random configuration at an infinitely deep hidden layer1 and then performing a top-down “ances- tral” pass in which the binary state of each variable in a layer is chosen from the Bernoulli distribution determined by the top-down input coming from its active parents in the layer above. In this respect, it is just like any other directed acyclic belief net. Unlike other directed nets, however, we can sam- ple from the true posterior distribution over all of the hidden layers by starting with a data vector on the visible units and then using the transposed weight matrices to infer the fac- torial distributions over each hidden layer in turn. At each hidden layer we sample from the factorial posterior before computing the factorial posterior for the layer above2. Ap- pendix A shows that this procedure gives unbiased samples because the complementary prior at each layer ensures that the posterior distribution really is factorial. Since we can sample from the true posterior, we can com- pute the derivatives of the log probability of the data. Let 1The generation process converges to the stationary distribution of the Markov Chain, so we need to start at a layer that is deep compared with the time it takes for the chain to reach equilibrium. 2This is exactly the same as the inference procedure used in the wake-sleep algorithm (Hinton et al., 1995) for the models described in this paper no variational approximation is required because the inference procedure gives unbiased samples. us start by computing the derivative for a generative weight, w00ij , from a unit j in layer H0 to unit i in layer V0 (see figure 3). In a logistic belief net, the maximum likelihood learning rule for a single data-vector, v0, is: ∂ log p(v0) ∂w00ij = (2) where < ·> denotes an average over the sampled states and vˆ0i is the probability that unit i would be turned on if the visi- ble vector was stochastically reconstructed from the sampled hidden states. Computing the posterior distribution over the second hidden layer, V1, from the sampled binary states in the first hidden layer, H0, is exactly the same process as recon- structing the data, so v1i is a sample from a Bernoulli random variable with probability vˆ0i . The learning rule can therefore be written as: ∂ log p(v0) ∂w00ij = (3) The dependence of v1i on h0j is unproblematic in the deriva- tion of Eq. 3 from Eq. 2 because vˆ0i is an expectation that is conditional on h0j . Since the weights are replicated, the full derivative for a generative weight is obtained by summing the derivatives of the generative weights between all pairs of lay- ers: ∂ log p(v0) ∂wij = + + +... (4) All of the vertically aligned terms cancel leaving the Boltz- mann machine learning rule of Eq. 5. 3 Restricted Boltzmann machines and contrastive divergence learning It may not be immediately obvious that the infinite directed net in figure 3 is equivalent to a Restricted Boltzmann Ma- chine (RBM). An RBM has a single layer of hidden units which are not connected to each other and have undirected, symmetrical connections to a layer of visible units. To gen- erate data from an RBM, we can start with a random state in one of the layers and then perform alternating Gibbs sam- pling: All of the units in one layer are updated in parallel given the current states of the units in the other layer and this is repeated until the system is sampling from its equilibrium distribution. Notice that this is exactly the same process as generating data from the infinite belief net with tied weights. To perform maximum likelihood learning in an RBM, we can use the difference between two correlations. For each weight, wij , between a visible unit i and a hidden unit, j we measure the correlation < v0i h0j > when a datavector is clamped on W V1 H1 V0 H0 V2 TW TW W W etc. 0 iv 0 jh 1 jh 1 iv 2 iv TW TW TW W W Figure 3: An infinite logistic belief net with tied weights. The downward arrows represent the generative model. The up- ward arrows are not part of the model. They represent the parameters that are used to infer samples from the posterior distribution at each hidden layer of the net when a datavector is clamped on V0. the visible units and the hidden states are sampled from their conditional distribution, which is factorial. Then, using al- ternating Gibbs sampling, we run the Markov chain shown in figure 4 until it reaches its stationary distribution and measure the correlation . The gradient of the log probability of the training data is then ∂ log p(v0) ∂wij = (5) This learning rule is the same as the maximum likelihood learning rule for the infinite logistic belief net with tied weights, and each step of Gibbs sampling corresponds to computing the exact posterior distribution in a layer of the infinite logistic belief net. Maximizing the log probability of the data is exactly the same as minimizing the Kullback-Leibler divergence, KL(P 0||P∞θ ), between the distribution of the data, P 0, and the equilibrium distribution defined by the model, P∞θ . In contrastive divergence learning (Hinton, 2002), we only run the Markov chain for n full steps3 before measuring the sec- ond correlation. This is equivalent to ignoring the derivatives 3Each full step consists of updating h given v then updating v given h. >< 00 ji hv i j i j i j i j ����� ����� ����� t = infinity t = 0 t = 1 t = 2 t = infinity >< ∞∞ ji hv Figure 4: This depicts a Markov chain that uses alternating Gibbs sampling. In one full step of Gibbs sampling, the hid- den units in the top layer are all updated in parallel by apply- ing Eq. 1 to the inputs received from the the current states of the visible units in the bottom layer, then the visible units are all updated in parallel given the current hidden states. The chain is initialized by setting the binary states of the visible units to be the same as a data-vector. The correlations in the activities of a visible and a hidden unit are measured after the first update of the hidden units and again at the end of the chain. The difference of these two correlations provides the learning signal for updating the weight on the connection. that come from the higher layers of the infinite net. The sum of all these ignored derivatives is the derivative of the log probability of the posterior distribution in layer Vn, which is also the derivative of the Kullback-Leibler divergence be- tween the posterior distribution in layer Vn, Pnθ , and the equi- librium distribution defined by the model. So contrastive di- vergence learning minimizes the difference of two Kullback- Leibler divergences: KL(P 0||P∞θ )−KL(P n θ ||P ∞ θ ) (6) Ignoring sampling noise, this difference is never negative because Gibbs sampling is used to produce P nθ from P 0 and Gibbs sampling always reduces the Kullback-Leibler diver- gence with the equilibrium distribution. It is important to no- tice that P nθ depends on the current model parameters and the way in which P nθ changes as the parameters change is being ignored by contrastive divergence learning. This prob- lem does not arise with P 0 because the training data does not depend on the parameters. An empirical investigation of the relationship between the maximum likelihood and the con- trastive divergence learning rules can be found in Carreira- Perpinan and Hinton (2005). Contrastive divergence learning in a restricted Boltzmann machine is efficient enough to be practical (Mayraz and Hin- ton, 2001). Variations that use real-valued units and differ- ent sampling schemes are described in Teh et al. (2003) and have been quite successful for modeling the formation of to- pographic maps (Welling et al., 2003), for denoising natural images (Roth and Black, 2005) or images of biological cells (Ning et al., 2005). Marks and Movellan (2001) describe a way of using contrastive divergence to perform factor analy- sis and Welling et al. (2005) show that a network with logistic, binary visible units and linear, Gaussian hidden units can be used for rapid document retrieval. However, it appears that the efficiency has been bought at a high price: When applied in the obvious way, contrastive divergence learning fails for deep, multilayer networks with different weights at each layer because these networks take far too long even to reach condi- tional equilibrium with a clamped data-vector. We now show that the equivalence between RBM’s and infinite directed nets with tied weights suggests an efficient learning algorithm for multilayer networks in which the weights are not tied. 4 A greedy learning algorithm for transforming representations An efficient way to learn a complicated model is to combine a set of simpler models that are learned sequentially. To force each model in the sequence to learn something different from the previous models, the data is modified in some way after each model has been learned. In boosting (Freund, 1995), each model in the sequence is trained on re-weighted data that emphasizes the cases that the preceding models got wrong. In one version of principal components analysis, the variance in a modeled direction is removed thus forcing the next modeled direction to lie in the orthogonal subspace (Sanger, 1989). In projection pursuit (Friedman and Stuetzle, 1981), the data is transformed by nonlinearly distorting one direction in the data-space to remove all non-Gaussianity in that direction. The idea behind our greedy algorithm is to allow each model in the sequence to receive a different representation of the data. The model performs a non-linear transformation on its input vectors and p
本文档为【A fast learning algorithm for deep belief nets】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_877538
暂无简介~
格式:pdf
大小:618KB
软件:PDF阅读器
页数:16
分类:
上传时间:2013-03-02
浏览量:97