A fast learning algorithm for deep belief nets ∗
Geoffrey E. Hinton and Simon Osindero
Department of Computer Science University of Toronto
10 Kings College Road
Toronto, Canada M5S 3G4
{hinton, osindero}@cs.toronto.edu
Yee-Whye Teh
Department of Computer Science
National University of Singapore
3 Science Drive 3, Singapore, 117543
tehyw@comp.nus.edu.sg
Abstract
We show how to use “complementary priors” to
eliminate the explaining away effects that make
inference difficult in densely-connected belief
nets that have many hidden layers. Using com-
plementary priors, we derive a fast, greedy algo-
rithm that can learn deep, directed belief networks
one layer at a time, provided the top two lay-
ers form an undirected associative memory. The
fast, greedy algorithm is used to initialize a slower
learning procedure that fine-tunes the weights us-
ing a contrastive version of the wake-sleep algo-
rithm. After fine-tuning, a network with three
hidden layers forms a very good generative model
of the joint distribution of handwritten digit im-
ages and their labels. This generative model gives
better digit classification than the best discrimi-
native learning algorithms. The low-dimensional
manifolds on which the digits lie are modelled by
long ravines in the free-energy landscape of the
top-level associative memory and it is easy to ex-
plore these ravines by using the directed connec-
tions to display what the associative memory has
in mind.
1 Introduction
Learning is difficult in densely-connected, directed belief nets
that have many hidden layers because it is difficult to infer the
conditional distribution of the hidden activities when given a
data vector. Variational methods use simple approximations
to the true conditional distribution, but the approximations
may be poor, especially at the deepest hidden layer where
the prior assumes independence. Also, variational learning
still requires all of the parameters to be learned together and
makes the learning time scale poorly as the number of param-
eters increases.
We describe a model in which the top two hidden layers
form an undirected associative memory (see figure 1) and the
∗To appear in Neural Computation 2006
remaining hidden layers form a directed acyclic graph that
converts the representations in the associative memory into
observable variables such as the pixels of an image. This hy-
brid model has some attractive features:
1. There is a fast, greedy learning algorithm that can find
a fairly good set of parameters quickly, even in deep
networks with millions of parameters and many hidden
layers.
2. The learning algorithm is unsupervised but can be ap-
plied to labeled data by learning a model that generates
both the label and the data.
3. There is a fine-tuning algorithm that learns an excel-
lent generative model which outperforms discrimina-
tive methods on the MNIST database of hand-written
digits.
4. The generative model makes it easy to interpret the dis-
tributed representations in the deep hidden layers.
5. The inference required for forming a percept is both fast
and accurate.
6. The learning algorithm is local: adjustments to a
synapse strength depend only on the states of the pre-
synaptic and post-synaptic neuron.
7. The communication is simple: neurons only need to
communicate their stochastic binary states.
Section 2 introduces the idea of a “complementary” prior
which exactly cancels the “explaining away” phenomenon
that makes inference difficult in directed models. An exam-
ple of a directed belief network with complementary priors
is presented. Section 3 shows the equivalence between re-
stricted Boltzmann machines and infinite directed networks
with tied weights.
Section 4 introduces a fast, greedy learning algorithm
for constructing multi-layer directed networks one layer at
a time. Using a variational bound it shows that as each new
layer is added, the overall generative model improves. The
greedy algorithm bears some resemblance to boosting in its
repeated use of the same “weak” learner, but instead of re-
weighting each data-vector to ensure that the next step learns
something new, it re-represents it. The “weak” learner that
2000 top-level units
500 units
500 units
28 x 28
pixel
image
10 label units
This could be the
top level of
another sensory
pathway
Figure 1: The network used to model the joint distribution
of digit images and digit labels. In this paper, each training
case consists of an image and an explicit class label, but work
in progress has shown that the same learning algorithm can
be used if the “labels” are replaced by a multilayer pathway
whose inputs are spectrograms from multiple different speak-
ers saying isolated digits. The network then learns to generate
pairs that consist of an image and a spectrogram of the same
digit class.
is used to construct deep directed nets is itself an undirected
graphical model.
Section 5 shows how the weights produced by the fast
greedy algorithm can be fine-tuned using the “up-down” al-
gorithm. This is a contrastive version of the wake-sleep al-
gorithm Hinton et al. (1995) that does not suffer from the
“mode-averaging” problems that can cause the wake-sleep al-
gorithm to learn poor recognition weights.
Section 6 shows the pattern recognition performance of
a network with three hidden layers and about 1.7 million
weights on the MNIST set of handwritten digits. When no
knowledge of geometry is provided and there is no special
preprocessing, the generalization performance of the network
is 1.25% errors on the 10, 000 digit official test set. This beats
the 1.5% achieved by the best back-propagation nets when
they are not hand-crafted for this particular application. It is
also slightly better than the 1.4% errors reported by Decoste
and Schoelkopf (2002) for support vector machines on the
same task.
Finally, section 7 shows what happens in the mind of the
network when it is running without being constrained by vi-
sual input. The network has a full generative model, so it is
easy to look into its mind – we simply generate an image from
its high-level representations.
Throughout the paper, we will consider nets composed of
Figure 2: A simple logistic belief net containing two inde-
pendent, rare causes that become highly anti-correlated when
we observe the house jumping. The bias of −10 on the earth-
quake node means that, in the absence of any observation, this
node is e10 times more likely to be off than on. If the earth-
quake node is on and the truck node is off, the jump node has
a total input of 0 which means that it has an even chance of
being on. This is a much better explanation of the observation
that the house jumped than the odds of e−20 which apply if
neither of the hidden causes is active. But it is wasteful to turn
on both hidden causes to explain the observation because the
probability of them both happening is e−10 × e−10 = e−20.
When the earthquake node is turned on it “explains away” the
evidence for the truck node.
stochastic binary variables but the ideas can be generalized to
other models in which the log probability of a variable is an
additive function of the states of its directly-connected neigh-
bours (see Appendix A for details).
2 Complementary priors
The phenomenon of explaining away (illustrated in figure 2)
makes inference difficult in directed belief nets. In densely
connected networks, the posterior distribution over the hid-
den variables is intractable except in a few special cases such
as mixture models or linear models with additive Gaussian
noise. Markov Chain Monte Carlo methods (Neal, 1992) can
be used to sample from the posterior, but they are typically
very time consuming. Variational methods (Neal and Hinton,
1998) approximate the true posterior with a more tractable
distribution and they can be used to improve a lower bound on
the log probability of the training data. It is comforting that
learning is guaranteed to improve a variational bound even
when the inference of the hidden states is done incorrectly,
but it would be much better to find a way of eliminating ex-
plaining away altogether, even in models whose hidden vari-
ables have highly correlated effects on the visible variables.
It is widely assumed that this is impossible.
A logistic belief net (Neal, 1992) is composed of stochas-
tic binary units. When the net is used to generate data, the
probability of turning on unit i is a logistic function of the
states of its immediate ancestors, j, and of the weights, wij ,
on the directed connections from the ancestors:
p(si = 1) =
1
1 + exp(−bi −
∑
j sjwij)
(1)
where bi is the bias of unit i. If a logistic belief net only
has one hidden layer, the prior distribution over the hidden
variables is factorial because their binary states are chosen
independently when the model is used to generate data. The
non-independence in the posterior distribution is created by
the likelihood term coming from the data. Perhaps we could
eliminate explaining away in the first hidden layer by using
extra hidden layers to create a “complementary” prior that
has exactly the opposite correlations to those in the likeli-
hood term. Then, when the likelihood term is multiplied by
the prior, we will get a posterior that is exactly factorial. It is
not at all obvious that complementary priors exist, but figure 3
shows a simple example of an infinite logistic belief net with
tied weights in which the priors are complementary at every
hidden layer (see Appendix A for a more general treatment of
the conditions under which complementary priors exist). The
use of tied weights to construct complementary priors may
seem like a mere trick for making directed models equiva-
lent to undirected ones. As we shall see, however, it leads
to a novel and very efficient learning algorithm that works
by progressively untying the weights in each layer from the
weights in higher layers.
2.1 An infinite directed model with tied weights
We can generate data from the infinite directed net in fig-
ure 3 by starting with a random configuration at an infinitely
deep hidden layer1 and then performing a top-down “ances-
tral” pass in which the binary state of each variable in a layer
is chosen from the Bernoulli distribution determined by the
top-down input coming from its active parents in the layer
above. In this respect, it is just like any other directed acyclic
belief net. Unlike other directed nets, however, we can sam-
ple from the true posterior distribution over all of the hidden
layers by starting with a data vector on the visible units and
then using the transposed weight matrices to infer the fac-
torial distributions over each hidden layer in turn. At each
hidden layer we sample from the factorial posterior before
computing the factorial posterior for the layer above2. Ap-
pendix A shows that this procedure gives unbiased samples
because the complementary prior at each layer ensures that
the posterior distribution really is factorial.
Since we can sample from the true posterior, we can com-
pute the derivatives of the log probability of the data. Let
1The generation process converges to the stationary distribution
of the Markov Chain, so we need to start at a layer that is deep
compared with the time it takes for the chain to reach equilibrium.
2This is exactly the same as the inference procedure used in the
wake-sleep algorithm (Hinton et al., 1995) for the models described
in this paper no variational approximation is required because the
inference procedure gives unbiased samples.
us start by computing the derivative for a generative weight,
w00ij , from a unit j in layer H0 to unit i in layer V0 (see figure
3). In a logistic belief net, the maximum likelihood learning
rule for a single data-vector, v0, is:
∂ log p(v0)
∂w00ij
= (2)
where < ·> denotes an average over the sampled states and
vˆ0i is the probability that unit i would be turned on if the visi-
ble vector was stochastically reconstructed from the sampled
hidden states. Computing the posterior distribution over the
second hidden layer, V1, from the sampled binary states in the
first hidden layer, H0, is exactly the same process as recon-
structing the data, so v1i is a sample from a Bernoulli random
variable with probability vˆ0i . The learning rule can therefore
be written as:
∂ log p(v0)
∂w00ij
= (3)
The dependence of v1i on h0j is unproblematic in the deriva-
tion of Eq. 3 from Eq. 2 because vˆ0i is an expectation that is
conditional on h0j . Since the weights are replicated, the full
derivative for a generative weight is obtained by summing the
derivatives of the generative weights between all pairs of lay-
ers:
∂ log p(v0)
∂wij
=
+
+
+... (4)
All of the vertically aligned terms cancel leaving the Boltz-
mann machine learning rule of Eq. 5.
3 Restricted Boltzmann machines and
contrastive divergence learning
It may not be immediately obvious that the infinite directed
net in figure 3 is equivalent to a Restricted Boltzmann Ma-
chine (RBM). An RBM has a single layer of hidden units
which are not connected to each other and have undirected,
symmetrical connections to a layer of visible units. To gen-
erate data from an RBM, we can start with a random state
in one of the layers and then perform alternating Gibbs sam-
pling: All of the units in one layer are updated in parallel
given the current states of the units in the other layer and this
is repeated until the system is sampling from its equilibrium
distribution. Notice that this is exactly the same process as
generating data from the infinite belief net with tied weights.
To perform maximum likelihood learning in an RBM, we can
use the difference between two correlations. For each weight,
wij , between a visible unit i and a hidden unit, j we measure
the correlation < v0i h0j > when a datavector is clamped on
W
V1
H1
V0
H0
V2
TW
TW
W
W
etc.
0
iv
0
jh
1
jh
1
iv
2
iv
TW
TW
TW
W
W
Figure 3: An infinite logistic belief net with tied weights. The
downward arrows represent the generative model. The up-
ward arrows are not part of the model. They represent the
parameters that are used to infer samples from the posterior
distribution at each hidden layer of the net when a datavector
is clamped on V0.
the visible units and the hidden states are sampled from their
conditional distribution, which is factorial. Then, using al-
ternating Gibbs sampling, we run the Markov chain shown in
figure 4 until it reaches its stationary distribution and measure
the correlation . The gradient of the log probability
of the training data is then
∂ log p(v0)
∂wij
= − (5)
This learning rule is the same as the maximum likelihood
learning rule for the infinite logistic belief net with tied
weights, and each step of Gibbs sampling corresponds to
computing the exact posterior distribution in a layer of the
infinite logistic belief net.
Maximizing the log probability of the data is exactly
the same as minimizing the Kullback-Leibler divergence,
KL(P 0||P∞θ ), between the distribution of the data, P 0, and
the equilibrium distribution defined by the model, P∞θ . In
contrastive divergence learning (Hinton, 2002), we only run
the Markov chain for n full steps3 before measuring the sec-
ond correlation. This is equivalent to ignoring the derivatives
3Each full step consists of updating h given v then updating v
given h.
><
00
ji hv
i
j
i
j
i
j
i
j
����� ����� �����
t = infinity
t = 0 t = 1 t = 2 t = infinity
><
∞∞
ji hv
Figure 4: This depicts a Markov chain that uses alternating
Gibbs sampling. In one full step of Gibbs sampling, the hid-
den units in the top layer are all updated in parallel by apply-
ing Eq. 1 to the inputs received from the the current states
of the visible units in the bottom layer, then the visible units
are all updated in parallel given the current hidden states. The
chain is initialized by setting the binary states of the visible
units to be the same as a data-vector. The correlations in the
activities of a visible and a hidden unit are measured after the
first update of the hidden units and again at the end of the
chain. The difference of these two correlations provides the
learning signal for updating the weight on the connection.
that come from the higher layers of the infinite net. The sum
of all these ignored derivatives is the derivative of the log
probability of the posterior distribution in layer Vn, which
is also the derivative of the Kullback-Leibler divergence be-
tween the posterior distribution in layer Vn, Pnθ , and the equi-
librium distribution defined by the model. So contrastive di-
vergence learning minimizes the difference of two Kullback-
Leibler divergences:
KL(P 0||P∞θ )−KL(P
n
θ ||P
∞
θ ) (6)
Ignoring sampling noise, this difference is never negative
because Gibbs sampling is used to produce P nθ from P 0 and
Gibbs sampling always reduces the Kullback-Leibler diver-
gence with the equilibrium distribution. It is important to no-
tice that P nθ depends on the current model parameters and
the way in which P nθ changes as the parameters change is
being ignored by contrastive divergence learning. This prob-
lem does not arise with P 0 because the training data does not
depend on the parameters. An empirical investigation of the
relationship between the maximum likelihood and the con-
trastive divergence learning rules can be found in Carreira-
Perpinan and Hinton (2005).
Contrastive divergence learning in a restricted Boltzmann
machine is efficient enough to be practical (Mayraz and Hin-
ton, 2001). Variations that use real-valued units and differ-
ent sampling schemes are described in Teh et al. (2003) and
have been quite successful for modeling the formation of to-
pographic maps (Welling et al., 2003), for denoising natural
images (Roth and Black, 2005) or images of biological cells
(Ning et al., 2005). Marks and Movellan (2001) describe a
way of using contrastive divergence to perform factor analy-
sis and Welling et al. (2005) show that a network with logistic,
binary visible units and linear, Gaussian hidden units can be
used for rapid document retrieval. However, it appears that
the efficiency has been bought at a high price: When applied
in the obvious way, contrastive divergence learning fails for
deep, multilayer networks with different weights at each layer
because these networks take far too long even to reach condi-
tional equilibrium with a clamped data-vector. We now show
that the equivalence between RBM’s and infinite directed nets
with tied weights suggests an efficient learning algorithm for
multilayer networks in which the weights are not tied.
4 A greedy learning algorithm for
transforming representations
An efficient way to learn a complicated model is to combine
a set of simpler models that are learned sequentially. To force
each model in the sequence to learn something different from
the previous models, the data is modified in some way after
each model has been learned. In boosting (Freund, 1995),
each model in the sequence is trained on re-weighted data that
emphasizes the cases that the preceding models got wrong. In
one version of principal components analysis, the variance in
a modeled direction is removed thus forcing the next modeled
direction to lie in the orthogonal subspace (Sanger, 1989).
In projection pursuit (Friedman and Stuetzle, 1981), the data
is transformed by nonlinearly distorting one direction in the
data-space to remove all non-Gaussianity in that direction.
The idea behind our greedy algorithm is to allow each model
in the sequence to receive a different representation of the
data. The model performs a non-linear transformation on its
input vectors and p
本文档为【A fast learning algorithm for deep belief nets】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。