Building High-level Features
Using Large Scale Unsupervised Learning
Quoc V. Le quocle@cs.stanford.edu
Marc’Aurelio Ranzato ranzato@google.com
Rajat Monga rajatmonga@google.com
Matthieu Devin mdevin@google.com
Kai Chen kaichen@google.com
Greg S. Corrado gcorrado@google.com
Jeff Dean jeff@google.com
Andrew Y. Ng ang@cs.stanford.edu
Abstract
We consider the problem of building high-
level, class-specific feature detectors from
only unlabeled data. For example, is it pos-
sible to learn a face detector using only unla-
beled images? To answer this, we train a 9-
layered locally connected sparse autoencoder
with pooling and local contrast normalization
on a large dataset of images (the model has
1 billion connections, the dataset has 10 mil-
lion 200x200 pixel images downloaded from
the Internet). We train this network using
model parallelism and asynchronous SGD on
a cluster with 1,000 machines (16,000 cores)
for three days. Contrary to what appears to
be a widely-held intuition, our experimental
results reveal that it is possible to train a face
detector without having to label images as
containing a face or not. Control experiments
show that this feature detector is robust not
only to translation but also to scaling and
out-of-plane rotation. We also find that the
same network is sensitive to other high-level
concepts such as cat faces and human bod-
ies. Starting with these learned features, we
trained our network to obtain 15.8% accu-
racy in recognizing 20,000 object categories
from ImageNet, a leap of 70% relative im-
provement over the previous state-of-the-art.
Appearing in Proceedings of the 29 th International Confer-
ence on Machine Learning, Edinburgh, Scotland, UK, 2012.
Copyright 2012 by the author(s)/owner(s).
1. Introduction
The focus of this work is to build high-level, class-
specific feature detectors from unlabeled images. For
instance, we would like to understand if it is possible to
build a face detector from only unlabeled images. This
approach is inspired by the neuroscientific conjecture
that there exist highly class-specific neurons in the hu-
man brain, generally and informally known as “grand-
mother neurons.” The extent of class-specificity of
neurons in the brain is an area of active investiga-
tion, but current experimental evidence suggests the
possibility that some neurons in the temporal cortex
are highly selective for object categories such as faces
or hands (Desimone et al., 1984), and perhaps even
specific people (Quiroga et al., 2005).
Contemporary computer vision methodology typically
emphasizes the role of labeled data to obtain these
class-specific feature detectors. For example, to build
a face detector, one needs a large collection of images
labeled as containing faces, often with a bounding box
around the face. The need for large labeled sets poses
a significant challenge for problems where labeled data
are rare. Although approaches that make use of inex-
pensive unlabeled data are often preferred, they have
not been shown to work well for building high-level
features.
This work investigates the feasibility of building high-
level features from only unlabeled data. A positive
answer to this question will give rise to two significant
results. Practically, this provides an inexpensive way
to develop features from unlabeled data. But perhaps
more importantly, it answers an intriguing question as
to whether the specificity of the “grandmother neuron”
could possibly be learned from unlabeled data. Infor-
mally, this would suggest that it is at least in principle
possible that a baby learns to group faces into one class
ar
X
iv
:1
11
2.
62
09
v3
[
cs
.L
G]
1
2 J
un
20
12
Building high-level features using large-scale unsupervised learning
because it has seen many of them and not because it
is guided by supervision or rewards.
Unsupervised feature learning and deep learning have
emerged as methodologies in machine learning for
building features from unlabeled data. Using unlabeled
data in the wild to learn features is the key idea be-
hind the self-taught learning framework (Raina et al.,
2007). Successful feature learning algorithms and their
applications can be found in recent literature using a
variety of approaches such as RBMs (Hinton et al.,
2006), autoencoders (Hinton & Salakhutdinov, 2006;
Bengio et al., 2007), sparse coding (Lee et al., 2007)
and K-means (Coates et al., 2011). So far, most of
these algorithms have only succeeded in learning low-
level features such as “edge” or “blob” detectors. Go-
ing beyond such simple features and capturing com-
plex invariances is the topic of this work.
Recent studies observe that it is quite time intensive
to train deep learning algorithms to yield state of the
art results (Ciresan et al., 2010). We conjecture that
the long training time is partially responsible for the
lack of high-level features reported in the literature.
For instance, researchers typically reduce the sizes of
datasets and models in order to train networks in a
practical amount of time, and these reductions under-
mine the learning of high-level features.
We address this problem by scaling up the core compo-
nents involved in training deep networks: the dataset,
the model, and the computational resources. First,
we use a large dataset generated by sampling random
frames from random YouTube videos.1 Our input data
are 200x200 images, much larger than typical 32x32
images used in deep learning and unsupervised fea-
ture learning (Krizhevsky, 2009; Ciresan et al., 2010;
Le et al., 2010; Coates et al., 2011). Our model, a
deep autoencoder with pooling and local contrast nor-
malization, is scaled to these large images by using
a large computer cluster. To support parallelism on
this cluster, we use the idea of local receptive fields,
e.g., (Raina et al., 2009; Le et al., 2010; 2011b). This
idea reduces communication costs between machines
and thus allows model parallelism (parameters are dis-
tributed across machines). Asynchronous SGD is em-
ployed to support data parallelism. The model was
trained in a distributed fashion on a cluster with 1,000
machines (16,000 cores) for three days.
Experimental results using classification and visualiza-
tion confirm that it is indeed possible to build high-
level features from unlabeled data. In particular, using
a hold-out test set consisting of faces and distractors,
we discover a feature that is highly selective for faces.
1This is different from the work of (Lee et al., 2009) who
trained their model on images from one class.
This result is also validated by visualization via nu-
merical optimization. Control experiments show that
the learned detector is not only invariant to translation
but also to out-of-plane rotation and scaling.
Similar experiments reveal the network also learns the
concepts of cat faces and human bodies.
The learned representations are also discriminative.
Using the learned features, we obtain significant leaps
in object recognition with ImageNet. For instance, on
ImageNet with 20,000 categories, we achieved 15.8%
accuracy, a relative improvement of 70% over the state-
of-the-art.
2. Training set construction
Our training dataset is constructed by sampling frames
from 10 million YouTube videos. To avoid duplicates,
each video contributes only one image to the dataset.
Each example is a color image with 200x200 pixels.
A subset of training images is shown in Ap-
pendix A. To check the proportion of faces in
the dataset, we run an OpenCV face detector on
60x60 randomly-sampled patches from the dataset
(http://opencv.willowgarage.com/wiki/). This exper-
iment shows that patches, being detected as faces by
the OpenCV face detector, account for less than 3% of
the 100,000 sampled patches
3. Algorithm
In this section, we describe the algorithm that we use
to learn features from the unlabeled training set.
3.1. Previous work
Our work is inspired by recent successful algorithms in
unsupervised feature learning and deep learning (Hin-
ton et al., 2006; Bengio et al., 2007; Ranzato et al.,
2007; Lee et al., 2007). It is strongly influenced by the
work of (Olshausen & Field, 1996) on sparse coding.
According to their study, sparse coding can be trained
on unlabeled natural images to yield receptive fields
akin to V1 simple cells (Hubel & Wiesel, 1959).
One shortcoming of early approaches such as sparse
coding (Olshausen & Field, 1996) is that their archi-
tectures are shallow and typically capture low-level
concepts (e.g., edge “Gabor” filters) and simple invari-
ances. Addressing this issue is a focus of recent work
in deep learning (Hinton et al., 2006; Bengio et al.,
2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009)
which build hierarchies of feature representations. In
particular, Lee et al (2008) show that stacked sparse
RBMs can model certain simple functions of the V2
area of the cortex. They also demonstrate that con-
volutional DBNs (Lee et al., 2009), trained on aligned
Building high-level features using large-scale unsupervised learning
images of faces, can learn a face detector. This result
is interesting, but unfortunately requires a certain de-
gree of supervision during dataset construction: their
training images (i.e., Caltech 101 images) are aligned,
homogeneous and belong to one selected category.
Figure 1. The architecture and parameters in one layer of
our network. The overall network replicates this structure
three times. For simplicity, the images are in 1D.
3.2. Architecture
Our algorithm is built upon these ideas and can be
viewed as a sparse deep autoencoder with three impor-
tant ingredients: local receptive fields, pooling and lo-
cal contrast normalization. First, to scale the autoen-
coder to large images, we use a simple idea known as
local receptive fields (LeCun et al., 1998; Raina et al.,
2009; Lee et al., 2009; Le et al., 2010). This biologically
inspired idea proposes that each feature in the autoen-
coder can connect only to a small region of the lower
layer. Next, to achieve invariance to local deforma-
tions, we employ local L2 pooling (Hyva¨rinen et al.,
2009; Le et al., 2010) and local contrast normaliza-
tion (Jarrett et al., 2009). L2 pooling, in particular,
allows the learning of invariant features (Hyva¨rinen
et al., 2009; Le et al., 2010).
Our deep autoencoder is constructed by replicating
three times the same stage composed of local filtering,
local pooling and local contrast normalization. The
output of one stage is the input to the next one and
the overall model can be interpreted as a nine-layered
network (see Figure 1).
The first and second sublayers are often known as fil-
tering (or simple) and pooling (or complex) respec-
tively. The third sublayer performs local subtractive
and divisive normalization and it is inspired by bio-
logical and computational models (Pinto et al., 2008;
Lyu & Simoncelli, 2008; Jarrett et al., 2009).2
As mentioned above, central to our approach is the use
of local connectivity between neurons. In our experi-
ments, the first sublayer has receptive fields of 18x18
pixels and the second sublayer and the second sub-
layer pools over 5x5 overlapping neighborhoods of fea-
tures (i.e., pooling size). The neurons in the first sub-
layer connect to pixels in all input channels (or maps)
whereas the neurons in the second sublayer connect
to pixels of only one channel (or map).3 While the
first sublayer outputs linear filter responses, the pool-
ing layer outputs the square root of the sum of the
squares of its inputs, and therefore, it is known as L2
pooling.
Our style of stacking a series of uniform modules,
switching between selectivity and tolerance layers, is
reminiscent of Neocognition and HMAX (Fukushima
& Miyake, 1982; LeCun et al., 1998; Riesenhuber &
Poggio, 1999). It has also been argued to be an archi-
tecture employed by the brain (DiCarlo et al., 2012).
Although we use local receptive fields, they are not
convolutional: the parameters are not shared across
different locations in the image. This is a stark differ-
ence between our approach and previous work (LeCun
et al., 1998; Jarrett et al., 2009; Lee et al., 2009). In
addition to being more biologically plausible, unshared
weights allow the learning of more invariances other
than translational invariances (Le et al., 2010).
In terms of scale, our network is perhaps one of the
largest known networks to date. It has 1 billion train-
able parameters, which is more than an order of mag-
nitude larger than other large networks reported in
literature, e.g., (Ciresan et al., 2010; Sermanet & Le-
Cun, 2011) with around 10 million parameters. It is
worth noting that our network is still tiny compared to
the human visual cortex, which is 106 times larger in
terms of the number of neurons and synapses (Pakken-
berg et al., 2003).
3.3. Learning and Optimization
Learning: During learning, the parameters of the
second sublayers (H) are fixed to uniform weights,
whereas the encoding weights W1 and decoding
weights W2 of the first sublayers are adjusted using
2The subtractive normalization removes the
weighted average of neighboring neurons from the
current neuron gi,j,k = hi,j,k −
∑
iuv Guvhi,j+u,i+v
The divisive normalization computes yi,j,k =
gi,j,k/max{c, (
∑
iuv Guvg
2
i,j+u,i+v)
0.5}, where c is set
to be a small number, 0.01, to prevent numerical errors.
G is a Gaussian weighting window. (Jarrett et al., 2009)
3For more details regarding connectivity patterns and
parameter sensitivity, see Appendix B and E.
Building high-level features using large-scale unsupervised learning
the following optimization problem
minimize
W1,W2
m∑
i=1
(∥∥W2WT1 x(i) − x(i)∥∥22+
λ
k∑
j=1
√
�+Hj(WT1 x
(i))2
)
. (1)
Here, λ is a tradeoff parameter between sparsity and
reconstruction; m, k are the number of examples and
pooling units in a layer respectively; Hj is the vector of
weights of the j-th pooling unit. In our experiments,
we set λ = 0.1.
This optimization problem is also known as recon-
struction Topographic Independent Component Anal-
ysis (Hyva¨rinen et al., 2009; Le et al., 2011a).4 The
first term in the objective ensures the representations
encode important information about the data, i.e.,
they can reconstruct input data; whereas the second
term encourages pooling features to group similar fea-
tures together to achieve invariances.
Optimization: All parameters in our model were
trained jointly with the objective being the sum of the
objectives of the three layers.
To train the model, we implemented model parallelism
by distributing the local weights W1, W2 and H to
different machines. A single instance of the model
partitions the neurons and weights out across 169 ma-
chines (where each machine had 16 CPU cores). A
set of machines that collectively make up a single copy
of the model is referred to as a “model replica.” We
have built a software framework called DistBelief that
manages all the necessary communication between the
different machines within a model replica, so that users
of the framework merely need to write the desired up-
wards and downwards computation functions for the
neurons in the model, and don’t have to deal with the
low-level communication of data across machines.
We further scaled up the training by implementing
asynchronous SGD using multiple replicas of the core
model. For the experiments described here, we di-
vided the training into 5 portions and ran a copy of
the model on each of these portions. The models com-
municate updates through a set of centralized “param-
eter servers,” which keep the current state of all pa-
rameters for the model in a set of partitioned servers
(we used 256 parameter server partitions for training
the model described in this paper). In the simplest
implementation, before processing each mini-batch a
4In (Bengio et al., 2007; Le et al., 2011a), the encod-
ing weights and the decoding weights are tied: W1 = W2.
However, for better parallelism and better features, our
implementation does not enforce tied weights.
model replica asks the centralized parameter servers
for an updated copy of its model parameters. It then
processes a mini-batch to compute a parameter gra-
dient, and sends the parameter gradients to the ap-
propriate parameter servers, which then apply each
gradient to the current value of the model parame-
ter. We can reduce the communication overhead by
having each model replica request updated parame-
ters every P steps and by sending updated gradient
values to the parameter servers every G steps (where
G might not be equal to P). Our DistBelief software
framework automatically manages the transfer of pa-
rameters and gradients between the model partitions
and the parameter servers, freeing implementors of the
layer functions from having to deal with these issues.
Asynchronous SGD is more robust to failure than stan-
dard (synchronous) SGD. Specifically, for synchronous
SGD, if one of the machines goes down, the entire
training process is delayed; whereas for asynchronous
SGD, if one machine goes down, only one copy of SGD
is delayed while the rest of the optimization can still
proceed.
In our training, at every step of SGD, the gradient is
computed on a minibatch of 100 examples. We trained
the network on a cluster with 1,000 machines for three
days. See Appendix B, C, and D for more details re-
garding our implementation of the optimization.
4. Experiments on Faces
In this section, we describe our analysis of the learned
representations in recognizing faces (“the face detec-
tor”) and present control experiments to understand
invariance properties of the face detector. Results for
other concepts are presented in the next section.
4.1. Test set
The test set consists of 37,000 images sam-
pled from two datasets: Labeled Faces In the
Wild dataset (Huang et al., 2007) and ImageNet
dataset (Deng et al., 2009). There are 13,026 faces
sampled from non-aligned Labeled Faces in The Wild.5
The rest are distractor objects randomly sampled from
ImageNet. These images are resized to fit the visible
areas of the top neurons. Some example images are
shown in Appendix A.
4.2. Experimental protocols
After training, we used this test set to measure the
performance of each neuron in classifying faces against
distractors. For each neuron, we found its maximum
and minimum activation values, then picked 20 equally
5http://vis-www.cs.umass.edu/lfw/lfw.tgz
Building high-level features using large-scale unsupervised learning
spaced thresholds in between. The reported accuracy
is the best classification accuracy among 20 thresholds.
4.3. Recognition
Surprisingly, the best neuron in the network performs
very well in recognizing faces, despite the fact that no
supervisory signals were given during training. The
best neuron in the network achieves 81.7% accuracy in
detecting faces. There are 13,026 faces in the test set,
so guessing all negative only achieves 64.8%. The best
neuron in a one-layered network only achieves 71% ac-
curacy while best linear filter, selected among 100,000
filters sampled randomly from the training set, only
achieves 74%.
To understand their contribution, we removed the lo-
cal contrast normalization sublayers and trained the
network again. Results show that the accuracy of
best neuron drops to 78.5%. This agrees with pre-
vious study showing the importance of local contrast
normalization (Jarrett et al., 2009).
We visualize histograms of activation values for face
images and random images in Figure 2. It can be seen,
even with exclusively unlabeled data, the neuron learns
to differentiate between faces and random distractors.
Specifically, when we give a face as an input image, the
neuron tends to output value larger than the threshold,
0. In contrast, if we give a random image as an input
image, the neuron tends to output value less than 0.
Figure 2. Histograms of faces (red) vs. no faces (blue).
The test set is subsampled su
本文档为【2012 - Andrew Y. Ng - Building high-level features using large scale unsupervised learning】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。