Terence D. Sanger
.y, ed., pp. 29-39. Mor-
receptor fields.
yorks of locally-tuned
nalysis of survey data,
or approximation and
riable interpolation: A
and M. G. Cox, eds.,
3ooks, New York.
Learning internal rep-
-cd Processing, Chap. 8,
eigenvectors of radial
Co~irputing conference,
;enrching for Structure.
bor.
ametric models. Ann.
that learns sequential
‘ems, D. Z. Anderson,
York.
‘1 rletwork for optimum
iv. School of Electrical
1990. Predicting the
I, 193.
uits. In IRE WESCON
. Communicated by Jeffrey Elman
Adaptive Mixtures of Local- Experts
Robert A. Jacobs
Michael I. Jordan
Department of Braill nmf Cognitive Sciences, Massachusetts Institute of Technology,
Cambridge, MA 02139 USA
Steven J. Nowlan
Geoffrey E. Hinton
Deparfmeut of Computer Science, University of Toronto,
Toronto, Cnrladn M5S IA4
We present a new supervised learning procedure for systems composed
of many separate networks, each of which learns to handle a subset of
the complete set of training cases. The new procedure can be viewed
either as a modular version of a multilayer supervised network, or as
an associative version of competitive learning. It therefore provides
a new link between these two apparently different approaches. We
demonstrate that the learning procedure divides up a vowel discrimi-
nation task into appropriate subtasks, each of which can be solved by
a very simple expert network.
1 Making Associative Learning Competitive
If backpropagation is used to train a single, multilayer network to per-
form different subtasks on different occasions, there will generally be
strong interference effects that lead to slow learning and poor gener-
alization. If we know in advance that a set of training cases may be
naturally divided into subsets that correspond to distinct subtasks, inter-
ference can be reduced by using a system composed of several different
“expert” networks plus a gating network that decides which of the ex-
perts should be used for each training case.’ Hampshire and Waibel
(1989) have described a system of this kind that can be used when the
division into subtasks is known prior to training, and Jacobs et al. (1990)
have described a related system that learns how to allocate cases to ex-
perts. The idea behind such a system is that the gating network allocates
a new case to one or a few experts, and, if the output is incorrect, the
weight changes are localized to these experts (and the gating network).
‘This idea was first presented by Jacobs and Hinton at the Connectionist Summer
School in Pittsburgh in 1988.
Newel Computation 3, 79-87 (1991) @ 1991 Massachusetts Institute of Technology
80 Robert A. Jacobs et al.
So there is no interference with the weights of other experts that special-
ize in quite different cases. The experts are therefore local in the sense
that the weights in one expert are decoupled from the weights in other
experts. In addition they will often be local in the sense that each expert
will be allocated to only a small local region of the space of possible input
vectors.
Unfortunately, both Hampshire and Waibel and Jacobs et al. use an
error function that does not encourage localization. They assume that the
final output of the whole system is a linear combination of the outputs of
the local experts, with the gating network determining the proportion of
each local output in the linear combination. So the final error on case c
is
(1.1)
where 0” is the output vector of expert i on case c, p; is the proportional
contribution of expert % to the combined output vector, and d” is the
desired output vector in case c.
This error measure compares the desired output with a blend of the
outputs of the local experts, so, to minimize the error, each local expert
must make its output cancel the residual error that is left by the combined
effects of all the other experts. When the weights in one expert change,
the residual error changes, and so the error derivatives for all the other
local experts change. * This strong coupling between the experts causes
them to cooperate nicely, but tends to lead to solutions in which many
experts are used for each case. It is possible to encourage competition by
adding penalty terms to the objective function to encourage solutions in
which only one expert is active (Jacobs et al. 1990), but a simpler remedy
is to redefine the error function so that the local experts are encouraged
to compete rather than cooperate.
Instead of linearly combining the outputs of the separate experts, we
imagine that the gating network makes a stochastic decision about which
single expert to use on each occasion (see Fig. 1). The error is then the
expected value of the squared difference between the desired and actual
output vectors
(1.2)
Notice that in this new error function, each expert is required to pro-
duce the whole of the output vector rather than a residual. As a result,
the goal of a local expert on a given training case is not directly affected
by the weights within other local experts. There is still some indirect
2For Hampshire and Waibel, this problem does not arise because they do not learn the
task decomposition. They train each expert sepu~utely on its own preassigned subtask.
3ert A. Jacobs et nl. Adaptive Mixtures of Local Experts 81
aerts that special-
local in the sense
weights in other
I that each expert
of possible input
zobs et al. use an
y assume that the
of the outputs of
the proportion of
al error on case f’
(1.1)
; the proportional
or, and d” is the
th a blend of the
eat? local expert
: by the combined
ne expert change,
j for all the other
he experts causes
.s in which many
;e competition by
rrage solutions in
a simpler remedy
:s are encouraged
parate experts, we
sion about which
error is then the
.esired and actual
(1.2)
s required to pro-
dual. As a result,
t directly affected
till some indirect
they do not learn the
Jreassigned subtask.
Neiwork
Figure 1: A system of expert and gating networks. Each expert is a feed-
forward network and all experts receive the same input and have the same
number of outputs. The gating network is also feedforward, and typically
receives the same input as the expert networks. It has normalized outputs
11~ = exp(.r))/ 1, exp(.r,), where .rj is the total weighted input received by out-
put unit j of the gating network. The selector acts like a multiple input, single
output stochastic switch; the probability that the switch will select the output
from expert j is pJ.
coupling because if some other expert changes its weights, it may cause
the gating network to alter the responsibilities that get assigned to the ex-
perts, but at least these responsibility changes cannot alter the sign of the
error that a local expert senses on a given training case. If both the gating
network and the local experts are trained by gradient descent in this new
error function, the system tends to devote a single expert to each training
case. Whenever an expert gives less error than the weighted average of
the errors of all the experts (using the outputs of the gating network to
decide how to weight each expert’s error) its responsibility for that case
82 Robert A. Jacobs et al.
will be increased, and whenever it does worse than the weighted average
its responsibility will be decreased.
The error function in equation 1.2 works in practice but in the sim-
ulations reported below we used a different error function which gives
better performance:
The error defined in equation 1.3 is simply the negative log probability
of generating the desired output vector under the mixture of gaussians
model described at the end of the next section. To see why this error
function works better, it is helpful to compare the derivatives of the two
error functions with respect to the output of an expert. From equation 1.2
we get
dE"
g = -2&‘(d’ - 0:‘)
while from equation 1.3 we get
8E"
-=-
a0;
I
Cd” - Ol)
(1.5)
In equation 1.4 the term p:’ is used to weight the derivative for expert i.
In equatiomn 1.5 we use a weighting term that takes into account how well
expert i does relative to other experts. This is a more useful measure of
the releva:lce of expert I to training case c, especially early in the train-
ing. Supp’ase, for example, that the gating network initially gives equal
weights tc all experts and IId’ - o,“ll > 1 for all the experts. Equation 1.4
will adapi the best-fitting expert the slou~t, whereas equation 1.5 will
adapt it t1.e fastest.
2 Making Competitive Learning Associative
It is natur 11 to think that the “data” vectors on which a competitive net-
work is trained play a role similar to the input vectors of an associative
network t!lat maps input vectors to output vectors. This correspondence
is assumed in models that use competitive learning as a preprocessing
stage witl.in an associative network (Moody and Darken 1989). A quite
different Iview is that the data vectors used in competitive learning cor-
respond to the oz.+& vectors of an associative network. The competitive
network csn then be viewed as an inputless stochastic generator of output
vectors ar d competitive learning can be viewed as a procedure for mak-
ing the network generate output vectors with a distribution that matches
the distribution of the “data” vectors. The weight vector of each com-
petitive hidden unit represents the mean of a multidimensional gaussian
nl.
ige
m-
‘es
.3)
.tv
ns
or
vo
.2
4
5)
311
,f
il
4
11
e
Adaptive Mixtures of Local Experts 83
distribution, and output vectors are generated by first picking a hidden
unit and then picking, an output vector from the gaussian distribution
determined by the weight vector of the chosen hidden unit. The log
probability of generating any particular output vector o’ is then
(2.1)
where i is an index over the hidden units, pi is the “weight” vector of
the hidden unit, ,k is a normalizing constant, and pZ is the probability of
picking hidden unit I, so the I+ are constrained to sum to 1. In the statis-
tics literature (McLachlan and Basford 1988), the 11~ are called “mixing
proportions.”
“Soft” competitive learning modifies the weights (and also the vari-
ances and the mixing proportions) so as to increase the product of the
probabilities (i.e., the likelihood) of generating the output vectors in the
training set (Nowlan 1990a). “Hard” competitive learning is a simple
approximation to soft competitive learning in which we ignore the pos-
sibility that a data vector could be generated by several different hidden
units. Instead, we assume that it must be generated by the hidden unit
with the closest weight vector, so only this weight vector needs to be
modified to increase the probability of generating the data vector.
If we view a competitive network as generating output vectors, it is
not immediately obvious what role input vectors could play. However,
competitive learning can be generalized in much the same way as Barto
(1985) generalized learning automata by adding an input vector and mak-
ing the actions of the automaton be conditional on the input vector. We
replace each hidden unit in a competitive network by an entire expert
network whose output vector specifies the mean of a multidimensional
gaussian distribution. So the means are now a function of the current
input vector and are represented by activity levels rather than weights.
In addition, we use a gating network which allows the mixing propor-
tions of the experts to be determined by the input vector. This gives us
a system of competing local experts with the error function defined in
equation 1.3. We could also introduce a mechanism to allow the input
vector to dynamically determine the covariance matrix for the distribu-
tion defined by each expert network, but we have not yet experimented
with this possibility.
3 Application to Multispeaker Vowel Recognition
The mixture of experts model was evaluated on a speaker independent,
four-class, vowel discrimination problem (Nowlan 199Ob). The data con-
sisted of the first and second formants of the vowels Ii], [I], [al, and [A]
(usually denoted [11]) from 75 speakers (males, females, and children) ut-
tered in a hVd context (Peterson and Barney 1952). The data forms two
Robert A. Jacobs rt al.
Figure 2: Data for vowel discrimination problem, and expert and gating net-
work decision lines. The horizontal axis is the first formant value, and the
vertical axis is the second formant value (the formant values have been lin-
early scaled by dividing by a factor of 1000). Each example is labeled with its
corresponding vowel symbol. Vowels [i] and [I] form one overlapping pair of
classes, vowels [al and [A] form the other pair. The lines labeled Net 0, 1, and 2
represent the decision lines for 3 expert networks. On one side of these lines the
output of the corresponding expert is Iess than 0.5, on the other side the output
is greater than 0.5. Although the mixture in this case contained 4 experts, one
of these experts made no significant contribution to the final mixture since its
mixing proportion pi was effectively 0 for all cases. The line labeled Gate 0:2 in-
dicates the decision between expert 0 and expert 2 made by the gating network.
To the left of this line ~2 > ~0, to the right of this line ~0 > ~2. The boundary
between classes [al and [A] is formed by the combination of the left part of Net
2’s decision line and the right part of Net O’s decision line. Although the system
tends to use as few experts as it can to solve a problem, it is also sensitive to
specific problem features such as the slightly curved boundary between classes
[al and [Al.
pairs of overlapping classes, and different experts learn to concentrate
on one pair of classes or the other (Fig. 2).
We compared standard backpropagation networks containing a sin-
gle hidden layer of 6 or 12 units with mixtures of 4 or 8 very simple
experts. The architecture of each expert was restricted so it could form
only a Iinear decision surface, which is defined as the set of input vec-
tors for which the expert gives an output of exactly 0.5. All models were
trained with data from the first 50 speakers and tested with data from
the remaining 25 speakers. The small number of parameters for each ex-
pert allows excellent generalization performance (Table 11, and permits
Robert A. Jacobs et ~1. Adaptive Mixtures of Local Experts 85
1 2 1 5
cvnert and gating net-
form lnt value, and the
t values have been lin-
mple is labeled with its
one overlapping pair of
s labeled Net 0, 1, and 2
ne side of these lines the
he other side the output
,ontained 4 experts, one
e final mixture since its
line labeled Gate 0:2 in-
? by the gating network.
pi > 112. The boundary
In of the left part of Net
Le. Although the system
n, it is also sensitive to
undary between classes
; learn to concentrate
)rks containing a sin-
)f 4 or 8 very simple
cted so it could form
the set of input vec-
0.5. All models were
ested with data from
lrameters for each ex-
Iable l), and permits
Average number
System Train % correct Test % correct of epochs SD
4 Experts 88 90 1124 23
8 Experts 88 90 1083 12
BP 6 Hid 88 90 2209 83
BP 12 Hid 88 90 2435 124
Table 1: Summary of Performance on Vowel Discrimination Task. Results are
based on 25 simulations for each of the alternative models. The first column of
the table indicates the system simulated. The second column gives the percent
of training cases classified correctlv by the final set of weights, while the third
column indicates the percent of testing cases classified correctly. The last two
columns contain the average number of epochs required to reach the error
criterion, and the standard deviation of the distribution of convergence times.
Although the squared error was used to decide when to stop training, the
criterion for correct performance is based on a weighted average of the outputs
of all the experts. Each expert assigns a probability distribution over the classes
and these distributions are combined using proportions given by the gating
network. The most probable class is then taken to be the response of the system.
The identical performance of all the systems is due to the fact that, with this
data set, the set of misclassified examples is not sensitive to small changes in
the decision surfaces. Also, the test set is easier than the training set.
a graphic representation of the process of task decomposition (Figure 3).
The number of hidden units in the backpropagation networks was cho-
sen to give roughly equal numbers of parameters for the backpropagation
networks and mixture models. All simulations were performed using a
simple gradient descent algorithm with fixed step size F. To simplify
the comparisons, no momentum or other acceleration techniques were
used. The value of F for each system was chosen by performing a lim-
ited exploration of the convergence from the same initial conditions for
a range of F. Batch training was used with one weight update for each
pass through the training set (epoch). Each system was trained until an
average squared error of 0.08 over the training set was obtained.
The mixtures of experts reach the error criterion significantly faster
than the backpropagation networks (p B 0.999), requiring only about half
as many epochs on average (Table 1). The learning time for the mixture
model also scales well as the number of experts is increased: The mixture
of 8 experts has a small, but statistically significant (p > 0.951, advantage
in the average number of epochs required to reach the error criterion.
In contrast, the 12 hidden unit backpropagation network requires more
epochs (/I > 0.95) to reach the error criterion than the network with 6
Robert A. Jacobs it nl.
i
0
-0.4
-0.6
-0.84 -0.56 -0.28 0 0.28
Figure 3: The trajectories of the decision lines of some experts during one
simulation. The horizontal axis is the first formant value, and the vertical axis
is the second formant value. Each trajectory is represented by a sequence of
dots, one per epoch, each dot marking the intersection of the expert’s decision
line and the normal to that line passing through the origin. For clarity, only 5
of the 8 experts are shown and the number of the expert is shown at the start
of the trajectory. The point labeled TO indicates the optimal decision line for
a single expert trained to discriminate [i] from [I]. Similarly, Ti represents the
optimal decision line to discriminate [a] from [A]. The point labeled X is the
decision line learned by a single expert trained with data from all 4 classes, and
represents a type of average solution.
hidden units (Table 1). All statistical comparisons are based on a t test
with 48 degrees of freedom and a pooled variance estimator.
Figure 3 shows how the decision lines of different experts move
around as the system learns to allocate pieces of the task to different
experts. The system begins in an unbiased state, with the gating net-
work assigning equal mixing proportions to all experts in all cases. As
a result, each expert tends to get errors from roughly equal numbers of
cases in all 4 classes, and all experts head towards the point X, which
represents the optimal decision line for an expert that must deal with
all the cases. Once one or more experts begin to receive more error
from cases in one class pair than the other, this symmetry is broken and
the trajectories begin to diverge as different experts concentrate on one
class pair or the other. In this simulation, expert 5 learns to concentrate
on discriminating classes [i] and
本文档为【混合专家模型经典文献】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。