混合专家模型经典文献

混合专家模型经典文献 Terence D. Sanger .y, ed., pp. 29-39. Mor- receptor fields. yorks of locally-tuned nalysis of survey data, or approximation and riable interpolation: A and M. G. Cox, eds., 3ooks, New York. Learning internal rep- -cd Processing, Chap. 8, eig...

Terence D. Sanger .y, ed., pp. 29-39. Mor- receptor fields. yorks of locally-tuned nalysis of survey data, or approximation and riable interpolation: A and M. G. Cox, eds., 3ooks, New York. Learning internal rep- -cd Processing, Chap. 8, eigenvectors of radial Co~irputing conference, ;enrching for Structure. bor. ametric models. Ann. that learns sequential ‘ems, D. Z. Anderson, York. ‘1 rletwork for optimum iv. School of Electrical 1990. Predicting the I, 193. uits. In IRE WESCON . Communicated by Jeffrey Elman Adaptive Mixtures of Local- Experts Robert A. Jacobs Michael I. Jordan Department of Braill nmf Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA Steven J. Nowlan Geoffrey E. Hinton Deparfmeut of Computer Science, University of Toronto, Toronto, Cnrladn M5S IA4 We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases. The new procedure can be viewed either as a modular version of a multilayer supervised network, or as an associative version of competitive learning. It therefore provides a new link between these two apparently different approaches. We demonstrate that the learning procedure divides up a vowel discrimi- nation task into appropriate subtasks, each of which can be solved by a very simple expert network. 1 Making Associative Learning Competitive If backpropagation is used to train a single, multilayer network to per- form different subtasks on different occasions, there will generally be strong interference effects that lead to slow learning and poor gener- alization. If we know in advance that a set of training cases may be naturally divided into subsets that correspond to distinct subtasks, inter- ference can be reduced by using a system composed of several different “expert” networks plus a gating network that decides which of the ex- perts should be used for each training case.’ Hampshire and Waibel (1989) have described a system of this kind that can be used when the division into subtasks is known prior to training, and Jacobs et al. (1990) have described a related system that learns how to allocate cases to ex- perts. The idea behind such a system is that the gating network allocates a new case to one or a few experts, and, if the output is incorrect, the weight changes are localized to these experts (and the gating network). ‘This idea was first presented by Jacobs and Hinton at the Connectionist Summer School in Pittsburgh in 1988. Newel Computation 3, 79-87 (1991) @ 1991 Massachusetts Institute of Technology 80 Robert A. Jacobs et al. So there is no interference with the weights of other experts that special- ize in quite different cases. The experts are therefore local in the sense that the weights in one expert are decoupled from the weights in other experts. In addition they will often be local in the sense that each expert will be allocated to only a small local region of the space of possible input vectors. Unfortunately, both Hampshire and Waibel and Jacobs et al. use an error function that does not encourage localization. They assume that the final output of the whole system is a linear combination of the outputs of the local experts, with the gating network determining the proportion of each local output in the linear combination. So the final error on case c is (1.1) where 0” is the output vector of expert i on case c, p; is the proportional contribution of expert % to the combined output vector, and d” is the desired output vector in case c. This error measure compares the desired output with a blend of the outputs of the local experts, so, to minimize the error, each local expert must make its output cancel the residual error that is left by the combined effects of all the other experts. When the weights in one expert change, the residual error changes, and so the error derivatives for all the other local experts change. * This strong coupling between the experts causes them to cooperate nicely, but tends to lead to solutions in which many experts are used for each case. It is possible to encourage competition by adding penalty terms to the objective function to encourage solutions in which only one expert is active (Jacobs et al. 1990), but a simpler remedy is to redefine the error function so that the local experts are encouraged to compete rather than cooperate. Instead of linearly combining the outputs of the separate experts, we imagine that the gating network makes a stochastic decision about which single expert to use on each occasion (see Fig. 1). The error is then the expected value of the squared difference between the desired and actual output vectors (1.2) Notice that in this new error function, each expert is required to pro- duce the whole of the output vector rather than a residual. As a result, the goal of a local expert on a given training case is not directly affected by the weights within other local experts. There is still some indirect 2For Hampshire and Waibel, this problem does not arise because they do not learn the task decomposition. They train each expert sepu~utely on its own preassigned subtask. 3ert A. Jacobs et nl. Adaptive Mixtures of Local Experts 81 aerts that special- local in the sense weights in other I that each expert of possible input zobs et al. use an y assume that the of the outputs of the proportion of al error on case f’ (1.1) ; the proportional or, and d” is the th a blend of the eat? local expert : by the combined ne expert change, j for all the other he experts causes .s in which many ;e competition by rrage solutions in a simpler remedy :s are encouraged parate experts, we sion about which error is then the .esired and actual (1.2) s required to pro- dual. As a result, t directly affected till some indirect they do not learn the Jreassigned subtask. Neiwork Figure 1: A system of expert and gating networks. Each expert is a feed- forward network and all experts receive the same input and have the same number of outputs. The gating network is also feedforward, and typically receives the same input as the expert networks. It has normalized outputs 11~ = exp(.r))/ 1, exp(.r,), where .rj is the total weighted input received by out- put unit j of the gating network. The selector acts like a multiple input, single output stochastic switch; the probability that the switch will select the output from expert j is pJ. coupling because if some other expert changes its weights, it may cause the gating network to alter the responsibilities that get assigned to the ex- perts, but at least these responsibility changes cannot alter the sign of the error that a local expert senses on a given training case. If both the gating network and the local experts are trained by gradient descent in this new error function, the system tends to devote a single expert to each training case. Whenever an expert gives less error than the weighted average of the errors of all the experts (using the outputs of the gating network to decide how to weight each expert’s error) its responsibility for that case 82 Robert A. Jacobs et al. will be increased, and whenever it does worse than the weighted average its responsibility will be decreased. The error function in equation 1.2 works in practice but in the sim- ulations reported below we used a different error function which gives better performance: The error defined in equation 1.3 is simply the negative log probability of generating the desired output vector under the mixture of gaussians model described at the end of the next section. To see why this error function works better, it is helpful to compare the derivatives of the two error functions with respect to the output of an expert. From equation 1.2 we get dE" g = -2&‘(d’ - 0:‘) while from equation 1.3 we get 8E" -=- a0; I Cd” - Ol) (1.5) In equation 1.4 the term p:’ is used to weight the derivative for expert i. In equatiomn 1.5 we use a weighting term that takes into account how well expert i does relative to other experts. This is a more useful measure of the releva:lce of expert I to training case c, especially early in the train- ing. Supp’ase, for example, that the gating network initially gives equal weights tc all experts and IId’ - o,“ll > 1 for all the experts. Equation 1.4 will adapi the best-fitting expert the slou~t, whereas equation 1.5 will adapt it t1.e fastest. 2 Making Competitive Learning Associative It is natur 11 to think that the “data” vectors on which a competitive net- work is trained play a role similar to the input vectors of an associative network t!lat maps input vectors to output vectors. This correspondence is assumed in models that use competitive learning as a preprocessing stage witl.in an associative network (Moody and Darken 1989). A quite different Iview is that the data vectors used in competitive learning cor- respond to the oz.+& vectors of an associative network. The competitive network csn then be viewed as an inputless stochastic generator of output vectors ar d competitive learning can be viewed as a procedure for mak- ing the network generate output vectors with a distribution that matches the distribution of the “data” vectors. The weight vector of each com- petitive hidden unit represents the mean of a multidimensional gaussian nl. ige m- ‘es .3) .tv ns or vo .2 4 5) 311 ,f il 4 11 e Adaptive Mixtures of Local Experts 83 distribution, and output vectors are generated by first picking a hidden unit and then picking, an output vector from the gaussian distribution determined by the weight vector of the chosen hidden unit. The log probability of generating any particular output vector o’ is then (2.1) where i is an index over the hidden units, pi is the “weight” vector of the hidden unit, ,k is a normalizing constant, and pZ is the probability of picking hidden unit I, so the I+ are constrained to sum to 1. In the statis- tics literature (McLachlan and Basford 1988), the 11~ are called “mixing proportions.” “Soft” competitive learning modifies the weights (and also the vari- ances and the mixing proportions) so as to increase the product of the probabilities (i.e., the likelihood) of generating the output vectors in the training set (Nowlan 1990a). “Hard” competitive learning is a simple approximation to soft competitive learning in which we ignore the pos- sibility that a data vector could be generated by several different hidden units. Instead, we assume that it must be generated by the hidden unit with the closest weight vector, so only this weight vector needs to be modified to increase the probability of generating the data vector. If we view a competitive network as generating output vectors, it is not immediately obvious what role input vectors could play. However, competitive learning can be generalized in much the same way as Barto (1985) generalized learning automata by adding an input vector and mak- ing the actions of the automaton be conditional on the input vector. We replace each hidden unit in a competitive network by an entire expert network whose output vector specifies the mean of a multidimensional gaussian distribution. So the means are now a function of the current input vector and are represented by activity levels rather than weights. In addition, we use a gating network which allows the mixing propor- tions of the experts to be determined by the input vector. This gives us a system of competing local experts with the error function defined in equation 1.3. We could also introduce a mechanism to allow the input vector to dynamically determine the covariance matrix for the distribu- tion defined by each expert network, but we have not yet experimented with this possibility. 3 Application to Multispeaker Vowel Recognition The mixture of experts model was evaluated on a speaker independent, four-class, vowel discrimination problem (Nowlan 199Ob). The data con- sisted of the first and second formants of the vowels Ii], [I], [al, and [A] (usually denoted [11]) from 75 speakers (males, females, and children) ut- tered in a hVd context (Peterson and Barney 1952). The data forms two Robert A. Jacobs rt al. Figure 2: Data for vowel discrimination problem, and expert and gating net- work decision lines. The horizontal axis is the first formant value, and the vertical axis is the second formant value (the formant values have been lin- early scaled by dividing by a factor of 1000). Each example is labeled with its corresponding vowel symbol. Vowels [i] and [I] form one overlapping pair of classes, vowels [al and [A] form the other pair. The lines labeled Net 0, 1, and 2 represent the decision lines for 3 expert networks. On one side of these lines the output of the corresponding expert is Iess than 0.5, on the other side the output is greater than 0.5. Although the mixture in this case contained 4 experts, one of these experts made no significant contribution to the final mixture since its mixing proportion pi was effectively 0 for all cases. The line labeled Gate 0:2 in- dicates the decision between expert 0 and expert 2 made by the gating network. To the left of this line ~2 > ~0, to the right of this line ~0 > ~2. The boundary between classes [al and [A] is formed by the combination of the left part of Net 2’s decision line and the right part of Net O’s decision line. Although the system tends to use as few experts as it can to solve a problem, it is also sensitive to specific problem features such as the slightly curved boundary between classes [al and [Al. pairs of overlapping classes, and different experts learn to concentrate on one pair of classes or the other (Fig. 2). We compared standard backpropagation networks containing a sin- gle hidden layer of 6 or 12 units with mixtures of 4 or 8 very simple experts. The architecture of each expert was restricted so it could form only a Iinear decision surface, which is defined as the set of input vec- tors for which the expert gives an output of exactly 0.5. All models were trained with data from the first 50 speakers and tested with data from the remaining 25 speakers. The small number of parameters for each ex- pert allows excellent generalization performance (Table 11, and permits Robert A. Jacobs et ~1. Adaptive Mixtures of Local Experts 85 1 2 1 5 cvnert and gating net- form lnt value, and the t values have been lin- mple is labeled with its one overlapping pair of s labeled Net 0, 1, and 2 ne side of these lines the he other side the output ,ontained 4 experts, one e final mixture since its line labeled Gate 0:2 in- ? by the gating network. pi > 112. The boundary In of the left part of Net Le. Although the system n, it is also sensitive to undary between classes ; learn to concentrate )rks containing a sin- )f 4 or 8 very simple cted so it could form the set of input vec- 0.5. All models were ested with data from lrameters for each ex- Iable l), and permits Average number System Train % correct Test % correct of epochs SD 4 Experts 88 90 1124 23 8 Experts 88 90 1083 12 BP 6 Hid 88 90 2209 83 BP 12 Hid 88 90 2435 124 Table 1: Summary of Performance on Vowel Discrimination Task. Results are based on 25 simulations for each of the alternative models. The first column of the table indicates the system simulated. The second column gives the percent of training cases classified correctlv by the final set of weights, while the third column indicates the percent of testing cases classified correctly. The last two columns contain the average number of epochs required to reach the error criterion, and the standard deviation of the distribution of convergence times. Although the squared error was used to decide when to stop training, the criterion for correct performance is based on a weighted average of the outputs of all the experts. Each expert assigns a probability distribution over the classes and these distributions are combined using proportions given by the gating network. The most probable class is then taken to be the response of the system. The identical performance of all the systems is due to the fact that, with this data set, the set of misclassified examples is not sensitive to small changes in the decision surfaces. Also, the test set is easier than the training set. a graphic representation of the process of task decomposition (Figure 3). The number of hidden units in the backpropagation networks was cho- sen to give roughly equal numbers of parameters for the backpropagation networks and mixture models. All simulations were performed using a simple gradient descent algorithm with fixed step size F. To simplify the comparisons, no momentum or other acceleration techniques were used. The value of F for each system was chosen by performing a lim- ited exploration of the convergence from the same initial conditions for a range of F. Batch training was used with one weight update for each pass through the training set (epoch). Each system was trained until an average squared error of 0.08 over the training set was obtained. The mixtures of experts reach the error criterion significantly faster than the backpropagation networks (p B 0.999), requiring only about half as many epochs on average (Table 1). The learning time for the mixture model also scales well as the number of experts is increased: The mixture of 8 experts has a small, but statistically significant (p > 0.951, advantage in the average number of epochs required to reach the error criterion. In contrast, the 12 hidden unit backpropagation network requires more epochs (/I > 0.95) to reach the error criterion than the network with 6 Robert A. Jacobs it nl. i 0 -0.4 -0.6 -0.84 -0.56 -0.28 0 0.28 Figure 3: The trajectories of the decision lines of some experts during one simulation. The horizontal axis is the first formant value, and the vertical axis is the second formant value. Each trajectory is represented by a sequence of dots, one per epoch, each dot marking the intersection of the expert’s decision line and the normal to that line passing through the origin. For clarity, only 5 of the 8 experts are shown and the number of the expert is shown at the start of the trajectory. The point labeled TO indicates the optimal decision line for a single expert trained to discriminate [i] from [I]. Similarly, Ti represents the optimal decision line to discriminate [a] from [A]. The point labeled X is the decision line learned by a single expert trained with data from all 4 classes, and represents a type of average solution. hidden units (Table 1). All statistical comparisons are based on a t test with 48 degrees of freedom and a pooled variance estimator. Figure 3 shows how the decision lines of different experts move around as the system learns to allocate pieces of the task to different experts. The system begins in an unbiased state, with the gating net- work assigning equal mixing proportions to all experts in all cases. As a result, each expert tends to get errors from roughly equal numbers of cases in all 4 classes, and all experts head towards the point X, which represents the optimal decision line for an expert that must deal with all the cases. Once one or more experts begin to receive more error from cases in one class pair than the other, this symmetry is broken and the trajectories begin to diverge as different experts concentrate on one class pair or the other. In this simulation, expert 5 learns to concentrate on discriminating classes [i] and

                    本文档为【混合专家模型经典文献】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

混合专家模型经典文献

你可能还喜欢