IEEE SIGNAL PROCESSING MAGAZINE [115] MAY 20101053-5888/10/$26.00©2010IEEE IEEE SIGNAL PROCESSING MAGAZINE [115] MAY 20101053-5888/10/$26.00©2010IEEE
I
n machine learning, there are two distinct groups of learning methods for
building pattern classifiers: generative learning and discriminative
learning. The generative learning scheme aims to estimate the joint
probability distribution of observation data and class label and
then use the estimated distribution for classification according
to the well-known Bayes decision rule. In generative learning, it
is normal practice to adopt the so-called parametric modeling
approach, where it is assumed that unknown probability dis-
tributions belong to computationally tractable function fami-
lies. In this way, the difficult density estimation problem
becomes a more tractable parameter estimation problem.
The advantage of generative learning is that some inherent
dependency structures and/or independency assumptions
applicable to the underlying data can be explicitly
exploited by choosing an appropriate statistical model,
such as mixture models or even highly structured graphi-
cal models. On the other hand, the discriminative learn-
ing scheme makes no explicit attempt to model the data
distribution and instead optimizes a mapping function
from inputs to any desired outputs. As a result, only the
decision boundary is adjusted without forming a data gen-
erator in the entire feature space. The advantage of dis-
criminative learning lies in that the mapping function can
be estimated based on criteria that are more relevant to the
ultimate classification and regression task. Because of their
complementary nature, recent research work in machine learn-
ing [1], [9], [10], [12], [32] has shown the potential benefit of
combining generative and discriminative learning methods.
One active research topic in speech and language processing is
how to learn generative models using discriminative learning
approaches. For example, discriminative training (DT) of hidden Markov
models (HMMs) for automatic speech recognition (ASR) has been intensively
[ Hui Jiang and Xinwei Li]
[An advanced method of discriminative
training for speech and language processing]
© BRAND X PICTURES
Digital Object Identifier 10.1109/MSP.2010.936018
IEEE SIGNAL PROCESSING MAGAZINE [116] MAY 2010
studied for several decades.
The essential idea of DT is to
apply a discriminative criterion
to the training procedure of
HMMs. Many discriminative
criteria have been proposed for
ASR, such as maximum mutu-
al information estimation
(MMIE) [2], [35], minimum classification error (MCE) [18],
minimum phone (word) error (MPE/MWE) [29], and large
margin estimation (LME) [13], [20], [31], [40]. In a standard
DT process, an objective function is first constructed accord-
ing to one of these discriminative criteria. Then, optimization
methods are used to maximize/minimize the objective func-
tion with respect to the model parameters. The major difficulty
of DT in ASR lies in the fact that normally the above-men-
tioned DT criteria result in large-scale nonconvex optimization
problems. For example, in a state-of-the-art large vocabulary
ASR system, these DT criteria normally lead to complex non-
convex objective functions involving over millions or even tens
of millions of free variables. Generally speaking, optimizing
nonconvex objective functions of such a large number of vari-
ables is extremely difficult since it is very easy to get trapped
in a shallow local optimum point in the complicated surface of
the objective functions. During the last two decades, a signifi-
cant amount of research effort in the speech community has
been devoted to this problem. Many different optimization
methods have been proposed and applied to DT of HMMs in
ASR. For example, the extended Baum-Welch (EBW) method
was proposed based on a growth function [7], which was first
applied to MMIE [35] and then extended to other discrimina-
tive criteria including MCE and MPE/MWE [8]. Moreover, a
stochastic approximation method called generalized probabi-
listic descent (GPD) [18] was first proposed for MCE, and then
an approximate second-order Quasi-Newton method called
quickprop was applied to MCE and other criteria [26]. More
recently, constrained line search [21] and trust region [37],
[22] methods have also been proposed for DT of HMMs in ASR.
Generally speaking, all of these optimization methods are non-
convex in nature and normally lead to better recognition per-
formance only when carefully and skillfully implemented, such
as in [35]. These nonconvex optimization methods, especially
the EBW approach, remain popular optimization methods for
DT of HMMs in ASR. Interested readers may refer to several
recent survey articles [8], [17] for details. In this tutorial arti-
cle, as part of applications of convex optimization in signal
processing, we focus on more recent research work that
applies convex optimization methods to DT for speech and
language processing applications.
INTRODUCTION
To the best of our knowledge, there have been at least two
independent research efforts to apply convex optimization
methods to the discriminative learning of HMMs for ASR.
The advantage of convex optimization is that it does not
suffer from the local optimum
problem because any local
optimum is always globally
optimal in convex optimiza-
tion problems. Both research
groups have been motivated
by advances of large-margin
classifiers in machine learning
and attempt to estimate Gaussian mixture HMMs for ASR
based on large margin criterion. In [30] and [31], a large
margin-based DT method was proposed to estimate Gaussian
mixture models for multiclass pattern classification tasks and
then extended to Gaussian mixture HMMs for sequential clas-
sification such as ASR. In this work, LME of Gaussians is for-
mulated as a regularized optimization problem where the
regularization term is calculated based on traces of parameter
matrices and a so-called reparamerization method is pro-
posed to relax multivariate Gaussian parameters to formulate
LME as a convex optimization problem, which is solved by a
customized gradient descent method for efficiency. In [13]
and [15], a different strategy is applied to LME of Gaussian
mixture HMMs for ASR, where LME of HMMs is directly for-
mulated as a minimax optimization problem based on the
principle of maximizing the minimum margin of the HMM-
based classifiers. Then, an iterative optimization method,
called approximation optimization (AM), is used to solve the
original minimax optimization problem through a locality
approximation. Convex relaxation methods may be used to
convert the original nonconvex optimization into standard
convex optimization problems, such as semidefinite program-
ming (SDP) [19], [39] and second-order cone programing
(SOCP) [36], [38]. Even though this work was initially pro-
posed only for LME of mean parameters of Gaussian mixture
HMMs, it can be easily extended to estimate other parameters
of HMMs, such as covariance matrices and transition proba-
bilities, as well as other discriminative criteria, such as MMIE
and MCE. As shown in [16], the proposed AM method is gen-
eral enough for DT of generative models from a wide class of
statistical models, namely finite mixtures of exponential fam-
ily models, while the method in [30], [31] is specially tailored
for Gaussians. Over the past few years, we have successfully
applied a variety of convex optimization methods under the
AM framework to solve some DT problems arising in speech
and language processing tasks, initially starting with
Gaussian mixture HMMs for speech recognition [19], [36],
[38], [39] and more recently extending to multinomial-based
models for language processing [25], [28]. Here, convex opti-
mization plays a critical role in solving the underlying large-
scale optimization problems in DT of the generative models.
In this tutorial article, as an example of convex optimization
for signal processing, we summarize recent research efforts
that apply convex optimization to DT of various statistical
models under the AM framework and highlight the emerging
role of convex optimization in these traditional speech and
language applications.
ONE ACTIVE RESEARCH TOPIC IN SPEECH
AND LANGUAGE PROCESSING IS TO
STUDY HOW TO LEARN GENERATIVE
MODELS USING DISCRIMINATIVE
LEARNING APPROACHES.
IEEE SIGNAL PROCESSING MAGAZINE [117] MAY 2010
GENERATIVE VERSUS
DISCRIMINATIVE LEARNING
Pattern classification is the task
of classifying an unknown
object into one of a set of priori
pattern classes based on some
noisy observation data (also
known as features). The genera-
tive learning scheme originates
from the Bayes decision rule (also known as maximum a poste-
rior decision rule), which is guaranteed to achieve the mini-
mum classification error rate provided the underlying data
distributions are known. For an unknown object, we assume its
class label, S, and its observation data, X, are both random vari-
ables, if the true joint probability distribution of X and S is
known as p 1X, S 2 5 p 1S 2 # p 1X|S 2 , the optimal pattern classifier
can be built based on the MAP (maximum a posterior) decision
rule. In other words, class label of an unknown observation X is
predicted as follows:
S^5 arg max
S
p 1S | X 2 5 arg max
S
p 1S 2 # p 1X|S 2 . (1)
As indicated in (1), constructing the optimal pattern classifier
requires to compute two probability distributions, namely the
prior probability p 1S 2 and the conditional probability p 1X|S 2 ,
which are usually unknown in practice. The essential problem
in generative learning is how to estimate them from some
available training data. To simplify the estimation problem,
we adopt the so-called parametric modeling approach, where
it is assumed that the unknown probability distributions
belong to a computationally tractable function family, such as
the exponential family [4]. Furthermore, some latent vari-
ables may be introduced to derive even more complex models.
A rich family of multimodal distributions can be obtained by
introducing discrete latent variables. In this way, the difficult
density estimation problem turns into a more tractable
parameter estimation problem. One major benefit of genera-
tive learning is that there exist efficient learning algorithms
that can estimate rather complicated models based on the
maximum likelihood (ML) criterion, e.g., the expectation-
maximization (EM) algorithm [6].
Discriminative learning uses a mapping function (from
observation X to class label S) to model class boundaries
without forming data distribution in the entire feature space.
The mapping function is estimated using criteria that are
more relevant to the ultimate classification and regression
task, such as maximum condition likelihood (MCL) estima-
tion [11], empirical risk minimization (ERM), or LME [41].
Traditionally, only some relatively simple models are consid-
ered in the discriminative learning scheme due to computa-
tional complexity issues, e.g., linear discriminant functions
for support vector machines (SVMs). Other discriminative
models include logistic regression and neural networks. It is
an interesting topic to extend the discriminative learning
scheme to other more complicated models, particularly mix-
ture models and latent graphi-
cal models widely adopted in
the generat ive l earn ing
scheme. Recent work in both
machine learning [1], [9], [10],
[12], [32] and speech recogni-
tion [2], [18], [35] has shown
the benefit of learning genera-
tive models discriminatively.
Discriminative learning of these generative models imposes a
computational challenge since it normally leads to a compli-
cated nonconvex optimization problem. A significant amount
of research effort has been devoted to this problem. In this
article, we introduce one method proposed for discriminative
learning of a wide class of generative models, i.e., mixtures of
exponential family (e-family), and particularly underline the
role of convex optimization in this method by using convex
relaxation techniques.
In the next section, we first introduce notations for a general
canonical form of the e-family and its mixtures since many pop-
ular generative models fall into this category. Following that, we
will present a general framework to learn mixtures of exponen-
tial family models in a discriminative way.
STATISTICAL MODELS: THE E-FAMILY AND ITS MIXTURE
In practice, we normally choose generative models from a
rather general family of statistical models, namely the e-fam-
ily, due to its highly computationally tractability in parame-
ter estimation.
THE E-FAMILY
As in [4], all statistical models in the e-family can be represent-
ed as the following general canonic form:
p 1X|l 2 5 exp 5A 1x 2 1 xTl2K 1l 2 6,
where l in bold denotes the natural parameter vector of the
e-family distribution, and x in bold is called the sufficient statis-
tics. In most cases, the natural parameters l may take a differ-
ent form from the original model parameter l and sufficient
statistics x can be represented as a function of original data X.
Furthermore, K 1l 2 is called cumulant generating function,
which is a convex function of the natural parameters l and
independent of X. A 1x 2 is a function of sufficient statistics x and
independent of l.
Many common probability distributions belong to the
e-family, including Gaussian, multinomial, Bernoulli, expo-
nential, gamma, and poison. For example, one-dimensional
Gaussian (with unknown mean and variance) can be repre-
sented in the canonic e-family form as
N 1x|m, s 2 2 5 1"2ps 2 e
2
1x2m22
2s2 5 exp5A 1x 2 1 xTl2K 1l 26,
where sufficient statistics x5 3x, 2 x2/2 4 , the natural param-
eters l5 3m/s2, 1/s2 4 and A 1x 2 52 11/2 2 ln 2p, and the
THE ADVANTAGE OF CONVEX
OPTIMIZATION IS THAT IT DOES NOT
SUFFER FROM THE LOCAL OPTIMUM
PROBLEM BECAUSE ANY LOCAL
OPTIMUM IS ALWAYS GLOBALLY OPTIMAL
IN CONVEX OPTIMIZATION PROBLEMS.
IEEE SIGNAL PROCESSING MAGAZINE [118] MAY 2010
cumulant generating function
K 1l 2 52 11/2 2 ln l21 l12/2l2
(with the constraint l2 . 0),
which is clearly a convex func-
tion of l. We can represent all
other common probability dis-
tributions in a similar way. In
Table 1, we have summarized the major results for multivari-
ant Gaussian and multinomial distributions since they are
mostly relevant to the remainder of this article.
One important property of the e-family is that products of
e-family distributions remain in the e-family. The natural
parameters of the product distribution can be easily con-
structed by concatenating natural parameters of all member
distributions.
MIXTURES OF THE E-FAMILY
All e-family distributions are computationally tractable in
parameter estimation but they are normally too simple to
model real-word data appropriately. A widely adopted strategy
is to introduce latent variables to derive more complicated
models from these simple e-family distributions. A rather rich
family of multimodal distributions can be obtained by intro-
ducing discrete latent variables to derive the so-called finite
mixture models.
Mixtures of the e-family are a generalization of the above
simple e-family distribution by considering a convex combina-
tion of different e-family distributions with different parameters.
In a very general form, mixtures of the e-family can be repre-
sented as the following form:
p 1X|l 2 5 a
k
wk # exp5Ak 1xk 2 1 lTxk2Kk 1l 2 6,
where l denotes the natural parameters of the mixture
model that is concatenated from natural parameters of its
component distributions, and sufficient statistics xk and
Kk 1l 2 may vary in different mixture components, and wk
stands for mixture weight and
satisfies the sum-to-one con-
straint ak wk5 1.
Mixtures of the e-family
represent a wide class of sta-
tistical models and are fre-
quently used in machine
learning and pattern classification. Mixtures of the e-fami-
ly include the Gaussian mixture model (GMM), the multi-
nomial mixture model (MMM), and the HMM. In the HMM,
the latent variable is a complete state sequence, xk repre-
sents sufficient statistics collected along a state sequence,
and each mixture component is product of various e-fami-
ly distributions.
In the generative learning scheme, statistical models are
normally estimated from available training data based on the
maximum likelihood (ML) criterion. Due to the convexity of
the cumulant generating function K 1l 2 , ML estimation of an
e-family model normally result in a simple solution. In many
cases, ML estimation of e-family models can even be solved
by a closed-form solution. Furthermore, ML estimation of
mixtures of the e-family can also be iteratively derived in a rel-
atively easy manner, relying on the well-known EM algo-
rithm [6].
A NEW FRAMEWORK FOR DT OF GENERATIVE MODELS
In this section, we consider the use of alternative discrimina-
tive learning approaches to estimate mixtures of e-family mod-
els. Unlike the conventional ML estimation, discriminative
learning of these generative models faces significant computa-
tional challenges, no longer enjoying simple updating rules
and relatively fast convergence of the EM algorithm. Generally
speaking, discriminative learning of generative models is an
optimization problem, where an objective function must be
optimized in an iterative manner. The discriminative objective
function is normally formulated according to some popular
discriminative criteria, such as maximum mutual information
[TABLE 1] CANONIC FORM OF E-FAMILY DISTRIBUTIONS: A) MULTIVARIATE GAUSSIAN WITH KNOWN PRECISION MATRIX;
B) MULTIVARIATE GAUSSIAN WITH UNKNOWN MEAN AND PRECISION MATRIX; C) MULTINOMIAL IN A CONSTRAINED FORM;
D) MULTINOMIAL IN AN UNCONSTRAINED FORM.
l x K 1l 2 CONSTRAINT
A) GAUSSIAN N 1m, S0 2 (MEAN) m S021x 1
2
lTS0
21l
B) GAUSSIAN N 1m, S 2 (MEAN,COV) 3S21m, S21 4 cx,2 1
2
x # xT d 2 1
2
ln |l2|1
1
2
l1
Tl2
21l1
l2 s 0
C) MULTINOMIAL (CONSTRAINED)
C # q
D
d51
md
x d
3 lnm1, c , lnmD 4 3x1, c , xD 4 1 l , 0
a
D
d51
eld5 1
D) MULTINOMIAL (UNCONSTRAINED)
C # q
D
d51
md
x d
c ln m1
12 a
D21
d51 md
, c ,
ln
mD21
12 a
D21
d51 md
d
c x1
a
D
d51 xd
, c ,
xD21
a
D
d51 xd
d
lna11 aD21d51 eldb
ONE IMPORTANT PROPERTY OF THE
E-FAMILY IS THAT PRODUCTS OF
E-FAMILY DISTRIBUTIONS REMAIN
IN THE E-FAMILY.
IEEE SIGNAL PROCESSING MAGAZINE [119] MAY 2010
(MMI) [2], [35] (maximum
conditional likelihood [11]),
MCE [18], and LME [13], [31].
Historically, these discrimina-
tive criteria have been pro-
posed from different contexts,
in the following, we first pres-
ent a unified view for discrimi-
native criteria, centering on
the concept of margin. Following that, we discuss a general
method to optimize these discriminative objective functions
using locality approximation and convex optimization, named
as the AM method.
A UNIFIED VIEW OF VARIOUS DT CRITERIA
In supervised learning, given a set of training samples, denoted
as D5 5X1, X2, c , XT6, we usually know the true class labels
for all training samples in D, denoted as L5 5S1, S2, c , ST6.
For notational convenience, we use the uppercase letter St to
represent the true transcription of each training sample Xt, and
use lowercase st to denote a variable that may take all possible
labels in a hypothesis space.
Following [5] and [34], a multiclass separation margin for
each training sample Xt is defined as follows:
d 1Xt | L 2 5 ln p 1St, Xt|L 2 2 max
st[Mt
ln p 1st, Xt|L 2 , (2)
where L denotes the whole set of model parameters from all
classes and max is taken over a particular hypothesis space,
denoted as Mt
本文档为【凸面最优法的统计模型参数估计】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。