Segmental Minimum Bayes-Risk Decoding for Automatic Speech Recognition

ont color=blue>the Internet Archive. Yahoo! is not affiliated with the authors of this page or responsible for its content.
Segmental Minimum Bayes-Risk Decoding for Automatic Speech Recognition
234
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 12, NO. 3, MAY 2004
Segmental Minimum Bayes-Risk Decoding for
Automatic Speech Recognition
Vaibhava Goel, Shankar Kumar, and William Byrne, Member, IEEE
AbstractMinimum Bayes-Risk (MBR) speech recognizers
have been shown to yield improvements over the conventional
maximum a-posteriori probability (MAP) decoders through
N-best list rescoring and
search over word lattices. We present
a Segmental Minimum Bayes-Risk decoding (SMBR) framework
that simplifies the implementation of MBR recognizers through
the segmentation of the
N-best lists or lattices over which the
recognition is to be performed. This paper presents lattice cutting
procedures that underly SMBR decoding. Two of these procedures
are based on a risk minimization criterion while a third one
is guided by word-level confidence scores. In conjunction with
SMBR decoding, these lattice segmentation procedures give
consistent improvements in recognition word error rate (WER)
on the Switchboard corpus. We also discuss an application of
risk-based lattice cutting to multiple-system SMBR decoding and
show that it is related to other system combination techniques
such as ROVER. This strategy combines lattices produced from
multiple ASR systems and is found to give WER improvements in
a Switchboard evaluation system.
Index TermsASR system combination, extended-ROVER, lat-
tice cutting, minimum Bayes-risk decoding, segmental minimum
Bayes-risk decoding.
I. I
NTRODUCTION
I
N ASR, an acoustic observation sequence
is to be mapped to a word string
, where
are words belonging to a vocabulary
.
We assume that a language
is known; it is a subset of the
set of all word strings over
. This language specifies the word
strings that could have produced any acoustic data seen by the
ASR system. We further assume that the ASR classifier selects
its hypothesis from a set
of word strings. This set, called the
hypothesis space of the classifier, would usually be a subset of
the language. The ASR classifier can then be described as the
functional mapping
.
Let
be a real valued loss function that describes the
cost incurred when an utterance
belonging to language
is mistranscribed as
. An example loss function, the
one that we focus on in this paper, is Levenshtein distance [1]
that measures the minimum string edit distance (word error rate)
between
and
. This loss function is defined as the min-
Manuscript received February 8, 2003; revised August 4, 2003. This work was
supported by the National Science Foundation under Grants IIS-9810517 and
IIS-9820687. The associate editor coordinating the review of this manuscript
and approving it for publication was Dr. Jerome R. Bellegarda.
V. Goel is with the IBM T. J. Watson Research Center, Yorktown Heights,
NY 10598 USA (e-mail: vgoel@us.ibm.com).
S. Kumar and W. Byrne are with the Center for Language and Speech
Processing, Department of Electrical and Computer Engineering, The Johns
Hopkins University, Baltimore, MD 21218 USA (e-mail: skumar@jhu.edu;
byrne@jhu.edu).
Digital Object Identifier 10.1109/TSA.2004.825678
imum number of substitutions, insertions and deletions needed
to transform one word string into another.
Suppose the true distribution
of speech and lan-
guage is known. It would then be possible to measure the per-
formance of a classifier
as
(1)
This is the expected loss when
is used as the classifica-
tion rule for data generated under
. Given a loss func-
tion and a distribution, the classification rule that minimizes
is given by [2]
(2)
We note that while the sum in (2) is carried out over the entire
language of the recognizer, only those word strings with non
zero conditional probability
contribute to the sum. Let
denote the subset of
such that
(3)
Equation (2) can now be re-written as
(4)
We shall refer to the sum
in (4)
as conditional risk and classifier given by this equation as the
Minimum Bayes-Risk (MBR) classifier.
The set
serves as the evidence for the MBR classifier
using which it selects the hypothesis. Therefore, we shall refer
to
as the evidence space for the acoustic observations
. The
distribution
describes the evidence space and shall be
referred to as the evidence distribution.
Our treatment so far assumes that the true distribution over
the evidence is available, however this is not the case in practice.
This distribution is obtained by applying Bayes rule
(5)
where the component distributions are approximated by models.
As is commonly done,
is approximated as the language
model and
is obtained from a hidden Markov model
acoustic likelihoods.
II. S
EGMENTAL
M
INIMUM
B
AYES
-R
ISK
D
ECODING
For most practical ASR tasks, the spaces
and
are
large and the minimum Bayes-risk recognizer of (4) faces com-
putational problems. Previous work has focused on efficient
1063-6676/04$20.00 © 2004 IEEE GOEL et al.: SEGMENTAL MINIMUM BAYES-RISK DECODING FOR AUTOMATIC SPEECH RECOGNITION
235
search procedures to implement (4). Here we discuss an al-
ternate set of strategies that segment hypothesis and evidence
spaces of the MBR recognizer. The segmentation transforms the
original search problem into a series of search problems, which
due to their smaller sizes, can be more easily solved. These
strategies are collectively referred to as segmental MBR recog-
nition [3].
For rigor, we introduce a segmentation rule
which di-
vides strings in the language
into
segments of zero or more
words each. We denote the
segment of
as
. In this
way, we impose a segmentation of the space
into segment
sets
, where
, when applied to
, generates
evidence segment sets
. We now define the marginal probability
of any word string
(6)
The application of the segmentation rule to the hypothesis
space yields hypothesis segment sets
. The concatena-
tion of the sets
yields a search space
that is the cross-product of the hypothesis segment sets
. Concatenating the sets may introduce
new hypotheses since suffixes can be appended to prefixes in
ways that were not permitted in the original space. However
no hypotheses are lost through the concatenation. It is our
goal to search over this larger space and, by considering more
hypotheses, possibly achieve improved performance.
We now discuss the inclusion of the segmentation rule into
MBR decoding. We begin by making the strong assumption that
the loss function between any pair of evidence and hypothesis
strings
,
distributes over the segmentation,
i.e.,
(7)
Under this assumption, we can now state the following propo-
sition [4].
Proposition: An utterance level MBR recognizer given by
(8)
can be implemented as a concatenation
(9)
where
(10)
This proposition defines the Segmental MBR (SMBR) decoder.
Equation (10) follows by substituting (7) into (8).
A special case of segmental MBR recognition is particularly
useful in practice. It arises when the strings in the hypothesis
and evidence segment sets are restricted to length one or zero,
i.e., individual words or the NULL word. We also assume that
there is a 0/1 loss function on the segment sets
if
otherwise.
(11)
Under these conditions the segmental MBR recognizer of (10)
become
(12)
Equation (12) is the maximum a-posteriori decision over each
hypothesis segment set. In each segment set the posterior prob-
ability of all the words are first computed based on the evidence
space, and the word with the highest posterior probability is
selected. We call this procedure segmental MBR voting. This
simplification has been utilized in several recently developed
-best list and lattice based hypothesis selection procedures to
improve the recognition word error rate [5][7].
This summarizes the relationship between SMBR decoding,
MAP decoding and segmental MBR voting. From (8), if
no lattice cutting had been done, MBR decoding under the
0/1 loss function would lead to the standard MAP rule:
. Introducing hypothesis space
segmentation transforms the standard MAP rule to segmental
MBR voting as in (12).
For a given loss function, evidence space and hypothesis
space, it may not be possible to find a segmentation rule such
that (7) is satisfied for any pair of hypothesis and evidence
strings. However, given any segmentation rule, we can specify
an associated induced loss function defined as
(13)
From the discussion of Proposition 1, we see that the segmental
MBR recognizer is equivalent to an utterance level MBR recog-
nizer under the loss function
. Therefore, the overall perfor-
mance of the SMBR recognizer under a desired loss function
will depend on how well
approximates .
For ASR, we are particularly interested in the Levenshtein
loss function. Here, a segmentation of the hypothesis and
evidence spaces will rule out some string alignments between
word sequences. Therefore, under a given segmentation rule,
the alignments permitted between any two word strings from
and
might not include the optimal alignment needed
to achieve the Levenshtein distance. Therefore, the choice of
a given segmentation involves a trade-off between two types
of errors: search errors from MBR decoding on large segment
sets and the errors in approximating the loss function due to
the segmentation.