THE UNIVERSITY OF CHICAGO Analysis and Automatic Recognition of Tones ...

ious versions at the Internet Archive. Yahoo! is not affiliated with the authors of this page or responsible for its content.
THE UNIVERSITY OF CHICAGO Analysis and Automatic Recognition of Tones in Mandarin Chinese A DISSERTATION SUBMITTED TO
THE UNIVERSITY OF CHICAGO
Analysis and Automatic Recognition of Tones in Mandarin Chinese
A DISSERTATION SUBMITTED TO
THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES
IN CANDIDACY FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
DEPARTMENT OF COMPUTER SCIENCE
BY
DINOJ SURENDRAN
CHICAGO, ILLINOIS
SEPTEMBER 27, 2007 TABLE OF CONTENTS
1 INTRODUCTION
5
1.1
Syllables in Mandarin Chinese . . . . . . . . . . . . . . . . . . . . . .
6
1.2
Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . .
8
1.2.1
How important is Tone in Mandarin? . . . . . . . . . . . . . .
9
1.2.2
What are good basic features based on Duration, Pitch, and
Intensity? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.2.3
Can Voice Quality help Tone Recognition?
. . . . . . . . . .
11
1.2.4
How useful is Context? . . . . . . . . . . . . . . . . . . . . . .
12
1.2.5
Are Strong Syllables easier to recognize? . . . . . . . . . . . .
12
2 QUANTIFYING THE IMPORTANCE OF RECOGNIZING TONES
13
2.1
The Simplest Denition of Functional Load . . . . . . . . . . . . . . .
13
2.2
Functional Load of Mandarin Tones (I) . . . . . . . . . . . . . . . . .
16
2.3
Generalized Functional Load calculations . . . . . . . . . . . . . . . .
19
2.4
Interpretation of Functional Load Computations . . . . . . . . . . . .
21
2.5
Functional Load of Mandarin Tones (II) . . . . . . . . . . . . . . . .
22
2.6
Functional Load Versus Perceptual Ease . . . . . . . . . . . . . . . .
24
2.7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3 LOCAL FEATURES BASED ON DURATION, PITCH, AND INTENSITY 28
3.1
Evaluating Classication Performance . . . . . . . . . . . . . . . . . .
29
3.2
Speaker Normalization . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3
Features based on Duration . . . . . . . . . . . . . . . . . . . . . . .
34
2 3
3.4
Features Based on Pitch . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.5
Features based on Overall Intensity . . . . . . . . . . . . . . . . . . .
49
3.6
Combining the Duration, Pitch, and Intensity Features . . . . . . . .
54
3.7
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4 CONTOUR HEIGHT ADJUSTMENT
59
4.1
Pitch Height Adjustment . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.2
Intensity Adjustments . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.3
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
5 VOICE QUALITY MEASURES FOR MANDARIN TONE RECOGNITION 71
5.1
Measures of Voice Quality Considered . . . . . . . . . . . . . . . . . .
72
5.1.1
Glottal Flow Estimation . . . . . . . . . . . . . . . . . . . . .
73
5.1.2
Harmonic-Formant Dierences . . . . . . . . . . . . . . . . . .
74
5.1.3
Spectral Center of Gravity . . . . . . . . . . . . . . . . . . . .
76
5.1.4
Spectral Tilt . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.1.5
Band Energy . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.2
Classication Task . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.4
Band Energy Features . . . . . . . . . . . . . . . . . . . . . . . . . .
81
5.5
Subsets
of Band Energy Features . . . . . . . . . . . . . . . . . . . .
83
5.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
6 COARTICULATION
89
6.1
Using dierent classiers for dierent contexts . . . . . . . . . . . . .
89 4
6.2
Adding True Labels of Neighboring Syllables as Features . . . . . . .
92
6.3
Adding Predicted Probabilities of Labels as Features . . . . . . . . .
93
6.4
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
7 STRENGTH
97
7.1
Predicting Focus in Lab Speech . . . . . . . . . . . . . . . . . . . . .
98
7.2
Predicting Strength in Broadcast Speech . . . . . . . . . . . . . . . . 102
7.3
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8 CONCLUSIONS
110 CHAPTER 1
INTRODUCTION
All human languages use sequences of words to convey information. In languages
like English, Dutch and most Indo-European languages, words are just sequences of
discrete units called phonemes. However, as Yip (2002) points out, most languages
in the world are tonal, which means that their words are also dened by intonational
patterns based on the pitch (rate of vocal fold vibration) with which words are said.
Each pattern, or tone, is associated with a unit, such as a syllable, word, or morpheme.
Speakers of non-tonal languages learning a tonal language have been observed to have
activity in previously unused parts of their cortex (Wang et al. (2003)).
This thesis is an investigation of tones in Mandarin Chinese. Chapter 2 tackles the
question of how important it is to recognize tones, while the remaining chapters focus
on the automatic recognition of tones.
Figure 1.1 Averaged pitch contours for four citation-speech utterances of ma. From
Xu (1997).
5 6
1.1
Syllables in Mandarin Chinese
Each syllable in Mandarin has one of ve tones:
1. High Tone. Also called High-Level since the pitch stays fairly constant.
2. Rising Tone.
3. Low Tone. Also called Low-Rising, since the pitch tends to start o low and
then increase.
4. Falling Tone.
5. Neutral Tone. This is, to some extent, a none of the above category. All
syllables with neutral tone are unstressed.
The distribution of these tones is far from uniform. Falling tones are the most com-
mon, with about a third of all syllables having it, while only around six percent have
neutral tone. Table 1.1 has the distribution based on forty thousand syllables from
the Mandarin Voice of America TDT 2 corpus (Wayne (2000)).
Table 1.1 Distribution of ve tones in Mandarin test data (40 798 syllables) from
news broadcasts in the Mandarin Voice of America TDT 2 corpus. About a third of
all syllables have Falling Tone.
High
Rising
Low Falling Neutral
0.23
0.24
0.14
0.33
0.06
We shall write each Mandarin syllable using the form PPPT, where PPP is its phone-
mic component and T, a number from 1 to 5, is its tonal component. For example,
the monosyllabic word ma1 (mother) is ma said with a high tone, while ma2 7
(hemp) is ma said with a rising tone, ma3 (horse) is said with a low tone, and
ma4 (scold,curse) a falling tone.
Figure 1.1 from Xu (1997) shows stereotypical shapes of these tones based on their
average pitch contour over several speakers and utterances. In practice, pitch contours
rarely achieve these idealized shapes. There are several reasons for this, some of which
we outline here. Anticipatory Coarticulation. The pitch contour of a syllable is aected by that
of the syllable after it.
Carryover Coarticulation. The pitch contour of a syllable is aected by that of
the syllable before it.
Syllable Strength. Some syllables are said more clearly than others.
Phonology. Certain sequences of tones do not occur. The most famous example
is third tone sandhi, where a low tone is converted to a rising tone if it is followed
by another low tone, i.e. two low tones do not occur in succession.
Phrase level eects such as Declination, where the average pitch steadily de-
creases as the utterance progresses.
Most Mandarin syllables are of the form [C]V[N] or [C]VV[N], where C = consonant,
V = vowel, N = nasal, and square brackets denote optionality (Chao (1968)). (The
exceptions include, for example, degenerate syllables of the form N.) The initial C, if
present, is called the syllables onset. The rest of the syllable is called its rhyme.
We shall refer to a collection of syllables said with a single breath as a phrase. 8
1.2
Contributions of this Thesis
Until recently, tone recognition methods were so poor that it was better to leave them
out of the entire speech recognition pipeline. This has started to change. For example,
Lei et al. (2005) obtained an improvement in character-level classication accuracy
from
64.3% to 66.8% on multi-speaker telephone speech by adding the posteriors
output from a separate tone recognition module to the traditional MFCC feature
vector used at the base of a complete speech recognition system.
Most of this thesis focuses on ways that could improve such modules.
One of the primary ways in which this thesis is dierent is the scope of experiments
considered on a large dataset. Our primary dataset is a collection of 1159 news
stories from the Mandarin Voice of America (VOA) Topic Detection and Tracking
(TDT) 2 dataset of Wayne (2000). It has about ten hours of speech containing over
160 000 syllables. To deal with such a large dataset, we implemented
1
a series of
classiers based on the fast Conjugate Gradient Least Squares algorithm of Keerthi
and DeCoste (2005).
The contributions of this thesis, listed in order of importance, are:
1. Finding a set of new band energy features that improve tone recognition, par-
ticularly of low and neutral tones. This was determined during the course of
testing about twenty types of voice quality measures. The recognition of these
two tones, particularly neutral tone,