Analysis and Synthesis of Pathological Voice Quality
Revised February, 2006
Analysis and Synthesis of Pathological Voice Quality
by Jody Kreiman Bruce R. Gerratt Norma Antoñanzas-Barroso
Bureau of Glottal Affairs Division of Head/Neck Surgery UCLA School of Medicine 31-24 Rehab Center Los Angeles, CA 90095-1794
This research was supported by grant DC01797 from the National Institute on Deafness and Other Communication Disorders.
© 2001-2006 by The Regents of the University of California
Software © 2001-2006 The Regents of the University of California The following terms apply to all files associated with the software unless explicitly disclaimed in individual files. The authors hereby grant permission to use, copy, modify, distribute, and license this software and its documentation for any purpose, provided that existing copyright notices are retained in all copies and that this notice is included verbatim in any distributions. No written agreement, license, or royalty fee is required for any of the authorized uses. Modifications to this software may be copyrighted by their authors and need not follow the licensing terms described here, provided that the new terms are clearly indicated on the first page of each file where they apply. IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
-2-
Table of Contents I. Introduction ...........................................................................................................5 Organization of the Manual ...........................................................................5 Technical Credits ...........................................................................................5 II. Inverse Filtering ...................................................................................................6 Part 1: Background .......................................................................................6 Introduction.............................................................................................6 Recording Voice Samples for Inverse Filtering .....................................6 Estimating the Vocal Tract Filter ...........................................................7 Inverse Filtering Method ........................................................................8 Part 2: Step by Step Procedures ....................................................................8 Program Installation................................................................................8 Inverse Filtering Procedure: Introduction..............................................9 Open a File..............................................................................................9 Run the Inverse Filter .............................................................................15 Print and Save Files for Use in the Synthesizer......................................20 Part 3: Other Features of the Inverse Filter...................................................24 Introduction.............................................................................................24 File Menu................................................................................................24 Help Menu ..............................................................................................24 Display Menu..........................................................................................24 Edit Menu ...............................................................................................26 Glottal Analysis Menu............................................................................26 III. Voice Synthesis Software Part 1: Introduction .......................................................................................28 About Voice Synthesis and the UCLA Voice Synthesizer.....................28 Issues in Source Modeling......................................................................28 Modeling the Inharmonic Part of the Source..........................................32 Frequency and Amplitude Modulations (Tremor)..................................34 Modeling the Effecgts of Source/Filter Interactions ..............................35 The Synthesis Process.............................................................................35 Part 2: Step-by-Step Synthesis Procedures Program Installation................................................................................36 The Synthesizer Interface .......................................................................37 Step 1: Open a File ................................................................................38 Step 2: Initialize Variables.....................................................................38 Step 3: Fit an LF Model to th Inverse Filtered Source Pulses ...............39 Step 4: Track F0.....................................................................................41 Step 5: Model Frequency and Amplitude Modulations.........................41
-3-
Step 6: Model the Inharmonic Part of the Source (Noise Excitation) ...43 Step 7: Synthesize the Voice .................................................................43 Part 3: Making Changes to the Synthetic Voices Introductory Remarks .............................................................................45 Editing the Vocal Tract Configuration ...................................................46 Editing the Source...................................................................................48 Editing the Tremor Parameters...............................................................50 Adjusting Levels of Jitter, Shimmer, and Noise.....................................51 Saving Your Work and Creating Stimuli................................................53 Part 4: Menu Commands and Other Features of the Synthesizer .................53 File Menu................................................................................................53 Variables Menu.......................................................................................55 Display Menu..........................................................................................55 LF Fit Menu............................................................................................56 Analysis Menu ........................................................................................56 Synthesis Menu.......................................................................................57 Play Menu...............................................................................................57 Restore Menu..........................................................................................57 Help Menu ..............................................................................................58 Part 5: Index of File Names and What They Mean ......................................58 IV. Sky Analysis Program About Sky ......................................................................................................60 Menu Function ...............................................................................................60 File Menu................................................................................................60 Using the File Menu to Convert File Formats ........................................63 View Menu.....................................................................................................63 Setup Menu ....................................................................................................64 Sample/Play Menu .........................................................................................64 Help Menu .....................................................................................................64 Display Menu.................................................................................................64 Analysis Menu ...............................................................................................71 Edit Menu.......................................................................................................80 Batch Menu....................................................................................................82 AAA Menu.....................................................................................................83 MatLab Menu.................................................................................................83 V. References............................................................................................................84 VI. Last-Minute Changes..........................................................................................87
-4-
I. INTRODUCTION This document describes software for inverse filtering (invf.exe), voice synthesis (synthesis.exe), and voice analysis (sky.exe). This software was developed at the UCLA Bureau of Glottal Affairs, with support from the National Institute on Deafness and Other Communication Disorders (grant DC01797). The software is distributed as shareware, and the code is available on an open source basis. C++ code, executable files, and limited documentation are available for download from www.surgery.medsch.ucla.edu/glottalaffairs/. Two sample voices (one male and one female) are also available at that site. This software requires Windows 3.11 or later and a sound card. All software is best suited for fast computers with 1 GB or more of memory. The site also includes MATLAB code for a previous version of the synthesizer. This code will not run on MATLAB versions more recent than 4.11, but it is included in the event that someone wants a head start on developing a more modern MATLAB implementation. The present version requires a MATLAB-compatible sound card (SoundBlaster or equivalent), and will not run on computers with 1 GB or more of memory regardless of the MATLAB version installed. Users are implored to report any bugs they find in any of this software to Norma Antoñanzas-Barroso (nab@ucla.edu) or to Jody Kreiman (jkreiman@ucla.edu). Suggestions for modifications, additions, and clarifications are also very welcome, but technical support is not available beyond the information provided in this document. Organization of the Manual The organization of this manual follows the steps of the analysis/synthesis process. The usual first step in this process is estimating the vocal source and vocal tract transfer functions, and accordingly, Section II of the manual describes inverse filtering software developed for these purposes. (It is also possible to omit this step by opening and modifying one of the sample cases, as described in the synthesizer documentation.) Subsequent analyses and synthesis are conducted using the synthesizer software, which is described in Section III. Finally, a number of specialized tools for voice analysis are described in Section IV. Each section of the manual begins with a brief introduction to some of the relevant theoretical and technical considerations, followed by step-by-step instructions for completing typical analyses. The final part of each section describes additional features and the function of each menu command. We have assumed some previous knowledge of speech acoustics and signal processing, especially in the introductory sections. In particular, a basic understanding of the acoustic theory of speech production is needed to understand much of what follows. Users without such background may wish to skip the introductory sections of each chapter and proceed directly to the "cookbook" sections that provide step-by-step instructions for using the software. Technical Credits Inverse filtering algorithms were written by Norma Antoñanzas-Barroso in C++ to run under Windows. Algorithms for source and noise modeling and interactive synthesis were originally programmed by Brian Gabelman in MATLAB, and subsequently were significantly revised and adapted for Windows by Norma Antoñanzas-Barroso and Diane Budzik. Significant technical advice has been provided by Lloyd Rice and Michael Döllinger, whose help we gratefully acknowledge.
-5-
II. INVERSE FILTERING Part 1: Background Introduction Estimates of the shape of the harmonic part of the glottal source can be obtained by inverse filtering the voice signal (Figure 1). In source-filter theory, the vocal tract is modeled as an all-pole filter shaping the input glottal source, and radiation at the lips (which increases the output sound energy level by 3 dB/octave) is modeled by a differentiator. To recover the glottal source, these factors must be canceled out. The vocal tract transfer function is canceled by applying an all-zero filter (the inverse of the vocal tract model) to the speech signal. This process removes the effects of the transfer function from the signal, leaving behind an estimate of the glottal flow derivative. If the radiation characteristic is also canceled, an estimate of the actual glottal pulse shape is generated. This introductory section describes some of the theoretical and practical issues involved in inverse filtering, along with the technical details of the algorithms. Step-bystep instructions appear in Part 2. Recording Voice Samples for Inverse Filtering Recording techniques are not particularly critical in investigations of the spectral characteristics of the voice source, as long as good quality equipment is used in a controlled environment. However, when the goal of the analysis is to recover the shape of the glottal pulse (or its derivative, usually referred to as the flow derivative) accurately in the time domain, voice recording for inverse filtering must preserve phase Figure 1. The inverse filtering process. relationships among the different spectral From A. Ní Chasaide & C. Gobl, "Voice components. Two recording methods source variation," in W.J. Hardcastle & J. theoretically can preserve spectral phase Laver, The Handbook of Phonetic characteristics (and thus pulse shapes): direct Sciences (Oxford, Blackwell, 1997), digitization from a precision condenser p. 430. microphone with an appropriate frequency response, or recording the flow signal with a pneumotachographic mask and a differential pressure transducer, as described by Rothenberg (1973; 1977). Signals from a precision condenser microphone can also be recorded on an FM tape recorder to preserve phase, but direct digitization is preferable due to tape recorder wow and flutter distortions and restricted frequency range. Standard audio tape recorders do not preserve phase information. Each recording method has advantages and drawbacks (Javkin et al., 1987). Recording in free field with a condenser microphone provides an excellent high-frequency response. High fidelity ½ inch condenser microphones, like those manufactured by Bruel & Kjaer, can transduce acoustic energy down to about 0.1 Hz. However, a microphone cannot capture the low
-6-
frequency components of the airflow that arise when the glottis fails to close completely, so information about any constant DC offset is generally lost when this method is applied (although it may be possible to use calibration techniques to estimate the DC airflow without use of a flow mask; see Alku et al., 1998, for details). Finally, the effect of radiation from the lips, equivalent to a differentiation of the signal, must be taken into account with microphone signals to recover actual glottal pulse shapes, although this is not an issue when the goal is to recover the glottal flow derivative. To recover the glottal pulse shape, the signal must be integrated to remove radiation effects, producing a high frequency de-emphasis of 3 dB per octave. This de-emphasis has the effect of enhancing any low frequency noise in the signal. Airflow masks preserve the DC component of the signal and give a calibrated, quantitative measurement of actual glottal flow. The mask also eliminates radiation effects. Thus, low frequency noise is less of a problem with a flow mask system. However, the flow mask has a poor high frequency response (only up to about 1200 Hz; Rothenberg, 1973, 1977), which may cause significant errors in estimation of the flow waveform shape. In particular, the glottal airflow waveform has the most abrupt changes during the closing phase, and high frequency information is needed to represent these fast changes (Alku & Vilkman, 1995). In addition, the filtering effects of the mask placed over the face make it difficult to estimate voice formant frequencies accurately. The particular recording method selected thus depends on the specific application. In our case (where the goal is to derive the input for a synthesizer), loss of high frequency information and difficulties estimating vocal tract resonance frequencies has proven far more problematic than contamination by low frequency noise, so recordings are made with a condenser microphone rather than a flow mask system. Voices for our studies are transduced with a 1/2" Bruel and Kjaer condenser microphone (model 4193) and directly digitized. Signals are sampled at 20 kHz, with 16-bit resolution. They are subsequently downsampled to 10 kHz for analysis. Estimating the Vocal Tract Filter Success in inverse filtering is usually defined as an output pulse with minimal residual formant ripple, indicating that most of the effect of the formants has been canceled, and a smoothly decreasing source spectrum conforming to theoretical expectations (Fant, 1979). A successful result depends mostly on the correct specification of formants and bandwidths. In particular, the frequency and bandwidth of F1 must be determined rather precisely to avoid distorting the glottal pulse shape. The frequency and bandwidth of the formants above F3 do not have a large effect on the overall source pulse shape (Ní Chasaide & Gobl, 1997), but are important for correct modeling of glottal closure and vocal tract excitation, as discussed below (Alku & Vilkman, 1995). Because of interactions between the vocal tract and the source, formant frequencies and bandwidths modulate during the open phase of the glottal cycle. For this reason, the most accurate estimate of vocal tract parameters should be obtained during the glottal closed phase, which can be detected from the LPC residual signal (Childers et al., 1983, 1990; Childers & Krishnamurthy, 1985; Childers & Lee, 1991). The closed phase (to the extent that there is one) begins one sample after the residual peak. In LPC analysis, the "error" left over after the vocal tract filter has been estimated approximates the source component of the signal (assuming a linear source-filter theory). Thus, a noisy residual signal indicates that the LPC model of the resonances leaves variance unaccounted for. In theory this may signal a need to adjust the
-7-
formants later in the modeling process, although in practice we have not noticed a correlation between a noisy residual signal and a "poor" inverse filtering result. Once the closed phase has been detected, formant frequencies and bandwidths are estimated using a closed phase covariance LPC analysis of 30-56 points, depending on F0 (40 is typical). The number of poles is not restricted, to assure the smoothest possible flow derivative and source spectrum slope. When there is no closed phase, or when the result is unsatisfactory, it is also possible to compute an autocorrelation LPC analysis over a larger window as an alternative to covariance analysis. The variability introduced by the longer window adds its own error to the analysis, but sometimes produces a better result in cases where the assumptions of covariance analysis are violated. Inverse Filtering Method Inverse filtering is performed using the method described by Javkin et al. (1987). For signals sampled at 10 kHz, 5-6 zeros are generally appropriate, although the inverse filter software allows up to 10 zeros to be specified. The filter also includes 6 poles (to remove spectral zeros), although in our experience these have not proven particularly useful. Given that the whole inverse filtering procedure is noisy, an interactive process has been developed that allows the user to manipulate formants and bandwidths to produce the "best" result possible. Use of an interactive filter minimizes the need for precise vocal tract estimation, because a poor estimate can easily be corrected to improve the inverse filtering result. In practice, however, care must be taken because manipulation of the inverse filter to eliminate formant ripple and smooth the source spectrum often simultaneously smoothes away perceptually-important details about the shape of the glottal pulse, indicating that the traditional criteria for successful inverse filtering should be applied with caution. This difficulty may be overcome in part by smoothing a theoretically less-than-ideal inverse filter output with a theoretical model instead of attempting to completely model the pulses in the inverse filter. The best approach appears to be limiting intervention in the filter to removing spurious lowfrequency poles (a pole at F0, for example) and only enough high-frequency ripple to ensure that the model-fitting algorithm does not crash. Less definitely appears to be more in this case. Finally, there is no way to know for certain that the inverse filtering process has recovered the "true" or "correct" shape of the glottal pulses, even when the analysis goes smoothly and all traditional criteria for success are met. Depending on the application, different standards for validating the results may be applied. In our case, the recovered source pulses are imported into the synthesizer and then adjusted to produce a synthetic voice that perceptually matches the original natural target voice sample. The source pulse that produces a match to the target voice is considered to be "perceptually correct," although its relationship to underlying physiological vocal function remains unknown. Individual researchers should be aware of validity issues surrounding the output of the inverse filter, and should take steps to validate their results if they plan to make any claims that require or imply correctness or accuracy. Part 2: Step by Step Procedures Program Installation Create the directory C:\Program Files\invf. Copy the file invf.exe from the webpage or CD into this directory. Use the Task and Start Menu wizard (found under "Settings" in the Windows Start menu) to add the inverse filter to the start menu if desired. The software will automatically create the other directories it needs on first use.
-8-
Inverse Filtering Procedure Introduction The inverse filter as described here serves as a way of estimating the voice source so that it can be modeled and used to synthesize the voice in question. Obviously, there are other reasons to inverse filter vowels--for example, to gain information about the source to assist in evaluating patients with voice disorders--and the inverse filter can be used for these purposes as well. Procedures will vary slightly depending on the purpose. Major procedural variants are noted below. This may seem like a complicated process from the number of pages it takes to describe it, but once you get the hang of it you can finish an average analysis in about a minute. Open a File First, open a candidate audio file (Figure 2). The default format in the inverse filter is the home-grown .AUD format, in which microphone data have the extension .AUD and flow mask data have the extension .FLO. The .WAV format can also be used with the command File-Open a WAVE file-filename.
Figure 2. Inverse filter file opening dialog box. Figure 3 shows the newly opened file. The inverse filter is optimized for a sample rate of 10 kHz, so the sound file may need to be downsampled before proceeding further. Files can be downsampled to rates that are integer submultiples of the original sampling rate in the inverse filter itself, but non-integrally-related sample rates must be converted in some other utility (for example, SoundForge). To downsample, execute the command Edit-Downsample (Figure 4). The rate defaults to 10,000 (which is what you want). Clicking `ok' replaces the current file with
-9-
Figure 3. Inverse filtering window showing open file.
Figure 4. Downsample dialog box.
- 10 -
Figure 5. Downsampled file is saved and reopened in active window. the downsampled file, and also saves a copy of the downsampled file with the character `d' appended (e.g., `if.aud' is saved as `ifd.aud'; Figure 5). Next, screen your voice sample for any undesirable features, and identify a segment to work on. Play the whole file by clicking on the sound icon in the toolbar. To select a segment of the file, set the beginning by left clicking, and then right click to set the end (Figure 6). Click the ZI button on the tool bar to Zoom In the segment (Figure 7). (Click the R button to Restore the original complete file.)
Figure 6. Define a speech segment by left and right clicking.
- 11 -
Figure 7. Zoom in on a defined segment by clicking the ZI button.
Figure 8. Use the menu to play a segment of speech; use the sound button on the toolbar to play the whole file. Play the selected segment (zoomed or not) by using the command Play-Play speech-Play segment (Figure 8). Continue zooming and playing until you have isolated a segment at least 8
- 12 -
cycles long that meets your analysis criteria. In most cases, this segment will be fairly steady in quality, representative of the overall sample, and free of recording artifacts. To begin the analysis, click the FFT button on the toolbar (Figure 9). Normally, the default choices of Hamming Window and Preemphasis are appropriate. The default window size is 256 points, which should cover about 2.5 periods. If it doesn't (because F0 is less than 100 Hz), increase the window size to 512 points. Choice of window size often involves compromise. A longer window will give a better analysis, but if the window is too long, variability in the vocal tract configuration may introduce errors. 2.5 periods has proven to be a reasonable compromise in the past. Click OK to continue.
Figure 9. FFT analysis dialog box. A spectrum will now appear in the lower left part of the analysis window. Next, click the LPC button (Figure 10). Select autocorrelation and preemphasis, as shown in Figure 10. Window size considerations are as above. Order 14 is usually good for 10 kHz sample rates. If you change the analysis order, the window size will also change automatically. If necessary, reset the value to the window size you want before clicking ok. Changing the window size will not affect the order (unless you set the window size to less than twice the order; in that case, the analysis will be rejected.) After OK is clicked, the number of cycles in the upper right window will decrease, an LPC envelope will appear over the FFT spectrum, and numbers will appear in the table of formants and bandwidths in the upper left part of the window (Figure 11). An error signal will also appear under the waveform in the center of the window. Referring to this error signal, mark the beginning and end of a cycle so that F0 can be estimated. To do this, first find the peaks in
- 13 -
Figure 10. Autocorrelation LPC analysis dialog box.
Figure 11. Output of autocorrelation LPC analysis. Error signal appears below waveform in the middle of the figure. Left and right click to define a cycle, as described in the text.
- 14 -
the error signal. Click the left mouse button near the left peak in the error signal, and click the right button near the right peak. Precision is not critical, and the choice of peaks is not necessarily straightforward, as Figure 11 shows. You may have several choices of peak, or you may have to guess at the cycle boundaries. There does not appear to be any particular correlation between the prominence of the peaks and the quality of the inverse filtering--prominent peaks can give a rotten result, and peaks placed by guessing can give a very good result. Also refer to the time series above the error signal. The part of the waveform corresponding to the peaks you choose should look like a complete cycle of phonation, with the left cursor at the beginning and the right cursor at the end. You can reset the cursors if necessary by reclicking--you may have to move the cursor far to the left or right of the location you want for this to work (because the software looks for a peak near the cursor). Once you're satisfied and have finished marking a period, click the F0 button on the toolbar to compute F0 for that cycle. The value will appear in the caption at the top right of the frame, just above the time series waveform. Check it to be sure it is sensible given your previous listening to the voice. If it isn't, something is wrong and you should start over. Run the Inverse Filter To run the inverse filter using the autocorrelation estimates of vocal tract resonances, just click the IF button on the toolbar (Figure 12). If the extension for the file in use is .AUD, the program assumes that this is a microphone signal and automatically cancels the radiation characteristic. If the file extension is .FLO, the program assumes this is data from a flow mask and does not cancel the radiation characteristic. The right panels of Figure 12 show the output of the inverse filter. The top tracing is the glottal waveform, the second trace is the flow derivative, and the bottom shows the spectrum of the flow derivative. This result is not very satisfactory, due to the large amount of ripple in the
Figure 12. Initial output of the inverse filter.
- 15 -
flow derivative. The flow derivative spectrum is not smoothly decreasing, as one would expect from a correctly inverse-filtered signal. An examination of the formant values in the top left panel of the figure shows the reason for this: A resonance has incorrectly been placed below F1 at 441 Hz. To remove this unwanted resonance (or any other resonance), point the cursor at its location in the spectrum shown in the lower left panel of the display, and double right click. The formant will be deleted and the inverse filter automatically reapplied with the new vocal tract model, as shown in Figure 13. (Resonances can also be deleted manually by editing the values in the table, and then clicking the IF button to apply the new vocal tract model.) Spurious resonances below F1 occur rather commonly when inverse filtering is based on autocorrelation estimates of the vocal tract transfer function, and simply removing them often results in a good result.
Figure 13. When the spurious low-frequency resonance is removed, the inverse filtering result improves significantly. The inverse filter also allows the user to add resonances and to manipulate bandwidths interactively, as shown in Figures 14-17. To add a new resonance, point the cursor to the appropriate place in the spectrum in the lower left panel of the display and double left click. A resonance will appear in that location (at 1872 Hz, indicated by an arrow in the figure) and in the table at the top of the display, with default bandwidth of 100 Hz (Figure 14). Notice the change in the shape of the flow derivative spectrum (lower right panel), also indicated by an arrow. To remove this resonance, double right click it, as described above. Figure 15 shows the effect of deleting a resonance from the analysis. In this case, the formant at 4636 Hz has been deleted (by double right clicking), resulting in a large increase in flow derivative ripple and an extra bump in the flow derivative spectrum, both indicated by arrows in the figure.
- 16 -
Figure 14. Results of adding an extra resonance to the inverse filter.
Figure 15. Result obtained by deleting a high-frequency pole from the inverse filter. Existing formant frequencies can be manipulated in two ways: by typing a new value into the table and clicking the IF button, or by single left clicking the formant in question and
- 17 -
dragging it to a new position. Figure 16 shows the result of dragging F1 from its starting value of 829 Hz to a value of 711 Hz; notice the increase in ripple in the flow derivative. As the formant moves, the inverse filter and display update automatically, showing the effect of the new resonance value on the estimated glottal waveform, flow derivative, and flow derivative spectrum. When you are happy with the value, single right click to lock the resonance in place. Bandwidths may also be manipulated interactively using the sliders to the right of the table of resonance values. Dragging a slider to the right widens the bandwidth of the resonance in question; in Figure 17, the bandwidth of the first formant has been widened to excess. Dragging the slider to the left narrows the bandwidth. As with the formant values, the effects of changes in slider position are shown immediately in the output display of glottal waveform, flow derivative, and flow derivative spectrum. Bandwidths can be edited directly in the table as well. In this case, the IF button must be clicked to apply the new values. (If you are using more than 6 resonances, you will have to edit bandwidth values for the higher resonances in the table, because there are only 6 sliders. Click IF to implement your changes in the filter model.)
Figure 16. Result of decreasing the frequency of F1 by dragging the resonance peak to a lower value. Autocorrelation estimates of vocal tract parameters are robust, but may contain errors in unstable voices due to the long analysis window, as the above example illustrates. Vocal tract parameters may also be estimated in the inverse filter using covariance LPC analysis, which applies a short window but assumes complete or near-complete glottal closure. To estimate the vocal tract using covariance analysis, begin by windowing the signal, calculating an FFT, and use autocorrelation LPC analysis to select a cycle and calculate F0, as described above. Then click the LPC button on the toolbar again, and this time select covariance (Figure 18). The default window size of 56 points is usually too long. Depending on F0, adjust this value so that
- 18 -
the analysis just includes the most-closed phase of the cycle (usually this is the section with the largest excitation ripples). If you use fewer than 29 points, change the order to 12. When you
Figure 17. Result obtained when the bandwidth of F1 is increased by dragging the sliding cursor.
Figure 18. Covariance LPC analysis dialog box.
- 19 -
click OK, a bar will appear above the time series waveform showing the position and size of the window applied in estimating the vocal tract (as indicated by the arrow in Figure 19). When you are satisfied with the window size, click the IF button to proceed with the analysis, as above. The output of the inverse filter based on the covariance LPC analysis is shown in Figure 19. The result is very similar to that obtained using autocorrelation LPC, except for a spurious formant at 5 kHz which produces a very steep drop-off in the flow derivative spectrum at high frequencies. This often occurs with covariance analysis inverse filtering unless you fuss with the analysis order ahead of time. Remove this formant by double right clicking and all will be well.
Figure 19. Result of covariance LPC analysis. Arrow shows window size indicator. The inverse filter also includes a provision for canceling apparent spectral zeros by adding a pole to the model. To do this, type the frequency and bandwidth into the table between the formant values and bandwidth sliders. This is fun to play around with, but we have never found it to be particularly helpful. Print and Save Files for Use in the Synthesizer Once you are satisfied with your result, you can print the inverse filter window by clicking on the printer icon. (Be sure to use landscape format.) You can also save the files needed to import this case into the synthesizer. To save, select File-Custom Save-Save For Synthesizer-Windows (Figure 20). This command creates the directory \synthesis\work\filename (in this case, filename = ifd), into which it places 3 files: filename.lv, filename.par, and filename.s. Filename.lv is a 1second sample of the original voice used as a standard of comparison for later modeling efforts.
- 20 -
Listen to this file (it's in ASCII format; convert to .wav or .aud in the Sky utility program if necessary) to be sure that it is representative and suitable. Filename.par contains various parameter values needed by the synthesizer; and filename.s contains several cycles of the inverse filtered source waveform. (The Save-Matlab option shown in the figure saves a different set of files needed to run the Matlab version of the synthesizer, which is no longer supported.)
Figure 20. Dialog boxes used to save files needed for input to the voice synthesizer. If you are not completely happy with the inverse filtering results, you can try repeating the analysis on a single cycle of phonation. This removes variability in resonances across the sample, and can improve the outcome in cases where formant estimation is particularly difficult. To do this, once you have finished analyzing the connected speech as described above, use the command: File-Custom Save-Concatenated Cycles (Figure 21). This creates a file, filenamec.aud (e.g., IFc.aud), with the selected cycle repeated a number of times. (The c stands for concatenated.) (All files created from now on will have the form filenamec.xxx, where filename = the original filename.) This pulse will form the basis for source modeling in the synthesizer, so be sure you like it. After you have saved the concatenated cycles, the inverse filter closes the original audio file and opens filenamec.aud (the new file containing a single concatenated cycle of phonation; Figure 22). Repeat the inverse filtering process on this concatenated cycle: 1. 2. Click the FFT button on the toolbar; adjust analysis length if necessary. Click the LPC button on the toolbar; perform an autocorrelation analysis, adjusting window length if necessary.
- 21 -
3. 4. 5.
Right and left click to define a cycle; click the F0 button on the toolbar. If desired, click the LPC button again and perform a covariance LPC analysis; adjust window size and analysis order if necessary. Click the IF button to inverse filter the signal. Adjust formants and bandwidths until satisfied with the output.
Filter parameters can now be adjusted as desired to alter the output of the inverse filter, as described above. The result of this process for the example voice is shown in Figure 23.
Figure 21. File-Custom save-Concatenated cycles command saves the marked cycle as a series of concatenated pulses in the analysis window. We offer these final suggestions regarding the use of the inverse filter. In general, we have found that it is better to undermodel than to overmodel. Remember, you will get rid of extra ripples and bumps when you LF fit the source pulse (see Section III); you don't have to do all the work here. Achieving a satisfactory result may require several trials over different cycles. Don't worry if the formants and bandwidths seem strange. The only purpose of the vocal tract model used here is to make a good inverse filter, not to model vowel quality, and many other factors (including but not limited to source-tract interactions) will influence the filter characteristics that you finally settle on. Also, remember that you are working on a noise-free simulation of a noisy signal. Error is built into the process right from the start, so obsessing about getting the "right answer" is usually misguided. You will get a chance to derive a perceptually corrected answer when you model this voice in the synthesizer, so try to be patient now.
- 22 -
Figure 22. File-Custom save-Concatenated cycles command concatenates a single cycle and places the result in the analysis window.
Figure 23. Results of inverse filtering process for a single concatenated cycle. These serve as input to the synthesis process.
- 23 -
Part 3: Other Features of the Inverse Filter Introduction This section describes additional functions available in the inverse filter that are not listed in the preceding section. Features are listed according to the menu in which they occur. File Menu File-Open a Text File: Use this command to open a sound file in ASCII format. Help Menu Not currently helpful.
Figure 24. Display menu options. Display Menu The Display menu is shown in Figure 24. The following additional commands are available: Display-Glottal Window-Display Zero Line in Flow Derivative-Insert: This command inserts a line at the current zero value in the display for reference. If there is no constant DC offset in the signal, this will align with the closed portions of the flow derivative and flow pulse. It is particularly useful to check this if you are going to fit the pulses with the LF model in the synthesizer, because the LF fitting procedure can go awry if the pulses are substantially offset from zero. If this is the case for your data, proceed to the next command.
- 24 -
Display-Glottal Window-Remove DC in Flow Derivative: Figure 25 illustrates this process. To recenter the display around the "true" zero line, first left-click to set a cursor at the desired zero point. Then use this command to remove the DC offset and rezero the data. This command can be repeated if you are not satisfied with the first point you select.
Figure 25. Resetting the zero line to remove DC offset from the flow derivative. Top panel: inverse filtered file with zero placed below the apparent correct value. Bottom panel: Corrected zero line after removal of the DC offset.
- 25 -
Display-Glottal Window-Display Zero Line in Flow Derivative-Remove: Removes the zero line from the display. This does not affect the zero location; it just hides the line. Edit Menu The Edit menu includes the following additional commands. Edit-Invert the Waveform: Multiplies the acoustic signal by -1. Edit-Highpass: Applies a constant phase high pass filter to remove baseline noise caused by air currents in the recording suite. This is only a problem when the microphone has a very good DC response. Figure 26 shows a typical signal before and after use of this feature. A center frequency of 6 Hz and a transition band of 10 Hz usually work pretty well. Because the filter is linear phase, it does not affect the output of the inverse filter.
Figure 26. Removing baseline drift by high pass filtering the file. Left panel: Audio signal including significant baseline drift. Right panel: The same file after application of a linear phase filter at 6 Hz. Glottal Analysis Menu The Glottal Analysis menu allows the user to make preliminary estimates of glottal timing features. Mark Features: Marks and displays the instants of closing, opening, maximum flow, and maximum closing velocity for the current glottal waveform, as shown in Figure 27. Compute: Using the marks shown in Figure 27, this command calculates the open quotient, speed quotient, speed index, rate quotient, DC offset, and peak flow. The last two measures are not meaningful for audio data. Delete Glottal Marking: Erases the marked glottal features from the display.
- 26 -
Figure 27. Timing features marked in the glottal waveform.
- 27 -
III. VOICE SYNTHESIS SOFTWARE Part 1: Introduction About Voice Synthesis and the UCLA Voice Synthesizer This introduction reviews the features of the synthesizer and some of the issues surrounding voice synthesis in general. It also describes the algorithms used in the synthesizer. The second part of the documentation describes step-by-step procedures for basic analysis and synthesis, and the third section provides details of some additional features of the synthesizer that aren't necessarily needed for every case. The voice synthesizer is a formant synthesizer, based on the source-filter theory of speech production (Fant, 1960). Accordingly, users must model the vocal source, which is then filtered through a cascade of resonators that models the vocal tract response (e.g., Klatt, 1980). This synthesizer differs from most other formant synthesizers in the precision with which the source can be modeled, and in the degree of interactivity. It also differs from other synthesizers in that it is limited at present to modeling of vowels with steady-state resonances (although instabilities in the source functions can be modeled in great detail.) The synthesis process begins by generating an estimate of the shape of the harmonic part of the source. When the goal is to copy a specific voice, this can be accomplished through inverse filtering (as described in the previous chapter of this manual). Alternatively, a source can be imported from outside the program; for example, a source pulse with the desired characteristics can be created from scratch in another program (a text editor or other program), converted to ASCII format, and then imported. Finally, one of the supplied sample voices can be opened in the synthesizer and its source can be edited until it has the desired characteristics. The inharmonic part of the source (noise excitation) is estimated through application of a cepstral-domain comb lifter like that described by de Krom (1993). Noise analysis takes place within the synthesizer, as described below. It is not presently possible to import a noise spectrum from outside the synthesizer. The third step in voice modeling is assessment of the patterns of F0 and amplitude modulation (vocal tremors). Several approaches are available. F0 and amplitude can be tracked within the synthesizer; the degree of smoothing applied to the contours can be specified by the user. Alternatively, pitch tracks can be imported from outside the synthesizer. Finally, users can model instabilities using two synthetic tremor models, one that models sinusoidal modulations and one that provides "random" contours. Jitter and shimmer are also modeled, as described below. Finally, users model the vocal tract response by specifying formant frequencies and bandwidths. Again, these can be based on analyses of a specific voice to be copied; a desired configuration can be created from scratch; or the sample cases can be imported and then manipulated as desired. Issues in Source Modeling Accurate modeling of the voice source is an essential part of accounting for variations in voice quality (e.g., Ananthapadmanabha, 1984; Karlsson, 1991). Inverse filtering is commonly used to estimate the shape of the voice source, but despite an experimenter's best efforts the recovered glottal flow waveform often includes ripples, bumps, and other theoretically undesirable but in practice unavoidable features. It is hard to tell if these wiggles and bumps are errors or if they're real features of the voice source; we usually lack the data from imaging or
- 28 -
aerodynamics to disambiguate this issue. However, synthesizing a voice without removing at least some of these bumps and lumps provides a terrible-sounding result, suggesting that at least some of them are in fact errors. One common approach to coping with this situation is to fit the output of the inverse filter with a theoretical model of the glottal flow pulse. In practice, substituting the modeled flow for the experimentally derived flow eliminates errors, wiggles, bumps, and excess high-frequency formant ripple and attendant high-frequency distortion, while preserving most of the important features of the pulse shapes. Experiments with synthetic voices have further shown that smoothing with a theoretical model increases the accuracy with which various parameters of the glottal source can be estimated (Strik, 1998). Many time-domain source models have been proposed (Ananthapadmanabha, 1984; Imaizumi et al., 1991; see Fujisaki & Ljungqvist, 1986, or Ní Chasaide & Gobl, 1997, for review), including physiological models (e.g., Ishizaka & Flanagan, 1972; Cranen & Schroeter, 1996), models of the glottal flow pulse (Rosenberg, 1971; Fant, 1979), and models of the glottal flow derivative (Fant et al., 1985; Fujisaki & Ljungqvist, 1985). The most common choice, and the one implemented in the present synthesizer, is the LF model (Figure 28; Fant et al., 1985). This model of the glottal flow derivative is welldocumented and includes a relatively small number of parameters, which can be estimated from inverse filtered waveforms. Mode