A hierarchical method of automatic speech segmentation for synthesis applications

S Pauws*, Y Kamp, L Willems

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

Abstract

The paper describes a method for automatically segmenting a database of isolated words as required for the purpose of speech synthesis. The phoneme-like units in the phonetic transcription of the utterances are represented by dedicated hidden Markov models (HMMs) and segmentation is performed by aligning the speech signal against the sequence of HMMs representing the words. The specific advantage of the method presented here is that it does not need manually segmented speech material to initialize the training of the HMMs. Therefore, it can be regarded as an improved variant of established techniques for automatic segmentation. The problem of proper initialization of the HMMs without resorting to manually segmented material is solved by a hierarchical approach consisting of three successive steps. In the first step a segmentation in broad phonetic classes is realized that provides anchor points for the second stage, consisting of a sequence-constrained vector quantization. In this stage each broad phonetic class is further segmented into its constituent phonemes. The result is a crude phonetic segmentation which is then used as initialization of the HMMs in the last stage. Fine-tuning of the models is realized via Baum-Welch estimation. The final segmentation is obtained by Viterbi alignment of the utterances against the HMMs. This hierarchical approach was used to segment a database of isolated words recorded from a male speaker. An accuracy of 89.51% was obtained in the location of the phoneme boundaries with a tolerance of 20 ms.

Original languageEnglish
Pages (from-to)207-220
Number of pages14
JournalSpeech Communication
Volume19
Issue number3
Publication statusPublished - Sep 1996
Externally publishedYes

Keywords

  • speech segmentation
  • hidden Markov models (HMM)
  • vector quantization
  • AMERICAN ENGLISH
  • RECOGNITION
  • DURATION

Cite this

@article{4dfd7e60223e4f729331c281df8c2a48,
title = "A hierarchical method of automatic speech segmentation for synthesis applications",
abstract = "The paper describes a method for automatically segmenting a database of isolated words as required for the purpose of speech synthesis. The phoneme-like units in the phonetic transcription of the utterances are represented by dedicated hidden Markov models (HMMs) and segmentation is performed by aligning the speech signal against the sequence of HMMs representing the words. The specific advantage of the method presented here is that it does not need manually segmented speech material to initialize the training of the HMMs. Therefore, it can be regarded as an improved variant of established techniques for automatic segmentation. The problem of proper initialization of the HMMs without resorting to manually segmented material is solved by a hierarchical approach consisting of three successive steps. In the first step a segmentation in broad phonetic classes is realized that provides anchor points for the second stage, consisting of a sequence-constrained vector quantization. In this stage each broad phonetic class is further segmented into its constituent phonemes. The result is a crude phonetic segmentation which is then used as initialization of the HMMs in the last stage. Fine-tuning of the models is realized via Baum-Welch estimation. The final segmentation is obtained by Viterbi alignment of the utterances against the HMMs. This hierarchical approach was used to segment a database of isolated words recorded from a male speaker. An accuracy of 89.51{\%} was obtained in the location of the phoneme boundaries with a tolerance of 20 ms.",
keywords = "speech segmentation, hidden Markov models (HMM), vector quantization, AMERICAN ENGLISH, RECOGNITION, DURATION",
author = "S Pauws and Y Kamp and L Willems",
year = "1996",
month = "9",
language = "English",
volume = "19",
pages = "207--220",
journal = "Speech Communication",
issn = "0167-6393",
publisher = "Elsevier Science BV",
number = "3",

}

A hierarchical method of automatic speech segmentation for synthesis applications. / Pauws, S; Kamp, Y; Willems, L.

In: Speech Communication, Vol. 19, No. 3, 09.1996, p. 207-220.

Research output: Contribution to journalArticleScientificpeer-review

TY - JOUR

T1 - A hierarchical method of automatic speech segmentation for synthesis applications

AU - Pauws, S

AU - Kamp, Y

AU - Willems, L

PY - 1996/9

Y1 - 1996/9

N2 - The paper describes a method for automatically segmenting a database of isolated words as required for the purpose of speech synthesis. The phoneme-like units in the phonetic transcription of the utterances are represented by dedicated hidden Markov models (HMMs) and segmentation is performed by aligning the speech signal against the sequence of HMMs representing the words. The specific advantage of the method presented here is that it does not need manually segmented speech material to initialize the training of the HMMs. Therefore, it can be regarded as an improved variant of established techniques for automatic segmentation. The problem of proper initialization of the HMMs without resorting to manually segmented material is solved by a hierarchical approach consisting of three successive steps. In the first step a segmentation in broad phonetic classes is realized that provides anchor points for the second stage, consisting of a sequence-constrained vector quantization. In this stage each broad phonetic class is further segmented into its constituent phonemes. The result is a crude phonetic segmentation which is then used as initialization of the HMMs in the last stage. Fine-tuning of the models is realized via Baum-Welch estimation. The final segmentation is obtained by Viterbi alignment of the utterances against the HMMs. This hierarchical approach was used to segment a database of isolated words recorded from a male speaker. An accuracy of 89.51% was obtained in the location of the phoneme boundaries with a tolerance of 20 ms.

AB - The paper describes a method for automatically segmenting a database of isolated words as required for the purpose of speech synthesis. The phoneme-like units in the phonetic transcription of the utterances are represented by dedicated hidden Markov models (HMMs) and segmentation is performed by aligning the speech signal against the sequence of HMMs representing the words. The specific advantage of the method presented here is that it does not need manually segmented speech material to initialize the training of the HMMs. Therefore, it can be regarded as an improved variant of established techniques for automatic segmentation. The problem of proper initialization of the HMMs without resorting to manually segmented material is solved by a hierarchical approach consisting of three successive steps. In the first step a segmentation in broad phonetic classes is realized that provides anchor points for the second stage, consisting of a sequence-constrained vector quantization. In this stage each broad phonetic class is further segmented into its constituent phonemes. The result is a crude phonetic segmentation which is then used as initialization of the HMMs in the last stage. Fine-tuning of the models is realized via Baum-Welch estimation. The final segmentation is obtained by Viterbi alignment of the utterances against the HMMs. This hierarchical approach was used to segment a database of isolated words recorded from a male speaker. An accuracy of 89.51% was obtained in the location of the phoneme boundaries with a tolerance of 20 ms.

KW - speech segmentation

KW - hidden Markov models (HMM)

KW - vector quantization

KW - AMERICAN ENGLISH

KW - RECOGNITION

KW - DURATION

M3 - Article

VL - 19

SP - 207

EP - 220

JO - Speech Communication

JF - Speech Communication

SN - 0167-6393

IS - 3

ER -