Overview
HTS is a statistical text-to-speech system that models speech with HMMs, then uses a vocoder to synthesize waveforms. It is compact, adaptable, and intelligible, though less natural than modern neural TTS.
Description
HTS represents speech as sequences of context-dependent HMM states that generate spectral features, F0, and durations. Decision trees cluster rich linguistic contexts, then maximum likelihood training estimates parameters for each stream. At runtime the system predicts durations, concatenates state trajectories, and applies parameter generation with dynamic features and variance control to produce smooth acoustic sequences. A vocoder such as MLSA or STRAIGHT renders the final waveform. Because the model is statistical and low footprint, it supports speaker adaptation with MLLR or MAP, multilingual voices, and flexible style control from limited data. Quality is clear and intelligible, but timbre and prosody can sound buzzy or over-smoothed compared with neural models. HTS remained a workhorse for embedded devices and research until neural TTS became practical at scale.
About HTS Working Group
HTS materials raise complex challenges in the design and prediction of devices' behavior due to i) the strong coupling between their nonlinear and anisotropic electromagnetic and thermal properties, and ii) the high aspect ratio of HTS tapes. State-of-the-art research in HTS modelling tools consider this reality and propose ways to address these challenges. Many examples are provided on this website.
View Company Profile