Diphone
A diphone is a unit of speech sound consisting of a transition from one phoneme to the next. In other words, it encompasses the second half of one phoneme and the first half of the subsequent phoneme. Diphones are used in speech synthesis to create more natural-sounding speech because they capture the coarticulation effects that occur between phonemes. Coarticulation refers to the way the articulation of one phoneme influences the articulation of neighboring phonemes.
Diphone synthesis involves concatenating pre-recorded or synthesized diphones to form words and sentences. A diphone inventory typically includes all possible phoneme-to-phoneme transitions in a given language. For English, this usually numbers in the thousands.
The advantage of using diphones over individual phonemes is the incorporation of coarticulation, resulting in smoother and more realistic speech output. However, diphone inventories can be quite large, requiring significant storage space. Additionally, finding suitable diphone boundaries for concatenation can be challenging to ensure a natural flow of speech. Techniques like smoothing and spectral modification are often employed to minimize discontinuities at diphone boundaries.