by: Sieb Nooteboom
Warning:
The high-lighted literature references in the text are accessible by clicking on them. But the reader should be warned: They all are optical scans of old documents, and may take a few minutes each to download. Audio demonstrations plus corresponding texts are accessible by clicking on the high-lighted words text or audio below subsections. Here also a warning is in place: Some players for audio files start playing before the whole file has been downloaded. If so, it sometimes happens that an audio demonstration is audibly interrupted for brief periods of time. However, if you immediately play the same demonstration for a second time, it will normally be heard uninterrupted.
Introduction
This webpage contains a text describing developments in the area of speech synthesis in the former Institute for Perception Research in Eindhoven, in a period of roughly three decades, running from the early sixties to the early nineties of the twentieth century. The text also provides, via the literature references in the text, a number of hyperlinks to old documents describing particular developments in the corresponding period of time. Under most subheadings the highlighted word DEMO can be found, providing a hyperlink to audio demonstrations of the system to be discussed plus the corresponding texts.
The Institute for Perception Research IPO was set up, in the form of a legally independent foundation, by Philips Electronics, the University of Technology in Eindhoven, and the Dutch Organisation for Fundamental Research ZWO in the year 1957. IPO ceased to exist as an independent organisation in the year 2001. Main research areas were from the beginning the psychophysics of vision and audition, and ergonomics. Speech research started very soon.
IPOVOX 1
Speech research was initiated when Anthony Cohen joined the institute in 1959. The first attempts towards the synthesis of speech were perhaps rather primitive, but certainly bold (Cohen and ‘t Hart 1962, Cohen, Schouten and ‘t Hart 1962). Inspired by earlier listening experiments with gated-out segments of natural speech, a synthesising device was developed, later referred to as IPOVOX 1. This device had two sound sources, an adjustable multivibrator for periodic sounds and a noise source for aperiodic sounds. Spectral properties of each sound were approximated with only two formant regions. Only steady-state segments could be synthesised. Connected speech was simulated by carefully shaping the temporal aspects of each steady-state segment. This was achieved by using a monostable adjustable multivibrator, that made it possible to adjust the temporal envelope of the segments with RC circuits. Amplitude rise time (t1), rise time plus time before decay (t2), and decay time (t3) could be adjusted separately with simple exponential functions. Transitions between successive segments were approximated without spectral transitions, simply by temporal overlap between the amplitude decay time of one segment and the rise time of the following segment. Stop consonants like /t/, /p/ and /k/ were made from the same sounds as the fricatives /s/, /f/ and /X/ respectively, by shortening those considerably and by preceding silent intervals. Diphthongs were made from two steady-state segments without any spectral glide, but with temporally overlapping envelopes. In this way it was convincingly demonstrated that if sufficient care is taken in shaping the temporal structure of speech, intelligible speech can be achieved by concatenating phoneme-sized steady-state segments.
DEMO IPOVOX I
If you want to know how intelligible the speech is when you do not know the text, play the audio-file first.
Dutch text of the audio demonstration of the Dutch text-to-speech system IPOVOX I
Dutch texts in the audiodemonstrations of the IPOVOX 1.
- "Schouten"
- "schaven"
- "schavuit"
- "Hitchcock"
- "Fokkema"
- "postcheque"
- "kunt U mij verstaan?"
- "ik heet IPOVOX"
- "Onze gegevens zijn afgeleid van geïsoleerde woordjes. Het gevolg is dat de zinsrhythmiek vreemd aandoet. Over intonatie is ons te weinig bekend om tot een werkelijk aanvaardbaar resultaat te kunnen komen."
The Intonator
One of the many problems with IPOVOX 1 was the lack of natural sentence melodies. This was the inspiration for specifically Anthony Cohen and Hans ‘t Hart for turning to the study of intonation. To this end an instrument was developed called the INTONATOR (Willems 1966a), based on a common channel vocoder, that made it possible to remove the original natural pitch contour from the segmental layer of speech, and replace it with a stylised artificial pitch contour. This gave researchers the possibility to go back and forth between perceived sentence melodies and their physical correlates in a systematic way. So began the so-called “Dutch school of intonation”, later leading to “intonation grammars” for Dutch (Cohen, ‘t Hart and Collier 1967, 1990), English (De Pijper 1983), Russian (Odé 1986) and German (Adriaens 1991). The discrete units of such grammars were physically well-defined pitch movements, and the rules of the grammars determined which sequences of these units were grammatical and which were ungrammatical. The approach was greatly stimulated by the early finding that natural pitch contours could be highly stylised and simplified without any change in perceived melody. Further standardisation and stylisation was possible such that the artificial pitch contour was perhaps not perceptually identical with its original, but was melodically equivalent to it. This made it possible to set up relatively simple grammars of intonation that could be applied in later systems for synthesis-by-rule.
DEMO INTONATOR
For the text spoken by the INTONATOR click here: TEXTS
For the re-synthesized speech spoken by the INTONATOR play the audio-file.
Dutch text of the audio demonstration of the Dutch text-to-speech system IPO INTONATOR
Dutch texts of the audiodemonstrations of the IPO INTONATOR from the year 1968
- "De temperatuur werd automatisch constant gehouden"
- natural speech
- monotonous
- monotonous with declination
- "We gaan vanavond naar de bioscoop"
- natural speech
- monotonous
- monotonous with declination
- with “hat pattern” made by a rise in “Avond” followed by a high declination, plus a fall on “biosCOOP"
- with a “pointed hat” on “biosCOOP
- with a “pointed hat” on “Avond"
- with a “hat pattern” made by a rise on “TUUR”, followed by a high declination and a fall on “STANT”.
IPOVOX 2
The speech-synthesis system IPOVOX 1 was useful in drawing attention to the extreme importance of temporal details in the production and perception of speech. But the quality of the synthesised speech left much to be desired. A new attempt, still in the pre-computer era of IPO, led to IPOVOX 2 (Willems 1966b ). In this approach 3 vowel formants were simulated with electronic band filters having adjustable peak frequencies and fixed bandwidths. There were separate noise formant filters. Again there were two sound sources, a buzz source for voiced segments and a noise source for unvoiced segments. The sound sources could be mixed for voiced fricatives. The instrument had an internal memory for 5 successive segments, an intonation contour generator, a transition generator for formant transitions between segments, and an envelope generator for the control of precise amplitude variations. In the description provided in Willems (1966a) the parameters were controlled from a read-in desk with a column of push-buttons for each parameter (see the photograph in Willems 1966a). Later IPOVOX 2 got a second input in the form of a special purpose ferrite core electronic memory (A Philips C4 magnetic core storage unit). The storage capacity of this external memory was 256 memory words, and because two memory words were needed for each segment, this made it possible to synthesise utterances of 127 segments. The memory could be fed from the read-in desk of IPOVOX 2, but also from a punch tape reader. The latter feature made it possible to prepare the punch tape from a computer program for speech-synthesis-by-rule written by Slis and Muller (1971) and running on a P9202 digital computer. An example of how this set-up was used in experiments on the perception of speech is provided in Nooteboom (1972, chapter 3).
DEMO IPOVOX 2
If you want to know how intelligible the speech is when you do not know the text, play the audio-file.
Text of the IPOVOX 2 demo
Dutch texsts of the audio demonstrations of the IPOVOX 2.
- “communicatie”
- “electronisch”
- “biofysica”
- “Eindhoven”
- “kunt U mij verstaan?”
- “het IPO heeft open huis”
- “in het IPO werken Philips en TH samen”.
For the text spoken by IPOVOX 2 click here: TEXT
For the synthetic speech by IPOVOX 2 click here: AUDIO
IPOVOX 3
A third attempt of IPO to improve on speech synthesis resulted in what we for the sake of convenience here refer to as IPOVOX 3 (Nooteboom, Slis and Willems 1973, Slis, Nooteboom and Willems 1977). In this new system, designed mainly as a research tool, the hardware was confined to the actual signal generator, a Digital Speech Synthesiser (Rockland Systems Model 4512), driven from a general purpose digital computer (Philips P9202, 16 bits, 16 K). The Rockland synthesiser needed fresh information on all parameter values at each period of the fundamental frequency (“glottal period”). The rule system was segment based, although all long vowels and diphthongs were considered to exist of two segments. For each Dutch phoneme standard values were stored in a table. These values did not necessarily correspond to actual manifestations of these phonemes, but were chosen such that the rule system for changing them according to context, was optimised. Adjustable parameters were 5 formant frequencies, a nasal formant, and a noise formant, all with corresponding band widths, amplitude of voice, noise and hiss, pitch level, and segment duration. Each parameter was assigned three values, viz. a target value, a time constant controlling the transition time from the previous value, and a timing value controlling the onset of change with respect to an abstract phoneme boundary. All transitions were made as straight line interpolations between one value and the next in a domain of time versus an appropriate physical dimension. The rule system was made as modular as possible, to the extent that many subroutine rules could be replaced without changing the rest of the system.
DEMO IPOVOX 3
If you want to know how intelligible the speech is when you do not know the text, play the audio-file.
Text of the IPOVOX 3 demo
Dutch texts of the audio demonstrations of the IPOVOX 3.
- “fototoestel”
- “opmerken”
- “dit is synthese volgens regels”
- “kunt U mij verstaan?”
- “IPO betekent Instituut voor Perceptie Onderzoek”
- “Ik ga u een klein verhaaltje vertellen. Vorige week liep een man in gedachten verzonken langs de straat een liedje te fluiten. Hij keek daarbij niet goed uit en struikelde daardoor over een stoeprandje. Door een stom toeval kon ie onder een geparkeerde auto kijken waar hij een al lang vermist document vond. De eigenaren keerden hem een ruime beloning uit”.
The Formator
A system for the analysis and re-synthesis of speech using Linear Predictive Coding
A major development in the history of speech synthesis, world-wide but also in IPO, was the advent of systems for the analysis, manipulation and re-synthesis of speech based on Linear Predictive Coding (LPC). Such a system, deriving formants from LPC coefficients, was developed in IPO by Vogten and Willems (1977). LPC analysis is applied to the speech signal in the order of 100 times per second, each time on the successive amplitude values of samples of the speech wave form within a time window in the order of 20 or 25 ms. Thus successive time windows overlap. LPC analysis consists of the calculation of an Mth-order digital filter, the coefficients of which are determined by minimising the mean squared error between the actual input sample and an Mth-order linear prediction of the input sample. The digital filter captures the slowly varying aspects of the speech wave form that stem from movements of the articulatory organs. Regularly occurring strong increases in the error function stem from the periodic source in voiced speech. From the M coefficients the speech wave can be re-synthesised as the output of the inverse filter with the same M coefficients, excited by either periodic pulses, for voiced fragments, or noise, for aperiodic fragments.
The digital filter used in the FORMATOR was characterised by 10 filter coefficients. This filter was then analysed into a product of 5 quadratic terms, thus giving 5 digital filters each with 2 coefficients p and q. The 5 pairs of p and q coefficients could be calculated numerically. Unfortunately, the p and q values could not always be easily interpreted. For example they could not easily be ordered on a frequency scale and did not always correspond to complex poles. Therefore a slightly modified analysis procedure was designed (Willems, 1987) such that the 5 digital filters were ordered on a frequency scale and that these filters always corresponded to complex pole pairs. After this procedure the five digital filters could be interpreted as representing five formants, each with a particular peak frequency and a particular band width.
In the FORMATOR the discontinuities in the error function obtained when the 10 coefficients of the digital filter were calculated were not used for estimating the original source function. Better results were obtained by using a separate procedure for estimating the presence or absence of periodicity and the pitch period. This procedure was applied every 10 ms on a window of 35 ms, so that at least 2 pitch periods were present in the wave form. In each time window the spectrum was flattened by a dynamic centre-clipping procedure, after which an auto-sign-correlation function was calculated. One of the maxima of this function, viz. the one that was found within a specific interval, was taken to be the pitch period. The specific interval was specified on the basis of the history of the analysis, thus among other things minimising the probability of octave errors.
This procedure led to a set of parameters that for example could be used to drive a Rockland 4512 digital synthesiser. However, it became soon common practice to re-synthesise speech in software, using the inverse digital filter, and calculating the formants only when this was necessary for specific purposes, for example extreme data reduction and using the reduced parameters for driving a formant-based speech synthesis chip, the MEA8000 (see below). Another reason for deriving formants could be the wish to manipulate formant structure before re-synthesis. A full description of the method for analysis, manipulation, and re-synthesis of speech is given in Vogten (1983, see also ‘t Hart, Nooteboom, Vogten and Willems, 1982). Vogten (1983) also provides extensive perceptual tests on the quality of the resulting speech.
The system for the analysis, manipulation and re-synthesis of speech described here, although basically remaining the same, has been improved in later years in many detailed ways. The system has been an important research tool in studying the perception of speech prosody, because it allowed manipulation of pitch fluctuation and temporal structure in high quality, natural sounding, speech. Examples of studies using this system or one of its immediate successors for perceptual investigations are De Rooij (1977), Brokx (1979), Willems (1982), De Pijper (1983), Kruyt (1985), Terken (1985), Odé (1989), Adriaens (1991), Blaauw (1995), Sanders (1996), Sanderman (1996). This system also supplied the main tool for later developments in the area of speech synthesis from diphones (See below).
Audio demonstrations and texts resynthesised with the FORMATOR
Original recording (Dutch): "Wij proberen de spraak te begrijpen en te beheersen". | |
Copy-synthesis of the same spoken sentence. | |
Attempt to make a woman's voice 1: Re-synthesis with pitch made an octave higher. | |
Attempt to make a woman's voice 2: Re-synthesis with normal pitch and all formants 25 % higher. | |
Attempt to make a woman's voice 3: Re-synthesis with both increased pitch and higher formants. | |
Completely monotonous speech. | |
Twice as fast. | |
Twice as slow. | |
Copy-synthesis of original recording (Dutch): "Dat is niet waar de apparaten spreken." | |
Adapted version of same recording (Dutch): Dat is niet waar, de apparaten spreken." |
The MEA8000 Speech Synthesis Chip
The MEA8000 speech synthesis chip was developed in the early eighties by Philips ELCOMA on the basis of an earlier “voice response unit” (van Essen and Willems, 1978). This voice response unit made it possible to generate a speech signal from a highly parsimonious code (1kbit/second), stored on a programmable read-only memory (PROM). The MEA8000 basically was an integrated version of this voice response unit, generating a speech signal with 4 resonance filters. It needed parameter values for frame duration, periodic vs aperiodic sound source, pitch, amplitude, and 4 formant frequencies. It contained an 8 bits digital-to-analog converter plus a table for converting the 32 bits of each synthesis frame into the required synthesis parameters. It also contained an interface with the most popular microprocessors and proms of the time. The chip has been in production and was sold by Philips Electronics over a number of years. In 1988 the MEA8000 got a successor, called the PCF8200, which had some essential improvements such as 5 kHz band width, an adjustable bit rate varying from 450 - 4550 bits/second, 5 formants instead of 4, selectable male and female parameter tables, CMOS technology, and an 11-bits digital-to-analog converter.
DEMO MEA8000
If you want to know how intelligible the speech is when you do not know the text, play the audio-file first.
Audio demonstrations of the Philps-ELCOMA speech synthesis chip MEA8000
Texts belonging to the audio demonstrations of the Philps-ELCOMA speech synthesis chip MEA8000.
This is very parsimoniously coded re-synthesized natural speech.
- (English:) “You are listening to the ELCOMA talking chip MEA8000”.
- (English:) “One, two, three, four”.
- (French:) “Un, deux, trois, quatre”.
- (German:) “Eins, zwei, drei, vier”.
- (Dutch:) “U luistert naar de sprekende chip, ontwikkeld door IPO, Nat. lab. en ELCOMA”.
- (Dutch:) “Deze spraak is honderd keer zuiniger gecodeerd dan normaal”.
- (Dutch:) “Als u na het verlaten van deze kamer de eerste gang rechts neemt, vindt u daar de kantine waar u een kopje koffie wordt aangeboden”.
Towards a system for the synthesis of Dutch from diphones
In IPOVOX 3 speech was synthesised starting from a set of fixed parameter values for each of the phonemes of the language. All spectral changes in the synthesised speech were to be made by rule. Rules were to be found by closely inspecting registrations of natural speech, and making generalisations from there, and also by trial and error. This approach has the great advantage that if one succeeds in synthesising reasonable speech, it is clear that this reflects intimate knowledge of and insight in the properties and regularities of phoneme-sized segments in real speech. However, the approach has the great disadvantage that it may take enormous efforts and a long period of time before anything like convincing, natural sounding and intelligible synthesised speech is achieved. Moreover, research interests in IPO were more oriented towards speech prosody than towards detailed properties of phoneme-sized segments.
No wonder, then, that as soon as an LPC-based system for analysis, manipulation and re-synthesis of speech had become available, an attempt started to synthesise speech from analysed fragments of natural speech, in which the intricate transitions between successive speech sounds were pre-compiled. In order to restrict the number of necessary building blocks for synthesis the choice was made to use diphones instead of syllables, demisyllables, whole morphemes or units of arbitrary length (Elsendoorn and 't Hart 1982, Elsendoorn 1983). Diphones in this context were fragments of speech running from somewhere in one acoustic phoneme-sized speech segment to somewhere in the following phoneme-sized speech segment. These fragments of speech were excised from human natural speech, after analysis with the FORMATOR, so that speech properties could later be changed by rule if necessary. Each diphone consisted of a number of successive frames, each frame being specified for 5 formant values plus band widths, periodic or aperiodic sound source, and relative amplitude. Pitch was omitted, because it was to be replaced in synthesis by artificial values. Note that duration could be easily changed by rule by changing the frame durations. To this end each diphone had a “flag” at the boundary between the two partial phonemic segments, so that after concatenation durational rules could be applied to the stretches of speech between flags.
Initial attempts showed that results could be improved considerably by excising CV diphones such that the vowel portion had always the same duration for the same vowel phoneme, and the VC diphones such that the duration of the vowel portion depended on the following consonant. Also irregularities in parameter values from frame to frame had to be removed, and care had to be taken that there were no conspicuous discontinuities in parameter values on the boundaries between successive diphones, most particularly for boundaries within vowel segments. Over the years the speech quality for a specific set of diphones was improved considerably, simply by correcting disturbing errors whenever they were found. The full set of diphones contained diphones from each Dutch phoneme/allophone, to each other Dutch phoneme/allophone. There were separate diphones from silence to each initial phoneme/allophone, and from each final phoneme/allophone to silence. Also there were separate diphones from glottal stop to each vowel and from each vowel to glottal stop. These glottal stops were introduced to obtain more natural word boundaries following and/or preceding a vowel. Finally, a set of triphones was developed for obtaining satisfactory results for the intervocalic /h/. This was necessary, because the /h/ appeared to be in all productions so much coloured by the preceding and following vowel sound that the diphone approach failed. The full set of diphones/triphones in the end counted over 2000 units. Because of the relatively rapid success of the diphone approach, this stimulated further research on text-to-speech systems.
Audio demonstrations of an early attempt to synthesise speech from Dutch diphones
Isolated diphones for the Dutch word "attentie". | |
Concatenated diphones for the Dutch word "attentie". | |
Concatenated diphones for the Dutch word "attentie" with an artificial speech melody. | |
Concatenated diphones for the Dutch phrase "goed gedaan, jochie". | |
Synthesizing Dutch with Dutch diphones, from the newspaper: |
|
Synthesizing German with Dutch diphones: |
|
Synthesizing English with Dutch diphones 1): |
|
Synthesizing English with Dutch diphones 2): |
|
Synthesizing French with Dutch diphones:
|
The Tiepstem
A keyboard to speech synthesis system for the speech impaired.
The possibility of synthesising speech from parsimoniously coded diphones, using a simple and miniaturised signal generator such as the MEA8000, opened the way towards the development of a keyboard-to-speech synthesis system as an aid for those people who have temporally or permanently lost the capacity of speaking. Such a system was developed and described by Deliege (1989a, 1989b). Its architecture was inspired by the architecture of the multilingual text-to-speech system described above, but then simplified and stream-lined. The accent location rules and the graphemes-to-phonemes rules were developed in the Phonetics Department of Nijmegen University. The set of diphones was developed in IPO and re-coded in terms of the parameters of the MEA8000. The output of the system was either audible speech, generated with the MEA8000, or digital code for the control parameters for a speech synthesis chip. The latter facility made it possible to use the “Tiepstem” for preparing utterances and storing these in the memory of a much smaller, more portable, and more user-friendly device, the “Pocket-stem” (Waterham, 1989). The latter device made it possible to store and use a number of synthetic utterances, adapted to the particular speech-impaired user, and accessible via an extremely simple display.
Spraakmaker
A text-to-speech system for the Dutch language.
A major aim of the earlier mentioned national strategic research programme ASSP (See Van Heuven and Pols 1993 for reports on quite a number of different projects that were part of this programme), was to develop an experimental system for high quality text-to-speech conversion. Because results on different aspects of text-to-speech conversion from six different research groups had to be integrated, in IPO a highly flexible and modular software environment has been created, called Speech Maker, characterized by a language-independent architecture (Van Leeuwen and Te Lindert 1990, Van Leeuwen 1993). Different from the earlier discussed multilingual system for text-to-speech conversion, in Speech Maker a multi-level, synchronized data structure was employed, called a “grid”, in which different types of information such as morphology, orthography, pronunciation etc. were represented on different levels, and these different levels were synchronized with sync marks placed between data items on each level. This gave a transparent data structure, that also made it possible to retain, and later refer to, information on earlier levels that would have been erased in using a linear data structure. Most importantly, separate modules in the system became more independent of each other. The full grid had 14 different levels, viz. (1) sentence, (2) intonation phrase, (3) word class, (4) accent, (5) morpheme, (6) syllable, (7) grapheme, (8) phonemic segment, (9) segment duration, (10) pitch type, (11) pitch type anchor, (12) pitch type onset, (13) pitch type duration, (14) pitch type excursion. Note that different levels of the grid do not correspond to different modules.
Speech Maker was used to make a specific text-to-speech system for Dutch, called Spraakmaker. Spraakmaker had eight different modules that operated on and changed the information in the grid. These modules were called LABEL, marking in the input text beginnings and ends of sentences, word groups and words, marking character sequences as words, telephone numbers, amounts of money etc., EXPAND, dealing with the special character sequences that are not common words, WORD dealing with several aspects of grapheme-to-phoneme conversion, PROS determining locations of sentence accents and major prosodic boundaries, MORPHON adjusting the phonemic representation where necessary, DURATION determining segment durations, INTONATION determining the relevant pitch movement parameters, and finally SYNTHESIS generating the speech wave form. Each module could in principle use information from all levels in the grid. For example, it was not excluded that the module serving to provide intonation used aspects of the level of orthography.
Although Speech Maker was physically developed in IPO, Spraakmaker was a collaborative effort of five different research groups, involving intense discussions on architecture, types of information, implementation strategies etc. The resulting text-to-speech system for Dutch was the last major development in this area in IPO. The development of Spraakmaker was instrumental in several later developments elsewhere in the Netherlands.
DEMO SPRAAKMAKER
If you want to know how intelligible the speech is when you do not know the text, play the audio-file first.
Dutch text of the audio demonstration of the Dutch text-to-speech system SPRAAKMAKER
"De taal- en spraaktechnologie zal, in de jaren negentig, een steeds belangrijker rol gaan spelen. U luistert nu naar spraak, voortgebracht door een experimenteel systeem. Dit systeem is het resultaat van een gezamenlijke inspanning van een zestal onderzoeksgroepen. Dat zijn de spraakonderzoeksgroepen in Amsterdam, Eindhoven, Leiden, Leidschendam, Nijmegen en Utrecht".
References
Adriaens, L.M.H. (1991). Ein Modell deutscher Intonation. Eine experimentell-phonetische Untersuchung nach den perzeptiv relevanten Grundfrequenzänderungen in vorgelesem Text. Doctoral thesis, Eindhoven University of Technology.
Blaauw, E. (1995). On the perceptual classification of spontaneous and read speech. Doctoral thesis, Utrecht University.
Brokx, J.P.L. (1979). Waargenomen continuïteit in spraak: Het belang van toonhoogte. Doctoral thesis, Eindhoven University of Technology.
Cohen, A., Collier, R. and ‘t Hart, J. (1967). On the anatomy of intonation. Lingua 19, 177-192.
Cohen, A., Collier, R. and ‘t Hart, J. (1991). A perceptual study of intonation. An experimental-phonetic approach. (Cambridge Studies in Speech Science and Communication). Cambridge: Cambridge University Press.
Cohen, A. and ‘t Hart, J. (1963). Speech synthesis of steady-state segments. Proceedings of the Speech Communication Seminar, Stockholm 1962, F1.
Cohen, A., Schouten, J.F., ‘t Hart, J. (1962). Contribution of the time parameter to the perception of speech. Proceedings of the 4th International Congress of Phonetic Sciences Helsinki 1961, The Hague: Mouton, pp. 555-560.
Deliege, R.J.H. (1989a). A stand-alone text-to-speech system. IPO Annual Progress Report 24, 43-4.
Deliege, R.J.H. (1989b). The “Tiepstem”: an experimental Dutch keyboard-to-speech system for the speech impaired. Doctoral thesis, Eindhoven University of Technology.
De Pijper, J.R. (1983). Modelling British English Intonation. An analysis by resynthesis of British English intonation. Doctoral thesis, Utrecht University.
De Rooij, J.J. (1979). Speech punctuation. An acoustic and perceptual study of some aspects of speech prosody in Dutch. Doctoral thesis, Utrecht University.
Elsendoorn, B.A.G. (1984). Heading for a diphone speech synthesis system for Dutch. IPO Annual Progress Report 19, 32-35.
Elsendoorn, B.A.G. and ‘t Hart, J. (1982). Exploring the possibilities of speech synthesis with Dutch diphthongs. IPO Annual Progress Report 17, 63-65.
Kruyt, J.G. (1985). Accents from speakers to listeners. An experimental study of the production and perception of accent patterns in Dutch. Doctoral thesis, Leyden.
Nooteboom, S.G. (1972). Production and perception of vowel duration. A study of durational properties of vowels in Dutch. Doctoral thesis Utrecht University, also published as Philips Research Reports, 1972, 5 (165 pp).
Nooteboom, S.G., Slis, I.H. and Willems, L.F. (1973)., IPO Annual Progress Report 8, 3-13.
Sanderman, A.A. (1996). Prosodic phrasing. Production, perception, acceptability and comprehension. Doctoral thesis, Eindhoven University of Technology.
Sanders, M.J. (1996). Intonation contour choice in English. Doctoral thesis, Utrecht University.
Slis, I.H. and Muller, H.F. (1971). A computer programme for synthesis by rule. IPO Annual Progress Report 6, 24-28.
Slis, I.H., Nooteboom, S.G. and Willems, L.F. (1977). Speech synthesis by rule: an overview of a system used in IPO. Hamburger Phonetische Beiträge 22, pp. 161-187.
Terken, J.M.B. (1985). Use and function of accentuation. Some experiments. Doctoral thesis, Leyden University.
't Hart, J., Nooteboom, S.G., Vogten, L.L.M., and Willems, L.F. (1982). Manipulaties met spraakgeluid. Philips Technisch Tijdschrift 40, no 4, 108-119.
Van Heuven, V.J. and Pols, L.C.W. (1993). Analysis and synthesis of speech. Strategic research towards high-quality text-to-speech generation. Berlin: Mouton de Gruyter.
Van Leeuwen, H.C. and Te Lindert, (1990). Spraakmaker: A text-to-speech system for the Dutch language. IPO Annual Progress Report 25, 40-49.
Van Rijnsoever, P.A. (1988). A multilingual text-to-speech system. IPO Annual Progress Report 23, 34-40.
Vogten, L.L.M. (1983). Analyse, zuinige codering en resynthese van spraakgeluid. Doctoral thesis, Eindhoven University of Technology.
Vogten, L.L.M. and Willems, L.F. (1977). The FORMATOR: A speech analysis-synthesis system based on formant extraction from linear prediction coefficients. IPO Annual Progress Report 12.
Waterham, R.P. (1989). The “Pocketstem”: an easy-to-use speech communication aid for the vocally handicapped. Doctoral thesis Eindhoven University of Technology.
Willems, L.F. (1966a). IPOVOX II: A speech synthesizer. IPO Annual Progress Report 1, 120-123.
Willems, L.F. (1966b).The INTONATOR. IPO Annual Progress Report 1, 123-125.
Willems, L.F. (1987). Robust Formant Analysis for Speech Synthesis Applications. Proceedings of the European Conference on Speech Technology, Edinburgh 1987, 250-253.
Willems, N.J. (1982). English intonation from a Dutch point of view. Doctoral thesis, Utrecht University.