- Connexion -
Friday, June 24, 2011, 9:30am-1pm / IRCAM, Igor Stravinsky Conference Room
This workshop will feature top figures in speech processing who will present works-in-progress in speech technologies, from recognition, transformation, and synthesis to interactions.
Axel Roebel and Xavier Rodet, IRCAM. Analysis and Synthesis Team.
For the past 7 years or so, the interest of composers and musical assistants at IRCAM in speech synthesis and transformation techniques has continued to grow. As a result, speech processing has become one of the central research objectives of the Analysis/Synthesis team at IRCAM. In this introduction some of the key results of the research efforts carried out will be presented, providing examples, notably related to the estimation of the spectral envelope, the estimation of the LF glottal pulse model parameters, text-to-speech synthesis, shape invariant signal transformation in the phase vocoder, speaker transformation, voice conversion, and transformation of emotional states.
Jean-François Bonastre, Laboratoire d'Informatique d'Avignon. University of Avignon.
Speaker recognition main approaches are based on statistical modeling of the acoustic space. This modeling relies usually on a Gaussian Mixture Model (GMM) denoted Universal Background Model (UBM), with a large number of components and trained using a large set of speech data gathered from hundreds of speakers. Each target model is derived from the UBM thanks to a MAP adaptation of the gaussian mean parameters only. An important evolution of the UBM/GMM paradigm was to consider the UBM as a definition of a new data representation space defined by the concatenation of the Gaussian mean parameters. This space, denoted "supervector" space, allowed to use Support Vector Machine (SVM) classifiers feed by the supervector. A second evolution step was crossed by the direct modelling of the session variability in the supervector space using the Joint Factor Analysis (JFA) approach. More recently, the Total Variability Space was introduced, as an evolution of JFA. It consists on a modelling of the total variability in the supervector space in order to build a smaller space which concentrates the information and where it is easier to model jointly session and speaker variability. Looking at this evolution, three remarks could be proposed. The evolution is always linked to large models with thousands of parameters. All the new approaches are quite unable to work at the frame per frame level and finally, these approaches rely on the general statistical paradigm where one information is evaluated as strong when it is present very often.
This speech proposes an analysis of the consequences of these remarks and presents a new paradigm for speaker recognition, based on a discrete binary representation, which is able to overpass the previous approaches limitations.
Nick Campbell, Centre for Language & Communications Studies. Trinity College, Dublin.
This talk describes a robot interface for gathering conversational data currently on exhibition in the Science Gallery of Trinity College Dublin.
We use a small LEGO NXT Mindstorms device as a platform for a high definition webcam and microphones, in conjunction with a finite-state dialogue machine and recordings of several human utterances that are played back through a sound-warping device to sound as if the robot is speaking them. Visual processing using OpenCV forms the core of the device, interacting with the discourse model to engage passers-by in a brief conversation so that we can record the exchange in order to learn more about such discourse strategies for advanced human-computer interaction.
Simon King, Centre for Speech Technology Research. The University of Edinburgh.
Some text-to-speech synthesisers are now as intelligible as human speech. This is a remarkable achievement, but the next big challenge is to approach human-like naturalness, which will be even harder. I will describe several lines of research which are attempting to imbue speech synthesisers with the properties they need to sound more "natural" - whatever that means.
The starting point is personalised speech synthesis, which allows the synthesiser to sound like an individual person without requiring substantial amounts of their recorded speech. I will then describe how we can work from imperfect recordings or achieve personalised speech synthesis across languages, with a few diversions to consider what it means to sound like the same person in two different languages and how vocal attractiveness plays a role.
Since the voice is not only our preferred means of communication but also a central part of our identity, losing it can be distressing. Current voice-output communication aids offer a very poor selection of voices, but recent research means that soon it will be possible to provide people who are losing the ability to speak, perhaps due to conditions such as Motor Neurone Disease, with personalised communication aids that sound just like they used to, even if we do not have a recording of their original voice.
There will be plenty of examples, including synthetic child speech, personalised synthesis across the language barrier, and the reconstruction of voices from recordings of disordered speech.
This work was done with Junichi Yamagishi, Sandra Andraszewicz, Oliver Watts, Mirjam Wester and many others.