Recently, automatic speech and speaker recognition has matured to the degree that it entered the daily lives of thousands
of Europe's citizens, e.g., on their smart phones or in call services. During the next years, speech processing technology
will move to a new level of social awareness to make interaction more intuitive, speech retrieval more efficient, and lend
additional competence to computer-mediated communication and speech-analysis services in the commercial, health, security,
and further sectors. To reach this goal, rich speaker traits and states such as age, height, personality and physical and
mental state as carried by the tone of the voice and the spoken words must be reliably identified by machines.
The iHEARu project aims to push the limits of intelligent systems for computational paralinguistics by considering Holistic analysis of multiple speaker attributes at once, Evolving and self-learning, deeper Analysis of acoustic parameters - all on Realistic data on a large scale, ultimately progressing from individual analysis tasks towards universal speaker characteristics analysis, which can easily learn about and can be adapted to new, previously unexplored characteristics.
From a methodological point of view, today's speaker characteristic recognition mostly relies on standard machine learning techniques
that have been proven successful for various audio recognition tasks including speech and speaker recognition.
However,there still remains a major gap between today's systems and humans analysing speech in a holistic fashion,
learning how speaker states and traits influence each other, and continuously improving their skills from interactions with others.
In the iHEARu project, ground-breaking methodology including novel techniques for multi-task and semi-supervised learning will deliver for the first time intelligent holistic and evolving analysis in real-life condition of universal speaker characteristics which have been considered only in isolation so far. Today's sparseness of annotated realistic speech data will be overcome by large-scale speech and meta-data mining from public sources such as social media, crowd-sourcing for labelling and quality control, and shared semi-automatic annotation. All stages from pre-processing and feature extraction, to the statistical modelling will evolve in "life-long learning" according to new data, by utilising feedback, deep, and evolutionary learning methods. Human-in-the-loop system validation and novel perception studies will analyse the self-organising systems and the relation of automatic signal processing to human interpretation in a previously unseen variety of speaker classification tasks.
Today’s studies considers speaker characteristics in isolation, i.e., single - or only few - speaker characteristics are considered at once. There is very little exploitation of the interplay and synergies between different characteristics, yet in reality, strong interdependencies between bits of paralinguistic information exist. Still, before this can be exploited on a larger scale, richly annotated data sets will have to be created: at present, databases provide only one or a few speaker characteristics in parallel. The iHEARu project aims to provide the knowledge and technology required for a holistic understanding of all the paralinguistic facets of human speech in tomorrow’s real-life information, communication and entertainment systems.
Self-learning and self-improvement in the iHEARu project will not be limited to iterative data collection. Rather, iHEARu will consider self-optimising feature extraction and self-organising classifiers: The whole process of speaker characteristics learning and analysis shall be self-optimising, as depicted in the flow chart above. For realising these ambitious goals, deep learning combined with neuroevolutionary methods and nonparametric Bayesian learning will play an essential role. This provides promising means for creating self-optimising statistical models and hierarchical input representations with very little amount of supervision.
The iHEARu project approaches the acoustic feature generation and selection issue by trying to understand human reasoning in challenging conditions, from very low SNR, application of voice conversion algorithms, and speech compression, all the way to deliberate faking of voice or speaker states by the subjects. As a consequence, the iHEARu project will not only address environmental (technical) robustness, but more importantly also robustness against fraud.
To automatically obtain robust speech detection and segmentation into meaningful units, the iHEARu project aims to improve all of the pre-processing algorithms including speech separation, noise reduction, voice activity detection, and segmentation in a loop with the subsequent analysis algorithms and the confidence scores given by these (cf. flowchart). Further, dealing with real-life data also means coping with various transmission channels.
The iHEARu project addresses the automatic recognition of speaker attributes and speaking styles that can be clearly identified by humans. However, the iHEARu approach to universal analysis is not to simply define more and more new recognition tasks that are chosen 'ad hoc'; conversely, it is aimed at developing data-driven methods for a framework which is able to automatically identify characteristics of interest by looking at crowd-sourced resources, such as tag collections, opinions in textual comments, or explicitly collected annotations from paid click-workers.
Last Updated: 14 July 2016