Concepts and terminology

Lip tracking = detection of the lip contour in each frame of a videosequence, except the first frame, considering the lip contour already detected in the previous frames.

Region of interest = the subpart of the image to be analyzed in the specific application; in our case, the region of interest is the mouth region

Bimodal (/multimodal) speech = the combination of audio speech signal and video speech signal; the 2 modalities = audio + video (alternatively: another modality can be text)

Speech recognition = given an audio, video or audio-video spoken sequence, recognize the words/sounds/phrases pronounced in this sequence ? text output

Speech synthesis (“talking heads”) = opposite task to speech recognition (text input ? audio-visual sequence as output).

Aristotle University of Thessaloniki

Previous slide Next slide Back to first slide View graphic version