Concepts and terminology
Lip tracking = detection of the lip contour in each frame of a videosequence, except the first frame, considering the lip contour already detected in the previous frames.
Region of interest = the subpart of the image to be analyzed in the specific application; in our case, the region of interest is the mouth region
Bimodal (/multimodal) speech = the combination of audio speech signal and video speech signal; the 2 modalities = audio + video (alternatively: another modality can be text)
Speech recognition = given an audio, video or audio-video spoken sequence, recognize the words/sounds/phrases pronounced in this sequence ? text output
Speech synthesis (“talking heads”) = opposite task to speech recognition (text input ? audio-visual sequence as output).
Department of Informatics
Aristotle University of Thessaloniki