Content-based information retrieval (2)
Content-based video parsing involves temporal video segmentation into elementary units and content extraction of those units based on visual and/or audio semantic primitives.
Audio-video interaction serves two purposes:
(a) to enhance the content findings of one source by exploiting the knowledge offered by other sources, and,
(b) to offer a more detailed content description about the same video instances by combining the semantic labels of all data sources using fusion rules.
The interaction of audio and visual semantic labels (e.g., speech, silence, speaker identity, face presence, face absence, talking face presence, etc.) can be exploited to issue queries.