We introduce a new multimodal interaction dataset with extensive annotations in a conversational Human-Robot-Interaction (HRI) scenario. It has been recorded and annotated to benchmark many relevant perceptual tasks, towards enabling a robot to converse with multiple humans, such as speaker localization, key word spotting, speech recognition in audio domain; tracking, pose estimation, nodding, visual focus of attention estimation in visual domain; and an audio-visual task such as addressee detection.