A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition