Learning to Localize The Temporal Extent of Human Activities In Videos