Files
Abstract
Tracking objects or specifically pedestrians implies that we correctly detect and re-identify(re-id) them throughout the video stream. To accomplish this we need to run these algorithms on every frame of the video which is difficult in real-time as these networks are compute intensive. To make the system near real-time we use smaller detection and re-id networks, namely OpenPose at a lower network resolution and MobileNet-v2 for feature extraction and matching respectively. This end-to-end pedestrian detection and re-id pipeline run efficiently on embedded platforms, with a trade-off in accuracy. The reason for this decrease in accuracy mainly comes from the detection algorithm which is not running in its full potential due to memory and power constraints of the edge device. Also, in scenarios like occlusions specifically dynamic occlusion where pedestrians cross each other, the re-id network fails again due to the missed detection. To deal with these limitations we explored algorithms which can understand movement patterns of different pedestrians and predict their future positions. By knowing their future positions we do not solely rely on the detection network and can replace any miss-detection with the future prediction for that pedestrian. Similarly when a pedestrian is partially or fully occluded by another pedestrian and cannot be detected in the scene, we again use these future predictions for that pedestrian. In this way we envision to deal with scenarios like miss-detection incurred by the detection algorithm and occlusions which is very frequent in real world cases.Long Short Term Memory (LSTM) neural networks have been proven to achieve state of the art performance for pattern recognition problems. They inherently have a memory cell which keeps track of all the relevant data they have seen and learn to recognize the hidden patterns in it. We leveraged these pattern learning capabilities of LSTM in this research and trained it to predict future positions of the pedestrians in the scene. The LSTM is trained at a coarse-grain granularity of 5 frames per second using sequences shot at 60 frames per second. We then quantify its performance and analyze that predicting 5 future frames is optimum for our system. This trained LSTM is then integrated with the existing end-to-end system and its performance is evaluated against the system without the LSTM by validating results obtained on the DukeMTMC dataset. In this way we analyze its impact and present a qualitative study of why it does not improve the system’s accuracy for some cameras in the dataset. We then fine-tune our trained LSTM model for each camera individually to observe the increase in the accuracy for every camera. This provides a proof of concept that we require an algorithm that remains specific to each camera and learn movement patterns for each specific perspective. We conclude the study by comparing our complete end-to-end system's performance with the state of the art.