a Enter the raw video frame
b Input to video frames enhanced using motion-significant maps
It can be noticed that the method with the processing of motion salience maps improves its accuracy rate significantly on scene related datasets, such as UCF101 increased from 94.9% to 96.8%, HMDB51 increased from 71.2% to 73.5%, and Kinetics-400 increased from 75.1% to 77.2%. However, the time related and simple background dataset SSV1 does not show significant improvement in recognition accuracy. The main purpose of using motion salience maps is to reduce the interference of irrelevant background information. Since scene related datasets have cluttered background, the process of highlighting motion salience areas could help the computational network to focus on human action, and so the obtained results are more accurate.
Comparison with State-of-the-art: Compare our presented method with three basic 2D CNN methods TSN [3], TSM [13], and TEA [14], using TSN as the benchmark (almost all the advanced and popular 2D CNN based methods are all originated from TSN).
The experimental results showed that our presented method had better recognition accuracy rate than TSN and TSM on all four datasets, however the improvement compared to TEA [14] was relatively small or almost unchanged. The reason is that TSN [3] does not include modules that can describe temporal information and lacks the information of motion. Therefore, for scene related datasets UCF101, HMDB51, and Kinetics-400, its recognition accuracy rate is only slightly lower than the other three methods, while for time related datasets SSV1, it lags far behind.
Table 2: Comparison Results between our presented method and Three Benchmark 2D CNN Methods (The recognition accuracy of UCF101 and HMDB51 in the table was obtained when the video segment number was 16, while the other two datasets were obtained when the video segment number was 8.)