mechanism. Although Transformer-based video algorithms has achieved
advanced accuracy results, its memory needed and computational overhead
are large, thus the subsequent research are mainly focusing and trying
to reduce memory costs and computational complexity.
Methods: A new framework for video human action recognition is
proposed which does not require additional extraction of optical flow
information, neither uses 3D convolution kernels and numerous attention
modules. In addition, the method of using three dimensions of time,
space, and channels to obtain more discriminative capabilities is
developed. The time and spatial characteristics and action features of
human behavior are stimulated and integrated in the three dimensions of
time, space and channel to form more discriminative spatio-temporal
action features, so that the computational network model can
simultaneously process time and spatial features and action features,
thereby improving the calculation effectiveness.
The main computational process is demonstrated in Figure 1, which is
based upon deep learning model by coarse and fine time-granularity
combining motion salience and multi-dimensional excitation.
- Based on the dual-stream time segmentation deep learning network, the
video is divided into π segments with equal time intervals and
no overlap, π = {π1, π2, π3, β― , ππ }, and sample one frame as
key-frame, called πΎππππ, in each segment, πΎππππ = {πΎ1, πΎ2, πΎ3, β― , πΎπ
} and sample π frames non-key-frame, as ππΎππππ.
- By using time-domain Fourier transform, the pixel changes in video
frames caused by action changes in the time dimension are utilized to
obtain motion salient regions and generate motion salient maps. And by
this graph, the original key-frames πΎππππ and non-key-frames ππΎππππ
are achieved, to highlight the significant motion regions where human
behavior is located and weaken irrelevant background information.
- The feature extraction method of spatio-temporal differential based on
coarse-fine time granularity is used to model the action explicitly.
The features of the results from time and spatial differentiation on
the coarse time granularity is taken to represent the long-term action
changes between video segments. More refined action changes are
represented by the results from spatio-temporal differentiation on the
fine time granularity. Finally, integrating the long-term and the
short-term action changes.
- 2D CNN is used to extract video spatial information and generate
appearance features, and then the appearance features are integrated
with the action features, thus the new spatio-temporal characteristics
of videos with stronger expressive abilities are constructed jointly.
- In order to reduce computational complexity while ensuring the
recognition accuracy of the algorithm, a deformable convolution based
action feature excitation method (DCME) is used to excite the motion
information in time and spatial action features. It utilizes two
deformable convolutions of different scales to perform feature
differentiation on adjacent key-frames in the complete video,
obtaining excitation parameters that can enhance action features,
thereby generating video level spatio-temporal action features.
- Due to the varying importance of different channel features, a
spatio-temporal channel excitation method based on time correlation is
used to obtain temporal and spatial information of the channel
dimension, and the two are fused to generate the time and spatial
features of the channel dimension. Combine it with the video level
action features of the time and space generated by action motivation
methods to jointly construct stronger ability to describe action
features with richer action and information.
In summary, the difference between our presented algorithm and the
existing 2D CNN methods that integrate spatio-temporal and motion
features lies in: (1) the introduction of motion salience which making
this type of 2D CNN method more robust for recognizing human action in
complex scene videos; (2) By combining the interaction information
between frames within video segments and between frames within video
segments, the time and spacial features and action features were built
and fused at coarse and fine time granularity, respectively. A more
accurate model of spatio-temporal action features at the video segment
level was carried out in the time dimension; (3) Adopting two feasible
variable convolutions with different scales to stimulate the actions
features with diverse spatial scale changes and irregular deformations,
and meanwhile, the features ae combined with the characteristics in the
channel dimension, so the spatio-temporal action characteristics are
stimulated in both spatial and channel dimensions.
Data Set: Four popular benchmarks in video recognition are used
in the data experiments of this paper, UCF101 [9] (101 categories of
actions and 13320 video clips), HMDB51 [10] (51 action categories
and 6766 video clips), Kindetics-400 [11] (400 action categories,
each containing at least 400 video clips), and Something Something
[12] (with two versions, SSV1 and SSV2, both containing 174 action
categories). All the data are divided into training and testing sets
using the official designated division rules. The scene related datasets
are UCF101, HMDB51, and Kinetics-400, while the time related datasets
are Something-Something.
Computational Process and Prameters: Using Ubuntu 18.04, Intel
Xeon E5-2620-2.10GHz CUP, 64GRAM, equipped with 4 GeForce GTX 1080 Ti
GPUs, using Python, using a deep learning framework of Pytorch 1.8+CUDA
11.0.
Training phase. Firstly, the video was first divided into equal interval
π segments, and then randomly sample 5 frames in each segment, selecting
the 4th frame as the key-frame. The short edge size of these frames is
fixed at 256, and then they are adjusted to 224 x 224 using random
cropping and core cropping as inputs to the network. Therefore, the
input of the network is [π, π, 3, 224, 224], where π is the batch
size and π is the number of segments or key-frames for each video. In
this experiment, π is set to 8 or 16.
ResNet50 is taken as the backbone network and initialize the computation
with a pre-trained model on ImageNet. The learning rate starts from
0.01, and a small batch stochastic gradient descent algorithm (mini
batch SGD) with a momentum of 0.9 is used, with a weight decay rate of
0.0001. Adjustments are made at the 30th, 40th, and 50th epochs,
resulting in a total of 70 epochs being trained.
Testing phase. Two testing strategies are used to balance recognition
rate of accuracy and efficiency. One strategy is core drop/1 clip, which
cuts the core of each sampled video frame to 224 x 224 as network input.
However, due to the randomness and sparsity of sampling, this strategy
results in low recognition accuracy. Another sampling strategy with high
recognition accuracy is 3βππππ/10βπππππ , which means each video is
sampled 10 times, and the resulting video frames are cropped 3 times as
inputs to the network. All recognition results are averaged to obtain
the finals. This strategy obtains more comprehensive information due to
multiple sampling and cropping, resulting in higher recognition accuracy
but lower recognition efficiency.
The evaluation indicators used in this paper are divided into efficiency
and accuracy, among which the efficiency evaluation indicator refers to
the number of floating-point calculation (FLOPs), which is the
computational complexity of the model. Evaluate the recognition accuracy
rate of the network using the average accuracy (A). According to the
different discrimination rules for correctly classified positive
samples, they are divided into average accuracy (ππ΄), Top-1 accuracy
(π΄π‘ππ1) (the principle for discriminating positive
samples is that the classification result with the highest probability
in the prediction result is the correct classification result), and
Top-5 accuracy (π΄π‘ππ5) (the principle for discriminating
positive samples is that the first five classification results with the
highest probability in the prediction result contain the correct
classification result).
Ablation Experiment : The goal of the ablation experiment is to
improve the effectiveness of the algorithm. The original video
frames and the video frames with prominent motion regions were used as
input for action feature extraction, and experimental verification was
conducted on four experimental datasets, as shown in Table 2 (the
recognition accuracy of UCF101 and HMDB51 was obtained when the video
segment number was 16, while the other two datasets were obtained when
the video segment number was 8).
Table 1: Validation of the validity of the motion-significant
map was used.