2Guangzhou Xinhua University, Guangzhou,
P.R.China
*E-mail:kaishixu@hotmail.com
Video human action recognition is an important academic issue, and due
to the challenges involved in the problem, such as complex scenes, large
changes in spatial scale, and irregular deformation of the identified
targets, the algorithm generally has some shortcomings such as large
parameter quantities and high computational costs. This paper proposes a
new computing framework based on lightweight deep learning models, using
time-domain Fourier transform to generate motion salience maps to
highlight human motion areas in feature extraction in which video
intrasegment and inter-segment Feature differences at different time
scales are used to obtain action information at different time
granularity, to conduct more accurate action model of the human action
with varying spatio-temporal scales. Additionally, an action excitation
method based on the deformable convolution is proposed to solve problems
of irregular deformation, spatial multi-scale changes, and the loss of
underlying information as the network depth increases. Data experiments
are proposed to verify the effectiveness of the proposed algorithm in
terms of computational efficiency and accuracy.
Introduction: The computational frameworks of video human action
recognition based on deep learning mainly include CNN, Recurrent Neural
Network (RNN) and Transformer. Simonyan et al. [1] proposed a
dual-stream CNN for video human action recognition, and Feichtenhofer et
al. [2] used a residual network (ResNet) to implement a dual-stream
structure, though required additional optical flow images to obtain
action features. Recurrent Neural Network (RNN) can model time series
data [3], but it cannot fully utilize spatio-temporal information.
Tran et al. [4] proposed the C3D network, Tran et al. [5]
extended ResNet and proposed the Res3D network, and Diba et al. [6]
designed the Temporal 3D Convolutional Network (T3D).
In addition, Girdhar et al. [7] combined Fast-RCNN and Transformer
and proposed a Video action transformer network, while Bertasiu et al.
[8] proposed a TimeSformer network based on a distributed attention