• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Action Recognition and Detection Based on Deep Learning:A Comprehensive Summary

    2023-12-12 15:49:18YongLiQimingLiangBoGanandXiaolongCui
    Computers Materials&Continua 2023年10期

    Yong Li,Qiming Liang,Bo Gan and Xiaolong Cui

    1College of Information Engineering,Engineering University of PAP,Xi’an,710086,China

    2PAP of Heilongjiang Province,Heihe Detachment,Heihe,164300,China

    3National Key Laboratory of Science and Technology on Electromagnetic Energy,Naval University of Engineering,Wuhan,430033,China

    4Joint Laboratory of Counter Terrorism Command and Information Engineering,Engineering University of PAP,Xi’an,710086,China

    ABSTRACT Action recognition and detection is an important research topic in computer vision,which can be divided into action recognition and action detection.At present,the distinction between action recognition and action detection is not clear,and the relevant reviews are not comprehensive.Thus,this paper summarized the action recognition and detection methods and datasets based on deep learning to accurately present the research status in this field.Firstly,according to the way that temporal and spatial features are extracted from the model,the commonly used models of action recognition are divided into the two stream models,the temporal models,the spatiotemporal models and the transformer models according to the architecture.And this paper briefly analyzes the characteristics of the four models and introduces the accuracy of various algorithms in common data sets.Then,from the perspective of tasks to be completed,action detection is further divided into temporal action detection and spatiotemporal action detection,and commonly used datasets are introduced.From the perspectives of the twostage method and one-stage method,various algorithms of temporal action detection are reviewed,and the various algorithms of spatiotemporal action detection are summarized in detail.Finally,the relationship between different parts of action recognition and detection is discussed,the difficulties faced by the current research are summarized in detail,and future development was prospected.

    KEYWORDS Action recognition;action detection;deep learning;convolutional neural networks;dataset

    1 Introduction

    Being widely used in fields like security,video content review,human-computer interaction,and others,action recognition and detection is among the important research directions in the field of computer vision[1].The concept differs for the two components of action recognition and detection,which are action recognition and action detection[2].Action recognition refers to judging the category of human action in a given video footage,while action detection determines the start and end time of certain actions in the video,and locates the spatial position of figures in the picture,besides classifying the action.More specifically,action detection can be divided into temporal action detection and spatiotemporal action detection.Temporal action detection only determines the start and end time of certain actions,while spatiotemporal action detection needs to further determine the position of the figures in the picture.

    Previous reviews of action recognition and detection focus more on the field of action recognition alone and fail to summarize the current research status comprehensively and accurately.A clear explanation of the difference and connection between the concept of action recognition and action detection is not given in some literature.For example,Hassner [3] reviewed the early development of action recognition,focusing on the commonly used datasets for action recognition.Luo et al.[4]reviewed various algorithms commonly used in action recognition from the perspective of descriptors.Zhao et al.[5] provided insight into traditional recognition methods and deep learning-based recognition methods from two aspects: input content and network depth.Chai et al.[6] focused on the comparison between action recognition methods based on descriptors and what is based on deep learning,before prospects the development direction of action recognition.Zhang[2]provided a comprehensive summary of the current research status of both action recognition and action detection,but the latest research results are not mentioned.Zhu et al.[7]reviewed the research status of action recognition and detection in detail,but the concept of action recognition isn’t distinguished clearly from that of action detection.Sun et al.[8] summarized the current research on action recognition and detection in detail from the perspective of data mode.

    In recent years,the transformer model,typified by Vision Transformer (VIT) [9],has made remarkable achievements,which reveals a new trend in the field of action recognition and detection.The existing literature rarely reviews action recognition and action detection side by side,and there are few introductions to transformer-based models.Therefore,based on dividing the action recognition and detection structure,this paper summarizes the action recognition and detection in detail from the perspective of the model structure,and much emphasis is put on the prominent transformer model.Besides,this paper summarizes the various algorithms of action recognition and detection,points out the difficulties faced by current research,and explores the subsequent trends.

    2 Action Recognition

    As shown in Fig.1,action recognition can fit into either traditional frameworks or deep learningbased frameworks,which mainly consist of three steps:preprocessing,action expression,and classification[10].Preprocessing includes serialization of video and extraction of optical flow features.In traditional frameworks,Action expression mainly includes feature extraction and coding,while deep learning frameworks use various deep neural networks to extract features.In traditional frameworks,algorithms such as Support Vector Machine (SVM) and random forests are mainly used for action classification,while the action classification of deep learning frameworks mainly uses Softmax and SVM.

    2.1 Action Recognition Datasets

    The earliest published dataset for action recognition was Kungliga Tekniska H?gskolan(KTH)[11].In recent years,UCF-101[12]and HMDB51[13]have been widely used in action recognition.The KTH dataset contains 6 types of actions completed by 25 people in 4 different scenarios,with a total of 2391 video samples,while UCF-101 contains 13,320 video samples in 101 categories,and HMDB51 includes 6,849 video samples in 51 action categories.With the deepening of research,the scenarios,action categories,and sample sizes covered by UCF-101 and HMDB51 are becoming difficult to meet the needs of the study,thus they are now being phased out.Table 1 shows the comparison of action recognition algorithms in UCF-101 and HMDB51.

    Table 1:Comparison of action recognition algorithms

    Figure 1:Action recognition flow chart

    In recent years,some research institutions have released datasets with larger sample sizes and richer scenarios.Take the datasets of the Kinetic[38]series and the Something-Something[39]series as an example.The Kinetics400 [38] dataset was released in 2017 and includes a total of 306,245 video samples in 400 categories.DeepMind then expanded the Kinetics400 dataset to release the larger Kinetics600 [41] and Kinetics700 [44].The Something-Somethingv1 dataset was released at ICCV2017,with 2–6 seconds per video,divided into training sets,test sets,and validation sets according to the ratio of 8:1:1,containing more than 100,000 sample sizes.Something-Somethingv2 was released at CVPR2020,with a further expanded sample size of more than 200,000 and the image format was updated from the previous JPG to Webm.

    At present,action recognition datasets devote to expanding around specific scenarios for finegrained motion analysis to meet various task scenarios.Table 2 shows the basic information of commonly used datasets for action recognition.

    Table 2:Common datasets for action recognition

    Table 3:Comparison of algorithms in ActivityNet-1.3

    Table 4:Comparison of algorithms in THUMOS’14

    2.2 Descriptor-Based Action Recognition

    Before deep learning was widely used,action recognition mainly adopted descriptor-based methods.Action recognition methods that are based on descriptors can be divided into global feature-based and local feature-based methods.The global feature extraction method initially uses the direction gradient histogram(DGH),before developing into two methods:contour silhouette and human joint point method.For example,Bobick et al.[48]generated Motion History Images(MHI)through the construction of a two-dimensional motion energy map to achieve action classification.Yang et al.[49] constructed the coordinates of joint points and combines static posture,motion attributes,and overall dynamics for action recognition.Local feature extraction mainly includes two methods:spatiotemporal point of interest sampling and dense trajectory tracking.For example,Willems et al.[50]proposed an action recognition method based on 3D Harris corner point detection,and Wang et al.[16,51]proposed action recognition methods Dense Trajectories(DT)and Improved Dense Trajectories(IDT)based on dense trajectory tracking.Through multi-scale intensive sampling,these feature points are closely tracked in the temporal dimension to form trajectories,via which the category of action is judged eventually.

    2.3 Deep Learning-Based Action Recognition

    When judging the category of action,humans usually need to distinguish both the static information of the actor and the dynamic information of the change of the actor’s movement.Therefore,the implementation of action recognition relies on static spatial information and dynamic temporal information of the action in the video.According to the different network structures that obtain these two types of information,deep learning-based action recognition models can generally be divided into four categories: Two stream models,Temporal models,Spatiotemporal models,and the latest transformer models.

    2.3.1 Two Stream Models

    To obtain spatial features and temporal features,the two-stream model uses two parallel pathways to extract spatial and temporal features,respectively.It realizes the fusion of spatiotemporal feature information through appropriate feature fusion methods before finally realizing the classification of action.Such an idea was first proposed in 2014 by Simonyan et al.[17].He inputs preprocessed video frames and optical flow maps to two parallel paths and then uses AlexNet[52]for feature extraction on both paths,where static spatial features are obtained from video frames,and dynamic temporal features are obtained from optical flow maps.The feature information is fused at the end of the channel to achieve action classification.

    Feichtenhofer et al.[21] improved the feature fusion method on this basis.Feature fusion is performed in advance in the convolutional layer of either channel,and additional fusion is performed on the quasi-prediction layer to replace the previous end fusion method.The stated fusion method not only reduces the parameters of the model but also improves the accuracy of recognition.Then,Feichtenhofer et al.[53]introduced He et al.’s residual network(ResNet)[54]to the two stream models,introducing residual connections between the two-Stream architectures to enhance the spatiotemporal feature interaction and fusion between the two streams.

    Similarly,focusing on the fusion of two stream features,Ng et al.[55]introduced the Long Short Term Memory network(LSTM)[56]based on the two stream neural networks.LSTM is used to fuse the output of two stream CNNs,effectively expressing the sequence of frames before and after through the memory unit of LSTM,strengthening the feature extraction ability of temporal information,and realizing the recognition of actions in long videos.

    To achieve feature extraction of long-term video information,Wang et al.[24] used a sparse time sampling strategy to sample multiple video clips from the entire video at the input,giving the preliminary judgment result of the action category in each segment,and then combines the results of multiple fragments to carry out “consensus”to realize the classification of action,thereby realizing the recognition of long-range actions.

    The two-stream model pre-processes video frames and optical flow maps at the input,which requires much time and computing power at the pre-processing stage of the data,making the model far from being able to achieve end-to-end recognition.For the above reasons,Zhu et al.[21]established a network structure called MotionNet based on the two-stream model,which can directly model the temporal features of video frames,replacing the role of the optical flow map.

    Feichtenhofer et al.[57]made great improvements to the two-stream model and built a lightweight two-stream recognition network named SlowFast.SlowFast captures spatial semantics for slow channels operating at low frame rates,and fast channels operating at high frame rates,thereby capturing temporal information with fine temporal resolution.Finally,horizontal connections are used to fuse from the fast path to the slow path to achieve action classification.

    Compared with Temporal models,Spatiotemporal models,and the latest transformer models.,the Two stream models are more complex and the model training is cumbersome,hence is difficult to truly achieve end-to-end recognition.The idea of two-stream models,however,provides important inspiration for algorithm innovation in the field of action detection and promotes the development of action detection.The two-stream model is a compromise model made in the architecture at the beginning stage of the development of the action recognition algorithm,and some algorithms even need to train two pathways step by step during the model training process,which is very timeconsuming.From the architecture design of the two-stream model,it can be clearly seen that the extraction of action features is difficult.

    2.3.2 Temporal Models

    To obtain spatial features and temporal features,the temporal model adopts a cascading method,in which the spatial semantic information is first extracted by the convolutional neural network(CNN),and the temporal feature information is then extracted by the recurrent neural network(RNN).Simple RNNs produce gradient divergence or vanishing gradients when processing long-term feature information,so the actual temporal model uses LSTM with a forgetting gate.

    In the study of Donahue et al.[15],AlexNet and LSTM are used to cascade,and spatial and temporal features are modeled respectively before being classified by the fully connected layer at the end to construct Long-term recurrent convolutional networks (LRCN) for action recognition.To better represent the relationship in space and eliminate redundant information,Sudhakaran et al.[58]introduced ConvLSTM [59] to replace the traditional LSTM for violent scenes,which realized the fusion of spatiotemporal information and further improved the accuracy of recognition.

    Li et al.[19] introduced the attention model into the LSTM network and constructed a new action recognition model VideoLSTM through the fusion of ConvLSTM and Attention LSTM.VideoLSTM introduces motion features and attention mechanisms to spatiotemporal positions,focusing on preserving spatial feature information between video frames.Wang et al.[60]combined I3D with LSTM based on the I3D network,and modeled the high-level temporal features obtained by the I3D model through LSTM.

    When the CNN comes too complex,the constructed spatial feature map will be abstracted,resulting in the loss of temporal feature information,and the limitation of LSTM’s ability to process temporal information.That accounts for the loss of the popularity of current temporal mode-based action recognition research.Since RNN networks and their variants cannot implement multi-GPU parallel computing,temporal models cannot achieve parallel training of models on multiple devices.Action recognition algorithms have high hardware requirements in the model training process,so the inability to build action recognition models by multi-device parallel training will bring great trouble to researchers,which is another important factor in the development bottleneck of current temporal models.

    2.3.3 Spatiotemporal Models

    Spatiotemporal models design an integrated structure and obtain space-time feature information at the same time.A spatiotemporal model usually uses 3D convolution,and the data with different time dimensions is performed on 3D convolution operation.In recent years,some scholars have proposed adopting a specially designed data processing method in the model to fuse the spatiotemporal feature information in advance,and then use 2D convolution to model it,which can also realize action recognition.

    The use of 3D convolution for action recognition was first proposed by Ji et al.[61].Du et al.[14]further extended 3D convolution to the pooling process to form 3D pooling and established the C3D(Convolutional 3D) model.Diba et al.[20] adopted the method of transfer learning in the model construction process and proposed a new temporal transition layer (TTL),which embeds TTL into DenseNet [62] that extends to a 3D structure,thereby constructing a new network Temporal 3D ConvNets(T3D).

    3D convolution is very computationally intensive in the computation process,to alleviate this problem,Qiu et al.[18] formed a new convolutional block Pseudo-3D (P3D) by separating the convolution based on the ResNet.Based on the P3D structure,the new action recognition model P3D was successfully constructed.Final experiments show that P3D ResNet has significantly improved action recognition performance.

    Transfer learning has a wide application in deep learning,which makes it easier to train new models.For example,in the field of object recognition,transfer learning via a trained model obtained by ImageNet can accelerate the convergence speed of a new model.A similar approach can be used in the field of action recognition to reduce the workload of training.To be able to use pretrained models on a 3D convolutional network,Carreira et al.[25] expanded the two-dimensional convolution kernel and pooling kernel to 3D based on Inception-v1 [63].They then pre-trained the three-dimensional model implicitly on ImageNet before obtaining the pre-trained model of 3D convolution in Kinetics.After pre-training,the Inflated 3D(I3D)model obtained by transfer learning gains a great improvement in the accuracy of action recognition,which also greatly reduces the difficulty of model training.

    To make the 3D convolution model more lightweight,scholars have proposed many innovative methods,but it is more than difficult for 3D convolution is difficult to outperform 2D convolution.In 2019,Lin et al.[64] creatively performed the migration splicing of the feature map in the temporal dimension and proposed the Temporal Shift Module (TSM) by processing the temporal features before feature extraction.TSM fuses invisible temporal information into spatial features,and only 2D convolution can achieve the effect of 3D convolution,which alleviates computational overhead by sacrificing storage.Then,Shao et al.[65] proposed a new deformation displacement module,a temporal interlacing network (TIN),based on TSM,which further strengthened the fusion of spatiotemporal information.Fan et al.[66]proposed a learnable 3D shift network RubkisNet,which simultaneously migrates in both the spatial and temporal dimensions,and dynamically generates the proportion of the migration part.RubkisNet obtained a larger range of spatiotemporal information as well as higher accuracy.

    In addition,Li et al.[67] extracted adjacent frame information and multi-frame global information by establishing a temporal excitation and aggregation block(TEA).Both short-time motion and long-term feature aggregation were considered,which effectively reduced the complexity of the network and also effectively avoided the drawbacks of 3D CNNs.

    Before 2019,3D convolution was mainly used in spatiotemporal models.Then through the ingenious design of data preprocessing,2D convolution can also achieve accurate action classification.By sacrificing part of the storage,the computational overhead is greatly alleviated,and action recognition based on the spatiotemporal model has come to the prominent direction of current research.

    2.3.4 Transformer Models

    Transformer is an attention-based codec-like model that originated in the field of natural language processing(NLP)and has begun to achieve high accuracy in applications of computer vision after the release of the VIT model in 2021.The transformer model is a commonly used decoder architecture in the field of NLP,which has advantages in extracting“contextual”correlation information.It is now shining in the field of computer vision,including action recognition,and is becoming an important cross-modal architecture.

    The transformer was first used in the field of action recognition after the release of Video Vision Transformer (VIVIT)[68] in 2021.Similar to VIVIT,Ullah et al.[69] completely abandoned CNNs based on VIT and adopt the attention structure to achieve action recognition.To alleviate the redundancy of the temporal information,Patrick et al.[70]introduced trajectory information based on the transformer,and obtain high accuracy in multiple datasets.

    Truong et al.propose an end-to-end transformer structure,Direcformer [71].The structure introduces ordinal time learning into the transformer,which helps to understand the chronological order of actions.To strengthen its ability to model different spatiotemporal spaces,Google proposed Multiview Transformers for Video Recognition(MTV)[72],which consists of multiple independent encoders to represent different dimensional views of the input video.MTV also fuses the information between the views through horizontal connections.Self-Supervised Video Transformer(SVT)[73]is a new self-supervised approach that trains a teacher-student model using similarity goals that are represented along spatiotemporal dimensions by spatiotemporal attention matching.

    Although the transformer achieves very high accuracy in the field of computer vision,including action recognition,it incurs huge computational overhead,which places a burden on research institutions or researchers with average research conditions.Therefore,Recurrent Vision Transformer(RViT) [74] introduces a loop mechanism and integrates Attention Gate to establish a connection between the current frame and the previous hidden state,thereby extracting the global space-time features between frames,and alleviating the problem of insufficient computing power to a certain extent.

    3 Action Detection

    3.1 Action Detection Datasets

    The datasets commonly used for temporal action detection are mainly THUMOS14[75],MEXaction2[76],and ActivityNet[77].The THUMOS14 dataset includes an action recognition part and a temporal action detection part.The action recognition section includes all the categories covered by the UCF-101 dataset.The temporal action detection section includes 20 categories,divided into the training set,validation set,background fragment set,and test set.The MEXaction2 dataset includes two categories: horseback riding and bullfighting.The background fragment length of the MEXaction2 dataset is relatively long,while the proportion of labeled action fragments is low,which makes it more challenging for temporal action detection.ActivityNet is currently the largest database,it also contains two tasks:action classification and temporal action detection.ActivityNet has a very large sample size of more than 20,000,covering 200 action categories.It can only be downloaded by writing a script based on the official YouTube link.The above dataset only coarsely labeled the temporal action information,which is easy to cause the problem of unclear temporal action boundaries during an experiment,so ECCV2022 released the carefully labeled FineAction[78]dataset.The FineAction dataset contains nearly 17,000 untrimmed videos and 103,000 fine motor temporal annotations.For all 106 action categories,the category definitions are clearer and the temporal annotations are more accurate.

    J-HMDB-21[79],UCF101-24[80],and Atomic Visual Actions(AVA)[81]datasets are commonly used as spatiotemporal action detection datasets.J-HMDB-21 is a subset of the HMDB dataset,containing a total of 21 categories and 960 video samples.UCF101-24 is a subset of UCF101,including a total of 24 Action categories and 3207 video samples.Compared to either of the previous datasets,the labels of the AVA dataset are much sparser.The AVA dataset consists of 300 movies,each captured for 15 min and labeled second by second.The newly released MultiSports[82]dataset by ECCV2022 further increases the sample size and includes more complex scenes,which is a largescale spatiotemporal action detection dataset mainly including basketball,football,gymnastics,and volleyball events.

    3.2 Temporal Action Detection

    Temporal action detection is different from action recognition,which not only needs to classify the action itself,but also needs to locate the temporal position of the action in the video,specifically to locate the start and end time of certain actions accurately from a long video containing background clips,and to determine the category of the action.Temporal action detection usually requires video data with a long time span during model training,such data is huge,and it takes a lot of time and computing resources in the process of data preprocessing and model training,so temporal action detection is very difficult for research institutions with poor research conditions or weak teams.

    Tables 3 and 4 show respectively the accuracy of the commonly used algorithms for temporal action detection in the ActivityNet-1.3 dataset and the THUMOS’14 dataset,where mAP@k represents the Mean Average Precision of a certain algorithm when the intersection and union ratio is equal to k.

    3.2.1 Action Detection Based on Descriptors

    Traditional temporal action detection methods use descriptors to generate target fragments,thereby achieving the detection of temporal action.For example,Richard et al.[102] identified action types by merging two models,one of which is a length model that combines action duration information and the other is a language model that combines contextual context.Yuan et al.[103]extracted a pyramid of score distribution features (PSDF) based on IDT features.They then used the LSTM network to process the PSDF feature sequence and obtained the prediction of the action fragment according to the output frame-level action category confidence score.By training video,Hou et al.[104] automatically determined the number as well as types of sub-movements in each action.To locate an action,the objective function,which combines the appearance,duration,and time structure of a certain sub-action,is optimized as the shortest path problem in the network flow formula,before the best combination is selected by considering both the sub-action score and the distance between the sub-action.

    3.2.2 Deep Learning-Based Action Detection

    Another primary approach to temporal action detection is to use deep neural networks.According to whether the target candidate region needs to be extracted independently,the relevant object detection algorithms can be divided into one-stage-based and two-stage-based algorithms.Similar to object detection,temporal action detection algorithms can also be divided into one-stage-based methods and two-stage-based methods according to either the process of feature extraction or whether the temporal candidate region,where independent extraction action occurs,is required.

    Two-Stage Method

    Inspired by the common object detection algorithm R-CNN,Shou et al.[93]proposed a temporal action detection method based on sliding window Segment-CNN(S-CNN)in 2016.The S-CNN cuts the original video into several clips with different lengths and then sends it to the C3D network through pooling operations to conduct action detection.The flexibility for S-CNN to judge the starting and ending times of certain actions is limited by the sliding window mechanism.In the following year,Shou et al.[94]drew on the ideas of Fully convolutional network(FCN)[105]and introduced convolutional devolutional-De-Convolutional filters based on the C3D algorithm.Frame-level finegrained temporal action detection is then achieved upon the joint effect of upsampling in the temporal dimension and downsampling in the spatial dimension to achieve.

    Also inspired by the object detection algorithm,Xu et al.[83] built Region Convolutional 3D Network (R-C3D) based on the Faster R-CNN [106],which encodes the video stream with a 3D fully convolutional network,generates a time range candidate region that may contain action,and then classifies and fine-tunes the candidate region.Unlike S-CNNs,R-C3D can perform end-to-end detection of actions out of video with arbitrary length.

    Subject to similar influence,Chao et al.[85]used a Faster R-CNN-based multiscale framework to improve the calibration of receptive fields,enabling resilience to extreme changes in the duration of actions in certain videos.Then,by constructing a two-stream network,the characteristics of redgreen-blue (RGB) and optical flow are fused,and action classification is conducted using the late fusion mechanism.

    For temporal action detection in actual scenarios,there may be more than one action fragment contained in the video to be detected,so actions can be judged comprehensively by combining the action categories of multiple proposals.Hence Zeng et al.[88]used Graph Convolutional Networks(GCN) [107] to explore the connections among proposals and constructed a GCN-based temporal action detection framework Proposal-GCN(P-GCN).

    Liu et al.[99]proposed to use both coarse-grained and fine-grained features to build an end-to-end network multi-granularity generator(MGG)through two modules,segment proposal producer(SPP)and frame actions producer(FAP),which is for finding action fragments.Gao et al.[108]proposed a relation-aware pyramid(RapNet)based on the pyramid network,which enhanced the global feature information representation and located the different lengths of Action fragments.Lin et al.[109]established a novel Dense Boundary Generator (DBG),which extracts spatiotemporal features like the two-stream model and establishes the action perception completeness regression branch and the time boundary classification branch to realize rapid detection of action.

    The two-stage method can achieve high detection accuracy by obtaining proposals of the temporal dimension before classifying the action.However,apart from its low operation speed,the two-stage network model is too complex and requires a lot of computing resources.

    One-Stage Method

    The conventional approach for one-stage temporal action detection is to use convolutional layer generation proposals for recognition and boundary regression.The Action proposals obtained by this method are then assigned the same receptive fields.However,the temporal length varies for different actions.To solve such a problem,Long et al.[110]proposed Gaussian kernel learning,which expresses temporal information by learning a Gaussian kernel.Piergiovanni et al.[111]proposed a new convolutional layer Temporal Gaussian Mixture(TGM)layer,which also adopts a Gaussian model.It can capture long-distance dependencies in video effectively by using Gaussian kernels to calculate other time point features near the current temporal window.Inspired by the object detection relatedalgorithm,Lin et al.combined I3D with a borderless target detection algorithm,proposed a boundary learning consistency loss function,and constructed an anchor-free saliency-based action detection method Anchor-Free Saliency-Based Detector (AFSD) based on learning significance boundary features.

    In recent years,with the rapid development of transformer,some scholars have also begun to solve the problem of temporal action detection with transformer.Liu et al.[100] embeded the video features and positions extracted by CNNs as input and decode a set of action predictions in parallel through the transformer.By focusing on certain segments in a video adaptively,it extracts the contextual information required to make motion predictions,which greatly simplifies the process of temporal action detection and increases the speed of detection.Shi et al.[112] also proposed a DETR-like temporal action detection method based on the transformer,which proposes attention,action classification enhancement loss,and fragment quality prediction related to IOU attenuation,and analyzes them from three aspects: attention mechanism,training loss,and network reasoning.Zhang et al.[113]used transformer as a basic module to design a minimalist temporal action detection scheme,where feature pyramid and local self-attention mechanism are used to model long-time temporal features,and classification and regression are realized without generating proposals or predefined bounding boxes.In addition,Liu et al.[114]also combined transformer to propose an endto-end temporal action detection scheme,which obtains higher accuracy and a faster detection rate by constructing a medium-resolution benchmark detector.

    3.3 Spatiotemporal Action Detection

    So-called Spatiotemporal action detection is to determine the temporal and spatial position of certain actors in a video containing background clips and to realize action classification.In other words,spatiotemporal action detection needs to mark the position of the actor in the spatial picture based on temporal action detection.Spatiotemporal action detection can often be divided into multiple stages,including action recognition,target tracking,and object detection.Therefore,spatiotemporal action detection usually requires the construction of multiple network-level models in the process of model construction,which is difficult in model training.Table 5 shows the accuracy of the spatiotemporal action detection algorithms respectively over J-HMDB-21 and UCF101-24 datasets.

    Table 5:Comparison of spatiotemporal action detection algorithms

    Puscas et al.[115]employed a selective search method to produce the initial segmentation of still image-based video frames.This initial recommendation set is pruned and temporarily extended using optical flow and transudative learning.

    Inspired by the object detection algorithm,Kalogeiton et al.[117]built an Action Tubelet detector(ACT)based on the SSD framework.ACT focuses on temporal features between successive frames,reduces the ambiguity of Action prediction,and improves the accuracy of spatiotemporal localization.Yet Gu et al.[118] used I3D for contextual temporal modeling and Faster R-CNN for end-to-end localization and action classification,which is also derived from the object detection algorithm.

    Feichtenhofer et al.[57]used SlowFast for action recognition,Deepsort for object tracking,and YOLO for object detection,and realize action detection through a combination of three algorithms.Inspired by the human visual nervous system,Kpüklü et al.[119] proposed You Only Watch Once(YOWO),a unified architecture for spatiotemporal action detection.The network structure of YOWO is similar to the two-stream model,in which 3D-CNN branches and 2D-CNN branches are used in each parallel stream to extract spatiotemporal feature information,and feature fusion,as well as candidate region definition,is carried out in the end.YOWO uses 3D-ResNet-101 [120] to extract spatiotemporal features and solves classification problems with 3D-CNN branches.To solve the spatial localization problem,DarkNet-19[121]is eventually used to extract the dual-dimensional features of keyframes.

    Based on YOWO,Mo et al.[122]proposed to use of Linknet to introduce a connection between 2D and 3D-convolutional structures.They also use a custom bounding box similar to YOLOv2 to achieve the precise positioning of actors and update the YOWO network to the second version,which effectively reduces the complexity of the model and further improves the accuracy of the network.

    The Holistic Interaction Transformer(HIT)[116]network is a comprehensive dual-mode framework based on the transformer,which includes RGB streams and pose streams.Each flow models human,target,and hand interactions.Within each subnetwork,an Intra-Modal Aggregation Module(IMA) is introduced,which merges individual interaction units selectively.The Attentive Fusion Mechanism(AFM)is then used to glue together the features produced by each pattern.Finally,HIT extracts clues from the time context using cache memory to better classify the possible actions.

    4 Discussion

    Action recognition and action detection have been widely used in practical scenarios.Action recognition can be applied to human-computer interaction,video content review,and other fields,while action detection can be applied to intelligent security,video content positioning,video search,and other fields.Action recognition is the prior work of action detection,and only when the relevant algorithms of action recognition tend to be mature,could action detection have good development.Fig.2 shows the important algorithms for action recognition and detection,above the arrow,are the action recognition algorithms,and below are the action detection algorithms.In the field of action recognition,mainstream models include the two-stream model,temporal model,spatiotemporal model,and transformer model.

    Structurally,the two-stream model uses parallel CNNs to extract the spatiotemporal feature information separately,and it is difficult to train CNN models with two pathways separately during the model training process.The temporal model uses a cascading method to extract spatial information and temporal information respectively,which has a simple structure and less difficulty and can achieve end-to-end recognition.The traditional spatiotemporal model uses a 3D convolution network,and the model rises from two-dimensional to three-dimensional,which can extract spatiotemporal feature information at the same time,but the complexity of the model increases.In recent years,the spatiotemporal model uses the specially designed data preprocessing method to reduce the dimension of the convolution model,which greatly simplifies the structure of the model and reduces the difficulty in the training process.The recent popular transformer model comes from the NLP,which mainly adopts the attention mechanism model that is different from the other three models,and the model complexity is high and the training is difficult.

    In terms of development trends,transformer is currently the most popular model,which has high accuracy but is limited by the complexity of the model,and there is a big gap from the actual deployment application.After 2019,spatiotemporal models have emerged in a new direction of using data preprocessing to achieve spatiotemporal feature data fusion,which has provided new ideas for more researchers for some time.The temporal model relies on the LSTM network and extreme variants to obtain temporal information,but it has fallen into a bottleneck,and there are no good innovation points recently.As one of the earliest deep learning models in the field of action recognition,the twostream model has structural drawbacks,but it is still an important research direction.Table 6 shows the comparison of the characteristics of the four models.

    Table 6:Comparison of action recognition models

    Action detection emerged later than action recognition,but it has undergone faster development under the influence of action recognition algorithms.Temporal action detection can be divided into the one-stage method and the two-stage method according to whether the candidate region is obtained step by step.The latter can obtain higher accuracy,but the corresponding model is more complex,while the former has a more concise structure yet with lower accuracy.The role that action detection plays in the video field is just like the role object detection plays in the image field,as so many algorithms of action detection have been affected by object detection.

    In terms of the tasks that need to be completed,action recognition is the premise of action detection and the most important step in action detection.From the perspective of algorithms,the research of action recognition algorithms is the basis of action detection research,and the research of action detection must solve a series of problems faced by action recognition.Action detection not only needs to pay attention to the link of action recognition but also needs to solve the problem of the time and space position of the actor.In general,many algorithms for action detection are just starting,and room for improvement in efficiency and accuracy remains large.

    4.1 Difficulties and Challenges

    4.1.1 Difficulty in Data Collection

    At present,action recognition and action detection based on deep learning is mainly based on supervised learning,which has a strong dependence on data[123].Hence the requirements for sample size as well as scenarios covered by datasets are increasing.Action recognition and action detection datasets are usually video data,which is larger than images and more cumbersome in the process of data collection,pruning,and labeling.

    Action detection datasets need to label not only the categories of actions but also the spatial and temporal locations of actions,which is an arduous task.Temporal action detection requires the starting and ending time of the action to be annotated at a frame level.Spatiotemporal action detection also requires accurate labeling of the spatial location of the actor.This has led to a significant increase in the workload of dataset calibration,so some datasets have to adopt compromise methods such as sparse calibration to ease the pressure of data calibration.

    4.1.2 High Hardware Requirements

    At present,action recognition and action detection are facing mass computing power costs.With the gradual complication of deep learning models,especially since the transformer has penetrated the field of computer vision,it has become more difficult to train models,for the requirements for computer GPU computing power have gradually increased.The computing power overhead cannot be provided by ordinary hardware,which prevents the large-scale application of current action recognition and detection algorithms[124].

    With the development of action recognition and action detection,to obtain features from videos effectively,the scale of corresponding datasets has been expanding continuously.Thus,some opensource datasets come to hundreds of GB or even several TB,which is a huge burden on the storage as well as read-and-write capacity for computers.At present,most datasets need to undergo preprocessing processes such as serialization or optical fluidization before training,bringing heavy computational and read-write burden to the computer.

    4.1.3 Difficulty Judging Action Features

    Action recognition faces the following difficulties:

    The first is the complexity of fine-grained recognition.Just as there are a thousand Hamlets in the eyes of a thousand viewers,human action is complex and diverse,which may have different meanings from different perspectives.Therefore,it is difficult to strictly divide the categories of action,just as flapping actions,the speed of which directly determines whether the action itself has violent attributes.

    The second is the complexity of spatial information.Lighting variations,occlusion,and noise issues caused by video background information can affect feature extraction adversely.Different observing angles towards the actor will also cause problems related to scale transformation,which will also bring trouble to the judgment of Action characteristics.

    The third is the complexity of temporal information.The modeling problem of temporal dimension is the core problem of action recognition,yet also a key difficulty.From the viewpoint of current development status,the extraction of temporal information remains still very difficult[125].

    Action detection also faces several difficulties:

    The first is that action detection is limited by the performance of action recognition.Action recognition is the basis of action detection,but there are still many problems to be solved in the current action recognition tasks,which brings basic difficulties to subsequent action detection.

    The second is the ambiguity of Action space-time dimension positioning.From the perspective of the temporal dimension,the definition of starting and ending points of certain actions is vague,and the length of actions also varies.From the perspective of spatial dimension,the motion problem needs to be considered when positioning the actor in the spatial dimension,and the combination of multiple frames avoids the problem of jittering.

    4.2 Future Research Trends

    4.2.1 Enrich the Dataset

    Action detection and recognition mainly adopt supervised learning,and sufficient data support must be ensured.Due to the complexity of human action and the fine-grained requirements of practical applications,it is necessary to further expand on the existing datasets.At present,there are many datasets in the field of action recognition,but for specific task scenarios,the data covering related types of actions still need to be expanded.Research for action detection started late,hence the datasets of state-of-art research are few,and the coverage for scenarios remains incomplete.Therefore,it is now necessary to focus on enriching relevant datasets to rid them of data scarcity.For specific task scenarios,it can also be extended based on the existing datasets using data augmentation such as Rotation,Mixup,adding noise,and so on,which can alleviate the problem of insufficient data[126].

    4.2.2 Few-Shot Learning

    Given the above problems,in specific scenarios of action recognition and detection,such as the recognition and detection of violent acts in security scenarios,the recognition and detection of illegal operations in industrial production scenarios,etc.,the few-shot learning method can be used to relieve pressure.The basic idea of few-shot learning is to train the network to learn metaknowledge from a large number of prior tasks,and then use the existing prior knowledge to guide the model to learn faster in the new task [127].Few-shot learning can obtain data features from a small number of samples,reducing the intensive dependence on data in behavior recognition and behavior detection.

    4.2.3 Model Lightweight

    Because the existing algorithms have huge computing power overhead and are difficult to promote and deploy on a large scale,lightweight operations such as pruning the model are an important direction for subsequent research.Besides,it is also necessary to consider reducing the complexity of the model when designing the network structure[128].For example,algorithms such as the SlowFast network and TSM network can effectively promote the progress of research by reducing computing overhead.From the perspective of practical application,the action recognition algorithm or action detection algorithm cannot be deployed in the actual task scenario.The model architecture of the algorithm is too complex,which puts forward high requirements for hardware.Therefore,from the perspective of real needs,model lightweight is a task that must be completed.

    4.2.4 Transformer Model

    From the above content,the current transformer model has extensive participation in action recognition and action detection.As a popular model across the fields of NLP and computer vision,the transformer model will play an important role in the future development of action recognition and detection.At present,Transformer has higher accuracy than CNN in action recognition and action detection tasks,and can better connect or collaborate with NLP in future large model research.The transformer model also needs to be optimized in terms of model parameters to reduce the hardware requirements,so it could be optimized on this basis.

    5 Conclusion

    This paper systematically reviews the current research status of action recognition and detection,focuses on four commonly used models for action recognition,divides action detection into temporal action detection and spatiotemporal action detection,and elaborates on the context of algorithm development in each scenario.Finally,this paper summarizes the action recognition and detection,sorts out the differences and connections between various algorithms,and expounds on the prominent problems faced by current research and the general direction of the next development.

    Acknowledgement:None.

    Funding Statement:This work was supported by the National Educational Science 13th Five-Year Plan Project (JYKYB2019012),the Basic Research Fund for the Engineering University of PAP(WJY201907)and the Basic Research Fund of the Engineering University of PAP(WJY202120).

    Author Contributions:Study conception and design:Y.Li,X.Cui;data collection:Q.Liang;analysis and interpretation of results: B.Gan;draft manuscript preparation: Q.Liang.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:All data in this paper can be found in Google Scholar.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    真人一进一出gif抽搐免费| 国产欧美日韩一区二区三区在线| 日韩欧美三级三区| 1024视频免费在线观看| 欧美成人免费av一区二区三区| 久久青草综合色| 久久香蕉精品热| 无遮挡黄片免费观看| 国产精品日韩av在线免费观看 | 99热国产这里只有精品6| 亚洲avbb在线观看| 欧美不卡视频在线免费观看 | 免费久久久久久久精品成人欧美视频| 国产精品国产高清国产av| 亚洲av成人不卡在线观看播放网| 美女高潮到喷水免费观看| 91国产中文字幕| 成人三级做爰电影| 欧美精品啪啪一区二区三区| 久久国产精品男人的天堂亚洲| 中国美女看黄片| 欧美色视频一区免费| 正在播放国产对白刺激| 精品福利观看| av视频免费观看在线观看| 在线看a的网站| 久久久久久亚洲精品国产蜜桃av| 日韩国内少妇激情av| 亚洲伊人色综图| 女生性感内裤真人,穿戴方法视频| 欧美精品一区二区免费开放| 看免费av毛片| 999久久久精品免费观看国产| 国产蜜桃级精品一区二区三区| 国产成人精品久久二区二区91| 久久久国产成人精品二区 | 国产亚洲精品久久久久5区| 桃红色精品国产亚洲av| 国产成人av教育| 高清av免费在线| 我的亚洲天堂| 午夜免费鲁丝| 高清欧美精品videossex| 国产亚洲av高清不卡| 国产一区二区三区视频了| 国产av一区二区精品久久| 亚洲aⅴ乱码一区二区在线播放 | 深夜精品福利| 超色免费av| 国产av一区在线观看免费| 手机成人av网站| 天天影视国产精品| 母亲3免费完整高清在线观看| 成人三级做爰电影| 一边摸一边做爽爽视频免费| 在线免费观看的www视频| 桃红色精品国产亚洲av| 国产成人免费无遮挡视频| 女人被躁到高潮嗷嗷叫费观| 一个人免费在线观看的高清视频| 国产伦人伦偷精品视频| 精品熟女少妇八av免费久了| 咕卡用的链子| 日本欧美视频一区| 国产精品 国内视频| 成人国产一区最新在线观看| 久久香蕉激情| 国产精品美女特级片免费视频播放器 | 国产又色又爽无遮挡免费看| 欧美激情极品国产一区二区三区| 亚洲自拍偷在线| 成年人免费黄色播放视频| 又大又爽又粗| 夫妻午夜视频| av免费在线观看网站| 男女午夜视频在线观看| 少妇裸体淫交视频免费看高清 | 日韩欧美一区二区三区在线观看| 日本a在线网址| 日本一区二区免费在线视频| 精品国产国语对白av| 18禁观看日本| 国产有黄有色有爽视频| 老司机午夜福利在线观看视频| 精品高清国产在线一区| 97碰自拍视频| 国产伦人伦偷精品视频| 欧洲精品卡2卡3卡4卡5卡区| 激情视频va一区二区三区| 宅男免费午夜| 99精品久久久久人妻精品| av国产精品久久久久影院| 丝袜在线中文字幕| 亚洲狠狠婷婷综合久久图片| 中出人妻视频一区二区| 麻豆成人av在线观看| 亚洲第一av免费看| 国产黄色免费在线视频| 99精品在免费线老司机午夜| 淫秽高清视频在线观看| 夜夜夜夜夜久久久久| 国产伦一二天堂av在线观看| 一a级毛片在线观看| 日日干狠狠操夜夜爽| 老司机午夜福利在线观看视频| 色婷婷久久久亚洲欧美| 欧美激情 高清一区二区三区| 嫁个100分男人电影在线观看| 国产aⅴ精品一区二区三区波| 亚洲人成网站在线播放欧美日韩| 亚洲精品美女久久av网站| 国产欧美日韩一区二区精品| 午夜91福利影院| 涩涩av久久男人的天堂| 一个人观看的视频www高清免费观看 | 亚洲免费av在线视频| 在线国产一区二区在线| 久久国产精品男人的天堂亚洲| 亚洲欧美一区二区三区黑人| 国产人伦9x9x在线观看| 男人舔女人的私密视频| 中文字幕另类日韩欧美亚洲嫩草| 男女高潮啪啪啪动态图| 99久久久亚洲精品蜜臀av| 国产一区二区激情短视频| 亚洲人成电影免费在线| 欧美乱色亚洲激情| 91麻豆精品激情在线观看国产 | 女人高潮潮喷娇喘18禁视频| 国产成人精品无人区| 夫妻午夜视频| 久久人人97超碰香蕉20202| 久久久精品欧美日韩精品| 欧美黄色淫秽网站| 亚洲精品成人av观看孕妇| 亚洲,欧美精品.| 欧美日韩视频精品一区| 又黄又粗又硬又大视频| 亚洲精品一二三| 香蕉国产在线看| 丁香六月欧美| 亚洲国产看品久久| 中文欧美无线码| 窝窝影院91人妻| 久久欧美精品欧美久久欧美| 在线看a的网站| 国产精品 欧美亚洲| 亚洲熟女毛片儿| www.999成人在线观看| 中文字幕人妻熟女乱码| 亚洲成a人片在线一区二区| 99热国产这里只有精品6| 中文欧美无线码| 国产极品粉嫩免费观看在线| 亚洲成a人片在线一区二区| 亚洲精品国产一区二区精华液| 怎么达到女性高潮| 亚洲午夜精品一区,二区,三区| 欧美日韩亚洲高清精品| 男人舔女人下体高潮全视频| 美女高潮到喷水免费观看| 18禁美女被吸乳视频| 在线观看免费日韩欧美大片| 在线av久久热| 成人三级做爰电影| 美女 人体艺术 gogo| 久久香蕉激情| 亚洲一卡2卡3卡4卡5卡精品中文| 真人一进一出gif抽搐免费| 日韩人妻精品一区2区三区| 黄片大片在线免费观看| 又大又爽又粗| 国产成人精品无人区| 亚洲av熟女| 天堂中文最新版在线下载| 丝袜美腿诱惑在线| 满18在线观看网站| 免费搜索国产男女视频| 欧美日韩亚洲综合一区二区三区_| 露出奶头的视频| av有码第一页| 90打野战视频偷拍视频| 午夜福利在线免费观看网站| 久久精品人人爽人人爽视色| 十八禁网站免费在线| 精品久久久久久,| 男女下面进入的视频免费午夜 | 如日韩欧美国产精品一区二区三区| 18禁黄网站禁片午夜丰满| 校园春色视频在线观看| 黄片播放在线免费| 一本综合久久免费| 国产亚洲av高清不卡| 久久久久亚洲av毛片大全| 国产高清激情床上av| 亚洲人成77777在线视频| av超薄肉色丝袜交足视频| 人人澡人人妻人| 乱人伦中国视频| 电影成人av| xxxhd国产人妻xxx| 一区二区日韩欧美中文字幕| 亚洲人成伊人成综合网2020| 女同久久另类99精品国产91| 久久久久久大精品| 伦理电影免费视频| 在线视频色国产色| 在线观看免费午夜福利视频| 国产精品一区二区精品视频观看| 俄罗斯特黄特色一大片| 久久人人爽av亚洲精品天堂| 亚洲第一av免费看| 99久久国产精品久久久| 一边摸一边做爽爽视频免费| 中文字幕另类日韩欧美亚洲嫩草| 欧美最黄视频在线播放免费 | 精品国产一区二区久久| 侵犯人妻中文字幕一二三四区| 亚洲精品国产精品久久久不卡| 中文字幕色久视频| 亚洲 国产 在线| 亚洲欧美一区二区三区黑人| 一级毛片女人18水好多| 久久人人精品亚洲av| 午夜福利在线观看吧| 91精品国产国语对白视频| 啦啦啦免费观看视频1| 波多野结衣一区麻豆| 超碰成人久久| 国产成人精品在线电影| 亚洲欧美日韩高清在线视频| 久久精品国产亚洲av香蕉五月| 久久香蕉激情| 女同久久另类99精品国产91| 日本a在线网址| 一级毛片女人18水好多| 久久天堂一区二区三区四区| 国产成人欧美在线观看| 纯流量卡能插随身wifi吗| 亚洲国产精品一区二区三区在线| 一级毛片高清免费大全| 99热只有精品国产| 精品午夜福利视频在线观看一区| 国产欧美日韩一区二区三区在线| 久久久久国内视频| 国产熟女xx| 久久精品国产清高在天天线| 成人三级做爰电影| 涩涩av久久男人的天堂| 国产欧美日韩综合在线一区二区| 女同久久另类99精品国产91| 99国产精品一区二区蜜桃av| 国产熟女xx| 欧美黑人欧美精品刺激| 黄色丝袜av网址大全| 国产野战对白在线观看| 操出白浆在线播放| 亚洲精品国产区一区二| 亚洲精华国产精华精| 中文字幕高清在线视频| 老汉色av国产亚洲站长工具| 黄色丝袜av网址大全| 国产野战对白在线观看| 国产成人一区二区三区免费视频网站| 欧美日韩黄片免| 日本a在线网址| 国产精品野战在线观看 | 国产1区2区3区精品| 午夜a级毛片| 国产极品粉嫩免费观看在线| 色哟哟哟哟哟哟| 久久久久久久久久久久大奶| 国产免费男女视频| 国产成人精品在线电影| av国产精品久久久久影院| 婷婷六月久久综合丁香| 狂野欧美激情性xxxx| 99久久久亚洲精品蜜臀av| 亚洲五月天丁香| av电影中文网址| 国产精品一区二区三区四区久久 | 日韩人妻精品一区2区三区| 久久精品国产综合久久久| 亚洲,欧美精品.| 成人亚洲精品av一区二区 | 国产精品久久电影中文字幕| 国产成人系列免费观看| 老司机午夜十八禁免费视频| 久久久久久人人人人人| 欧美色视频一区免费| 成年版毛片免费区| 亚洲av第一区精品v没综合| 可以在线观看毛片的网站| 国产免费av片在线观看野外av| 亚洲成a人片在线一区二区| 丝袜在线中文字幕| 成年人黄色毛片网站| 美女国产高潮福利片在线看| 日韩欧美免费精品| 久久久久久久久中文| 免费在线观看完整版高清| 亚洲欧美精品综合久久99| www.自偷自拍.com| 欧美人与性动交α欧美精品济南到| 国产精品1区2区在线观看.| 久久人妻福利社区极品人妻图片| 午夜91福利影院| 国产欧美日韩一区二区三| 91国产中文字幕| 高潮久久久久久久久久久不卡| 久久久久久免费高清国产稀缺| 99精品欧美一区二区三区四区| 午夜精品久久久久久毛片777| 成人永久免费在线观看视频| 天天添夜夜摸| 色综合婷婷激情| 国产欧美日韩精品亚洲av| 在线播放国产精品三级| 欧美黑人精品巨大| 欧美中文综合在线视频| 日本欧美视频一区| 亚洲第一欧美日韩一区二区三区| 日本欧美视频一区| 久久久国产一区二区| 亚洲欧美日韩无卡精品| 中出人妻视频一区二区| 欧美在线黄色| 精品日产1卡2卡| 欧美日本亚洲视频在线播放| 国产精品亚洲av一区麻豆| 好男人电影高清在线观看| 后天国语完整版免费观看| 侵犯人妻中文字幕一二三四区| 交换朋友夫妻互换小说| 国产精品一区二区精品视频观看| 最近最新中文字幕大全免费视频| 亚洲国产精品合色在线| 亚洲欧美一区二区三区黑人| 欧美激情久久久久久爽电影 | 国产精品亚洲av一区麻豆| 精品人妻1区二区| 日本wwww免费看| ponron亚洲| 亚洲一卡2卡3卡4卡5卡精品中文| 日本免费一区二区三区高清不卡 | 日日爽夜夜爽网站| 女生性感内裤真人,穿戴方法视频| 日韩大尺度精品在线看网址 | 麻豆国产av国片精品| 丰满饥渴人妻一区二区三| 中国美女看黄片| 成年版毛片免费区| 久久精品91无色码中文字幕| 久久香蕉精品热| 99riav亚洲国产免费| 久热爱精品视频在线9| 一个人观看的视频www高清免费观看 | 精品高清国产在线一区| 日韩视频一区二区在线观看| 国产成+人综合+亚洲专区| 天天添夜夜摸| 欧美 亚洲 国产 日韩一| 国产av精品麻豆| 午夜日韩欧美国产| 久久人妻福利社区极品人妻图片| 两性午夜刺激爽爽歪歪视频在线观看 | xxxhd国产人妻xxx| 亚洲中文av在线| av国产精品久久久久影院| 成年人免费黄色播放视频| av国产精品久久久久影院| 国产一区二区三区综合在线观看| 两个人看的免费小视频| 国产一区二区三区综合在线观看| 成人精品一区二区免费| 美女高潮喷水抽搐中文字幕| 亚洲avbb在线观看| 国产又色又爽无遮挡免费看| 成人18禁在线播放| 亚洲国产精品sss在线观看 | 亚洲中文字幕日韩| 国产欧美日韩一区二区精品| 欧美+亚洲+日韩+国产| 国产成人av教育| 91字幕亚洲| 日本一区二区免费在线视频| 免费看a级黄色片| 最新在线观看一区二区三区| 国产在线观看jvid| 这个男人来自地球电影免费观看| 麻豆成人av在线观看| 日韩大码丰满熟妇| 激情视频va一区二区三区| 如日韩欧美国产精品一区二区三区| 麻豆久久精品国产亚洲av | 亚洲国产毛片av蜜桃av| 欧美成人免费av一区二区三区| 久久狼人影院| 丰满人妻熟妇乱又伦精品不卡| 黄色女人牲交| 黄色怎么调成土黄色| 国产黄a三级三级三级人| 久久国产亚洲av麻豆专区| 国产一区二区三区在线臀色熟女 | 1024视频免费在线观看| 欧美日韩黄片免| 精品熟女少妇八av免费久了| av中文乱码字幕在线| 久久久国产成人精品二区 | 精品国产乱子伦一区二区三区| 欧美黑人欧美精品刺激| 中文字幕高清在线视频| 精品免费久久久久久久清纯| 国产精品免费视频内射| 亚洲五月婷婷丁香| 亚洲一区二区三区色噜噜 | 一边摸一边做爽爽视频免费| 久久久国产成人免费| 精品电影一区二区在线| 色婷婷av一区二区三区视频| 国产又色又爽无遮挡免费看| 制服人妻中文乱码| 天堂√8在线中文| 久久 成人 亚洲| 国产亚洲欧美在线一区二区| 精品国产乱子伦一区二区三区| 国产高清激情床上av| 亚洲一卡2卡3卡4卡5卡精品中文| 亚洲欧美一区二区三区久久| 黄片小视频在线播放| 亚洲第一欧美日韩一区二区三区| 亚洲专区中文字幕在线| 一级,二级,三级黄色视频| av网站免费在线观看视频| 久久热在线av| 一级a爱片免费观看的视频| 国产极品粉嫩免费观看在线| 亚洲熟女毛片儿| 免费人成视频x8x8入口观看| 亚洲欧美一区二区三区久久| 国产三级黄色录像| 久久伊人香网站| 在线视频色国产色| 国产成人啪精品午夜网站| 午夜影院日韩av| 18禁黄网站禁片午夜丰满| 身体一侧抽搐| 十八禁人妻一区二区| 岛国在线观看网站| 99精国产麻豆久久婷婷| 怎么达到女性高潮| 宅男免费午夜| 精品国内亚洲2022精品成人| 亚洲性夜色夜夜综合| 99国产精品一区二区三区| 一级,二级,三级黄色视频| 欧美日韩黄片免| 久久久精品国产亚洲av高清涩受| 美女大奶头视频| 国产高清videossex| 天堂动漫精品| av电影中文网址| 一级片免费观看大全| 日韩中文字幕欧美一区二区| 久9热在线精品视频| 乱人伦中国视频| 色综合欧美亚洲国产小说| 日韩欧美一区二区三区在线观看| 可以在线观看毛片的网站| 香蕉丝袜av| 一二三四在线观看免费中文在| 国产色视频综合| 波多野结衣高清无吗| 亚洲欧美精品综合久久99| 免费少妇av软件| 久久午夜综合久久蜜桃| 国产成人精品在线电影| 欧美日韩精品网址| 国产av又大| 日韩欧美国产一区二区入口| 午夜久久久在线观看| 亚洲久久久国产精品| 亚洲av片天天在线观看| 69av精品久久久久久| 激情视频va一区二区三区| 精品一区二区三区四区五区乱码| 亚洲国产精品999在线| 搡老乐熟女国产| 亚洲中文av在线| 老司机亚洲免费影院| 十八禁人妻一区二区| 欧美+亚洲+日韩+国产| 成人亚洲精品av一区二区 | 中文字幕高清在线视频| 69av精品久久久久久| 国产无遮挡羞羞视频在线观看| 免费不卡黄色视频| 男女做爰动态图高潮gif福利片 | 亚洲一区中文字幕在线| 一区在线观看完整版| 久久人人精品亚洲av| 天堂影院成人在线观看| 90打野战视频偷拍视频| 亚洲欧美精品综合一区二区三区| 亚洲人成电影免费在线| 一a级毛片在线观看| 亚洲欧美日韩高清在线视频| 黄频高清免费视频| 妹子高潮喷水视频| 亚洲一区高清亚洲精品| 亚洲精品一二三| 亚洲午夜理论影院| 色精品久久人妻99蜜桃| 嫩草影视91久久| 国产激情久久老熟女| 国产黄a三级三级三级人| 久久精品亚洲熟妇少妇任你| 后天国语完整版免费观看| 亚洲国产毛片av蜜桃av| 我的亚洲天堂| 亚洲七黄色美女视频| 天堂俺去俺来也www色官网| 操美女的视频在线观看| 在线永久观看黄色视频| 男男h啪啪无遮挡| 日韩av在线大香蕉| 亚洲专区中文字幕在线| 夫妻午夜视频| 91成年电影在线观看| 亚洲欧美精品综合一区二区三区| 无人区码免费观看不卡| 少妇被粗大的猛进出69影院| 精品卡一卡二卡四卡免费| 免费看十八禁软件| av天堂在线播放| 久久午夜综合久久蜜桃| 精品国产乱码久久久久久男人| 夜夜看夜夜爽夜夜摸 | 18禁观看日本| 久久伊人香网站| 成人亚洲精品一区在线观看| 在线观看66精品国产| 精品久久久久久电影网| 丰满的人妻完整版| 91九色精品人成在线观看| 18禁裸乳无遮挡免费网站照片 | 精品卡一卡二卡四卡免费| 亚洲一区二区三区欧美精品| 97碰自拍视频| 国产高清国产精品国产三级| 久久人妻福利社区极品人妻图片| 久久久久久免费高清国产稀缺| 少妇的丰满在线观看| 亚洲av成人av| 中文字幕精品免费在线观看视频| 欧美色视频一区免费| 一a级毛片在线观看| 国产1区2区3区精品| 久久精品91无色码中文字幕| 精品人妻1区二区| 国产aⅴ精品一区二区三区波| 久久人妻福利社区极品人妻图片| 不卡一级毛片| 中文字幕色久视频| 啦啦啦 在线观看视频| 日韩精品中文字幕看吧| 在线观看免费午夜福利视频| 午夜福利一区二区在线看| 91在线观看av| 两性夫妻黄色片| 天天躁狠狠躁夜夜躁狠狠躁| 国产成人av激情在线播放| 高清黄色对白视频在线免费看| 18禁观看日本| av天堂久久9| 国产伦人伦偷精品视频| 国产精品综合久久久久久久免费 | 久久国产亚洲av麻豆专区| 麻豆av在线久日| 亚洲美女黄片视频| 脱女人内裤的视频| 中亚洲国语对白在线视频| 涩涩av久久男人的天堂| xxxhd国产人妻xxx| 亚洲精华国产精华精| 国产亚洲精品第一综合不卡| 亚洲在线自拍视频| 日本a在线网址| 亚洲 国产 在线| 久久国产精品人妻蜜桃| 欧美成人性av电影在线观看| 嫩草影视91久久| 9色porny在线观看| 欧美在线一区亚洲| 亚洲av成人不卡在线观看播放网| 久久国产精品人妻蜜桃| 夜夜看夜夜爽夜夜摸 | 国产精品影院久久| 精品久久久久久久久久免费视频 | 女人爽到高潮嗷嗷叫在线视频| 国产av又大| 久久精品国产亚洲av香蕉五月| 美国免费a级毛片| 午夜精品久久久久久毛片777| 国产免费av片在线观看野外av| 亚洲专区字幕在线| 免费在线观看黄色视频的| 丰满的人妻完整版| 99国产精品一区二区三区| 视频区图区小说| 亚洲第一青青草原| 亚洲午夜精品一区,二区,三区| 免费不卡黄色视频| 亚洲熟妇熟女久久| 国产精品亚洲一级av第二区|