• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Scribble-Supervised Video Object Segmentation

    2022-01-25 12:51:00PeiliangHuangJunweiHanNianLiuJunRenandDingwenZhang
    IEEE/CAA Journal of Automatica Sinica 2022年2期

    Peiliang Huang,Junwei Han,,Nian Liu,Jun Ren,and Dingwen Zhang,

    Abstract—Recently,video object segmentation has received great attention in the computer vision community.Most of the existing methods heavily rely on the pixel-wise human annotations,which are expensive and time-consuming to obtain.To tackle this problem,we make an early attempt to achieve video object segmentation with scribble-level supervision,which can alleviate large amounts of human labor for collecting the manual annotation.However,using conventional network architectures and learning objective functions under this scenario cannot work well as the supervision information is highly sparse and incomplete.To address this issue,this paper introduces two novel elements to learn the video object segmentation model.The first one is the scribble attention module,which captures more accurate context information and learns an effective attention map to enhance the contrast between foreground and background.The other one is the scribble-supervised loss,which can optimize the unlabeled pixels and dynamically correct inaccurate segmented areas during the training stage.To evaluate the proposed method,we implement experiments on two video object segmentation benchmark datasets,YouTube-video object segmentation (VOS),and densely annotated video segmentation(DAVIS)-2017.We first generate the scribble annotations from the original per-pixel annotations.Then,we train our model and compare its test performance with the baseline models and other existing works.Extensive experiments demonstrate that the proposed method can work effectively and approach to the methods requiring the dense per-pixel annotations.

    I.INTRODUCTION

    VIDEO object segmentation (VOS) aims at segmenting one or multiple foreground objects from a video sequence,which has been actively studied in computer vision applications,such as object tracking [1],video summarization[2],and action recognition [3].Segmenting objects on videos are partially inspired by researches about semantic segmentation,e.g.,semantic segmentation in the urban scene[4],[5].However,VOS needs to simultaneously explore both the spatial and temporal information.

    Based on the levels of human intervention in test scenarios,existing VOS approaches can be divided into two main categories,which are the semi-supervised approaches [6]–[10]and unsupervised approaches [11]–[15],respectively.The semi-supervised VOS approaches need the ground-truth annotations of the objects for the first frame for each test video sequence,while the unsupervised approaches do not need such annotation.However,both the semi-supervised and unsupervised approaches need the per-pixel annotations of each video frame in the training stage,and the per-pixel annotation on large-scale video data is prohibitively expensive and hard to obtain.Due to the lack of prior knowledge of the primary objects in the first frame,the unsupervised approaches cannot select the objects of interest flexibly.

    In this paper,to achieve the trade-off between annotating efficiency and model performance,we propose a novel method to achieve video object segmentation that needs only the scribble-level supervision.Specifically,our method requires the scribble annotation on each video frame in training while only requires the scribble annotation on the first video frame in testing.Consequently,compared to the conventional semi-supervised VOS and unsupervised VOS methods,our scribble-supervised VOS method has two advantages.Firstly,as shown in Fig.1 (a) and 1(b),unlike the per-pixel annotations,drawing a scribble greatly reduces the cost of labeling.Secondly,the annotated scribble on the first frame of each test video can provide informative guidance for segmenting the desired objects,which is beyond the capacity of the unsupervised VOS approaches.

    Implementing the scribble-supervised VOS process is very challenging.As shown in Fig.1 (a),per-pixel annotations allow fully supervised VOS methods to achieve reliable segmentation results since they provide supervision on all aspects of the targeted objects,including structure,position,boundary detail,etc.However,scribble-level annotations (see Fig.1 (b)) only provide a limited number of labeled pixels inside the objects which are highly sparse and most of the pixel information is missing.In this case,missing the supervision information would limit the performance of the deep model trained on scribble-level annotations.Directly using conventional network architectures and object functions to learn video object segmentation models with these sparse scribbles may lead to segmentation results that have poor boundary and details,as shown in Fig.1 (c).

    To achieve high-quality video object segmentation results,we introduce a novel scribble attention module and a scribblesupervised loss to build the video object segmentation model.

    Fig.1.The illustration of different annotations and the results of different methods supervised by scribbles.(a) Manually annotating cost with the perpixel annotations.(b) Manually annotating cost with the scribble-level annotations.(c) Baseline model:trained only on scribbles with partial crossentropy loss [16].(d) Ours.

    First,we design a novel scribble attention module that learns an effective attention map to enhance the contrast between the foreground and background.Most existing selfattention models utilize global context regions to construct contextual features,where the information at each location is aggregated.However,not all contextual information is reliable for the final decision.Instead of learning dense pairwise relationships globally,the scribble attention module is able to selectively learn the relationship between reliable foreground positions and the query position with the help of the foreground scribble annotation information.Specifically,the scribble attention module computes the response at a query position as a weighted sum of the features at the scribble positions in the input feature maps.Intuitively,if the query position belongs to the background region,the pairwise affinities between it and all scribble positions are relatively low,so is the aggregated response of this query position.On the contrary,if the query position belongs to the foreground region,the pairwise affinities between it and all scribble positions are relatively high,so is the aggregated response of this query position.Therefore,the scribble attention module can capture more accurate context information and learn an effective attention map to enhance the contrast between the foreground and background.

    Second,we propose a novel scribble-supervised loss which optimizes both the labeled and unlabeled pixels by making full use of pixel positions and red-green-blue (RGB) color.This scribbled-supervised loss can be seen as a regularized loss and cooperated with the partial cross-entropy loss [16] to optimize the training process.Specifically,we leverage the partial cross-entropy (pCE) loss [16] on the scribble labeled pixels and scribble-supervised loss on both the labeled and unlabeled pixels in “shallow” segmentation [16],[17].Our scribble-supervised loss penalizes disagreements between the pairs of pixel points and softly enforces output consistency among all pixels points based on predefined pairwise affinities.These pairwise affinities are based on dense conditional random field (CRF) [18] which makes full use of pixels position and RGB color.By adding this regularization term,all pixel information is used throughout the training process,which compensates for missing information in scribble-level annotations.Furthermore,we dynamically divide the predicted mask into three different confidence regions and assign different scribble-supervised loss weights to these three regions at the same time.Thus,we can achieve better segmentation of the object’s boundary by dynamically correcting the inaccurate segmentation area throughout the training process.

    The contributions of this paper can be summarized as follows:

    1) We propose a novel weakly-supervised VOS method which trains the model with only scribble-level annotations.To the best of our knowledge,this is one of the earliest VOS frameworks working under the scribble-level supervision.

    2) We design a novel scribble attention module which can capture more accurate context information and learn an effective attention map to enhance the contrast between foreground and background.

    3) We propose the scribble-supervised loss and integrate it into the training process to compensate for the missing information of scribble-level annotations and dynamically to correct the inaccurate segmentation area.

    4) We evaluate the proposed algorithm on two benchmark datasets quantitatively and qualitatively.The extensive experiments show that the proposed algorithm effectively reduces the gap to approaches trained with the per-pixel annotations.

    The remainder of this paper is organized as follows.A brief review of the related works on video object segmentation,scribble-supervised semantic segmentation,and self-attention module is presented in Section II.Then,the problem formulation and detailed explanations of the proposed method are presented in Section III,followed by the experimental results in Section IV.Finally,the conclusions are drawn in Section V.

    II.RELATED WORKS

    We start with providing an overview of related works on video object segmentation,followed by an overview of scribble-supervised semantic segmentation,and finally review several related works for self-attention models.

    A.Video Object Segmentation

    Semi-Supervised Methods:In the semi-supervised setting,the ground truth masks of the foreground objects are given for the first frame in the video sequence,the goal is to propagate the object masks of the given objects for the rest of the video sequence.Perazziet al.[6] combined offline and online learning strategies,where the former produces a refined mask from the estimation on previous frames,and the latter captures the appearance of the specific object instance.Caelleset al.[7] proposed a one-shot video object segmentation (OSVOS)method,which pre-trains a convolutional network for foreground-background segmentation and fine-tunes it on the first-frame ground truth of the video sequence at test time.After that,Maniniset al.[19] improved the OSVOS by semantic information from an instance segmentation network.Voigtlaender and Leibe [20] extended the OSVOS by updating the network online using training examples selected based on the confidence of the network and the spatial configuration.

    These online methods greatly improve segmentation results but sacrifice running efficiency.To address the time cost in the fine-tuning stage,some recent works achieved a better runtime and a satisfactory segmentation performance without fine-tuning.Chenet al.[21] proposed a blazingly fast video object segmentation method with pixel-wise metric learning(PML),which uses a pixel-wise embedding space learned with a triplet loss together with the nearest neighbor classifier.Inspired by PML,Voigtlaenderet al.[10] proposed a fast endto-end embedding learning method for video object segmentation,which uses a semantic pixel-wise embedding mechanism together with a global and a local matching mechanisms to transfer information from the first frame and the previous frame of the video to the current frame.Johnanderet al.[9] proposed a generative appearance model(A-GAME) which learns a powerful representation of the target and background appearance in a single forward pass.Although semi-supervised VOS methods have achieved tremendous progress,they rely on the high cost of per-pixel annotation.

    Unsupervised Methods:In the unsupervised setting,the ground truth masks of the foreground objects are not given at all.Early unsupervised VOS approaches mainly leveraged object proposals [11],[21]–[25],analyzed long-term motion information (trajectories) [12],[26]–[29] or utilized saliency information [30]–[33],to infer the target.Recently,with the success of deep learning,many convolutional neural network(CNN)-based models were proposed.For example,Jainet al.[34] proposed a two-stream architecture which fuses motion and appearance in a unified framework for segmenting generic objects in videos.Chenget al.[35] proposed an end-to-end trainable SegFlow network which simultaneously predicts optical flow and object segmentation in video sequences.The information of optical flow and object segmentation are propagated bidirectionally in this unified framework.This method needs to fine-tune the pre-trained segmentation network for specific objects.Liet al.[36] proposed an instance embedding network to produce an embedding vector for each pixel.This embedding vector can identify all pixels belonging to the same foreground.Then,they combined motion-based bilateral networks for identifying the background.Songet al.[13] used a video salient object detection method to segment objects,which fine-tunes the pre-trained segmentation network for extracting spatial saliency features and trains ConvLSTM to capture temporal dynamics.Luet al.[14] proposed a novel co-attention siamese network to address the unsupervised video object segmentation task from a holistic view.This method emphasizes the importance of inherent correlation among video frames and incorporates a global co-attention mechanism.Wanget al.[37] conducted a systematic study on the role of visual attention in the unsupervised VOS task,which decouples unsupervised VOS into two sub-tasks:Attention-guided object segmentation in the spatial domain and unsupervised VOS-driven dynamic visual attention prediction in spatiotemporal domain.However,due to the lack of prior knowledge about the primary objects,these unsupervised VOS methods are difficult to correctly distinguish the primary objects from the complex background in real-world scenarios.Additionally,they also need fully annotated masks in the training stage like the semi-supervised method,which is expensive and time-consuming.

    Our method uses scribble-level annotations as the supervision of each video frame in the training stage.We also use scribble-level annotations as the object’s inference mask of the first frame in the testing stage.This greatly reduces the cost of labeling during both the training and testing phases.Furthermore,it avoids the lack of prior knowledge about the primary objects.

    B.Scribble-Supervised Semantic Segmentation

    To avoid requiring expensive pixel-wise labels,semantic segmentation methods attempt to learn segmentation from low-cost scribble-level annotations [16],[17],[38],[39].Linet al.[38] first proposed scribble-supervised convolutional networks for semantic segmentation.Tanget al.[16]combined partial cross-entropy loss on labeled pixels and normalized cut loss for unlabeled pixels.After that,Tanget al.[17] further explored the role of different regularized losses for scribble-supervised segmentation tasks.Wanget al.[39] proposed a boundary perception guided method,which employs class-agnostic edge maps for supervision to guide the segmentation network.To the best of our knowledge,there is no published work for scribble-supervised VOS.Comparing with the scribble-supervised semantic segmentation,the scribble-supervised VOS task is more complex since it is still costly to manually draw scribble-level labels for each object in a large-scale video dataset.Further,the model has to detect and segment the objects along the video sequence depending on the scribble-level label only in the first frame.In this paper,we simulate the human annotations on the original per-pixel annotated datasets.Therefore,the time cost of manual labeling is further reduced.Furthermore,we propose a novel scribble attention module and a scribble-supervised loss to build our video object segmentation model.

    C.Self-Attention Mechanisms in Neural Networks

    A self-attention module computes the response at a position by using the weighted sum of all the features in an embedding space,which can capture the long-range information.Vaswaniet al.[40] first proposed the self-attention mechanism to draw global dependencies between input and output and applied it in the task of machine translation.This attention mechanism has been attended in the computer vision field.Wanget al.[41] proposed a non-local network for capturing long-range dependencies,which computed the response at a position as a weighted sum of the features at all positions.Inspired from[41],Yiet al.[42] proposed an improved non-local operation to fuse temporal information from consecutive frames for video super-resolution,which was designed to avoid the complex motion estimation and motion compensation and make better use of spatiotemporal information.Fuet al.[43]proposed a dual attention network to model the semantic interdependencies in spatial and channel dimensions respectively.Zhanget al.[44] proposed an aggregated cooccurrent feature module for semantic segmentation,which learned a fine-grained spatial invariant representation to capture co-occurrent context information across the scene.The difference between the co-occurrent feature module and non-local network is that the co-occurrent feature model learns a prior distribution conditioned on the semantic context[44].

    Our scribble attention module is inspired by the success of above attention modules.The key difference between our scribble attention module and them is that the scribble attention module computes the response at a position as a weighted sum of the features at the scribble positions labeled in objects by multiplying scribble-level ground truth and feature maps.

    III.METHOD

    This work aims to design an algorithm for weakly supervised video object segmentation with scribbles as supervision.We introduce the mask-propagation module [9]to propagate the object appearance and introduce the appearance module [9] to provide discriminative posterior class probabilities.To achieve high-quality segmental results with sparsely annotated scribbles,we add a scribble attention module in the two backbones to enhance the contrast between the foreground and background and propose a scribblesupervised loss to take both the labeled and unlabeled pixels into account.In this section,we first present an overview of our proposed VOS method in Subsection A.Then,we discuss the mask-propagation module and the appearance module in Subsection B.In Subsections C and D,we introduce in detail the scribble attention module and the scribble-supervised loss,respectively.Finally,we elaborate on the generation process of scribble annotation in Subsection E.For easy retrieval,we summarize the used symbols of this paper in Table I.

    A.Overview

    The flowchart of complete VOS in this work is shown in Fig.2,which can be divided into two stages:training stage and test stage.In the training stage,we generate the scribble annotations for all the frames and the detailed generation process is described in Subsection E.The scribble annotations of each frame are used to train our scribble-supervised video object segmentation (SSVOS) model.In the test stage,we only generate the scribble annotation for the first video frame and predict the segmentation mask for each frame.The proposed SSVOS model is illustrated in Fig.3.Given a video frame It,the features are first extracted with a backbone network.Then,they are input to the mask-propagation module and the appearance module.The features and the scribble annotation of the initial frame I0are input to the maskpropagation module to provide a rough prior.The coarse segmentation mask of frame It?1is input to the maskpropagation module to propagate the object’s appearanceinformation from frame It?1to frame It.The coarse segmentation mask and the mean and covariance of frameIt?1are used to update the appearance module.The two features from the mask-propagation module and the appearance module are combined in the fusion module,which comprises two convolutional layers.We first concatenate these two features along the channel dimension and obtain aggregated feature maps.Then,we fuse the aggregated features to generate a coarse mask encoding by using two convolution layers with 3×3 and 1×1 convolutional kernels,respectively.The coarse mask encoding is input to a convolution layer with 3×3 convolutional kernels to generate a coarse segmentation mask.This mask is further used to update the maskpropagation module to propagate the object’s appearance information from frame Itto frame It+1.The coarse segmentation mask is also used to update the appearance module to provide posterior possibilities for frame It+1.Finally,the coarse mask encoding from the fusion module is passed through the upsampling module and a convolution layer with 3×3 convolutional kernels to produce the final segmentation mask.

    TABLE IA SUMMARY OF SYMBOLS USED IN THIS PAPER

    Fig.2.Flowchart of complete video object segmentation in this work.

    Fig.3.The architecture of the proposed scribble-supervised video object segmentation model (SSVOS).The scribble attention (SA) module is only embedded into the third block of the backbone network.The features and the scribble annotation of the initial frame I0 are first concatenated with the features of the current frame It and the coarse segmentation mask of the previous frame It?1.Then,the concatenated features are input to the mask propagation module.The features of It and the coarse segmentation mask of I t?1 are input to the appearance module.Furthermore,the mean and covariance of I t?1 are also input to the appearance module.The two features from the mask propagation module and the appearance module are fused in the fusion module and generate a coarse mask encoding.This coarse mask encoding is input to a convolution layer to generate a coarse segmentation mask which is then used to update the maskpropagation module and the appearance module for the next frame I t+1.The coarse mask encoding is then refined by the upsampling module and a convolution layer to generate the final segmentation.The pCE loss and the auxiliary pCE loss only consider the labeled pixels and scribble-supervised (SS) loss considers both the labeled and unlabeled pixels.“Conv” indicates “convolution”.“r” indicates dilation rate [45].“Cov” indicates covariance.

    B.Mask-Propagation and Appearance Modules

    Inspired by A-GAME [9],we use a mask-propagation module to propagate the object’s appearance and keep the temporal-spatial consistency.As shown in Fig.3,the maskpropagation module consists of three convolutional layers,where the first and third layers are two convolution layers with 1×1 and 3×3 convolutional kernels,respectively.The middle layer is a dilation pyramid [45],which has three 3×3 convolution branches with dilation rates of 1,3,and 6,respectively.In the first frame I1,the scribble annotation of initial frame I0is used as input to the mask-propagation module to propagate the object’s appearance from frame I0to frame I1.In the subsequent frames,where the scribble annotations are not available,the coarse segmentation mask from It?1is used as the input to update the mask-propagation module.Thus,the mask-propagation can propagate the object’s appearance from frame It?1to frame It.As the displacements between adjacent frames are very small,temporal-spatial consistency can be kept,and the mask-propagation module can provide the accurate foreground prior estimation to frame It.

    Furthermore,we also use an appearance module [9] to return the posterior class probabilities at each image location.These posterior class probabilities form an extremely strong cue for foreground/background discrimination.Like [9],let us denote the of features extracted from the video frames as{Mp}.The feature Mpat each pixel locationpis aDdimensional vector.We model these feature vectors as i.i.d samples drawn from the underlying distribution:

    wherezpis a discrete random variable which assigns the observation Mpto a specific componentzp=k.We use a uniform prior Ψ(zp=k)=1/Kforzp.Kis the number of components and each component exclusively models the feature vectors of either the foreground or the background.In(1),each class-conditional density is a multi-variate Gaussian with mean μkand covariance matrix Σk:

    In the first frame,our module parameter is inferred from the extracted features and the scribble annotation of initial frame I0.In subsequent frames,we update the module using the network coarse segmentation predictions as soft class labels.We compute the appearance module parameter of frame Itas follows:

    Given the module parameters computed in the previous frame It?1,τt?1,our appearance module can predict the posterior map of frame Itas follows:

    C.Scribble Attention Module

    As illustrated in Fig.4,given an input feature X ∈RC×H×W,we first transform it into three new feature maps Φ(X),Θ(X)and Ω(X),respectively,where {Φ(X),Θ(X),Ω(X)}∈RC/2×H×W.We resize the scribble-level ground truth to R1×H×W.Then,we perform an element-wise multiplication between the Φ(X) and the scribble-level ground truth.Here,the foreground scribble pixels are set to 1 and the rest are set to 0.After that,we calculate the attention as

    where Ajiindicates the extent to which the model attends to theith position when synthesizing thejth position.δf sindicates the foreground scribble pixels position.Tindicates the transpose operation.Φ(Xi)=WφXi,Θ (Xj)=WθXj,where Wφand Wθare two weight matrices which are learned via 1×1 convolution.Then,the output of the attention layer in thejth position can be represented:

    where Ω(Xi)=WωXiand Wωis a weight matrix which is learned via 1×1 convolution.Finally,we further multiplyOjby a scale parameter γ and perform an element-wise sum operation with the input feature X to obtain the final output Z ∈RH×W×Cas follows:

    where γ is initialized as 0 and gradually learned to assign more weight [43],[46].The output feature Zjof the scribble attention module at each position is a weighted sum of the features at the positions of the foreground scribbles and original features.

    The scribble attention module computes the response at a position as a weighted sum of the features at the scribble positions labeled in objects.Each pixel in the scribble belongs to the foreground object.Thus,the module can capture more accurate context information and learn an effective attention map to enhance the contrast between foreground and background,which plays an important role in the segmentation task.

    Fig.4.Illustration of the proposed scribble attention module.

    D.Scribble-Supervised Loss

    As shown in Fig.3,the annotated scribbles are sparsely provided,and the pixels that are not annotated are considered as unknown.We use the (auxiliary) partial cross-entropy loss[16] to consider the partially labeled pixels which can effectively ignore unknown regions.This partial cross-entropy loss can be seen as a sampling of cross-entropy loss with perpixel masks by rewriting it as follows:

    whereq(i) is a binary indicator,which equals 1 if pixeliis labeled on the foreground scribble and 0 otherwise;P (i) is the softmax output probability of the network;δf sindicates that the scribble labeled pixels.

    So far,all scribble labeled pixels have been used for training with partial cross-entropy loss.However,comparing with perpixel annotations,scribbles are often labeled on the internal the objects which are lack of determinate boundary information of objects and confident background regions.Therefore,with limited pixels as supervision,object structure,and boundary details cannot be easily inferred.Furthermore,since the scribble-level annotations only label extremely fewer pixels,training the CNN networks on scribbles would lead to overfitting.The proposed scribble-supervised loss optimizes the training process on both the scribble-labeled and unlabeled pixels,which compensates for missing pixel information in unlabeled pixels.Thus,the scribble-supervised loss can be seen as a regularized loss and can prevent overfitting.Our scribble-supervised loss penalizes disagreements between the pairs of pixel points and softly enforces output consistency among all pixel points based on predefined pairwise affinities.These pairwise affinities are based on dense CRF [18] which makes full use of pixels position and RGB color.To this end,we first define the energy formulation between pixeliand pixeljbased on [47],[48] as follows:

    wherecaandcbare the class labels.(i) and(j) are the soft-max outputs of network at pixeliand pixelj,respectively.g(i,j) is a Gaussian kernel bandwidth filter which considers both pixel spatial positions and RGB colors and can be represented as follows:

    whereNis the normalized factor,dandrare the pixel spatial position and RGB color,respectively.δdand δrare the hyper parameters that control the bandwidths of Gaussian kernels.We use the Potts model [49] as a discontinuity preserving function to penalty disagreements between the pairs of pixels and encourage the pixels in the same region to have equal labels.Assuming that the segmentation results P are restricted to binary class indicators P ∈{0,1},the standard Potts model[49] could be represented via Iverson brackets [·],as on the left-hand side below:

    whereg(i,j) is a matrix of pairwise discontinuity costs or an affinity matrix.The right-hand side above is a particularly straightforward quadratic relaxation of the Potts model [49]that works for relaxed P ∈[0,1] corresponding to a typical soft-max output of CNNs [17].

    Inspired by [48],we use a modified quadratic relaxation of the Potts model [49] to simplify the energy formulation:

    The affinity matrixg(i,j) is based on a dense Gaussian kernel which makes full use of pixels position and RGB color.We can see that the Potts model [49] penalizes disagreements between the pairs of pixels only based on the spatial and appearance information of the pixels,and there is no need for any supervision information of groundtruth.

    The dense CRF loss [17],[47],[48] can be represented as follows:

    As we all know,the difficulty for the segmentation task is the segmentation of the object’s boundary.In this task,since scribbles have only limited pixels as supervision,it is more difficult to obtain accurate boundary segmentation masks.In order to better segment the boundary regions,we divide the predicted mask into three subregions by predicted mask scores.Specially,we identify pixels with scores larger than 0.3 as reliable foreground pixels,and smaller than 0.05 as reliable background pixels.The remaining pixels are classified as unreliable areas.Then,we assign different loss weights to these areas.Based on this idea,we design a scribblesupervised (SS) loss:

    wherew(i) is an adaptive region weight and can be represented as follows:

    where β controls the weight of the scribble-supervised loss on unreliable regions.

    For each prediction branch,we deploy a new joint loss function combining the partial cross-entropy loss with the designed scribble-supervised loss function:

    where λ controls the weight of the scribble-supervised loss.

    E.Scribble Simulation Generation

    Compared with per-pixel labels,scribbles greatly reduce the expensive cost of annotation.However,it is still costly to manually draw scribble-level labels for each object in a largescale video dataset.In order to reduce the annotation cost of video tasks,Liuet al.[50] designed an interactive computer vision system to allow a user to efficiently annotate motion.Karasevet al.[51] proposed a frame selection method for video annotation,which naturally allows for uncertainty in the measurements,and thus is applicable to both an automated annotation,as well as a manual annotation.Luoet al.[52]used an active learning framework to obtain higher segmentation accuracy with less user annotation.Differently from the above methods,we are inspired by [53] to simulate scribble-level annotations on each video frame from two available per-pixel annotated datasets,YouTube-VOS [54],and densely annotated video segmentation (DAVIS)-2017[55].In these two datasets,as all video frames are provided with per-pixel annotations,we can obtain the scribbles of all video frames by leveraging the proposed scribble simulation generation process.For each video frame,we use morphological skeletonization [56] to automatically generate realistic foreground object scribbles and randomly generate a rectangular box as the background scribbles.As shown in Fig.5,the whole process consists of four steps:1) Foreground skeletonization:we achieve the skeletonization of the perpixel annotated masks to get foreground scribbles by using a fast implementation of the thinning algorithm [56].2)Morphological dilation:these scribbles directly obtained through the thinning algorithm in Step 1 are a single-pixel width.These single-pixel width scribbles have a gap to the manually annotated scribbles.Therefore,we use a morphological dilation algorithm to obtain the five-pixel width scribbles.Then,we use the original per-pixel annotated mask and the obtained scribble-level mask to conduct dot multiplication to eliminate the pixels which are extended to the background due to morphological expansion.We use these scribbles as the final foreground object scribbles.3)Background generation:We randomly generate a five-pixel width rectangular box as the background scribble in the background area around the foreground scribbles.When the foreground object touches an edge of the image,the rectangular edge corresponding to the edge of the image will not be needed,e.g.,the background scribble in Fig.3.4) We label the areas outside of the foreground and background scribble regions as unknown areas.

    Fig.5.Illustration of scribble simulation generation.From left to right:RGB images,per-pixel ground-truth annotations,foreground scribbles,dilated foreground scribbles,final scribbles annotations (best viewed in amplification).

    IV.ExPERIMENTS

    The proposed algorithm is evaluated on two most recent VOS benchmark:YouTube-VOS [54] and DAVIS-2017 [55]dataset.We first conduct ablation studies on YouTube-VOS[54].Then,we compare it with the state-of-the-art VOS methods on both YouTube-VOS [54] and DAVIS-2017 [55]datasets.

    A.Datasets

    YouTube-VOS:The YouTube-VOS consists of 3 471 and 474 videos in the training and validation sets,respectively,being the largest video object segmentation benchmark.Each video sequence is 20 to 180 frames long and every fifth frame is annotated.In the training set,there are 65 unique object categories which are regarded as seen categories.The validation set contains 91 unique object categories,which contain all the seen categories and 26 unseen categories.

    DAVIS2017:The DAVIS2017 is a popular benchmark dataset for VOS tasks which is a multi-object extension of DAVIS2016 [57].In the training set,there are 60 video sequences with multiple annotated instances.The validation and test-dev sets include 30 video sequences with multiple annotated instances,respectively.

    B.Evaluation Metrics

    We evaluate experiment results using three usual evaluation measures suggested by [58]:1) Region similarityJ:The mean intersection-over-union (mIoU) between the ground truth and the predicted segmentation masks.2) Contour accuracyF:TheF-measures of the contour-based precision and recall between the contour points of the ground truth and the predicted masks.3) Overall score G:The average score ofJandF.For the YouTube-VOS dataset,bothJandFare divided into two different measures,depending on whether the categories of the validation are included in the training set(JseenandFseen),or these categories are not included in the training set (JunseenandFunseen.

    C.Implementation Details

    Our framework is based on the A-GAME [9],which is the state-of-the-art in semi-supervised VOS.Our network is implemented based on PyTorch [59].We choose Adam [60]as the optimizer to minimize the joint loss.Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions which is based on adaptive estimates of lower-order moments.Following [9],we divide network training into two stages.In the first stage,we train for 80 epochs on half resolution images (240×480).We set batch size to 4 video snippets and set 8 frames in each snippet.We set a learning rate of 1 0?4,exponential learning rate decay of 0.95 each epoch,and a weight decay of 10?5.In the second stage,we train 100 epochs on full image resolution.The batch size is set to 2 snippets,to accommodate longer sequences of 14 frames.We set a learning rate of 1 0?5,exponential learning rate decay of 0.985 each epoch,and a weight decay of 1 0?6.In both stages,we set the hyper parameters γdand γrof the Gaussian kernels bandwidths to 100 and 15,respectively.We set the scribble-supervised loss weight λ to 2× 10?15and set the unreliable region weight β to 0.1.We determine all hyperparameters via experience and grid search method.To show that they are the best settings,we vary these hyper-parameters in different ways,measuring the change in performance on the YouTube-VOS dataset [54].In Table II rows A and B,we vary the learning rateslr1andlr2in the two training stages,respectively.Each overall score decreases to varying degrees.In rows C and D,we observe that the overall score decreases to varying degrees when varying the hyper-parameters γdand γrof the Gaussian kernels bandwidths.In rows E and F,we vary the loss weight λ and region weight β.Each overall score decreases to varying degrees.These ablation experimental results prove that we can choose the best settings of the hyperparameters via experience and grid search method.

    D.Ablation Studies

    We conduct ablation studies on YouTube-VOS to verify the effectiveness of the scribble attention module and scribblesupervised loss.We compare our baseline with different variant models and the quantitative results for different models are shown in Table III.We also show qualitative comparison results in Fig.6.“pCE” denotes our model training on the scribble labels with the partial cross-entropy loss which can be seen as our baseline.“SA” denotes the proposed scribble attention module.“PSA” denotes the pointwise spatial attention module [61].“NL” denotes the nonlocal module [41].“CCA” denotes the criss-cross attention module [62].“SS” denotes the proposed scribble-supervised loss.“DCRF" denotes the dense CRF loss [17],[47],[48].“ ?” denotes the model with the corresponding module.

    Analysis on Scribble Attention Module:We first analyze the impact of the proposed scribble attention module.To confirm its effectiveness,we compare the performances of the baseline and its variant by adding the proposed scribble attention module.The second row of Table III shows the result of adding the scribble attention module to our baseline model.The overall score increases by more than 3.5% compared with our baseline model (first row of Table III).Fig.6 shows some qualitative results and the segmentation results (second column of Fig.6) with our scribble attention module are greatly improved compared to the baseline model.These quantitative and qualitative comparison results demonstrate usefulness of the proposed scribble attention module for this scribble-supervised video object segmentation.To further demonstrate the effectiveness of the proposed scribbleattention module in this special scribble-supervised task,we replace our scribble attention module with three recent attention modules,“PSA” [61],“NL” [41] and “CCA” [62].As shown in Table III,“PSA”,“NL” and “CCA” all achieve better performance compared with our baseline model,which demonstrates that the self-attention mechanism is able to capture contextual dependencies and is helpful for the VOS task.Compared with the proposed scribble attention module,the overall scores of “PSA”,“NL” and “CCA” drop from 47.8 to 46.5,46.6,and 46.9,respectively.The second and third columns of Fig.6 show the quality results of “SA” and“CCA”,respectively.We can see that the segmentation results of “SA” are better than those of “CCA”,such as human leg area.These quantitative and qualitative comparison results further demonstrate the superior performance of our scribble attention module.

    TABLE IIG OVERALL SCORE UNDER DIFFERENT HYPER-PARAMETERS SETTINGS.“BASE” INDICATES THE BASE HYPER-PARAMETERS SETTINGS WHICH ARE USED IN THIS PAPER.UNLISTED VALUES ARE IDENTICAL TO THOSE OF THE BASE HYPER-PARAMETERS SETTINGS

    TABLE IIIABLATION STUDIES ON YOUTUBE-VOS DATASET

    Scribble-Supervised Loss:Another ablation study is carried out on the use of the proposed scribble-supervised loss by adding it to the baseline model.The sixth row of Table III shows the quantitative result of our baseline with the proposed scribble-supervised loss.Experimental results show that the overall score increases by more than 7.1% compared with the baseline model (first row of Table III).In addition,the other four indicators have also been greatly improved with proposed scribble-supervised loss.As shown in Fig.6,compared with the baseline model (first column of Fig.6),the segmentation quality (fourth column of Fig.6) becomes much better with using the scribble-supervised loss.The baseline model can only segment out the general shape of the objects and the model with scribble-supervised loss can segment out the boundaries and details of the objects.For example,the boundary segmentation results of the man and the tennis racket are rough by the baseline model.Our scribblesupervised loss greatly improves the segmentation results.These quantitative and qualitative comparison results demonstrate that the proposed scribble-supervised loss can indirectly compensate for missing information in weak scribble-level labels.In order to further show the segmentation effects of our scribble-supervised loss on the boundary area,we replace our scribble-supervised loss with dense CRF loss.The overall score decreased from 49.5 to 48.7.As shown in Fig.6,the boundary segmentation results of the man with dense CRF loss (fifth column of Fig.6) are worse than those with scribble-supervised loss (fourth column of Fig.6).These quantitative and qualitative comparison results demonstrate that our scribble-supervised loss can dynamically correct inaccurately segmented areas.

    Fig.6.Quality comparison of different variant modules on Youtube-VOS.The first column to the sixth column are “pCE”,“pCE” with “SA”,“pCE” with“CCA”,“pCE” with “SS”,“pCE” with “DCRF”,and our complete model,respectively.

    TABLE IVQUANTITATIVE RESULTS ON THE YOUTUBE-VOS VALIDATION SET.“OL” DENOTES ONLINE LEARNING

    We also show the quantitative (eighth row of Table III) and qualitative (sixth column of Fig.6) results of our complete model which demonstrate that combining scribble attention module and scribble-supervised loss is able to further improve the segmentation results.

    E.Comparison With State-of-the-Art

    To our best knowledge,there are no published results for weakly-supervised VOS tasks to be compared on YouTube-VOS [54] and DAVIS-2017 [55].We first compare our approach with the state-of-the-art semi-supervised methods on YouTube-VOS [54].Then,we compare our approach with the state-of-the-art unsupervised methods and semi-supervised on DAVIS-2017 [55].

    YouTube-VOS:The validation set of this large-scale dataset contains 474 sequences with 91 categories,26 of which are not included in the training set.We evaluate our method on this validation set and evaluate the results on the open evaluation server [54].Table IV shows the comparison with previous start-of-the-art semi-supervised methods on YouTube-VOS [54].“OL” denotes online learning in the inference stage.Our method achieves 51.1 % of overall scores using scribbles as supervision and without online learning.We obtain almost the same overall scores like [63].Our method reaches comparable results to [6],[20],[64].The gap between our method and [20],[8] is very small.Compared with [7],[9],[54],[58],[65],although there is a certain gap in overall scores,this result is acceptable in view of the huge gap in supervision information.Our method closes the gap to approaches trained on per-pixel annotation.

    Besides,to our best knowledge,end-to-end recurrent network for video object segmentation (RVOS) [8] is the only one that published the results for zero-shot (unsupervised)VOS tasks in YouTube-VOS.Table V shows the quantitative comparison results of our scribble-supervised method with RVOS.We can see that although RVOS uses the per-pixel annotations as supervision during the training phase,our results are still far better than those of RVOS.One of the possible reasons for this is that our method provides the scribble annotation of the first frame in the testing phase,which is able to provide a certain prior knowledge of the objects.

    DAVIS2017:The dataset comprises 60 videos for training,30 videos for validation,and 30 videos for testing.Following[9],we first train on the Youtube-VOS and DAVIS2017 training sets and then evaluate to boost the performance.We evaluate our method on both the validation and test-dev setsand evaluate the results of the test-dev set on the open evaluation server [55].Table VI shows the quantitative comparison results on the DAVIS-2017 validation set of our method to the recent start-of-the-art unsupervised and semisupervised methods.We can see that our method achieves 55.7% ofJ&FMean using scribbles as supervision.Compared with unsupervised methods,our method exceeds [8] largely,and exceeds [13] to some extent.The gap between ours and[37] is very small.Compared with semi-supervised methods with per-pixel annotations as supervision,our method exceeds [63].Our method reaches comparable results to [15],[66].The gap between our method and [7],[8] is very small.Table VII shows the quantitative comparison results on the DAVIS-2017 test-dev set.Our method achieves 41.7 % ofJ&FMean and exceeds the most unsupervised methods [8],[13],[15].The large performance gap is due to the difference of the supervised information,which is acceptable.Compared with semi-supervised methods with per-pixel annotations as supervision,our method exceeds the semi-supervised method[63].The gap between our method and [67],[66] is very small.

    TABLE VQUANTITATIVE RESULTS FOR ZERO-SHOT METHODS IN THE YOUTUBE-VOS DATASET

    TABLE VIQUANTITATIVE RESULTS ON THE DAVIS2017 VALIDATION SET

    TABLE VIIQUANTITATIVE RESULTS ON THE DAVIS2017 TEST-DEV SET

    F.Qualitative Results

    Fig.7 shows the qualitative results of our methods on YouTube-VOS [54] and DAVIS-2017 [55].These videos all contain at least one object with diverse size,shape,motion,and occlusion,etc.It can be seen that in many cases our method can produce high-quality segmentation results.For example,our method is able to successfully segment the challenging motorcycle sequence (first row) in which objects move very fast.In the bmx-trees sequence (fourth row),the objects move fast and parts of objects are occlusion by the tree in some frames.Our method can successfully segment out people and the bike from these occlusion areas.In the last row,our method fails to segment some bottom edges of the car in the start frame.However,afterward,our method is able to recover from that error.

    G.Failure Case

    Fig.7.Qualitative results showing objects on videos from the Youtube-VOS and DAVIS2017.The first second rows are from Youtube-VOS validation set;the latter two are from DAVIS2017 validation set and test-dev set,respectively.

    Fig.8.Illustration of two failure cases on DAVIS-2017.The first row is the paragliding-launch video sequence,where the paragliders ropes are not segmented out properly.The second row is the pigs video sequence,where the segmentation results of the later frames are worse.

    While our approach verifies satisfactory results on both the quantitative and the qualitative evaluations in YouTube-VOS[54] and DAVIS-2017 [55],we find a few failure cases as shown in Fig.8.In the first row,our method fails to segment out paragliders’ ropes.It is because we fail to simulate generating scribbles of very small objects and thus scribbles of paragliders ropes are not well provided in the first frame during the inference stage.We believe that a good future direction is to generate more reliable scribbles with a better simulation generation algorithm.We can also use other data processing methods to handle the problems of the scribbles,such as [68]–[71].In the second row,we found that our method may be less stable to segment these frames which are far from the first frame.Since we take the scribbles as the guidance information in the first frame,these scribbles only occupy the limited pixels information of the objects.It is mostly because the model is difficult to propagate the limited guidance information from the first frame to the too far frames on very challenging scenes.The challenging scenarios can be resolved by incorporating the spatiotemporal information encoding module [72],which will be our future direction.In addition,we can also use other advanced technologies of image segmentation and deep learning [73]–[76] to further improve the performance of the failure cases.

    V.CONCLUSIONS AND FUTURE WORK

    In this paper,we propose a scribble-supervised method for video object segmentation.The first contribution of our method is the scribble attention module,which is designed to selectively learn the relationship between the context regions and the query position.Unlike conventional self-attention modules which consider all the context information,with this selective strategy,our attention module can capture more accurate context information and learn an effective attention map to enhance the contrast between the foreground and the background.Furthermore,to resolve the missing information of scribble-level annotations,we propose a novel scribblesupervised loss that can compensate for the missing information and dynamically correct inaccurate prediction regions.The proposed method alleviates a huge amount of human labor than that of the per-pixel annotation and closes the performance gap to approaches trained on per-pixel annotation,which achieves the trade-off between annotating efficiency and model performance.Therefore,it is of great significance to further exploit weakly supervised method[77]–[80] to learn video object segmentation.In future,we plan to explore the potential of using a new simulation method to generate more reliable scribbles and incorporating the spatiotemporal information encoding module [72],[81] to better propagate guidance information throughout the video sequence.

    制服人妻中文乱码| 国产午夜精品论理片| 欧美一级a爱片免费观看看| 欧美黑人巨大hd| 亚洲欧美日韩东京热| 成人18禁在线播放| 日韩有码中文字幕| 久久久久性生活片| 91麻豆av在线| 久久人妻av系列| 少妇熟女aⅴ在线视频| 九九在线视频观看精品| 久久精品国产亚洲av香蕉五月| 波多野结衣高清作品| 日韩欧美精品v在线| 亚洲精品国产精品久久久不卡| 国产黄色小视频在线观看| 非洲黑人性xxxx精品又粗又长| 欧美另类亚洲清纯唯美| h日本视频在线播放| 精品人妻偷拍中文字幕| 久久精品国产清高在天天线| 欧美成人a在线观看| 黄色片一级片一级黄色片| 国产成人欧美在线观看| 免费av观看视频| 熟女电影av网| 中文字幕人成人乱码亚洲影| 欧美一区二区亚洲| 一进一出抽搐动态| 欧洲精品卡2卡3卡4卡5卡区| 在线观看免费视频日本深夜| 中文字幕人成人乱码亚洲影| 免费人成视频x8x8入口观看| 精品无人区乱码1区二区| 日日夜夜操网爽| 色老头精品视频在线观看| 亚洲欧美日韩无卡精品| 国产成人系列免费观看| 美女黄网站色视频| 欧美成狂野欧美在线观看| 国产精品美女特级片免费视频播放器| 日本 欧美在线| 麻豆国产97在线/欧美| 啦啦啦免费观看视频1| 91久久精品电影网| 麻豆久久精品国产亚洲av| av福利片在线观看| 99热精品在线国产| 最近最新中文字幕大全电影3| 女警被强在线播放| 欧美一区二区亚洲| 久久精品国产自在天天线| 久久精品人妻少妇| av天堂中文字幕网| 真人做人爱边吃奶动态| 亚洲男人的天堂狠狠| 日本黄色片子视频| 欧美日韩福利视频一区二区| 免费人成视频x8x8入口观看| 国产av不卡久久| 国产一区二区激情短视频| 亚洲国产精品sss在线观看| 国产精品99久久久久久久久| 天堂影院成人在线观看| 真实男女啪啪啪动态图| 欧美色欧美亚洲另类二区| 一级作爱视频免费观看| av在线天堂中文字幕| 国产高清有码在线观看视频| 国产精品,欧美在线| 村上凉子中文字幕在线| 亚洲自拍偷在线| 亚洲成人免费电影在线观看| 少妇熟女aⅴ在线视频| 久久人妻av系列| 精品久久久久久久人妻蜜臀av| 欧美日韩综合久久久久久 | 亚洲精品成人久久久久久| 五月玫瑰六月丁香| 精品不卡国产一区二区三区| 日韩有码中文字幕| 亚洲aⅴ乱码一区二区在线播放| www.www免费av| 国产成人福利小说| 婷婷精品国产亚洲av在线| 国产一区二区三区视频了| 精品一区二区三区人妻视频| 草草在线视频免费看| 在线观看免费视频日本深夜| netflix在线观看网站| 免费在线观看日本一区| 亚洲一区高清亚洲精品| 色精品久久人妻99蜜桃| 熟女人妻精品中文字幕| 在线免费观看不下载黄p国产 | ponron亚洲| 99在线人妻在线中文字幕| 男女视频在线观看网站免费| 欧美大码av| 欧美日韩瑟瑟在线播放| 亚洲专区中文字幕在线| 麻豆成人av在线观看| 国产高清有码在线观看视频| 男女下面进入的视频免费午夜| 欧美又色又爽又黄视频| 久久精品亚洲精品国产色婷小说| 真实男女啪啪啪动态图| 黄色丝袜av网址大全| 中文字幕人妻熟人妻熟丝袜美 | 天天一区二区日本电影三级| 欧美乱码精品一区二区三区| 国内精品久久久久精免费| 在线观看免费午夜福利视频| 国内精品久久久久久久电影| bbb黄色大片| 欧美丝袜亚洲另类 | 久久亚洲真实| 色视频www国产| 好看av亚洲va欧美ⅴa在| 日韩免费av在线播放| 一区福利在线观看| 亚洲成人久久爱视频| xxx96com| 国产野战对白在线观看| 黄色视频,在线免费观看| 婷婷丁香在线五月| 欧美不卡视频在线免费观看| 婷婷六月久久综合丁香| 搡老妇女老女人老熟妇| 最近最新中文字幕大全电影3| 在线a可以看的网站| 在线免费观看的www视频| 亚洲黑人精品在线| 亚洲av成人精品一区久久| 欧美区成人在线视频| 欧美黑人欧美精品刺激| 午夜免费激情av| 怎么达到女性高潮| 好男人电影高清在线观看| 深爱激情五月婷婷| 欧美色欧美亚洲另类二区| 欧美一级a爱片免费观看看| 中文亚洲av片在线观看爽| 长腿黑丝高跟| 男人的好看免费观看在线视频| 天堂影院成人在线观看| 欧美日韩综合久久久久久 | 欧美性猛交黑人性爽| 十八禁人妻一区二区| 国产亚洲精品久久久com| a在线观看视频网站| 免费看日本二区| 亚洲久久久久久中文字幕| 国产淫片久久久久久久久 | 亚洲欧美日韩无卡精品| 亚洲精品在线美女| 一区二区三区高清视频在线| www日本黄色视频网| 男女下面进入的视频免费午夜| 国产蜜桃级精品一区二区三区| 日韩欧美一区二区三区在线观看| 午夜免费观看网址| 成人性生交大片免费视频hd| 手机成人av网站| 色噜噜av男人的天堂激情| 免费大片18禁| 亚洲欧美精品综合久久99| 真人做人爱边吃奶动态| 久久久成人免费电影| 一个人观看的视频www高清免费观看| 欧美中文综合在线视频| 日日夜夜操网爽| 老熟妇仑乱视频hdxx| 一个人看的www免费观看视频| 国产精品98久久久久久宅男小说| 国产精品98久久久久久宅男小说| 人人妻,人人澡人人爽秒播| 久久人人精品亚洲av| 精品人妻1区二区| 精品一区二区三区视频在线观看免费| 在线a可以看的网站| 日韩精品中文字幕看吧| 午夜两性在线视频| 久久精品国产亚洲av涩爱 | 中文资源天堂在线| 精品乱码久久久久久99久播| 欧美在线黄色| 精华霜和精华液先用哪个| 午夜福利高清视频| 国产黄片美女视频| 成人永久免费在线观看视频| 欧美精品啪啪一区二区三区| 中文亚洲av片在线观看爽| 欧美又色又爽又黄视频| 久久久久国产精品人妻aⅴ院| 国产精品 欧美亚洲| 一进一出抽搐动态| 久久精品91无色码中文字幕| 在线观看美女被高潮喷水网站 | 国产精品久久久久久人妻精品电影| 国产一区二区在线观看日韩 | 亚洲最大成人中文| 好男人电影高清在线观看| 日本精品一区二区三区蜜桃| 香蕉丝袜av| 无人区码免费观看不卡| 最近最新中文字幕大全电影3| 中出人妻视频一区二区| 丁香六月欧美| 久久久久亚洲av毛片大全| 国产精品久久久人人做人人爽| 最近最新中文字幕大全免费视频| 97超级碰碰碰精品色视频在线观看| 人妻丰满熟妇av一区二区三区| 国产伦一二天堂av在线观看| 啦啦啦免费观看视频1| 国产私拍福利视频在线观看| 一本一本综合久久| 久久久久免费精品人妻一区二区| 丰满的人妻完整版| tocl精华| 91在线观看av| 欧美日韩黄片免| 欧美+亚洲+日韩+国产| 国产精品av视频在线免费观看| 他把我摸到了高潮在线观看| 亚洲国产欧美人成| 欧美性猛交╳xxx乱大交人| 日本a在线网址| 岛国视频午夜一区免费看| 欧美日韩乱码在线| 在线观看美女被高潮喷水网站 | 欧美大码av| 真实男女啪啪啪动态图| 男女那种视频在线观看| 三级男女做爰猛烈吃奶摸视频| 欧美日本亚洲视频在线播放| 欧美性猛交黑人性爽| 欧美丝袜亚洲另类 | 国产三级在线视频| 国产精品一区二区三区四区久久| 成人18禁在线播放| 国产一区在线观看成人免费| x7x7x7水蜜桃| 亚洲成人久久性| 国产高清有码在线观看视频| 男女那种视频在线观看| av欧美777| 在线观看午夜福利视频| 色老头精品视频在线观看| 男人舔奶头视频| 亚洲精品久久国产高清桃花| 精品久久久久久久久久免费视频| 精品国产三级普通话版| 我要搜黄色片| 成人亚洲精品av一区二区| netflix在线观看网站| 日韩国内少妇激情av| 亚洲av成人不卡在线观看播放网| 国产激情欧美一区二区| 日日摸夜夜添夜夜添小说| 老师上课跳d突然被开到最大视频 久久午夜综合久久蜜桃 | 久久久久久久久久黄片| 国产精品精品国产色婷婷| 麻豆成人午夜福利视频| 精品无人区乱码1区二区| 99久久99久久久精品蜜桃| 精品福利观看| 国产成+人综合+亚洲专区| 日本黄色视频三级网站网址| 亚洲国产高清在线一区二区三| 身体一侧抽搐| 国产av在哪里看| 午夜福利18| 久久精品国产亚洲av香蕉五月| 九色成人免费人妻av| 国模一区二区三区四区视频| 熟女人妻精品中文字幕| 国产精品久久电影中文字幕| 中文资源天堂在线| 亚洲国产高清在线一区二区三| 国产一区二区亚洲精品在线观看| 日韩欧美国产在线观看| 在线观看免费视频日本深夜| 夜夜爽天天搞| 精品国内亚洲2022精品成人| www日本黄色视频网| 一级黄片播放器| 亚洲无线观看免费| 国产精品久久久久久精品电影| 亚洲国产色片| 亚洲电影在线观看av| 久久久久久国产a免费观看| 最近最新中文字幕大全电影3| 国产国拍精品亚洲av在线观看 | 亚洲色图av天堂| 亚洲精品美女久久久久99蜜臀| 欧美黑人巨大hd| 搞女人的毛片| 中文字幕人妻丝袜一区二区| 亚洲第一电影网av| 一区二区三区高清视频在线| 成人午夜高清在线视频| 亚洲第一电影网av| 九九久久精品国产亚洲av麻豆| 国产97色在线日韩免费| 欧美日韩黄片免| 欧美黑人巨大hd| 一级毛片高清免费大全| 长腿黑丝高跟| 日韩欧美在线二视频| 成人三级黄色视频| 男人和女人高潮做爰伦理| 在线a可以看的网站| 国产黄a三级三级三级人| 级片在线观看| 男人舔奶头视频| 色精品久久人妻99蜜桃| 色综合婷婷激情| 国产三级在线视频| e午夜精品久久久久久久| 免费av不卡在线播放| 岛国在线免费视频观看| 一级毛片女人18水好多| 日韩欧美三级三区| 亚洲av电影不卡..在线观看| 97人妻精品一区二区三区麻豆| www.www免费av| 网址你懂的国产日韩在线| 女生性感内裤真人,穿戴方法视频| 99久久成人亚洲精品观看| 12—13女人毛片做爰片一| 毛片女人毛片| 婷婷亚洲欧美| 欧美黑人巨大hd| 中文字幕久久专区| 亚洲av成人av| 国产成人aa在线观看| 色老头精品视频在线观看| 亚洲av一区综合| 免费人成在线观看视频色| 国产成年人精品一区二区| e午夜精品久久久久久久| 亚洲电影在线观看av| 精品免费久久久久久久清纯| 夜夜躁狠狠躁天天躁| av国产免费在线观看| 欧美丝袜亚洲另类 | 国产成人a区在线观看| 久久人人精品亚洲av| 女警被强在线播放| 动漫黄色视频在线观看| 亚洲,欧美精品.| 又爽又黄无遮挡网站| 无限看片的www在线观看| 亚洲国产精品999在线| 天天躁日日操中文字幕| 国产综合懂色| 波野结衣二区三区在线 | 国产av在哪里看| 免费看美女性在线毛片视频| 少妇人妻精品综合一区二区 | 国产精品99久久99久久久不卡| 国产私拍福利视频在线观看| 99riav亚洲国产免费| 欧美日韩亚洲国产一区二区在线观看| 中文亚洲av片在线观看爽| 草草在线视频免费看| 最近在线观看免费完整版| 最近最新中文字幕大全电影3| 黄色日韩在线| 国产成人福利小说| 欧美乱妇无乱码| 老司机午夜福利在线观看视频| 欧美日韩瑟瑟在线播放| 美女大奶头视频| 亚洲av日韩精品久久久久久密| 真人一进一出gif抽搐免费| 国产成人啪精品午夜网站| 我的老师免费观看完整版| 免费电影在线观看免费观看| a级毛片a级免费在线| 国产成人欧美在线观看| 无限看片的www在线观看| h日本视频在线播放| 久久中文看片网| 亚洲五月天丁香| 一本精品99久久精品77| 久久精品国产综合久久久| 精品国产亚洲在线| 亚洲av电影在线进入| 亚洲av免费高清在线观看| 国产美女午夜福利| 亚洲无线观看免费| 精品人妻1区二区| 久久久久久久午夜电影| 色视频www国产| 无限看片的www在线观看| www日本在线高清视频| 亚洲熟妇熟女久久| 精品熟女少妇八av免费久了| 99久久99久久久精品蜜桃| 在线视频色国产色| 欧美zozozo另类| 一本精品99久久精品77| 一进一出抽搐gif免费好疼| 欧美黑人巨大hd| 国产高清三级在线| 在线观看66精品国产| 久久婷婷人人爽人人干人人爱| 免费搜索国产男女视频| 午夜激情欧美在线| 免费无遮挡裸体视频| 国产精品av视频在线免费观看| 久久九九热精品免费| 内地一区二区视频在线| 亚洲成人免费电影在线观看| 欧美性猛交黑人性爽| 日本撒尿小便嘘嘘汇集6| 欧美一级a爱片免费观看看| 亚洲国产精品999在线| 一个人观看的视频www高清免费观看| 欧美一级a爱片免费观看看| 91久久精品电影网| 午夜福利免费观看在线| svipshipincom国产片| 白带黄色成豆腐渣| 国产精品一及| 天天添夜夜摸| 久久伊人香网站| 色在线成人网| 97超级碰碰碰精品色视频在线观看| 欧美av亚洲av综合av国产av| 午夜福利在线观看吧| 国产老妇女一区| 757午夜福利合集在线观看| 一个人观看的视频www高清免费观看| 在线国产一区二区在线| 男女午夜视频在线观看| 中国美女看黄片| 亚洲无线在线观看| 亚洲av电影在线进入| 国产97色在线日韩免费| 亚洲av五月六月丁香网| 热99在线观看视频| 亚洲人成电影免费在线| 亚洲精品456在线播放app | 亚洲国产精品成人综合色| 性欧美人与动物交配| 高潮久久久久久久久久久不卡| 成人欧美大片| 日本熟妇午夜| 99久久久亚洲精品蜜臀av| 免费看美女性在线毛片视频| 欧美国产日韩亚洲一区| 国产精品野战在线观看| 国产精品国产高清国产av| 国产精品久久久人人做人人爽| 91在线精品国自产拍蜜月 | 欧美日韩一级在线毛片| 可以在线观看毛片的网站| 嫩草影视91久久| 一区二区三区免费毛片| 亚洲国产日韩欧美精品在线观看 | 久久6这里有精品| 亚洲久久久久久中文字幕| 日本熟妇午夜| 波多野结衣高清作品| 很黄的视频免费| 女同久久另类99精品国产91| 国产国拍精品亚洲av在线观看 | 欧美一级a爱片免费观看看| av天堂中文字幕网| 精品乱码久久久久久99久播| 欧美日韩亚洲国产一区二区在线观看| 久久香蕉精品热| 一二三四社区在线视频社区8| 日韩欧美精品免费久久 | 夜夜爽天天搞| 动漫黄色视频在线观看| 国产欧美日韩一区二区精品| 人人妻人人澡欧美一区二区| 亚洲欧美激情综合另类| 亚洲精品一区av在线观看| 色精品久久人妻99蜜桃| 日本三级黄在线观看| 99在线视频只有这里精品首页| 91九色精品人成在线观看| 国产欧美日韩一区二区三| 色尼玛亚洲综合影院| 神马国产精品三级电影在线观看| 精品熟女少妇八av免费久了| 亚洲中文字幕日韩| 国产一区二区三区在线臀色熟女| 美女 人体艺术 gogo| 欧美国产日韩亚洲一区| 国产午夜福利久久久久久| 免费人成视频x8x8入口观看| 9191精品国产免费久久| 日韩 欧美 亚洲 中文字幕| 男插女下体视频免费在线播放| 变态另类成人亚洲欧美熟女| 免费看美女性在线毛片视频| 欧美高清成人免费视频www| 九九久久精品国产亚洲av麻豆| 蜜桃久久精品国产亚洲av| 在线天堂最新版资源| 18禁国产床啪视频网站| 午夜久久久久精精品| 少妇人妻一区二区三区视频| 99热6这里只有精品| 国产精品98久久久久久宅男小说| 国内毛片毛片毛片毛片毛片| 搡老熟女国产l中国老女人| 国产精品香港三级国产av潘金莲| 一本久久中文字幕| 窝窝影院91人妻| a级毛片a级免费在线| 亚洲国产精品久久男人天堂| 日韩免费av在线播放| 欧美激情在线99| av专区在线播放| 19禁男女啪啪无遮挡网站| 又爽又黄无遮挡网站| 嫩草影视91久久| 欧美zozozo另类| 天堂av国产一区二区熟女人妻| 岛国视频午夜一区免费看| 最新在线观看一区二区三区| 此物有八面人人有两片| 青草久久国产| 亚洲成人免费电影在线观看| 亚洲av一区综合| 级片在线观看| 少妇人妻一区二区三区视频| 成人av在线播放网站| 免费一级毛片在线播放高清视频| 亚洲精品成人久久久久久| 欧美一区二区亚洲| 欧美日韩国产亚洲二区| 国产欧美日韩精品亚洲av| 国产成人av激情在线播放| 久久精品国产自在天天线| 老司机福利观看| 亚洲乱码一区二区免费版| 亚洲精品乱码久久久v下载方式 | 一进一出抽搐动态| 久久人人精品亚洲av| 天堂影院成人在线观看| 丰满人妻一区二区三区视频av | 国产黄片美女视频| a在线观看视频网站| 亚洲精品在线观看二区| 51国产日韩欧美| 在线观看一区二区三区| 欧美最黄视频在线播放免费| 久久精品影院6| 欧美不卡视频在线免费观看| 最近视频中文字幕2019在线8| 国产探花在线观看一区二区| 美女高潮喷水抽搐中文字幕| 国产精品野战在线观看| 色播亚洲综合网| 国产精品98久久久久久宅男小说| 一区二区三区免费毛片| 激情在线观看视频在线高清| 深爱激情五月婷婷| 亚洲欧美日韩东京热| 欧美性感艳星| 亚洲最大成人中文| 俄罗斯特黄特色一大片| 免费观看精品视频网站| 国产高清激情床上av| bbb黄色大片| 亚洲自拍偷在线| 成人性生交大片免费视频hd| 亚洲成av人片在线播放无| 国产精品香港三级国产av潘金莲| 亚洲精品乱码久久久v下载方式 | 操出白浆在线播放| 成人特级黄色片久久久久久久| 国产亚洲精品久久久com| 白带黄色成豆腐渣| 久久久久九九精品影院| 国产aⅴ精品一区二区三区波| 又黄又爽又免费观看的视频| 午夜亚洲福利在线播放| 在线观看舔阴道视频| 中文字幕人妻丝袜一区二区| 亚洲国产中文字幕在线视频| 亚洲 国产 在线| 欧美黄色淫秽网站| 日本免费一区二区三区高清不卡| 久久久久性生活片| 久久久久久九九精品二区国产| 亚洲人与动物交配视频| 国产免费一级a男人的天堂| 黄色片一级片一级黄色片| 夜夜看夜夜爽夜夜摸| 麻豆国产av国片精品| 亚洲人成网站在线播放欧美日韩| 好男人在线观看高清免费视频| 久久香蕉精品热| 可以在线观看的亚洲视频| 亚洲精品在线美女| 在线观看日韩欧美| 国产高清视频在线播放一区| 在线看三级毛片| 日韩免费av在线播放| 成人亚洲精品av一区二区| 美女高潮喷水抽搐中文字幕| 亚洲色图av天堂| 黄片大片在线免费观看| 国产三级中文精品|