• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition

    2023-12-12 15:51:04SominParkMpabulungiMarkBogyungParkandHyunkiHong
    Computers Materials&Continua 2023年10期

    Somin Park,Mpabulungi Mark,Bogyung Park and Hyunki Hong,?

    1College of Software,Chung-Ang University,Seoul,06973,Korea

    2Department of AI,Chung-Ang University,Seoul,06973,Korea

    ABSTRACT Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition (SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules (a speaker-identification network and an emotion classification network) are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings (t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy (WA) of 72.14% and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech.

    KEYWORDS Attention block;IEMOCAP dataset;speaker-specific representation;speech emotion recognition;wav2vec 2.0

    1 Introduction

    The recent rapid growth of computer technology has made human-computer interaction an integral part of the human experience.Advances in automatic speech recognition (ASR) [1] and text-to-speech (TTS) synthesis [2] have made smart devices capable of searching and responding to verbal requests.However,this only supports limited interactions and is not sufficient for interactive conversations.Most ASR methods generally focus on the content of speech (words)without regard for the intonation,nuance,and emotion conveyed through audio speech.Speech emotion recognition(SER)is one of the most active research areas in the computer science field because the friction in every human-computer interaction could be significantly reduced if machines could perceive and understand the emotions of their users and perform context-aware actions.

    Previous studies used low-level descriptors (LLDs) generated from frequency,amplitude,and spectral properties (spectrogram,Mel-spectrogram,etc.) to recognize emotions in audio speech.Although the potential of hand-crafted features has been demonstrated in previous works,features and their representations should be tailored and optimized for specific tasks.Deep learning-based representations generated from actual waveforms or LLDs have shown better performance in SER.

    Studies in psychology have shown that individuals have different vocal attributes depending on their culture,language,gender,and personality[3].This implies that two speakers saying the same thing with the same emotion are likely to express different acoustic properties in their voices.The merits of considering speaker-specific properties in audio speech-related tasks have been demonstrated in several studies[4,5].

    In this paper,a novel approach in which a speaker-specific emotion representation is leveraged to improve emotional speech recognition performance is introduced.The proposed model consists of a speaker-identification network and an emotion classifier.The wav2vec 2.0[6](base model)is used as a backbone for both of the proposed networks,where it is used to extract emotion-related and speakerspecific features from input audio waveforms.A novel tensor fusion approach is used to combine these representations into a speaker-specific emotion representation.In this tensor fusion operation,the representation vectors are element-wise multiplied by a trainable fusion matrix,and then the resultant vectors are summed up.The main contributions of this paper are summarized as follows:

    ? Two wav2vec 2.0-based modules(the speaker-identification network and emotion classification)that generate a speaker-specific emotion representation from an input audio segment are proposed.The two modules are trained and evaluated on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset[7].Training networks on the IEMOCAP dataset is prone to over-fitting because it has only ten speakers.The representations generated by the speakeridentification network pre-trained on the VoxCeleb1 dataset[8]facilitate better generalization to unseen speakers.

    ? A novel tensor fusion approach is used to combine generated emotion and speaker-specific representations into a single vector representation suitable for SER.The use of the Arcface[9] and cross-entropy loss terms in the speaker-identification network was also explored,and detailed evaluations have been provided.

    2 Related Work

    2.1 Hand-Crafted Audio Representations

    A vast array of representations and models have been explored to improve audio speech-based emotion recognition.LLDs such as pitch and energy contours have been employed in conjunction with hidden Markov models[10]to recognize a speaker’s emotion from audio speech.Reference[11]used the delta and delta-delta of a log Mel-spectrogram to reduce the impact of emotionally irrelevant factors on speech emotion recognition.In this approach,an attention layer automatically drove focus to emotionally relevant frames and generated discriminative utterance-level features.Global-Aware Multi-Scale(GLAM)[12]used Mel-frequency cepstral coefficient(MFCC)inputs and a global-aware fusion module to learn a multi-scale feature representation,which is rich in emotional information.

    Time-frequency representations such as the Mel-spectrogram and MFCCs merge frequency and time domains into a single representation using the Fast Fourier Transform (FFT).Reference [13]addressed the challenges associated with the tradeoff between accuracy in frequency and time domains by employing a wavelet transform-based representation.Here,Morlet wavelets generated from an input audio sample are decomposed into child wavelets by applying a continuous wavelet transform(CWT) to the input signal with varying scale and translation parameters.These CWT features are considered as a representation that can be employed in downstream tasks.

    2.2 Learning Audio Representation Using Supervised Learning

    In more recent approaches,models learn a representation directly from raw waveforms instead of hand-crafted representations like the human perception emulating Mel-filter banks used to generate the Mel-spectrogram.Time-Domain (TD) filter banks [14] use complex convolutional weights initialized with Gabor wavelets to learn filter banks from raw speech for end-to-end phone recognition.The proposed architecture has a convolutional layer followed by anl2feature pooling-based modulus operation and a low-pass filter.It can be used as a learnable replacement to Mel-filter banks in existing deep learning models.In order to approximate the Mel-filter banks,the square of the Hanning window was used,and the biases of the convolutional layers were set to zero.Due to the absence of positivity constraints,a 1 was added to the output before applying log compression.A key limitation of this approach is that the log-scale compression and normalization that were used reduce the scale of spectrograms,regardless of their contents.

    Wang et al.[15]also proposed a learned drop-in alternative to the Mel-filter banks but replaced static log compression with dynamic compression and addressed the channel distortion problems in the Mel-spectrogram log transformation using Per-Channel Energy Normalization(PCEN).This was calculated using a smoothed version of the filter bank energy function,which was computed from a first-order infinite impulse response(IIR)filter.A smoothing coefficient was used in combining the smoothed version of the filter bank energy function and the current spectrogram energy function.In order to address the compression function’s fixed non-linearity,PCEN was modified to learn channeldependent smoothing coefficients alongside the other hyper-parameters[16]in a version of the model referred to as sPer-Channel Energy Normalization(sPCEN).

    2.3 Learned Audio Representation Using Self-Supervised Learning

    In supervised learning,class labels are used to design convolution filters and generate taskspecific representations.Due to the vast amounts of unlabeled audio data available,self-supervised learning(SSL)methods have been proposed for obtaining generalized representations of input audio waveforms for downstream tasks.These audio SSL methods can be categorized into auto-encoding,siamese,clustering,and contrastive techniques[17].

    Audio2vec[18]was inspired by word2vec[19]and learned general-purpose audio representations using an auto-encoder-like architecture to reconstruct a Mel-spectrogram slice from past and future slices.Continuous Bags of Words (CBoW) and skip-gram variants were also implemented and evaluated.In the Mockingjay[20]network,bidirectional Transformer encoders trained to predict the current frame from past and future contexts were used to generate general-purpose audio representations.Bootstrap your own latent for audio (BYOL-A) [21] is a Siamese model-based architecture that assumes no relationships exist between time segments of audio samples.In this architecture,two neural networks were trained by maximizing the agreement in their outputs given the same input.Normalization and augmentation techniques were also used to differentiate between augmented versions of the same audio segment,thereby learning a general-purpose audio representation.Hidden unit bidirectional encoder representations from Transformers(HuBERT)[22]addressed the challenges associated with multiple sound units in utterance,the absence of a lexicon of input sounds,and the variable length of sound units by using an offline clustering step to provide aligned target labels for a prediction loss similar to that in BERT[23].This prediction loss was only applied over masked regions,forcing the model to learn a combined acoustic and language model over continuous inputs.The model was based on the wav2vec 2.0 architecture that consists of a convolutional waveform encoder,projection layer,and code embedding layer but has no quantization layer.The HuBERT and wav2vec 2.0 models have similar architectures but differ in the self-supervised training techniques that they employ.More specifically,the wav2vec 2.0 masks a speech sequence in the latent space and solves a contrastive task defined over a quantization of the latent representation.On the other hand,the HuBERT model learns combined acoustic and language properties over continuous input by using an offline clustering step to provide aligned target labels for a BERT-like prediction loss applied over only the masked regions.Pseudo labels for encoded vectors were generated by applying K-means clustering on the MFCCs of the input waveforms.

    Contrastive methods generate an output representation using a loss function that encourages the separation of positive from negative samples.For instance,Contrastive Learning of Auditory Representations(CLAR)[24]encoded both the waveform and spectrogram into audio representations.Here,the encoded representations of the positive and negative pairs are used contrastively.

    2.4 Using Speaker Attributes in SER

    The Individual Standardization Network (ISNet) [4] showed that considering speaker-specific attributes can improve emotion classification accuracy.Reference[4]used an aggregation of individuals’neutral speech to standardize emotional speech and improve the robustness of individual-agnostic emotion representations.A key limitation of this approach is that it only applies to cases where labeled neutral training data for each speaker is available.Self-Speaker Attentive Convolutional Recurrent Neural Net(SSA-CRNN)[5]uses two classifiers that interact through a self-attention mechanism to focus on emotional information and ignore speaker-specific information.This approach is limited by its inability to generalize to unseen speakers.

    2.5 Wav2vec 2.0

    Wav2vec 2.0 converted an input speech waveform into spectrogram-like features by predicting the masked quantization representation over an entire speech sequence [6].The first wav2vec [25]architecture attempted to predict future samples from a given signal context.It consists of an encoder network that embeds the audio signal into a latent space and a context network that combines multiple time steps of the encoder to obtain contextualized representations.VQ-wav2vec[26],a vector quantized(VQ)version of the wav2vec model,learned discrete representations of audio segments using a future time step prediction task in line with previous methods but replaced the original representation with a Gumbel-Softmax-based quantization module.Wav2vec 2.0 adopted both the contrastive and diversity loss in the VQ-wav2vec framework.In other words,wav2vec 2.0 compared positive and negative samples without predicting future samples.

    Wav2vec 2.0 comprises a feature encoder,contextual encoder,and quantization module.First,the feature encoder converts the normalized waveform into a two-dimensional(2-d)latent representation.The feature encoder was implemented using seven one-dimensional (1-d) convolution layers with different kernel sizes and strides.A Hanning window of the same size as the kernel and a shorttime Fourier transform (STFT) with a hop length equal to the stride were used.The encoding that the convolutional layers generate from an input waveform is normalized and passed as inputs to two separate branches (the contextual encoder and quantization module).The contextual encoder consists of a linear projection layer,a relative positional encoding 1-d convolution layer followed by a Gaussian error linear unit (GeLU),and a transformer model.More specifically,each input is projected to a higher dimensional feature space and then encoded based on its relative position in the speech sequence.Here,the projected and encoded input,along with its relative position,are summed and normalized.The resultant speech features are randomly masked and fed into the Transformer,aggregating the local features into a context representation(C).The quantization module discretizes the feature encoder’s output into a finite set of speech representations.This is achieved by choosingVquantized representations (codebook entries) from multiple codebooks using a Gumbel softmax operation,concatenating them,and applying a linear transformation to the final output.A diversity loss encourages the model to use code book entries equally often.

    The contextual representationctof the masked time step(t)is compared with the quantized latent representationqtat the same time step(t).The contrastive loss makesctsimilar toqtandctdissimilar toKsampled quantized representations in every masked time step (Q~qt).The contrastive task’s loss term is defined as

    whereκis the temperature of the contrastive loss.The diversity loss and the contrastive loss are balanced using a hyper-parameter.A more detailed description is available in the wav2vec 2.0 paper[6].

    Several variations of the wav2vec 2.0 model have been proposed in recent studies [27–29].The wav2vec 2.0-robust model[27]was trained on more general setups where the domain of the unlabeled data for pre-training data differs from that of the labeled data for fine-tuning.This study demonstrated that pre-training on various domains improves the performance of fine-tuned models on downstream tasks.In order to make speech technology accessible for other languages,several studies pre-trained the wav2vec 2.0 model on a wide range of tasks,domains,data regimes,and languages to achieve cross-lingual representations [28,29].More specifically,in the wav2vec 2.0-xlsr and wav2vec 2.0-xlsr variations of the wav2vec 2.0 model such as wav2vec 2.0-large-xlsr-53,wav2vec 2.0-large-xlsr-53-extended,wav2vec 2.0-xls-r-300m,and wav2vec 2.0-xls-r-1b,“xlsr”indicates that a single wav2vec 2.0 model was pre-trained to generate cross-lingual speech representations for multiple languages.Here,the“xlsr-53”model is large and was pre-trained on datasets containing 53 languages.Unlike the“xlsr”variations,the“xls-r”model variations are large-scale and were pre-trained on several large datasets with up to 128 languages.Here,the“300m”and“1b”refer to the number of model parameters used.The difference between the“300m”and“1b”variations is mainly in the number of Transformer model parameters.

    The wav2vec 2.0 representation has been employed in various SER studies because of its outstanding ability to create generalized representations that can be used to improve acoustic model training.SUPERB[30]evaluated how well pre-trained audio SSL approaches performed on ten speech tasks.The pre-trained SSL networks with high performance can be frozen and employed on downstream tasks.SUPERB’s wav2vec 2.0 models are variations of the wav2vec 2.0 with the original weights frozen and an extra fully connected layer added.For the SER task,the IEMOCAP dataset was used.Since the outputs of SSL networks effectively represent the frequency features in the speech sequence,the length of representations varies with the length of utterances.In order to obtain a fixed-size representation for utterances,average time pooling is performed before the fully connected layer.In[31],the feasibility of partly or entirely fine-tuning these weights was examined.Reference[32]proposed a transfer learning approach in which the outputs of several layers of the pre-trained wav2vec 2.0 model were combined using trainable weights that were learned jointly with a downstream model.In order to improve SER performance,reference[33] employed various fine-tuning strategies on the wav2vec 2.0 model,including task adaptive pre-training(TAPT)and pseudo-label task adaptive pre-training(P-TAPT).TAPT addressed the mismatch between the pre-training and target domain by continuing to pre-train on the target dataset.P-TAPT achieves better performance than the TAPT approach by altering its training objective of predicting the cluster assignment of emotion-specific features in masked frames.The emotion-specific features act as pseudo labels and are generated by applying k-means clustering on representations generated using the wav2vec model.

    2.6 Additive Angular Margin Loss

    Despite their popularity,earlier losses like the cross-entropy did not encourage intra-class compactness and inter-class separability [34] for classification tasks.In order to address this limitation,contrastive,triplet [35],center [36],and Sphereface [37] losses encouraged the separability between learned representations.Additive Angular Margin Loss (Arcface) [9] and Cosface [38] achieved better separability by encouraging stronger boundaries between representations.In Arcface,the representations were distributed around feature centers in a hypersphere with a fixed radius.An additive angular penalty was employed to simultaneously enhance the intra-class compactness and inter-class discrepancy.Here the angular difference between an input feature vector(x∈Rd)and the center representation vectors of classes(W∈RN×d)are calculated.A margin is added to the angular difference between features in the same class to make learned features separable with a larger angular distance.Reference [39] used the Arcface loss to train a bimodal audio text network for SER and reported improved performance.A similar loss term is used in the proposed method.

    Eq.(2) is the equivalent of calculating the softmax with a bias of 0.After applying a logit transformation,Eq.(2)can be rewritten as Eq.(3).

    where‖·‖is thel2normalization andθjis the angle betweenWjandxi.In Eq.(4),the additive margin penalty(m)is only added to the angle(θyi)between the target weight(Wyi)and the features(xi).The features are re-scaled using the scaling factor(s).The final loss is defined as:

    Reference[39]demonstrated the Arcface loss term’s ability to improve the performance of SER models.It is therefore employed in training the modules proposed in this study.

    3 Methodology

    In order to leverage speaker-specific speech characteristics to improve the performance of SER models,two wav2vec 2.0-based modules(the speaker-identification network and emotion classification network)trained with the Arcface loss are proposed.The speaker-identification network extends the wav2vec 2.0 model with a single attention block,and it encodes an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation that contains both emotion and speaker-specific information.

    3.1 Speaker-Identification and Emotion Classification Networks

    The speaker-identification network(Fig.1)encodes the vocal properties of a speaker into a fixeddimension vector(d).The wav2vec 2.0 model encodes input utterances into a latent 2-d representation of shapeR768×T,whereTis the number of frames generated from the input waveform.This latent representation is passed to a single attention block prior to performing a max-pooling operation that results in a 1-d vector of length 768.Only a single attention block was used in the speakeridentification network because it is assumed that the core properties of a speaker’s voice are unaffected by his or her emotional state.In other words,a speaker can be identified by his/her voice regardless of his/her emotional state.In order to achieve a more robust distinction between speakers,theRdshape speaker-identification representation(Hid)and theR#ID×dshape Arcface center representation vector(Wid) for speaker classes arel2normalized,and their cosine similarity is computed.Configurations of the speaker-identification network using the cross-entropy loss were also explored.In experiments using the cross-entropy loss,the Arcface center representation vectors for speaker classes were replaced with a fully connected (FC) layer.Then,the FC outputs were fed into a softmax function,and the probability of each speaker class was obtained.In Fig.1,“#ID”represents the index of each speaker class.For example,in the VoxCeleb1 dataset with 1,251 speakers,the final#ID is#1,251.

    Figure 1:Architecture of the speaker-identification network with the extended wav2vec 2.0 model(left)and l2 normalization,cosine similarity and cross-entropy loss computation(right),and a single output for each speaker class

    In the emotion classification network (Fig.2),the wav2vec 2.0 model encodes input utterances into aR768×Tshape representation.The encoding generated is passed to a ReLU activation layer before being fed into an FC layer and eventually passed to four attention blocks.The four attention blocks identify the parts of the generated emotion representation that are most relevant to SER.Experiments were also conducted for configurations with one,two,as well as three attention blocks.Max-pooling is applied across the time axis to the outputs of each attention block.The max pooled outputs of the attention blockshiare concatenated before the tensor fusion operation.During tensor fusion,an element-wise multiplication betweenHemo={h1,h2,···,hk} and a trainable fusion matrix (Wfusion∈Rk×d)is performed.As shown in Eq.(5),all thekvectors are summed to generate the final embedding.

    Figure 2:Architecture of the emotion classification network.Extended wav2vec 2.0 model(left)with four attention blocks and a tensor fusion operation. l2 normalization,cosine similarity,and crossentropy loss computation(right)for emotion classes with a single output for each emotion class

    whereei∈RdandWfusion,i∈Rd.The final embedding (E) isl2normalized prior to computing the cosine similarity between the Arcface center representation vectors (Wemo∈R#EMO×d).In Fig.2,“#EMO” represents the emotion class indices defined in the IEMOCAP dataset.Here,1_EMO,2_EMO,3_EMO,and 4_EMO represent angry,happy,sad,and neutral emotion classes,respectively.

    3.2 Speaker-Specific Emotion Representation Network

    Fig.3 shows the architecture of the proposed SER approach.The same waveform is passed to the speaker-identification network as well as the emotion classification network.The speaker representation generated by the pre-trained speaker-identification network is passed to the emotion classification network.More specifically,the output vector of the attention block from the speakeridentification network is concatenated to the outputs of the emotion classification network’s four attention blocks,resulting in a total of five attention block outputs(H∈R5×d).The fusion operation shown in Eq.(5)combines these representations into a single speaker-specific emotion representation(E).The angular distance between the normalized tensor fused output vector and the normalized center of the four emotion representation vectors is calculated using Eq.(4).The emotion class predicted for any input waveform is determined by how close its representation vector is to an emotion class’s center vector.

    Figure 3:Architecture of the speaker-specific emotion representation model with the speakeridentification network (up) that generates a speaker representation and the emotion classification (down) that generates a speaker-specific emotion representation from emotion and speaker-identification representations

    4 Experiment Details

    4.1 Dataset

    The IEMOCAP[7]is a multimodal,multi-speaker emotion database recorded across five sessions with five pairs of male and female speakers performing improvisations and scripted scenarios.It comprises approximately 12 h of audio-visual data,including facial images,speech,and text transcripts.The audio speech data provided is used to train and evaluate models for emotion recognition.Categorical(angry,happy,sad,and neutral)as well as dimensional(valence,activation,and dominance)labels are provided.Due to imbalances in the number of samples available for each label category,only neutral,happy (combined with exciting),sad,and angry classes have been used in line with previous studies [4,30–33,39,40].The 16 kHz audio sampling rate used in the original dataset is retained.The average length of audio files is 4.56 s,with a standard deviation of 3.06 s.The minimum and maximum lengths of audio files are 0.58 and 34.14 s,respectively.Audio files longer than 15 s are truncated to 15 s because almost all of the audio samples in the dataset were less than 15 s long.For audio files shorter than 3 s,a copy of the original waveform is recursively appended to the end of the audio file until the audio file is at least 3 s long.Fig.4 shows how often various emotions are expressed by male and female speakers over five sessions in the IEMOCAP dataset.As shown in Fig.4,the dataset is unevenly distributed across emotion classes,with significantly more neutral and happy samples in most sessions.

    Figure 4:Distribution of male and female speakers across emotion classes in the IEMOCAP dataset

    In order to generate an evenly distributed random set of samples at each epoch,emotion classes with more samples are under-sampled.This implies that samples of the training dataset are evenly distributed across all the emotional classes.Leave-one-session-out five-fold cross-validation is used.

    In this study,VoxCeleb1’s[8]large variation and diversity allow the speaker-identification module to be trained for better generalization to unseen speakers.VoxCeleb1 is an audio-visual dataset comprising 22,496 short interview clips extracted from YouTube videos.It features 1,251 speakers from diverse backgrounds and is commonly used for speaker identification and verification tasks.Its audio files have a sampling rate of 16 kHz with an average length of 8.2 s as well as minimum and maximum lengths of 4 and 145 s,respectively.Additionally,audio clips in VoxCeleb1 are also limited to a maximum length of 15 s for consistency in the experiments.

    4.2 Implementation Details

    In recent studies [31,32],pre-training the wav2vec model on the Librispeech dataset [41] (with no fine-tuning for ASR tasks) has been shown to deliver better performance for SER tasks.In this study,the wav2vec 2.0 base model was selected because the wav2vec 2.0 large model does not offer any significant improvement in performance despite an increase in computational cost [31,32].The key difference between the“wav2vec2-large”and its base model is that it consists of an additional 12 Transformer layers that are intended to improve its generalization capacity.Using other versions of the wav2vec 2.0 model or weights may improve performance depending on the target dataset and the pre-training strategy [27–29,33].This study proposes two networks based on the wav2vec 2.0 representation (Sub-section 2.5).In addition,reference [31] showed that either partially or entirely fine-tuning the wav2vec 2.0 segments results in the same boost in model performance on SER tasks despite the differences in computational costs.Therefore,the wav2vec 2.0 modules (the contextual encoder) used in this study were only partially fine-tuned.The model and weights are provided by Facebook research under the Fairseq sequence modeling toolkit[42].

    A two-step training process ensures that the proposed network learns the appropriate attributes.First,the speaker-identification network and emotion network are trained separately.Then,the pretrained networks are integrated and fine-tuned with the extended tensor fusion matrix to match the size of concatenated speaker-identification and emotion representations.In order to prevent over-fitting and exploding gradients,gradient values are clipped at 100 withn-step gradient accumulations.A 10-8weight decay is applied,and the Adam[43]optimizer with beta values set to(0.9,0.98)is used.The LambdaLR scheduler reduces the learning rate by multiplying it by 0.98 after every epoch.An early stopping criterion is added to prevent over-fitting.Each attention block consists of four attention heads with a dropout rate of 0.1.In the Arcface loss calculation,the feature re-scaling factor (s) is set to 30 and the additive margin penalty(m)to 0.3 for the experiments.Experiments were conducted using Pytorch in an Ubuntu 20.04 training environment running on a single GeForce RTX 3090 GPU.The specific hyper-parameters used in the experiments are shown in Table 1.

    Table 1:Hyper-parameters used during model evaluation

    4.3 Evaluation Metrics

    In this paper,weighted and unweighted accuracy metrics were used to evaluate the performance of the proposed model.Weighted accuracy(WA)is an evaluation index that intuitively represents model prediction performance as the ratio of correct predictions to the overall number of predictions.WA can be computed from a confusion matrix containing prediction scores aswhere the number of true positive,true negative,false positive,and false negative cases are TP,TN,FP,and FN,respectively.In order to mitigate the biases associated with the weighted accuracy in imbalanced datasets such as the IEMOCAP dataset,unweighted accuracy(UA),also called average recall,is widely employed and can be computed usingwhereCis the total number of emotion classes and is set to four for all the results presented in this study.

    5 Experimental Results

    5.1 Performance of Speaker-Identification Network and Emotion Classification Network

    Table 2 shows the performance of the speaker-identification network on the VoxCeleb1 identification test dataset.Training the speaker identification network using the Arcface loss resulted in significantly better speaker classification than training with the cross-entropy loss.This indicates that the angular margin in the Arcface loss improves the network’s discriminative abilities for speaker identification.Fig.5 shows a t-distributed stochastic neighbor embedding (t-SNE) plot of speakerspecific representations generated from the IEMOCAP dataset using two configurations of the speaker-identification network.As shown in Fig.5,training with the Arcface loss results in more distinct separations between speaker representations than training with the cross-entropy loss.As shown in Fig.6,the speaker identification network may be unable to generate accurate representations for audio samples that are too short.Representations of audio clips that are less than 3 s long are particularly likely to be misclassified.In order to ensure that input audio waveforms have the information necessary to generate a speaker-specific emotion representation,a 3-s requirement is imposed.In cases where the audio waveform is shorter than 3 s,a copy of the original waveform is recursively appended to the end of the waveform until it is at least 3 s long.

    Table 2:Overall performance of the proposed method when either the cross-entropy or Arcface loss was used in the speaker-identification network

    Figure 5:t-SNE plot of speaker-specific representations generated by the speaker-identification network when trained with different loss functions:(a)Cross-entropy(b)Arcface

    Figure 6:t-SNE plot of speaker-specific representations generated by the speaker-identification network when trained with audio segments of varying minimum lengths:(a)1 s(b)2 s(c)3 s(4)4 s

    Table 3 shows a comparison of the proposed methods’performance against that of previous studies.The first four methods employed the wav2vec 2.0 representation and used the cross-entropy loss [30–32].Tang et al.[39] employed hand-crafted features and used the Arcface loss.Here,the individual vocal properties provided by the speaker-identification network are not used.Table 3 shows that the method proposed by Tang et al.[39] has a higher WA than UA.This implies that emotion classes with more samples,particularly in the imbalanced IEMOCAP dataset,are better recognized.The wav2vec 2.0-based methods [30–32] used average time pooling to combine features across the time axis.Reference[32]also included a long short-term memory(LSTM)layer to better model the temporal features.In the proposed method,the Arcface loss is used instead of the cross-entropy loss,and an attention block is used to model temporal features.Table 3 shows that the proposed attentionbased method outperforms previous methods with similar training paradigms.It also demonstrates that using four attention blocks results in significantly better performance than using one,two,or three attention blocks.This is because four attention blocks can more effectively identify the segments of the combined emotion representation that are most relevant to SER.Reference[33]’s outstanding performance can be attributed to the use of a pseudo-task adaptive pretraining(P-TAPT)strategy that is described in Subsection 2.5.

    Table 3:Comparing the performance of the proposed emotion classification approach(with a varying number of attention blocks)against that of previous methods also trained with the Arcface loss

    5.2 Partially and Entirely Fine-Tuning Networks

    The proposed speaker-identification network was fine-tuned under three different configurations:fine-tuning with the entire pre-trained network frozen (All Frozen),fine-tuning with the wav2vec 2.0 segment frozen and the Arcface center representation vectors unfrozen (Arcface Fine-tuned),and fine-tuning with both the wav2vec 2.0 weights and the Arcface center representation vectors unfrozen(All Fine-tuned).The wav2vec 2.0 feature encoder(convolutional layers)is frozen in all cases[31].The IEMOCAP dataset only has 10 individuals.Therefore,the Arcface center representation vectors are reduced from 1,251 (in the VoxCeleb1 dataset) to 8 while jointly fine-tuning both the speaker-identification network and the emotion classification network.While fine-tuning with both the wav2vec 2.0 weights and the Arcface vectors unfrozen,the loss is computed as a combination of emotion and identification loss terms as shown in Eq.(6):

    αandβare used to control the extent to which emotion and identification losses,respectively,affect the emotion recognition results.Since training the emotion classification network with four attention blocks showed the best performance in prior experiments,fine-tuning performance was evaluated under this configuration.Fig.7 shows that freezing the speaker-identification network provides the best overall performance.Due to the small number of speakers in the IEMOCAP dataset,the model quickly converged on a representation that could distinguish speakers it was trained on but was unable to generalize to unseen speakers.More specifically,the frozen version of the speakeridentification module was trained on the VoxCeleb1 dataset and frozen because it has 1,251 speakers’utterances.These utterances provide significantly larger variation and diversity than the utterances of the 8 speakers (training dataset) in the IEMOCAP dataset.This implies that the frozen version can better generalize to unseen speakers than versions fine-tuned on the 8 speakers of the IEMOCAP dataset,as shown in Figs.7b and 7c.

    Figure 7:Performance of the proposed method with the speaker-identification network fine-tuned to various levels:(a)All Frozen(b)Arcface Fine-tuned(c)All Fine-tuned

    Fig.7 b shows that increasingβ,which controls the significance of the identification loss,improves emotion classification accuracy when the Arcface center representation vectors are frozen.Conversely,Fig.7c shows that increasingβ,causes the emotion classification accuracy to deteriorate when the entire model is fine-tuned.This implies that partly or entirely freezing the weights of the speakeridentification network preserves the representation learned from the 1,251 speakers of the VoxCeleb1 dataset,resulting in better emotion classification performance.On the other hand,fine-tuning the entire model on the IEMOCAP dataset’s eight speakers degrades the speaker-identification network’s generalization ability.More specifically,in the partly frozen version,only the attention-pooling and speaker classification layers are fine-tuned,leaving the pre-trained weights of the speaker-identification network intact.

    Figs.8 and 9 show t-SNE plots of emotion representations generated by the emotion classification network under various configurations.In Figs.8a and 8b,the left column contains representations generated from the training set,and the right column contains those generated from the test set.In the top row of Figs.8a and 8b,a representation’s color indicates its predicted emotion class,and in the bottom row,it indicates its predicted speaker class.The same descriptors apply to Figs.9a and 9b.More specifically,Fig.8 illustrates the effect of employing the speaker-specific representations generated by the frozen speaker-identification network in the emotion classification network.As shown in Fig.8,using the speaker-specific representations improves intra-class compactness and increases inter-class separability between emotional classes compared to training without the speakerspecific representation.The emotion representations generated when speaker-specific information was utilized how a clear distinction between the eight speakers of the IEMOCAP dataset and their corresponding emotion classes.

    Figure 8:t-SNE plot of emotion representations generated by the emotion classification network under two configurations:(a)without the speaker-specific representation(b)with the speaker-specific representation

    Figure 9:t-SNE plot of emotion representations generated by the emotion classification network under two configurations:(a)Only Arcface vector weights fine-tuned(b)All fine-tuned

    In contrast to Figs.9a and 9b,the result shows that fine-tuning both the speaker-identification network and the emotion classification network increases inter-class separability between the emotion representations of speakers while retaining speaker-specific information.This results in a slight improvement in the overall SER performance,which is in line with the findings shown in Figs.7b and 7c.

    5.3 Comparing the Proposed Method against Previous Methods

    In Table 3,the proposed method is compared against previous SER methods that are based on the wav2vec 2.0 model or employ the Arcface loss.In Table 4,the performance of the proposed method under various configurations is compared against that of existing approaches on the IEMOCAP dataset.In Table 4,“EF” and “PF” stand for “entirely fine-tuned” and “partially fine-tuned,”respectively.Experiments showed that the configuration using four attention blocks in the emotion network and fine-tuning with the speaker-identification network frozen (Fig.7a) provided the best performance.Therefore,this configuration was used when comparing the proposed method against previous methods.The proposed method significantly improves the performance of SER models,even allowing smaller models to achieve performance close to that of much larger models.As shown in Table 4,reference [33] achieved better performance than the proposed method because it uses a pseudo-task adaptive pretraining(P-TAPT)strategy,as described in Subsection 2.5.

    Table 4:Comparing the performance of the proposed method against previous SER methods

    Reference[44]was a HuBERT-large based model which employs label-adaptive mixup as a data augmentation approach.It achieved the best performance among the approaches listed in Table 4.This is because they created a label-adaptive mixup method in which linear interpolation is applied in the feature space.Reference [45] employed balanced augmentation sampling on triple channel log Melspectrograms before using a CNN and attention-based bidirectional LSTM.Although this method was trained for several tasks,such as gender,valence/arousal,and emotion classification,it did not perform as well as the proposed method.This is because the proposed method uses speaker-specific properties while generating emotional representations from speaker utterances.

    5.4 Ablation Study

    Since the audio segments in the IEMOCAP are unevenly distributed across emotion classes,emotion classes with more samples were under-sampled.In order to examine the effects of an imbalanced dataset,additional experiments were conducted with varying amounts of training data.More specifically,the model was trained on the entire dataset with and without under-sampling to examine the effects of an imbalanced dataset.The best-performing configuration of the proposed model (the speaker-specific emotion representation network with four attention blocks and the speaker-identification network frozen) was used in these experiments.Table 5 shows the results of experiments conducted under four configurations.In the experiment results,both pre-trained and finetuned model variations showed their best performance when trained using the undersampled version of the IEMOCAP dataset.This is because under-sampling addresses the dataset’s imbalance problem adequately.

    Table 5:Performance of the speaker-specific emotion representation network trained under four training configurations(with under-sampled and complete versions of the IEMOCAP dataset)

    In order to investigate the effects of using the speaker-specific representation,experiments were conducted at first using just the emotion classification network and then using the speaker-specific emotion representation network.More specifically,cross-entropy and Arcface losses,as well as configurations of the networks with 1,2,3,and 4 attention blocks,were used to investigate the effects of using the speaker-specific representation.As shown in Table 6,the inter-class compactness and inter-class separability facilitated by the Arcface loss results in better performance than when the cross-entropy loss is used for almost all cases.Using the speaker-specific emotion representation outperformed the bare emotion representation under almost all configurations.

    Table 6:Performance(accuracy)of the speaker-specific emotion representation network under 1,2,3,and 4 attention block configurations and trained with cross-entropy and Arcface losses

    The computation time of the proposed method under various configurations was examined.The length of input audio segments (3,5,10,and 15 s) and number of attention blocks (1,2,3,and 4) were varied.The proposed model (speaker-specific emotion representation network) consists of two networks (speaker-identification and emotion classification).Table 6 shows the two networks’separate and combined computation times under the abovementioned configurations.As shown in Table 7,computation time increases as the length of input audio segments and the number of attention blocks increases.Experiments show that the proposed model’s best-performing configuration is that in which the speaker-specific emotion representation network has four attention blocks.Under this configuration,the model can process an audio segment in 27 ms.

    Table 7:Computation time (ms) of the proposed networks (speaker-identification,emotion classification,and speaker-specific emotion representation networks) for input audio segments of varying lengths(3,5,10,and 15 s)

    6 Conclusion

    This study proposes two modules for generating a speaker-specific emotion representation for SER.The proposed emotion classification and speaker-identification networks are based on the wav2vec 2.0 model.The networks are trained to respectively generate emotion and speaker representations from an input audio waveform using the Arcface loss.A novel tensor fusion approach was used to combine these representations into a speaker-specific representation.Employing attention blocks and max-pooling layers improved the performance of the emotion classification network.This was associated with the attention blocks’ability to identify which segments of the generated representation were most relevant to SER.Training the speaker-identification network on the VoxCeleb1 dataset(1,251 speakers) and entirely freezing it while using four attention blocks in the emotion network provided the best overall performance.This is because of the proposed method’s robust generalization capabilities that extend to unseen speakers in the IEMOCAP dataset.The experiment results showed that the proposed approach outperforms previous methods with similar training strategies.In future works,various wav2vec 2.0 and HuBERT model variations are to be employed to improve the proposed method’s performance.Novel pre-training and fine-tuning strategies,such as TAPT and P-TAPT,are also to be explored.

    Acknowledgement:The authors extend their appreciation to the University of Oxford’s Visual Geometry Group as well as the University of Southern California’s Speech Analysis &Interpretation Laboratory for their excellent work on the VoxCeleb1 and IEMOCAP datasets.The authors are also grateful to the authors of the wav2vec 2.0 (Facebook AI) for making their source code and corresponding model weights available.

    Funding Statement:This research was supported by the Chung-Ang University Graduate Research Scholarship in 2021.

    Author Contributions:Study conception and model design:S.Park,M.Mpabulungi,B.Park;analysis and interpretation of results: S.Park,M.Mpabulungi,H.Hong;draft manuscript preparation: S.Park,M.Mpabulungi,H.Hong.All authors reviewed the results and approved the final version of the manuscript.

    Availability of Data and Materials:The implementation of the proposed method is available at https://github.com/ParkSomin23/2023-speaker_specific_emotion_SER.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    亚洲精品中文字幕在线视频| 午夜成年电影在线免费观看| 国产av又大| 国产亚洲精品久久久久5区| 一区二区日韩欧美中文字幕| 午夜久久久在线观看| 国产精品自产拍在线观看55亚洲 | 亚洲熟女毛片儿| 激情视频va一区二区三区| 激情在线观看视频在线高清 | 超碰成人久久| 99国产精品99久久久久| 美女扒开内裤让男人捅视频| 亚洲人成伊人成综合网2020| 女人精品久久久久毛片| 日韩中文字幕视频在线看片| 少妇的丰满在线观看| 色综合欧美亚洲国产小说| 欧美国产精品一级二级三级| 欧美国产精品一级二级三级| 国产精品一区二区免费欧美| 中文字幕av电影在线播放| 黑人操中国人逼视频| 不卡av一区二区三区| 欧美成人午夜精品| 欧美 日韩 精品 国产| 欧美亚洲日本最大视频资源| 在线十欧美十亚洲十日本专区| 久久久水蜜桃国产精品网| 他把我摸到了高潮在线观看 | 国产av精品麻豆| 国产在视频线精品| 成人影院久久| 黄色丝袜av网址大全| 国产高清视频在线播放一区| 国产成人免费观看mmmm| 成年动漫av网址| 国产野战对白在线观看| 丝袜在线中文字幕| 国产片内射在线| 精品高清国产在线一区| 精品国内亚洲2022精品成人 | 91av网站免费观看| 国产国语露脸激情在线看| 欧美中文综合在线视频| 日日摸夜夜添夜夜添小说| 成年版毛片免费区| 韩国精品一区二区三区| 两个人免费观看高清视频| 亚洲熟妇熟女久久| 国产有黄有色有爽视频| 大香蕉久久成人网| 菩萨蛮人人尽说江南好唐韦庄| 国产成人欧美在线观看 | 一区二区三区国产精品乱码| 欧美精品啪啪一区二区三区| 亚洲成人免费电影在线观看| 十八禁高潮呻吟视频| 国产av一区二区精品久久| 俄罗斯特黄特色一大片| 欧美亚洲 丝袜 人妻 在线| 99热网站在线观看| 久久ye,这里只有精品| 亚洲成人手机| 国产精品98久久久久久宅男小说| 下体分泌物呈黄色| 麻豆成人av在线观看| av线在线观看网站| av在线播放免费不卡| 久久av网站| 99热国产这里只有精品6| 丝袜美腿诱惑在线| 久久中文字幕一级| 天堂中文最新版在线下载| 国产黄色免费在线视频| 日本一区二区免费在线视频| 欧美成人免费av一区二区三区 | 亚洲色图 男人天堂 中文字幕| 亚洲国产av影院在线观看| 午夜久久久在线观看| 成人精品一区二区免费| av免费在线观看网站| 无遮挡黄片免费观看| 欧美日本中文国产一区发布| 黄片小视频在线播放| www.熟女人妻精品国产| 高清在线国产一区| 成人精品一区二区免费| 热re99久久国产66热| 无限看片的www在线观看| 午夜精品久久久久久毛片777| 人成视频在线观看免费观看| 欧美精品一区二区免费开放| 老司机午夜福利在线观看视频 | 亚洲欧美日韩高清在线视频 | 国产伦人伦偷精品视频| 国产黄频视频在线观看| 成人国语在线视频| 高潮久久久久久久久久久不卡| 亚洲av片天天在线观看| 国产成人系列免费观看| 免费看十八禁软件| 亚洲 欧美一区二区三区| 每晚都被弄得嗷嗷叫到高潮| 国产男女超爽视频在线观看| 亚洲综合色网址| 五月天丁香电影| 亚洲精品久久成人aⅴ小说| 欧美精品一区二区大全| 视频区图区小说| 岛国在线观看网站| 成人精品一区二区免费| 久久国产亚洲av麻豆专区| 午夜免费鲁丝| 无限看片的www在线观看| 久久久精品区二区三区| 91成人精品电影| 欧美激情 高清一区二区三区| 免费观看人在逋| 一区二区三区精品91| 亚洲成a人片在线一区二区| 国产高清国产精品国产三级| 久久精品国产综合久久久| 一边摸一边做爽爽视频免费| tube8黄色片| 亚洲精品美女久久av网站| 亚洲av第一区精品v没综合| 久久精品国产亚洲av香蕉五月 | 久久久久久久久久久久大奶| 在线看a的网站| 最新的欧美精品一区二区| 日韩大片免费观看网站| 久久99热这里只频精品6学生| 欧美日韩av久久| 好男人电影高清在线观看| 久久天堂一区二区三区四区| 黄频高清免费视频| av福利片在线| 搡老乐熟女国产| 不卡av一区二区三区| 亚洲av成人不卡在线观看播放网| 1024香蕉在线观看| 国产成人精品无人区| 精品久久久久久电影网| 最近最新中文字幕大全免费视频| 好男人电影高清在线观看| 精品福利观看| 久久人人爽av亚洲精品天堂| 亚洲午夜理论影院| 咕卡用的链子| 亚洲专区国产一区二区| 欧美亚洲日本最大视频资源| 狠狠精品人妻久久久久久综合| 国产av一区二区精品久久| 叶爱在线成人免费视频播放| 波多野结衣一区麻豆| 美女扒开内裤让男人捅视频| 国产激情久久老熟女| 中国美女看黄片| 免费观看a级毛片全部| 亚洲精品乱久久久久久| 精品一区二区三区av网在线观看 | www日本在线高清视频| 天堂俺去俺来也www色官网| 国产欧美日韩一区二区三区在线| 国产成+人综合+亚洲专区| aaaaa片日本免费| 精品久久蜜臀av无| 国产高清videossex| 国产99久久九九免费精品| 国产精品二区激情视频| 亚洲七黄色美女视频| 99香蕉大伊视频| 久久人妻av系列| 亚洲国产欧美日韩在线播放| 亚洲av日韩在线播放| 午夜久久久在线观看| 精品国产超薄肉色丝袜足j| 久久久国产精品麻豆| 91麻豆av在线| 亚洲精品中文字幕在线视频| 国产av精品麻豆| 欧美中文综合在线视频| 蜜桃国产av成人99| 国产精品久久久久久人妻精品电影 | 欧美日韩亚洲国产一区二区在线观看 | 亚洲色图综合在线观看| 成年动漫av网址| 欧美日韩视频精品一区| 国产成人精品久久二区二区免费| 国产淫语在线视频| 久久久精品国产亚洲av高清涩受| 中文亚洲av片在线观看爽 | 在线观看免费午夜福利视频| 手机成人av网站| 国产精品二区激情视频| 2018国产大陆天天弄谢| 亚洲全国av大片| 亚洲av日韩在线播放| 亚洲情色 制服丝袜| 十八禁高潮呻吟视频| av超薄肉色丝袜交足视频| 脱女人内裤的视频| 人妻 亚洲 视频| 免费在线观看黄色视频的| 老司机亚洲免费影院| 蜜桃在线观看..| 18禁黄网站禁片午夜丰满| 女性被躁到高潮视频| 国产伦理片在线播放av一区| 狂野欧美激情性xxxx| 日韩一区二区三区影片| 中文字幕人妻丝袜制服| 淫妇啪啪啪对白视频| 热re99久久精品国产66热6| 免费黄频网站在线观看国产| kizo精华| 一边摸一边抽搐一进一小说 | 久久免费观看电影| 男人操女人黄网站| 欧美人与性动交α欧美精品济南到| 首页视频小说图片口味搜索| 美女福利国产在线| 12—13女人毛片做爰片一| 亚洲av美国av| 亚洲色图 男人天堂 中文字幕| 亚洲精品久久午夜乱码| 精品国产超薄肉色丝袜足j| 精品久久久久久久毛片微露脸| 法律面前人人平等表现在哪些方面| 热re99久久精品国产66热6| 国产精品自产拍在线观看55亚洲 | 一边摸一边抽搐一进一出视频| 久久中文看片网| 精品一区二区三区四区五区乱码| 一边摸一边做爽爽视频免费| 香蕉丝袜av| 久久久久精品国产欧美久久久| 建设人人有责人人尽责人人享有的| 免费高清在线观看日韩| 黄色a级毛片大全视频| 极品少妇高潮喷水抽搐| 久久九九热精品免费| 亚洲欧美一区二区三区久久| 高清视频免费观看一区二区| 亚洲欧洲日产国产| 91成人精品电影| 最新的欧美精品一区二区| 亚洲国产欧美网| 久热这里只有精品99| tocl精华| 黑人巨大精品欧美一区二区mp4| 亚洲av电影在线进入| √禁漫天堂资源中文www| 少妇粗大呻吟视频| 午夜福利在线观看吧| 色综合婷婷激情| 久久久欧美国产精品| 亚洲成a人片在线一区二区| 黄网站色视频无遮挡免费观看| 国产精品自产拍在线观看55亚洲 | 亚洲五月色婷婷综合| videosex国产| 淫妇啪啪啪对白视频| 免费少妇av软件| 国产av又大| 欧美日韩国产mv在线观看视频| 亚洲欧美精品综合一区二区三区| 我的亚洲天堂| 亚洲三区欧美一区| 精品少妇黑人巨大在线播放| 成人黄色视频免费在线看| 黄片播放在线免费| 妹子高潮喷水视频| videos熟女内射| 老司机影院毛片| 在线av久久热| 欧美日韩av久久| 日本精品一区二区三区蜜桃| 一区在线观看完整版| 丰满迷人的少妇在线观看| 老熟妇仑乱视频hdxx| 少妇裸体淫交视频免费看高清 | 国产精品自产拍在线观看55亚洲 | 成人国语在线视频| 精品久久久精品久久久| 动漫黄色视频在线观看| 亚洲国产看品久久| 午夜福利在线免费观看网站| av福利片在线| 制服人妻中文乱码| 亚洲欧美日韩另类电影网站| tube8黄色片| 叶爱在线成人免费视频播放| 99久久精品国产亚洲精品| 国产99久久九九免费精品| 国产成人欧美| 国产欧美亚洲国产| 99久久国产精品久久久| 亚洲专区国产一区二区| 中文字幕制服av| netflix在线观看网站| 亚洲精品国产区一区二| 欧美黑人欧美精品刺激| 最新的欧美精品一区二区| 老司机午夜福利在线观看视频 | 午夜91福利影院| 国产日韩欧美在线精品| 一级片免费观看大全| 精品第一国产精品| 久久人人97超碰香蕉20202| 最近最新免费中文字幕在线| 五月天丁香电影| 日韩视频一区二区在线观看| 一级毛片精品| 亚洲成人国产一区在线观看| 欧美精品av麻豆av| 国产精品久久久人人做人人爽| 亚洲成a人片在线一区二区| 丝袜在线中文字幕| 黄片大片在线免费观看| 黄色成人免费大全| 亚洲国产欧美一区二区综合| 免费日韩欧美在线观看| 日韩熟女老妇一区二区性免费视频| 正在播放国产对白刺激| 两个人免费观看高清视频| 中文字幕色久视频| 亚洲伊人色综图| 波多野结衣一区麻豆| 亚洲精品成人av观看孕妇| 国产成人精品无人区| 精品久久久精品久久久| 黄色视频,在线免费观看| 人人妻人人添人人爽欧美一区卜| 国产一卡二卡三卡精品| svipshipincom国产片| 国产一区有黄有色的免费视频| 国产成人一区二区三区免费视频网站| 18禁观看日本| 美女主播在线视频| 999精品在线视频| a在线观看视频网站| 五月天丁香电影| 叶爱在线成人免费视频播放| 欧美乱码精品一区二区三区| 99国产综合亚洲精品| 午夜福利视频精品| 性高湖久久久久久久久免费观看| 一级毛片精品| 国产精品.久久久| xxxhd国产人妻xxx| 亚洲精品美女久久av网站| 久久热在线av| 亚洲av美国av| 青青草视频在线视频观看| 在线永久观看黄色视频| 亚洲欧洲精品一区二区精品久久久| 巨乳人妻的诱惑在线观看| 国产97色在线日韩免费| 波多野结衣av一区二区av| 国产一区二区在线观看av| 久久久久久人人人人人| 女警被强在线播放| 丝瓜视频免费看黄片| 国产欧美日韩一区二区精品| 欧美亚洲日本最大视频资源| 久久香蕉激情| 久久久久久久大尺度免费视频| 黄片大片在线免费观看| 久久久久久免费高清国产稀缺| 日韩大码丰满熟妇| 亚洲成人免费电影在线观看| 最近最新中文字幕大全免费视频| 黑人欧美特级aaaaaa片| 亚洲情色 制服丝袜| 亚洲中文日韩欧美视频| 免费黄频网站在线观看国产| 亚洲精华国产精华精| 欧美久久黑人一区二区| 亚洲av成人不卡在线观看播放网| 啪啪无遮挡十八禁网站| 亚洲色图综合在线观看| 国产xxxxx性猛交| 麻豆国产av国片精品| 在线 av 中文字幕| 19禁男女啪啪无遮挡网站| 久久久精品94久久精品| 国产不卡av网站在线观看| 亚洲欧洲精品一区二区精品久久久| 两个人免费观看高清视频| www.精华液| 午夜福利欧美成人| 国产精品免费大片| 久久精品国产亚洲av香蕉五月 | 午夜免费鲁丝| 岛国毛片在线播放| 9191精品国产免费久久| 黄色视频,在线免费观看| 在线观看舔阴道视频| 国产av精品麻豆| 欧美乱码精品一区二区三区| 男女床上黄色一级片免费看| 久久九九热精品免费| 久久精品国产亚洲av高清一级| 夫妻午夜视频| 国产成人欧美| 人成视频在线观看免费观看| 在线观看舔阴道视频| 精品人妻在线不人妻| av福利片在线| 两性午夜刺激爽爽歪歪视频在线观看 | 久久国产精品男人的天堂亚洲| 国产色视频综合| 久久精品国产综合久久久| 国产91精品成人一区二区三区 | 99久久人妻综合| 99riav亚洲国产免费| 一本久久精品| 日韩三级视频一区二区三区| 欧美大码av| 国产野战对白在线观看| 99热国产这里只有精品6| 女同久久另类99精品国产91| 亚洲欧美一区二区三区黑人| 亚洲国产欧美网| videos熟女内射| 精品免费久久久久久久清纯 | 国产激情久久老熟女| 免费在线观看视频国产中文字幕亚洲| 欧美精品高潮呻吟av久久| 中国美女看黄片| 757午夜福利合集在线观看| 天天添夜夜摸| 国产免费视频播放在线视频| 人妻一区二区av| 久久久精品94久久精品| 精品一区二区三卡| 国产熟女午夜一区二区三区| 亚洲欧美精品综合一区二区三区| 51午夜福利影视在线观看| 国产成人免费无遮挡视频| www.999成人在线观看| 正在播放国产对白刺激| 美女午夜性视频免费| 久久香蕉激情| 亚洲中文字幕日韩| 丝袜美腿诱惑在线| 精品少妇久久久久久888优播| 国产日韩欧美在线精品| 精品国产一区二区久久| 国产精品久久久久久精品古装| 国产福利在线免费观看视频| 黑人操中国人逼视频| 在线天堂中文资源库| 成人18禁在线播放| 美女扒开内裤让男人捅视频| 欧美 日韩 精品 国产| 一级,二级,三级黄色视频| 精品卡一卡二卡四卡免费| 男女床上黄色一级片免费看| 国产在线视频一区二区| 精品久久久久久久毛片微露脸| 亚洲九九香蕉| 亚洲情色 制服丝袜| 丁香六月天网| 精品熟女少妇八av免费久了| 国产精品亚洲一级av第二区| 久久精品国产亚洲av高清一级| 日韩欧美免费精品| 在线av久久热| 12—13女人毛片做爰片一| 丝袜人妻中文字幕| 欧美 亚洲 国产 日韩一| 午夜福利在线免费观看网站| 亚洲欧美日韩高清在线视频 | 中文欧美无线码| 精品少妇一区二区三区视频日本电影| 黄色视频,在线免费观看| 久久久精品94久久精品| 他把我摸到了高潮在线观看 | 国产成人精品在线电影| 精品国产国语对白av| 美女高潮到喷水免费观看| 高清av免费在线| 侵犯人妻中文字幕一二三四区| 国产精品一区二区在线观看99| 免费久久久久久久精品成人欧美视频| 蜜桃国产av成人99| 国产av又大| 久久婷婷成人综合色麻豆| 超色免费av| 在线观看免费视频日本深夜| 视频在线观看一区二区三区| 在线观看66精品国产| 一级,二级,三级黄色视频| 欧美精品亚洲一区二区| 国产精品 国内视频| 国产精品1区2区在线观看. | av有码第一页| 少妇猛男粗大的猛烈进出视频| 亚洲精品粉嫩美女一区| 99精品在免费线老司机午夜| 国产午夜精品久久久久久| 日日摸夜夜添夜夜添小说| 亚洲精品美女久久av网站| av国产精品久久久久影院| 国产欧美日韩精品亚洲av| 中文字幕最新亚洲高清| 国产真人三级小视频在线观看| 老司机午夜福利在线观看视频 | 久久久欧美国产精品| 亚洲欧洲日产国产| 搡老岳熟女国产| 熟女少妇亚洲综合色aaa.| h视频一区二区三区| 国产激情久久老熟女| 啦啦啦免费观看视频1| 啪啪无遮挡十八禁网站| av不卡在线播放| 国产亚洲精品一区二区www | 99国产精品99久久久久| av福利片在线| 国产亚洲精品一区二区www | 一夜夜www| 757午夜福利合集在线观看| 99精国产麻豆久久婷婷| 中文字幕av电影在线播放| 久久久国产精品麻豆| 啦啦啦在线免费观看视频4| 99精品欧美一区二区三区四区| 久久ye,这里只有精品| 国产欧美日韩一区二区三| 国产在线精品亚洲第一网站| 19禁男女啪啪无遮挡网站| 精品卡一卡二卡四卡免费| 亚洲精品美女久久久久99蜜臀| 国产一卡二卡三卡精品| 欧美日韩亚洲国产一区二区在线观看 | 久久久久精品人妻al黑| 免费一级毛片在线播放高清视频 | 久久免费观看电影| 成年版毛片免费区| 一区二区三区精品91| 色婷婷久久久亚洲欧美| 婷婷成人精品国产| 日本wwww免费看| 国产在线一区二区三区精| 最黄视频免费看| 亚洲全国av大片| 国产麻豆69| 午夜福利欧美成人| 在线播放国产精品三级| 一二三四社区在线视频社区8| 久久 成人 亚洲| 国产一区二区三区综合在线观看| av国产精品久久久久影院| 国产欧美日韩精品亚洲av| 久久国产亚洲av麻豆专区| 亚洲av电影在线进入| 久久99热这里只频精品6学生| 久久精品人人爽人人爽视色| 我要看黄色一级片免费的| 国产成人精品久久二区二区91| 女同久久另类99精品国产91| 91国产中文字幕| 妹子高潮喷水视频| 欧美亚洲 丝袜 人妻 在线| 在线观看免费午夜福利视频| 国产亚洲一区二区精品| 亚洲精华国产精华精| 国产精品久久久久久精品电影小说| 国产有黄有色有爽视频| 老司机亚洲免费影院| 亚洲一码二码三码区别大吗| 欧美日韩精品网址| 国产黄色免费在线视频| 在线观看www视频免费| 90打野战视频偷拍视频| 国产成人一区二区三区免费视频网站| 丝袜在线中文字幕| 欧美人与性动交α欧美精品济南到| 亚洲欧美激情在线| 亚洲五月婷婷丁香| 中文字幕制服av| 久9热在线精品视频| 飞空精品影院首页| 波多野结衣一区麻豆| 亚洲av成人不卡在线观看播放网| 最新在线观看一区二区三区| 男女无遮挡免费网站观看| 最新美女视频免费是黄的| 久久天堂一区二区三区四区| 我的亚洲天堂| 免费高清在线观看日韩| 国产老妇伦熟女老妇高清| 久久免费观看电影| 国产深夜福利视频在线观看| 最新美女视频免费是黄的| 亚洲av第一区精品v没综合| 日韩免费av在线播放| 欧美精品av麻豆av| 久久人妻熟女aⅴ| 悠悠久久av| 国产精品免费大片| 日韩视频一区二区在线观看| 国产无遮挡羞羞视频在线观看| 午夜激情久久久久久久| 久久精品人人爽人人爽视色| 一区二区三区激情视频| 欧美一级毛片孕妇| 视频区欧美日本亚洲| 一区二区三区激情视频| av网站在线播放免费| 国产男女超爽视频在线观看|