• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Speech Recognition via CTC-CNN Model

    2023-10-26 13:15:32WenTsaiSungHaoWeiKangandSungJungHsiao
    Computers Materials&Continua 2023年9期

    Wen-Tsai Sung ,Hao-Wei Kang and Sung-Jung Hsiao

    1Department of Electrical Engineering,National Chin-Yi University of Technology,Taichung,411030,Taiwan

    2Department of Information Technology,Takming University of Science and Technology,Taipei,11451,Taiwan

    ABSTRACT In the speech recognition system,the acoustic model is an important underlying model,and its accuracy directly affects the performance of the entire system.This paper introduces the construction and training process of the acoustic model in detail and studies the Connectionist temporal classification(CTC) algorithm,which plays an important role in the end-to-end framework,established a convolutional neural network(CNN)combined with an acoustic model of Connectionist temporal classification to improve the accuracy of speech recognition.This study uses a sound sensor,ReSpeaker Mic Array v2.0.1,to convert the collected speech signals into text or corresponding speech signals to improve communication and reduce noise and hardware interference.The baseline acoustic model in this study faces challenges such as long training time,high error rate,and a certain degree of overfitting.The model is trained through continuous design and improvement of the relevant parameters of the acoustic model,and finally the performance is selected according to the evaluation index.Excellent model,which reduces the error rate to about 18%,thus improving the accuracy rate.Finally,comparative verification was carried out from the selection of acoustic feature parameters,the selection of modeling units,and the speaker’s speech rate,which further verified the excellent performance of the CTCCNN_5+BN+Residual model structure.In terms of experiments,to train and verify the CTC-CNN baseline acoustic model,this study uses THCHS-30 and ST-CMDS speech data sets as training data sets,and after 54 epochs of training,the word error rate of the acoustic model training set is 31%,the word error rate of the test set is stable at about 43%.This experiment also considers the surrounding environmental noise.Under the noise level of 80~90 dB,the accuracy rate is 88.18%,which is the worst performance among all levels.In contrast,at 40–60 dB,the accuracy was as high as 97.33% due to less noise pollution.

    KEYWORDS Artificial intelligence;speech recognition;speech to text;convolutional neural network;automatic speech recognition

    1 Introduction

    Speech is a linguistic term coined by the Swiss linguist Saussure;it is a concept that is opposite to language.Speech activity is mainly controlled by the individual’s free will;it has characteristics of personal pronunciation,word use,expression and emotion,etc.In contrast,language is a social part of speech activity,not dominated by individual will,but shared by members of society,and arises as a social psychological phenomenon.Speech activity,as defined by the linguist Saussure,is used to collectively describe the phenomenon of human speech.Human language is a natural and effective means of communication,and it is required at most levels of life to communicate with and be understood by others.Verbal communication is taken for granted by most people.In contrast,if an individual’s pronunciation or expression makes it difficult for others to even understand what they are saying,it is highly inconvenient and frustrating.

    Millions of people worldwide are unable to pronounce correctly and fluently due to disorders,such as strokes,amyotrophic lateral sclerosis(ALS),cerebral palsy,traumatic brain injury,or Parkinson’s disease.In response to this problem,we propose an end-to-end neural network architecture,as the connectionist temporal classification-convolutional neural network(CTC-CNN)to help these people communicate normally.Deep learning is mainly used in visual recognition,speech recognition,natural language processing,biomedicine and other fields,and has achieved good results.This research will use deep learning technology to develop a deep intelligent speech processing system,effectively integrating signal processing,acoustic processing,language processing and deep learning.We will research and develop intelligent multi-channel speech processing and speech separation,optimize speech recognition,speech translation,speech emotion recognition and innovation.In the frontend processing,we propose a multi-channel speech enhancement algorithm based on deep learning,and this algorithm integrates beamforming technology and deep neural network.In terms of speech separation,we propose a single-channel speech separation (SCSS) model based on GP regression.The source of the estimate is measured by the predicted mean of the Gaussian Process(GP)regression model,and the hyper-parameter learning process is performed by using a nonlinear conjugate gradient algorithm.We propose Hierarchical Extreme Learning Machine (HELM) for audiovisual speech enhancement as an alternative model for speech enhancement tasks.To enhance speech recognition,the use of a novel graph regularization-based method is proposed to enhance speech features by preserving the intrinsic diversity structure of the amplitude modulation spectrum and excluding irrelevant ones.In machine translation,bidirectional translation between English and Chinese is provided.The speech emotion recognition system uses a multi-feature extraction network based on deep learning and a self-developed recurrent neural network.To understand the semantics of dialogue,language understanding techniques for dialogue systems are developed.This study is developed with the deep learning speech recognition system of this study.Through intelligent speech recognition technology,the speech learning robot allows users to practice pronunciation and pronunciation in a speech environment.Robustness technology mitigates the adverse effects of environmental distortions to maintain acceptable performance levels for automatic speech recognition systems.Deep learning is widely used today because of its powerful learning ability,it can be trained through large-scale data sets,and autonomously extract and learn complex features and models from it.Deep learning uses a multi-layer neural network model that allows the exclusion of layers.Extract an abstract feature representation of the data.Through the combination of multiple hidden layers,the model can learn higher-level and abstract special performance,thereby improving the performance of the model,and at the same time,it can handle large data statistics and efficiently use these large model data to learn more from it.Accurate and generalized models,and can be learned and trained in an end-to-end manner,that is,from the initial input data,the final result is directly output,without the need to manually design complex special processes.This simplifies the development process of the machine learning system and improves the efficiency and accuracy of the model.

    However,in the field of speech recognition,the performance of the acoustic model directly affects the accuracy and stability of the final speech recognition system,requiring detailed consideration of its establishment,optimization and efficiency [1].The experiments in this study adopt CTC-CNN to train the acoustic model,and CTC-CNN shows better performance than the Gaussian Mixture Model-Hidden Markov Model(GMM-HMM)acoustic model commonly used earlier.We use stateof-the-art techniques to validate our method.Experimental results show that the effect is remarkable.Fig.1 illustrates the historical evolution of automatic speech recognition(ASR).

    In the traditional speech recognition system,the mismatch between the model training environment and the test environment(mismatch)is the primary problem that causes the recognition rate to decline.On this issue,many solutions have been proposed in the past literature,such as introducing model parameters at the speech model end.Robust CTC-CNN Model prediction classification rules established by uncertainty,or adjustment methods to adjust the model to the test environment,such as maximum posterior probability (MAP) adjustment and linear regression adjustment,and even further consider the discrimination of speech models the minimum classification error linear regression(MCELR)adaptation and other methods.Among them,the CTC-CNN Model prediction classification method is to properly introduce the uncertainty of the model parameters into the decision-making rule to achieve the robustness of the decision-making method,and the parameter uncertainty reflects the variability of the noise environment and acoustics.It can be represented by the prior probability,and the traditional CTC-CNN Model learning provides a mechanism for estimating and updating parameter prior information.In order to take into account the robustness and discriminability of the decision rule,this study proposes the discriminative training and updating of the acoustic model and its prior probability model under the CTC-CNN Model predictive classification framework.We use the discriminative criterion of minimum classification error(MCE)to Estimate the hyper-parameter of the model parameters,and propose two update methods,one is to directly update the pre-statistics for the hidden Markov model mean vector parameters;the other is to consider the linear regression adjustment,for the regression matrix.The prior information is updated under the minimum classification error criterion.In the evaluation experiment based on the environmental noise speech database,it is found that using the updated prior probability can improve the discrimination of the CTC-CNN Model prediction classification,and achieve the purpose of improving the performance of robust speech recognition.

    In this field of research,many scholars have proposed different methods to solve the problem of mismatch,which we roughly divide into three categories: signal space,feature parameter space,and model parameter space.In the first method,the method of speech enhancement is mainly used.The idea is to reduce the noise part of the signal affected by the environment through signal processing to obtain an approximately clean signal;the second method.The first method is similar to the processing concept of the signal space.It is hoped to restore the characteristics of the characteristic parameters in the original environment and perform compensation of the characteristic parameters;the last method is to process the model parameters that have been trained.It is subdivided into two types:one is to use a small amount of corpus obtained from the new environment to adapt the original model parameters to a method close to the new environment;the other is to consider its uncertainty in the model parameters to reduce The impact caused by model variation,and then achieve the mechanism of robust decision-making.In addition,during model training,parameters or distributions between different models often face confusion,resulting in increased classification errors.Therefore,the consideration of discriminability has also been proposed by scholars to be introduced into the training process of the model in order to achieve a clearer result.

    In this study,based on considering the uncertainty of the parameters,it is hoped that the uncertainty of its parameters can be updated under the consideration of the discriminative classification method,to further achieve robust decision-making with discriminative prior probability law.In this study,the prior probability learning that considers uncertainty and is discriminative is also implemented in the adjustment of model parameters,which is divided into the direct adjustment of model parameters and indirect adjustment of model parameters.In the continuous digital corpus dominated by environmental noise,the improvement of recognition performance can be achieved.This study uses Google’s public training data set—Speech Commands Dataset for analysis and deep learning model training,which contains 30 different word audio files,and each word,has about 2300~2400 original wav audio files.Based on this data set,data preprocessing(including analysis and conversion of sound waves,cutting of training and test data,etc.)will be performed,and Keras and various packages will be used in the Python environment to construct a convolutional neural network model and Long-term short-term memory model for image recognition training on converted data.

    2 Literature Survey

    In this era of technological progress,speech recognition technology has been applied in numerous fields,most of which are mainly based on intelligent electronics and driving navigation products.In addition to helping people troubled by language barriers and unable to communicate normally due to disease or various disorders,this research has the potential to bring more convenience to their lives.In the experimental architecture,this model considers linguistics,speech recognition applications,and deep learning techniques.This approach provides assists to individuals with speech recognition and language impairments,it is necessary to understand the basic linguistic theory.Because language belongs to human spontaneous speech,it contains numerous irregular variables,such as personal pronunciation,words,expressions,and various factors that lead to a certain degree of complexity in establishing the acoustic model to ensure that it meets the requirements as much as possible.Deep learning serves to improve the efficiency and accuracy of the acoustic model performance.

    In view of the fact that the application of artificial intelligence in various fields has increased significantly in recent years,understanding the basic concepts of deep learning and the implementation of programs has become an important learning goal.According to the obtained original data,we can further analyze and understand the characteristics of the data,and then it is an important issue to select and use various neural networks to construct models.Among the application fields of deep learning,the most important ones are image recognition and natural language processing.Therefore,this research intends to carry out a simple program implementation for the latter field,using Convolutional Neural Network(CNN)and Long Short-Term Memory(LSTM)to implement a simple speech recognition model,hoping to recognize simple words,and also through parameter tuning design and experiments,in order to develop a high-accuracy identification model.

    Speech recognition is a technology in which a computer converts the speaker’s pronunciation into text by comparing the acoustic features.In the 1980s,research in this field was initiated by the laboratory of the Massachusetts Institute of Technology,but due to the low recognition rate,it has not been able to be applied to commercial purposes.It was not until 2012 that scientists replaced the traditional Gaussian distribution calculation with the calculation method of DNN,which greatly improved the recognition rate,that it gradually attracted the attention and attention of large international companies.The main process of using a deep network to realize automatic speech recognition is:to input speech fragments(Spectrogram,MFCCs,...,etc.),convert the original language into acoustic features,and then pass through the judgment and probability distribution of the neural network,and finally output the corresponding text content.

    The two neural networks used in this study are the LSTM of the CNN and the Recurrent Neural Networks(RNN).CNN is a convolutional neural network consisting of a convolutional layer,a fully connected layer,and a pooling layer.With the operation of the backpropagation algorithm,it can use the two-dimensional structure of the input data to extract features and properly converge and learn.Speech recognition has excellent performance.RNN is a neural network with active data memory called LSTM,which can be used for a series of data to guess what will happen next.Its output is not only consistent with the current input and network it is related to the weight of the road,and also related to the input of the previous network.It is often used to process time series data.Now it has been widely used in natural language understanding(such as speech-to-text,translation,and hand-written text generation),image and video recognition,and other fields.

    2.1 Speech Recognition

    In recent years,the use of speech recognition has been spreading widely across various fields,and it is no longer limited to intelligent electronics products,but gradually expanding to the healthcare industry and even to product sales and customer services.A good speech recognition system must allow organizations to customize and adapt the technology to their specific needs,ranging from nuances in language to speech to everything else.For example:

    1.Language weighting: A discriminative weighted language model is proposed to better distinguish similar languages.Similar utterances or words are weighed to improve the accuracy[2].

    2.Speaker markers:Speaker selection,taking turns,elaboration,and digression.After providing definitions of discourse markers,turns,floor control types/turn segments,topic units,and actions,a list of verbal and non-verbal discourse markers is specified and grouped into subcategories according to their semantic relationship[3].

    3.Acoustic training:Building acoustic models from large databases has been shown to benefit the accuracy of speech recognition systems.Deep learning is employed to train these systems to adapt to various acoustic environments,such as speaker pronunciation,speech rate,pitch,etc.,to cope with a variety of different situations.

    4.Indecent content filtering: Filters are used to identify profanity words or nonsense particles,etc.,and eliminate this type of speech[4].

    2.1.1 Pattern Recognition

    Current mainstream large-vocabulary speech recognition systems mostly use statistical pattern recognition technology.A typical speech recognition system based on the statistical pattern recognition method consists of the following basic modules:

    1.Signal processing and feature extraction modules:The main work of this module is to extract sound features from the input signal and provide them with an acoustic model for processing.It also includes signal processing techniques to minimize the influence of environmental noise,channel,speaker,and other factors on the characteristics.

    2.Acoustic model: Typical systems are mostly modeled based on first-order hidden Markov models.

    3.Pronunciation dictionary:The pronunciation dictionary contains the vocabulary and pronunciations that can be handled by the system.The pronunciation dictionary provides the mapping between the acoustic model modeling unit and the language model modeling unit.

    4.Language model: A statistical language model represents a probability distribution over a sequence of words,and a language model mainly provides the context to distinguish two words and phrases that have similar pronunciations but different meanings,as shown in the example in Fig.2.This model is often used in numerous natural languages processing applications,such as speech recognition,machine translation,and part-of-speech tagging,etc.Because words and sentences are of any length in any combination,strings that have not appeared in the training process will appear.This further makes it difficult to estimate the probability of strings in the database.

    5.Decoder:The decoder is one of the core aspects of the speech recognition system.It mainly uses the input signal to find the word string that outputs the signal with the greatest probability according to acoustics,language models,and dictionaries.

    Fig.2 illustrates two homonymous English strings.In the language model,in addition to the pronunciation that affects the accuracy of speech recognition,punctuation is likewise an important reason that affects the recognition of the system.Therefore,we discuss several considerations when constructing a post-processing system: (1) Restoring the original requires a high-accuracy model of text punctuation and capitalization.The model must make quick inferences about interim results and catch up on instant captions.(2) Using several resources: Speech recognition is an AND operationintensive technology,such that punctuation patterns do not need to be so computationally intensive.(3)Ability to handle text not listed in the vocabulary:sometimes,the system must add punctuation or capitalization to text that the model has not seen before.

    2.1.2 Speech Recognition Algorithms

    Speech recognition is considered one of the most complex fields in modern technology,as it involves linguistics,mathematics,and statistics.At present,the common speech recognition system is mainly composed of several technologies,such as:speech signal input,feature extraction,acoustic model establishment,feature vector,decoder,and result output.Speech recognition technology is evaluated based on its accuracy,word error rate (WER),and speed.A variety of factors affect the misspelling rate.

    Here are some of the various algorithms and techniques that are currently most commonly used to recognize speech and convert it to text:

    1.Natural Language Processing(NLP)belongs to the field of artificial intelligence,which focuses on language interaction between humans and machines through speech and text.Numerous mobile devices currently incorporate speech recognition into their systems to provide more assistance.

    2.Hidden Markov Model(HMM)is used as a sequence model in speech recognition,assigning labels to each unit in the sequence,i.e.,words,syllables,sentences,etc.These labels map between them and the input provided,such that it can determine the most appropriate sequence of labels.

    3.N-grams is the simplest type of language model(LM)that assigns probabilities to sentences or phrases.An N-gram is a sequence of N words.For example:“How are you”is a ternary,and“I’m fine thank you”is a quaternary.Grammar and specific word sequence probabilities are used to enhance recognition and accuracy.

    4.Neural networks are mainly used in deep learning algorithms.They learn the mapping function through supervised learning and adjust it according to the loss function during gradient descent.

    5.Speaker Discrimination(SD)algorithms identify and separate utterances by speaker identity.This helps the system make better distinctions between individuals in a conversation[5].

    2.2 Convolutional Neural Networks Applied in This Study

    2.2.1 CTC-CNN Acoustic Model

    In the speech recognition system,the acoustic model is an important underlying model,and its accuracy directly affects the performance of the entire system.When acoustic features remain unchanged,the performance of the speech recognition system is mainly improved by optimizing the acoustic model.Early speech recognition systems mainly employ the GMM-HMM acoustic model,which is a shallow model.Thus,it is difficult to accurately describe the state space distribution of features.Furthermore,the frame-by-frame training mode requires mandatory alignment of the training speech,which increases the difficulty of model training.With the development of deep learning,speech recognition systems began to use deep learning-based acoustic models and achieved remarkable results.The latest end-to-end speech recognition framework abandons the more restrictive model of HMM,and directly optimizes the likelihood of input and output sequences,which significantly simplifies the training process.Deep neural networks,loop neural networks,and convolutional neural networks achieved great results in the field of speech recognition with their advantages[6].

    In this study,the convolutional neural network was mainly used to build the acoustic model combined with the connection sequence classification algorithm,which significantly improves the accuracy and performance of the speech recognition system.Based on the establishment of the baseline acoustic model,this research significantly reduces the error rate of the speech-to-pinyin sequence by continuously optimizing the acoustic model.As the output of the acoustic model,the choice of the modeling unit is also one of the factors affecting the performance of the acoustic model.When selecting a modeling unit,it is necessary to consider:(1)whether the modeling unit fully represents the context information,i.e.,the accuracy of the modeling unit;(2)whether it can describe the acoustic features for generalizability;(3) whether there is sufficient language material that can satisfy the modeling unit for model training and trainability.When building the speech recognition system in this study,a non-complete end-to-end speech recognition framework is employed,i.e.,the acoustic model uses the end-to-end recognition framework to convert speech into pinyin sequences,and then uses the language model to convert the pinyin sequences into text.In this study,a convolutional neural network is used to build the acoustic model,which is combined with the connectionist temporal classification(CTC)algorithm to realize the conversion of phonetic to pinyin sequences.Traditional classification methods face problems,such as unequal input and output lengths,and frame-by-frame training is required.CTC can directly map the input speech sequence into a string of text sequences,such that it can optimize the likelihood of the input and output sequences,which significantly simplifies the training process.The essence of the acoustic model based on CTC remains a sequence classification problem,meaning that the output of each node in the output layer of the neural network selects a generation path with the highest probability.Therefore,the input and output of CTC are often in a many-to-one relationship [7].When the CTC-based acoustic model recognizes speech,the acoustic feature parameters are further extracted through the convolutional neural network,and then the posterior probability matrix is output through the fully connected network and the SoftMax layer.The maximum probability label of each node is thus used as the output sequence.Finally,the optimized output label sequence of the CTC decoding algorithm marks the recognition result.The schematic diagram of the CTC-CNN acoustic model is shown in Fig.3[8].

    Figure 3:Schematic chart of CTC-CNN acoustic model

    2.2.2 Core Idea of CTC

    The core ideas of CTC mainly include the following parts:

    (1) Expanding the output layer of CNN,adding a many-to-one spatial mapping between the output sequence and the recognition result (label sequence),and defining the CTC loss function on this basis.

    (2)Drawing on the idea of the forward algorithm of HMM,the dynamic programming algorithm is used to effectively calculate the CTC loss function and its derivative,thus solving the problem of end-to-end training of CNN[9].

    (3) Combined with the CTC decoding algorithm,the end-to-end prediction of sequence data is effectively realized[10].

    Assuming that the speech signal isx,and the label sequence isl,the neural network obtains the probability distribution of the label sequence (l|x) during the training process.Therefore,after inputting the speech,the output sequence is selected with the highest probability,and after CTC decoding optimization,the final recognition resultO(x)can be output,where the operation formula is shown in Eq.(1).

    Given a CNN acoustic model for CTC derivation training,we first assume that S is the training data set,X is the input space,Z is the target space(the set of labeled sequences),and L is defined as the sum of all output labels (modeling units).Set,CTC extends L toL′=L ∪Blank.Under given conditions,the probability of outputting a labelkat a timetcan be expressed as Eq.(2).

    Assuming that under the condition of a given input sequencex,the output label probability is independent at the timet,andis defined as the set of output sequences of length T composed of L′,then the conditional probability formula of a pathπ∈is given by Eq.(3).

    We define the mapping relationship B:→L≤Tfrom the pathπto the label sequencel.Using this mapping relationship will keep only one consecutive and identical label in the output sequence contained in the pathπ,and remove the Blank label.Then,to calculate the probability of label sequencel∈L≤T,it is necessary to accumulate all path probabilities belonging tol,and the calculation formula is shown in Eq.(4).

    The mapping of the pathπto label sequencelis shown in Fig.4.

    Figure 4:Mapping output to label sequence

    Fig.4 shows that the probability of label sequencelabel_7 is equal to the total probability of its entire path,that is,P(label_7)=P(path_1)+P(path_2)+P(path_3)+P(path_4).It is impractical to directly violently calculate (l|x),which will increase the training time of the model and occupy computing power.Borrowing the forward and backward algorithm in HMM effectively solves(l|x),and it is assumed that under the condition of a given input sequencex,the output label probability at the timetis independent,such the transition probability between states does not need to be considered.The derivation diagram of the forward and backward algorithm is shown in Fig.5[11].

    Figure 5:Derivation of forward and backward algorithm

    The calculation of the forward-backward algorithm is as follows:For the input sequencexand the label sequencelwith the time sequence length T,the extended label sequence isl′,and the length of the extended label sequence is|l′|=2|l|+1,defining the firstt.The forward probability of outputting the extended label at thesth position at the moment isα(t,s),and the posterior probability calculation formula of the label sequence is shown in Eq.(5)[12].

    Before calculating the forward probability,the parameters must be initialized first,and the Blank label is abbreviated asb,such that the calculation formula is Eq.(6).

    The recursive calculation formula of forward probability obtained by recursion is shown in Eq.(7).

    The backward algorithm is similar to the forward algorithm.The backward probability of outputting the extended label at thesth position at the timetis defined as (t,),and the posterior probability calculation formula of the label sequence is shown in Eq.(8).

    Before calculating the backward probability(t,),we initialize the parameters,as shown in Eq.(9).

    The recursive calculation formula of the backward probability obtained by recursion is shown in Eq.(10).

    For any momentt,the posterior probability of the label sequence is calculated using the forward and backward probabilities,and the calculation formula is shown in Eq.(11).

    With the posterior probability calculation formula(l|x)of the label sequence,the training target can be optimized,and the parameters can be updated.The loss function of CTC is defined as the negative log probability of the label sequence on the training set S.Then,the loss function(x)output of each sample is given by Eq.(12).

    The loss functionLSof the entire training set is given by Eq.(13).

    The loss function L takes the derivative of the network output parameter,and its operation formula is shown in Eq.(14).

    The chain rule yields the partial derivative of the loss function to the network outputwithout the SoftMax layer.Because a characterkmay appear multiple times in a label sequence,a set is defined to represent the position wherekappears:(l)={s:ls=k}.It is obtained as Eq.(15).

    The parameters of the neural network part are updated layer by layer and frame by frame according to the back-propagation algorithm.When CTC decodes the output,the output sequence must be optimized to obtain the final label sequence.This study adopts the best path decoding algorithm,assuming that the probability maximum pathπand the probability maximum labell?have a one-to-one correspondence,meaning that the many-to-one mapping B has degenerated into a one-toone mapping relationship,and the algorithm accepts each frame.The label sequence corresponding to the output sequence generated by the maximum probability label is used as the final recognition result.First,we must calculate the maximum probability pathπ?output by the network,and the operation formula is shown in Eq.(16)[13].

    Then,we calculate the label sequence output by the network,and definel?=(π?).The formulal?is given by Eq.(17).

    The recognition result of the final acoustic model is given by Eq.(17).In essence,the acoustic model of CTC can be directly output to Chinese characters end-to-end.Due to the limitation of the training corpus and the complexity of the model,the output of the acoustic model in this study is Pinyin;the final result of speech recognition is obtained by inputting the pinyin sequence into the language model[14–16].

    2.2.3 Construction and Training of Baseline Acoustic Model

    In the convolutional neural network,the structure of the convolutional and pooling layers indicate that the input features with slight deformation and displacement are accurately recognized.This translation invariance characteristic is beneficial to the recognition of spectrogram features.The training mode of parallel computing of the convolutional neural network effectively shortens the training time and utilizes the powerful parallel processing capability of Graphics Processing Unit (GPU).CTC illustrates the optimization of the loss function of the neural network and the optimization of the output sequence.Therefore,this study proposes a CTC-CNN acoustic model based on CNN combined with the CTC algorithm.The overall structure of the CTC-CNN acoustic model is shown in Fig.6[17,18].

    Figure 6:Configuration of CTC-CNN acoustic model

    3 System Architecture

    3.1 System Design

    Fig.7 portrays the architectural diagram of the hardware employed in this experiment.It comprises the ReSpeaker Mic Array v2.0.1 and display screen.The ReSpeaker Mic Array v2.0.1 is used to record voice data,and the recorded voice signals are compared with the voice database.The algorithm is processed,and calculated results are displayed on the display screen to display the words,sentences,or phrases after the speech is converted into text.

    Fig.8 shows the overall structure and flow chart of the speech recognition assistance system for language-impaired individuals.ReSpeaker Mic Array v2.0.1 records the speech signals of individuals with language impairments and extracts the recorded original voice recording files through a Python algorithm.Then,the algorithms extract voice features from the extracted raw data and yield the extracted features.The vectors are calculated algorithmically by the speech recognition system,which includes acoustic comparison and language decoding.The features are repeatedly compared and decoded in acoustic comparison and language decoding,until the calculated result is very similar or correct to the original intention of the speaker,i.e.,it yields the intended output.The result is presented in the form of text to be displayed on the vehicle.

    Figure 8:System structure diagram

    The upper layer of acoustic comparison and language decoding is mainly divided into three parts,namely the acoustic model,pronunciation dictionary,and language model.Among them,the acoustic model uses the language corpus to train and adjust the acoustic model,enabling cross-comparison with the speaker’s pronunciation,words,and expressions to improve the accuracy of identification.The language model is generated in the same manner as the acoustic model.The language model is trained and adjusted through the text corpus,to establish common words or sentences,and even multi-languages.

    3.2 ReSpeaker Mic Array v2.0.1

    The radio hardware component used in this experiment is the ReSpeaker Mic Array v2.0.1 by SeeedStudio.This is an upgrade to the original ReSpeaker microphone array v1.0.The upgraded version is based on XMOS′XVF-3000,which is a chip with significantly higher performance than the previously used XVSM-2000.The comparison of XVF-3000 and XVSM-2000 specifications is shown in Table 1.

    Table 1:Comparison of XVF-3000 and XVSM-2000 specifications

    The microphones in this version have also been improved,with the number reduced to four compared to the seven in the first generation,and a significant performance increase.It can be used on many occasions,such as:smart speakers,smart voice assistant systems,voice conference systems,car voice assistants,etc.Compared with XVSM-2000,this new chipset adds a speech recognition algorithm to improve its performance.Added the following algorithms:

    1.Pick Up Voices From Far Away

    ? Far-field voice capture enables you to capture and understand requests from up to 5 m away

    2.Focus On The Right Voice

    ? DoA allows the device to know the direction of a source

    ? BF allows the device to focus only on sounds that come from the target direction

    ? Ignore background noise and chatter through NS

    3.Improved Voice Audio Quality

    ? Reduces environmental voice echo with de-reverberation

    ? Remove current audio output with AEC

    The ReSpeaker Mic Array v2.0.1 module has numerous voice algorithms and features,and the maximum sampling rate is 16 kHz.This small chip has the benefits of numerous functions,as the module is equipped with XMOS’s XVF-3000 IC,which integrates advanced Digital signal processing (DSP) algorithms,including acoustic echo cancellation (AEC),beamforming,demixing,noise suppression,and gain control.It further supports USB Audio Class 1.0(UAC 1.0)and has twelve programmable RGB color model(RGB)LED indicators for user freedom.The detailed specifications are shown in Table 1.Fig.9 shows the ReSpeaker Mic Array v2.0.1 system diagram[19].

    Figure 9:ReSpeaker Mic Array v2.0.1 system graph

    3.3 System Technology Description

    The study proposes an end-to-end speech enhancement architecture that(1)models the original speech waveform time domain signal,bypassing the phase processing operation in the traditional timefrequency conversion,and avoiding phase pollution,(2)transforms the one-dimensional time-domain speech signal,mapping it to a two-dimensional representation.More sufficient information is mined from the high-dimensional representation of the speech signal,and the codec network is subsequently used to learn the mapping from noisy to clean speech.This represents the dimensionality reduction and reconstruction into a time-domain waveform signal,(3)by combining the evaluation index with the loss function,the commonality and difference between different evaluation indexes are used to improve the perceptual ability of the model and obtain clearer speech.

    The end-to-end model framework UNet[20]comprises the main structure of the framework,as shown in Fig.10 [21].The UNet neural network was initially applied for medical image processing and achieved good results.The main structure of UNet is composed of an encoding stage (left half of UNet)and a decoding stage(right half of UNet).Between each corresponding encoding stage and decoding stage,skip connections are used.The skip connections herein are not residual,as it is not the calculation method of the residual,but the method of splicing.

    The structure of the model proposed in this study is shown in Fig.11.The architecture consists of three parts,namely,preprocessing of the original audio signal,encoding and decoding module based on the UNet architecture,and post-processing of enhanced speech synthesis.By directly modeling the time domain speech signal,we avoid the defects and problems in the time-frequency transformation,and convert the one-dimensional signal into a two-dimensional signal through the convolution operation,such that the neural network can mine the speech signal in the high-dimensional space and deep representation.To reduce the number of parameters and the complexity of the model,the up-sampling operation in the decoding part of UNet here is not deconvolution,but bilinear interpolation[22].

    Figure 10:End-to-end model framework UNet

    Figure 11:End-to-end speech enhancement framework

    4 Analyses of Experimental Results

    4.1 Basics Experimental Results

    Experimental results are presented in Figs.12a,12b as screenshots of the web Graphical User Interface (GUI) interface.Fig.12a shows the speaker saying“The weather is so nice today”,and the system successfully displays the speaker’s complete sentence.Fig.12b shows the speaker saying“Good morning”twice in a row,but the recognition result is only successful one time;the result of the second time presents the situation of homophones.First,a voice recording is made on the ReSpeaker Mic Array v2.0.1.Subsequently,the algorithm rapidly performs voice recognition and displays the speaker’s incomplete or intermittent sentences on the vehicle,helping the language-impaired person to communicate smoothly and quickly with others[23].

    Figure 12:Recognizing experiment

    4.2 Experimental Data Analysis

    To train and verify the CTC-CNN baseline acoustic model,THCHS-30 and ST-CMDS speech datasets were used as training data sets,and the data sets are divided into training and test sets.The training results are shown in Figs.13a and 13b,which show that after 54 epochs of training,the word error rate of the acoustic model training set is about 31%,and the word error rate of the test set is stable at about 43%.There is a certain overfitting phenomenon.Namely,the 43%-word error rate is difficult to put into practical application,such that it is necessary to optimize and adjust the network structure parameters to further improve the accuracy of the acoustic model.

    Figure 13:(a)Loss variation of baseline acoustic model(b)word error rate change of baseline acoustic model

    Table 2 lists the recognition accuracies of several consecutive words.The results are obtained from 100 test datasets.When the speaker only speaks one word,the recognition accuracy is the highest,and the accuracy rate can reach 98.11%.In contrast,when the speaker utters sentences with more than five words,the recognition accuracy rate falls to only 93.77%.Sentences with more than five characters may cause the system to misdiagnose the words due to the speaker’s punctuation or if the pronunciation of the words is too similar,for example:“recognize speech”and“wreck a nice beach”have similar pronunciations in English,and“factors”with similar pronunciation in Chinese and“Sonic”,etc.In addition to the above-mentioned situations that affect the accuracy of identification,the environmental noise factor may also lead to a decrease in the accuracy of identification.Fig.14 shows a prediction trend chart of various word count recognition accuracy rates.

    Table 2:Recognition accuracy of each character number

    Figure 14:Prediction trend graph of various word count recognition accuracy rates

    Therefore,this experiment also considers the surrounding ambient noise,and conducts 80 tests at each level of decibels,as shown in the following Table 3.Taking the noise of 80–90 dB as an example,at this level of noise,the accuracy rate is 88.18%,which represents the poorest performance among all levels.In contrast,at 40–60 dB,owing to less noise pollution,the accuracy rate is as high as 97.33%.Fig.15 shows the prediction trend of environmental noise impact identification accuracy.

    Table 3:Environmental noise affects the recognition accuracy

    Because of the similarities in Chinese pronunciation,the recognition error rate of the system is expected to increase significantly.To this end,we designed this experiment based on the characteristics of Chinese consonants and vowels to verify their time-frequency map.There are 21 consonants and 16 vowels,respectively,in the Chinese phonetic alphabet.The forming of vowels mainly occurs by the change of mouth shape,while consonants are formed by controlling the flow of air through certain parts of the oral cavity or nasal cavity.Therefore,the energy of consonants is small,their frequency is high,and the time is short,and most of them appear before vowels.Conversely,vowels have higher energy,lower frequency,and longer duration,and usually appear after consonants or independently.The energy and frequency difference of vowels can be verified through time-frequency graph experiments,and this difference can be used to perform simple vowel identification[24].

    Figure 15:Prediction trend of environmental noise impact recognition accuracy

    4.3 Tuning and Optimization of Acoustic Models

    The models are trained through continuous design and improvement of the relevant parameters of the acoustic model,and finally,the model with excellent performance is selected according to the evaluation index.The baseline acoustic model in this study faces challenges such as long training time,high error rate,and a certain degree of overfitting.Common optimization strategies for neural networks include dropout,normalization,and residual modules.Dropout was first proposed by Srivastava et al.in 2018,which can effectively solve the problem of overfitting.Normalization was first proposed by Segey Loffe and Christian Szegedy in 2020,which can speed up the model convergence and alleviate the overfitting problem to a certain extent.The residual module was proposed by Kaiming He et al.in 2022[25],which solves the problem of gradient disappearance caused by the deepening of network layers.

    The features of the neural network input generally follow the standard normal distribution,and generally perform well for shallow models.However,as the depth of the network increases,the nonlinear layer of the network will make the output results interdependent,and no longer meet a standard normal distribution.The problem of the output center offset will occur,which brings difficulties to the training of the network model.The training of deep models is particularly difficult.To solve the problem of model convergence,a normalization operation is added to the middle layer,i.e.,a normalization process is performed on the output of each layer to make it conform to the standard normal distribution.Through this processing,the network input conforms to the standard normal distribution,which can be well-trained,thus speeding up the convergence speed.The data dimension processed by the convolutional neural network is a four-dimensional tensor,such that there are numerous normalization methods:layer normalization(LN),instance normalization(IN),group normalization(GN),batch normalization(BN),etc.[26].

    Fig.16 illustrates schematic diagrams of the normalizations for comparison.Taking a piece of voice data as an example,as the voice frequency range is roughly 250–3400 Hz,and the high frequency is 2500–3400 Hz,four intrinsic mode functions(IMF)component frequency diagrams are decomposed by the normalized comparison method,as shown in Figs.17a–17d.From the density of the normalized amplitude value of each IMF component,the high-frequency region of speech is mainly concentrated in the first IMF component.Figs.17a–17d indicate that the high-frequency region of the speech signal can be effectively extracted by empirical mode decomposition(EMD)decomposition.However,for the feature parameter extraction method of the high-frequency region,the traditional extraction algorithm is not suitable,and one must seek the high-frequency feature parameter extraction algorithm[27].

    Figure 16:Contrast charts of applied normalization

    Figure 17:Contrast chart of normalizations based on IMF

    Fig.18 shows a schematic diagram of the Residual module,which transmits original input information to the output layer through a new channel opened on the network side.The Residual module directly transfers the input of the previous layer to a later layer by adding a congruent mapping layer.The principle of Dropout to suppress overfitting is to temporarily set some neurons to zero during network training,and ignore these neurons for parameter optimization,such that the network structure of each repeated operation training is different,to avoid network reliance on a single feature for classification and prediction.Dropout,a method of training multiple neural networks and then averaging the results of the entire set,instead of training a single neural network,increases the sparsity of the network model and improves its generalization.

    Figs.19a and 19b show the training comparison of the baseline acoustic model and improved acoustic model,respectively[28].

    Fig.19b shows that by increasing the depth of the network model,the improved acoustic model reduces the WER by 3.5% on the test set compared to the baseline model.Although the error rate drops,the effect is still unsatisfactory.Fig.19a shows that the improved acoustic model still faces the overfitting problem.Therefore,further optimization of this improved acoustic model is required.In the improved acoustic model,the number of network layers has reached 25.If the network layer continues to be deepened,the training time becomes too long,which likewise affects the decoding performance.To solve the overfitting problem,Dropout,and Batch Normalization(BN)layers are employed in the network model.The network model structure is shown in Fig.20[29].

    Figure 18:Schematic graph of residual module

    Figure 19:(a)Loss contrast of acoustic model;(b)contrast of WER of the acoustic model

    Figs.21a and 21b show the training comparison diagrams of the Dropout and BN acoustic models.

    Fig.21a shows that both the Dropout and the BN acoustic models play a role in suppressing overfitting.However,as indicated in Fig.21b,the error rate of the acoustic model using Dropout does not drop but rises instead,revealing the opposite effect.The acoustic model using BN effectively reduces the error rate,and at the same time accelerates the convergence,such that the training speed of the model is accelerated.The error rate of the BN acoustic model drops to 23.67%,indicating an 8% improvement over the baseline acoustic model.Considering the gradient vanishing problem that may be imposed by the deep convolutional neural network,the residual module is added based on the BN acoustic model,which is expected to further reduce the error rate.Fig.22 shows the acoustic model with the added Residual module.

    Fig.23a shows that the Residual plus BN acoustic model has the fastest convergence speed among all models,i.e.,the Residual module effectively alleviates the problem of gradient disappearance and speeds up the training speed of the model.As observed in Fig.23b,the error rate of the model on the test set is reduced to 12.45%,which is 17% higher than the initial baseline acoustic model.An error rate of 13.52% is already an excellent result on the current scale of the dataset.

    Figure 20:Contrast of structures between dropout and BN acoustic models

    Figure 21:(a)Contrast of loss between dropout and BN acoustic models;(b)contrast of WER between dropout and BN acoustic models

    According to all the above experiments,the experimental results show that each item of data in the experiment has a very good performance,and through the feedback of the experimental data,the experimental methods and procedures are continuously revised,and finally audio2text obtains very good performance.

    Figure 22:Acoustic model of residual plus BN

    Figure 23:(a) Training loss graph of residual plus BN acoustic model;(b) WER change of residual plus BN acoustic model

    5 Conclusions and Future Directions

    In the speech recognition system,the acoustic model is an important underlying model,whose accuracy directly affects the performance of the entire system.This chapter introduces the construction and training process of the acoustic model in detail,and studies the CTC algorithm that plays an important role in the end-to-end framework.We constructed the CTC-CNN baseline acoustic model,and on this basis,carried out the optimization,reducing the error rate to about 18%,hence improving the accuracy.Finally,the selection of acoustic feature parameters as well as the selection of modeling units,the speaker’s speech speed,and other aspects was compared and verified,and the excellent performance of the CTCCNN_5 plus BN plus Residual model structure is further verified.

    This study briefly introduces the historical development of deep learning and the most widely used deep learning models,and presents the development and current situation of these deep learning models in the field of speech recognition.Deep learning research is still in its developmental stage,and the main problems are:(1)training usually must solve a highly nonlinear optimization problem,which easily leads to many local minima in the process of training the network;(2)if the training takes too long,it will cause overfitting of the results.Thus,the use of deep neural networks to solve the robustness problem is currently the hottest topic in the field of speech recognition.In practical applications,the recognition rate of noisy speech is only about 85%,such that there is no stable,efficient,and universal system that can achieve a recognition rate of more than 95% for noisy speech.For future research on speech recognition,we believe that the best direction of development is brain-like computing.Only by continuously conforming to the characteristics of speech recognition of the human brain,can the recognition rate of speech be improved to a satisfactory level.However,the existing deep learning technology is far from sufficient to meet this requirement.How to better apply deep learning and meet the market demand for efficient speech recognition systems is a problem worthy of continued focus.

    Acknowledgement:This research was supported by the Department of Electrical Engineering at National Chin-Yi University of Technology.The authors would like to thank the National Chin-Yi University of Technology,Takming University of Science and Technology,Taiwan,for supporting this research.

    Funding Statement:The authors received no specific funding for this study.

    Author Contributions:W.-T.S.is responsible for research planning and providing improvement methods.H.-W.K.and S.-J.H.is responsible for thesis writing and experimental verification.

    Availability of Data and Materials:Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

    Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

    91午夜精品亚洲一区二区三区| 97超视频在线观看视频| 亚洲国产最新在线播放| 亚洲欧洲国产日韩| 成人亚洲欧美一区二区av| 久久这里有精品视频免费| 一级黄片播放器| 丰满迷人的少妇在线观看| 一级片'在线观看视频| 欧美国产精品一级二级三级| 日日撸夜夜添| 26uuu在线亚洲综合色| 国产精品久久久久久久电影| 久久久欧美国产精品| 国产色爽女视频免费观看| 亚洲av不卡在线观看| 中文字幕制服av| 欧美丝袜亚洲另类| 老司机影院成人| 日本爱情动作片www.在线观看| 久久人人爽av亚洲精品天堂| 99国产综合亚洲精品| 日本欧美视频一区| 亚洲成人一二三区av| 激情五月婷婷亚洲| 亚洲怡红院男人天堂| 国产午夜精品久久久久久一区二区三区| 日韩av在线免费看完整版不卡| 寂寞人妻少妇视频99o| 成年美女黄网站色视频大全免费 | 亚洲经典国产精华液单| 日本黄色日本黄色录像| 欧美人与善性xxx| 国产成人freesex在线| 久久av网站| 99国产精品免费福利视频| 精品人妻熟女毛片av久久网站| 久久久久久久久久成人| 精品酒店卫生间| 国产免费一级a男人的天堂| 亚洲国产av影院在线观看| 亚洲精品久久午夜乱码| 久久久精品区二区三区| 亚洲在久久综合| 欧美亚洲 丝袜 人妻 在线| 国产精品三级大全| 精品人妻熟女av久视频| 在线播放无遮挡| 亚洲av欧美aⅴ国产| 日韩三级伦理在线观看| 国产老妇伦熟女老妇高清| 老女人水多毛片| 夫妻性生交免费视频一级片| 国产成人免费观看mmmm| 99久久综合免费| 欧美 日韩 精品 国产| 麻豆成人av视频| 久久久午夜欧美精品| 有码 亚洲区| 老司机亚洲免费影院| 熟妇人妻不卡中文字幕| 黑丝袜美女国产一区| 国产一区二区在线观看av| 久久韩国三级中文字幕| 精品人妻一区二区三区麻豆| 色网站视频免费| 国产精品秋霞免费鲁丝片| 我的女老师完整版在线观看| 99久久综合免费| av线在线观看网站| 色94色欧美一区二区| 日韩中字成人| 啦啦啦视频在线资源免费观看| 国产精品久久久久久精品古装| a级片在线免费高清观看视频| 男女啪啪激烈高潮av片| 99九九线精品视频在线观看视频| 啦啦啦啦在线视频资源| 成年人免费黄色播放视频| 亚洲美女黄色视频免费看| 丰满迷人的少妇在线观看| 久久久精品区二区三区| 女性生殖器流出的白浆| 国产成人91sexporn| 日韩欧美精品免费久久| 黄色欧美视频在线观看| 亚洲精品亚洲一区二区| 国产视频首页在线观看| 欧美 日韩 精品 国产| 黄色一级大片看看| 国产国语露脸激情在线看| 黄色毛片三级朝国网站| 亚洲图色成人| 大话2 男鬼变身卡| 国产亚洲午夜精品一区二区久久| 国产精品一二三区在线看| 在线观看免费视频网站a站| 欧美日本中文国产一区发布| 尾随美女入室| 免费人成在线观看视频色| 一个人看视频在线观看www免费| 中文字幕人妻丝袜制服| 蜜臀久久99精品久久宅男| 在线天堂最新版资源| 亚洲精品aⅴ在线观看| 男女边摸边吃奶| 国产精品国产三级专区第一集| 少妇丰满av| 国产欧美另类精品又又久久亚洲欧美| 日本爱情动作片www.在线观看| 国语对白做爰xxxⅹ性视频网站| 久久精品夜色国产| 啦啦啦视频在线资源免费观看| 高清不卡的av网站| 我要看黄色一级片免费的| 黑人巨大精品欧美一区二区蜜桃 | 97在线人人人人妻| 亚洲美女搞黄在线观看| 精品亚洲乱码少妇综合久久| 少妇精品久久久久久久| 大陆偷拍与自拍| 日韩中文字幕视频在线看片| 99热这里只有是精品在线观看| 久久av网站| 成人国产麻豆网| 性色avwww在线观看| 精品久久久噜噜| 久久热精品热| 日日摸夜夜添夜夜爱| 久久久久久久久久久丰满| 日本免费在线观看一区| 亚洲内射少妇av| 一级黄片播放器| 免费大片黄手机在线观看| 国产熟女午夜一区二区三区 | 永久网站在线| 黄色怎么调成土黄色| 国产一区亚洲一区在线观看| 日日摸夜夜添夜夜爱| 国产乱人偷精品视频| 午夜福利在线观看免费完整高清在| 在线观看三级黄色| 午夜激情av网站| 日韩成人伦理影院| 七月丁香在线播放| 久久精品夜色国产| 精品国产一区二区三区久久久樱花| av又黄又爽大尺度在线免费看| 水蜜桃什么品种好| 国产精品一区二区三区四区免费观看| 日日撸夜夜添| 超碰97精品在线观看| 久久久久久久久久久免费av| 男女边吃奶边做爰视频| av线在线观看网站| 免费人妻精品一区二区三区视频| 国语对白做爰xxxⅹ性视频网站| 国产色婷婷99| 欧美日韩综合久久久久久| 日本午夜av视频| 日本猛色少妇xxxxx猛交久久| 国产高清有码在线观看视频| 久久97久久精品| 丰满饥渴人妻一区二区三| 一区二区日韩欧美中文字幕 | 久久 成人 亚洲| 国产在视频线精品| 狂野欧美白嫩少妇大欣赏| 午夜精品国产一区二区电影| 国产精品国产av在线观看| 在线看a的网站| 午夜av观看不卡| 99热国产这里只有精品6| 黑人巨大精品欧美一区二区蜜桃 | 99久久综合免费| 国产 精品1| 中文字幕人妻熟人妻熟丝袜美| 亚洲中文av在线| 免费观看a级毛片全部| 啦啦啦在线观看免费高清www| 精品人妻熟女av久视频| 99视频精品全部免费 在线| 国产精品欧美亚洲77777| 成人二区视频| 亚洲激情五月婷婷啪啪| 美女cb高潮喷水在线观看| 考比视频在线观看| 亚洲av成人精品一二三区| 美女国产高潮福利片在线看| 看十八女毛片水多多多| 成人影院久久| 青春草视频在线免费观看| 成人手机av| 丰满少妇做爰视频| 免费大片黄手机在线观看| 欧美精品国产亚洲| 国产在线一区二区三区精| 亚洲av欧美aⅴ国产| 亚洲色图综合在线观看| 久久狼人影院| 亚洲综合色惰| 亚洲成人一二三区av| 精品熟女少妇av免费看| 热re99久久国产66热| 三级国产精品片| 一级毛片aaaaaa免费看小| 最近中文字幕2019免费版| 五月伊人婷婷丁香| 久久久久久久精品精品| 国产精品国产三级国产专区5o| kizo精华| 免费人妻精品一区二区三区视频| 亚洲av免费高清在线观看| 亚洲无线观看免费| 色哟哟·www| 自线自在国产av| 亚洲一区二区三区欧美精品| 亚洲欧美清纯卡通| 国产在线免费精品| 多毛熟女@视频| 十八禁高潮呻吟视频| 午夜日本视频在线| 国产在视频线精品| 亚洲成色77777| 多毛熟女@视频| 久久精品国产鲁丝片午夜精品| 91午夜精品亚洲一区二区三区| 国产免费一区二区三区四区乱码| 国产不卡av网站在线观看| av免费在线看不卡| 久热久热在线精品观看| 亚洲精品日韩在线中文字幕| 日本欧美国产在线视频| 18+在线观看网站| 少妇被粗大猛烈的视频| 少妇熟女欧美另类| av.在线天堂| av在线播放精品| 精品一区二区三卡| 一本色道久久久久久精品综合| 久久久精品区二区三区| 欧美精品一区二区大全| 丝袜美足系列| 不卡视频在线观看欧美| 精品久久国产蜜桃| 欧美97在线视频| 精品国产乱码久久久久久小说| 久久久久精品性色| 久久国产精品大桥未久av| 黑人欧美特级aaaaaa片| 久久久精品免费免费高清| 精品国产乱码久久久久久小说| 国产成人aa在线观看| 精品酒店卫生间| 国产亚洲午夜精品一区二区久久| 国产色爽女视频免费观看| 亚洲色图 男人天堂 中文字幕 | 熟女人妻精品中文字幕| 欧美三级亚洲精品| 国产综合精华液| 在线观看一区二区三区激情| 视频区图区小说| 一级黄片播放器| 国产熟女欧美一区二区| 91精品伊人久久大香线蕉| 久久毛片免费看一区二区三区| 日日撸夜夜添| 国产乱人偷精品视频| 久久99蜜桃精品久久| 国产精品久久久久成人av| 伊人久久精品亚洲午夜| 在线观看国产h片| 亚洲av成人精品一区久久| 观看av在线不卡| 水蜜桃什么品种好| 91久久精品电影网| 99久国产av精品国产电影| 日韩熟女老妇一区二区性免费视频| 国产视频首页在线观看| 成人午夜精彩视频在线观看| 777米奇影视久久| 99久久人妻综合| 日韩av不卡免费在线播放| 能在线免费看毛片的网站| 久久久久久久久久成人| 黑人欧美特级aaaaaa片| 亚洲欧美一区二区三区国产| 国产av一区二区精品久久| 亚洲一区二区三区欧美精品| 黑人欧美特级aaaaaa片| 亚洲国产成人一精品久久久| 夜夜看夜夜爽夜夜摸| 一级毛片电影观看| 晚上一个人看的免费电影| 国产av精品麻豆| 18禁动态无遮挡网站| 黑人巨大精品欧美一区二区蜜桃 | 丝袜在线中文字幕| 久久久欧美国产精品| 老司机影院毛片| 日韩制服骚丝袜av| 在线观看www视频免费| 人成视频在线观看免费观看| 国产精品熟女久久久久浪| 久久久国产精品麻豆| 免费大片18禁| 99re6热这里在线精品视频| 伦理电影免费视频| 国产不卡av网站在线观看| 精品久久久久久久久av| 黄色欧美视频在线观看| 一本色道久久久久久精品综合| 高清毛片免费看| 久久99蜜桃精品久久| 在线观看免费高清a一片| av在线app专区| 中文字幕人妻丝袜制服| 亚洲一级一片aⅴ在线观看| 丝袜美足系列| 亚洲精品一区蜜桃| 亚洲久久久国产精品| 天天操日日干夜夜撸| 国产一区亚洲一区在线观看| 大香蕉97超碰在线| 久久久精品免费免费高清| 日韩人妻高清精品专区| 亚洲综合色网址| 最近2019中文字幕mv第一页| 久久毛片免费看一区二区三区| av福利片在线| 春色校园在线视频观看| 色5月婷婷丁香| 成人国产av品久久久| 男女边吃奶边做爰视频| 欧美xxxx性猛交bbbb| av有码第一页| 一边亲一边摸免费视频| 亚洲美女搞黄在线观看| 母亲3免费完整高清在线观看 | 三级国产精品片| 久久久欧美国产精品| 哪个播放器可以免费观看大片| 在线观看免费高清a一片| 18禁观看日本| 一个人看视频在线观看www免费| 十分钟在线观看高清视频www| 丝袜脚勾引网站| 麻豆成人av视频| 免费不卡的大黄色大毛片视频在线观看| 亚洲四区av| 国产成人精品福利久久| 大片免费播放器 马上看| www.av在线官网国产| 久久久精品94久久精品| 久久精品国产鲁丝片午夜精品| 日韩一本色道免费dvd| 免费大片18禁| 99久久人妻综合| 夜夜看夜夜爽夜夜摸| 校园人妻丝袜中文字幕| 国产精品麻豆人妻色哟哟久久| 亚洲激情五月婷婷啪啪| av国产久精品久网站免费入址| 午夜影院在线不卡| 色婷婷av一区二区三区视频| 国产一区有黄有色的免费视频| 美女国产视频在线观看| 精品亚洲成国产av| 爱豆传媒免费全集在线观看| 一级爰片在线观看| 青草久久国产| 热re99久久国产66热| 日韩欧美一区二区三区在线观看 | 宅男免费午夜| 亚洲免费av在线视频| 欧美日韩黄片免| 亚洲七黄色美女视频| 一本大道久久a久久精品| 国产精品av久久久久免费| 国产在视频线精品| 69av精品久久久久久 | 桃红色精品国产亚洲av| 亚洲国产欧美日韩在线播放| 欧美日韩一级在线毛片| 欧美+亚洲+日韩+国产| 精品一区二区三区四区五区乱码| 波多野结衣av一区二区av| 日韩中文字幕视频在线看片| 亚洲第一av免费看| 久久九九热精品免费| 怎么达到女性高潮| 悠悠久久av| 久久久水蜜桃国产精品网| 桃花免费在线播放| 亚洲精品在线美女| 满18在线观看网站| 精品少妇内射三级| 久久av网站| 成人国语在线视频| 飞空精品影院首页| av超薄肉色丝袜交足视频| 一级黄色大片毛片| 国产精品成人在线| 好男人电影高清在线观看| 日韩一卡2卡3卡4卡2021年| 精品亚洲成国产av| 国产精品免费大片| 18禁国产床啪视频网站| 国产av国产精品国产| 欧美激情极品国产一区二区三区| 王馨瑶露胸无遮挡在线观看| 久久精品国产亚洲av高清一级| 手机成人av网站| 亚洲av国产av综合av卡| 悠悠久久av| 国产麻豆69| tube8黄色片| 免费久久久久久久精品成人欧美视频| 精品人妻在线不人妻| 国产男女内射视频| 高清视频免费观看一区二区| 中文字幕高清在线视频| 国产免费福利视频在线观看| 亚洲精品成人av观看孕妇| 黄片大片在线免费观看| 国产精品国产高清国产av | 久久精品人人爽人人爽视色| 国产97色在线日韩免费| 久久久久久久国产电影| 国产黄色免费在线视频| 又黄又粗又硬又大视频| 日韩熟女老妇一区二区性免费视频| 成人影院久久| 免费少妇av软件| bbb黄色大片| 老司机深夜福利视频在线观看| 99国产精品免费福利视频| 一级毛片电影观看| 免费少妇av软件| 国产成人免费观看mmmm| 三上悠亚av全集在线观看| 一区二区三区精品91| 精品福利观看| 久久99一区二区三区| av电影中文网址| 亚洲人成77777在线视频| 黄色a级毛片大全视频| 日日夜夜操网爽| 热re99久久国产66热| 国产精品免费一区二区三区在线 | 国产一区二区三区在线臀色熟女 | 午夜福利免费观看在线| 成人18禁在线播放| 久久亚洲精品不卡| 亚洲免费av在线视频| 国产精品影院久久| 亚洲天堂av无毛| 亚洲欧美精品综合一区二区三区| 国产精品自产拍在线观看55亚洲 | 中文字幕av电影在线播放| 免费在线观看完整版高清| 久久ye,这里只有精品| 久久狼人影院| 人妻久久中文字幕网| 国产成人系列免费观看| 女人精品久久久久毛片| 国产一区二区三区视频了| 国产黄色免费在线视频| 美女国产高潮福利片在线看| 中文字幕人妻丝袜制服| 国产免费视频播放在线视频| 国产三级黄色录像| 正在播放国产对白刺激| 亚洲视频免费观看视频| bbb黄色大片| 一进一出抽搐动态| 日本黄色视频三级网站网址 | 熟女少妇亚洲综合色aaa.| 18禁国产床啪视频网站| 色综合婷婷激情| 人成视频在线观看免费观看| 免费在线观看影片大全网站| 国产区一区二久久| 精品一区二区三卡| 国产在线精品亚洲第一网站| 俄罗斯特黄特色一大片| 色婷婷久久久亚洲欧美| 亚洲熟女精品中文字幕| 亚洲欧美色中文字幕在线| 丰满人妻熟妇乱又伦精品不卡| 亚洲熟妇熟女久久| 五月天丁香电影| 欧美乱妇无乱码| 国产亚洲精品第一综合不卡| 国产成人啪精品午夜网站| 欧美日本中文国产一区发布| 超碰成人久久| 制服诱惑二区| 国产成人系列免费观看| 久久久久久久久免费视频了| 日韩欧美免费精品| 国产视频一区二区在线看| 国产成人一区二区三区免费视频网站| 久久亚洲精品不卡| 少妇猛男粗大的猛烈进出视频| 精品免费久久久久久久清纯 | 人人妻人人爽人人添夜夜欢视频| 99久久99久久久精品蜜桃| 国产男女超爽视频在线观看| 色播在线永久视频| 国产av精品麻豆| 女人爽到高潮嗷嗷叫在线视频| 亚洲精华国产精华精| 欧美 日韩 精品 国产| 亚洲第一欧美日韩一区二区三区 | 中文字幕色久视频| 黄片小视频在线播放| 亚洲欧美一区二区三区久久| 99riav亚洲国产免费| 人人妻,人人澡人人爽秒播| 成人18禁在线播放| 精品久久久久久久毛片微露脸| 亚洲欧美精品综合一区二区三区| 无人区码免费观看不卡 | 看免费av毛片| 男女之事视频高清在线观看| 成人特级黄色片久久久久久久 | 999久久久国产精品视频| 免费久久久久久久精品成人欧美视频| 国产精品九九99| 黄色视频在线播放观看不卡| 人人妻人人澡人人看| 两个人免费观看高清视频| 久久影院123| 亚洲成av片中文字幕在线观看| 高潮久久久久久久久久久不卡| 啦啦啦 在线观看视频| 99国产精品一区二区三区| 久久午夜亚洲精品久久| 久久国产精品男人的天堂亚洲| 最新在线观看一区二区三区| 午夜福利欧美成人| 亚洲欧洲日产国产| 久久久精品免费免费高清| 两个人看的免费小视频| 在线天堂中文资源库| 亚洲成人手机| 黄频高清免费视频| a级毛片黄视频| 亚洲国产欧美一区二区综合| 久热爱精品视频在线9| 亚洲五月婷婷丁香| 男人舔女人的私密视频| 亚洲熟女精品中文字幕| 欧美 日韩 精品 国产| 精品一区二区三区四区五区乱码| 极品少妇高潮喷水抽搐| 日本撒尿小便嘘嘘汇集6| 欧美av亚洲av综合av国产av| 国产男女内射视频| 欧美老熟妇乱子伦牲交| 91麻豆av在线| 久久亚洲精品不卡| 午夜成年电影在线免费观看| 欧美黑人欧美精品刺激| 久久久久网色| 精品国产一区二区久久| 欧美激情 高清一区二区三区| 老司机深夜福利视频在线观看| 久9热在线精品视频| 成年动漫av网址| 亚洲av片天天在线观看| 久久精品亚洲av国产电影网| 美女午夜性视频免费| 色视频在线一区二区三区| cao死你这个sao货| 人人妻,人人澡人人爽秒播| 欧美久久黑人一区二区| 国产福利在线免费观看视频| 亚洲精品中文字幕在线视频| 欧美成人免费av一区二区三区 | 精品国产一区二区三区久久久樱花| 在线观看www视频免费| 午夜激情久久久久久久| 人妻一区二区av| 热re99久久国产66热| 久久99一区二区三区| 亚洲中文字幕日韩| 一区二区av电影网| 精品一区二区三卡| 成人18禁高潮啪啪吃奶动态图| 美女高潮到喷水免费观看| 国产免费视频播放在线视频| 日韩欧美免费精品| 久久国产精品影院| 日韩有码中文字幕| 欧美在线黄色| 国产精品欧美亚洲77777| 十八禁网站免费在线| 咕卡用的链子| 久久亚洲真实| 成年版毛片免费区| 菩萨蛮人人尽说江南好唐韦庄| h视频一区二区三区| 黄色毛片三级朝国网站| 高清毛片免费观看视频网站 | 亚洲免费av在线视频| 成年动漫av网址| 一级毛片女人18水好多| 国产一区二区三区在线臀色熟女 | 18禁黄网站禁片午夜丰满| 日本五十路高清| 高清在线国产一区| 捣出白浆h1v1| 精品久久久精品久久久|