• <tr id="yyy80"></tr>
  • <sup id="yyy80"></sup>
  • <tfoot id="yyy80"><noscript id="yyy80"></noscript></tfoot>
  • 99热精品在线国产_美女午夜性视频免费_国产精品国产高清国产av_av欧美777_自拍偷自拍亚洲精品老妇_亚洲熟女精品中文字幕_www日本黄色视频网_国产精品野战在线观看 ?

    Progress in Neural NLP: Modeling, Learning, and Reasoning

    2020-09-14 03:42:00MingZhouNanDuanShujieLiuHeungYeungShum
    Engineering 2020年3期
    關(guān)鍵詞:鼠標(biāo)老鼠

    Ming Zhou, Nan Duan, Shujie Liu, Heung-Yeung Shum*

    Microsoft Research Asia, Beijing 100080, China

    Keywords:Natural language processing Deep learning Modeling, learning, and reasoning

    A B S T R A C T Natural language processing(NLP)is a subfield of artificial intelligence that focuses on enabling computers to understand and process human languages.In the last five years,we have witnessed the rapid development of NLP in tasks such as machine translation, question-answering, and machine reading comprehension based on deep learning and an enormous volume of annotated and unannotated data.In this paper, we will review the latest progress in the neural network-based NLP framework (neural NLP) from three perspectives: modeling, learning, and reasoning. In the modeling section, we will describe several fundamental neural network-based modeling paradigms,such as word embedding,sentence embedding,and sequence-to-sequence modeling,which are widely used in modern NLP engines.In the learning section, we will introduce widely used learning methods for NLP models, including supervised, semi-supervised, and unsupervised learning; multitask learning; transfer learning; and active learning. We view reasoning as a new and exciting direction for neural NLP, but it has yet to be well addressed. In the reasoning section, we will review reasoning mechanisms, including the knowledge,existing non-neural inference methods, and new neural inference methods. We emphasize the importance of reasoning in this paper because it is important for building interpretable and knowledgedriven neural NLP models to handle complex tasks. At the end of this paper, we will briefly outline our thoughts on the future directions of neural NLP.

    1. Introduction

    As an important branch of artificial intelligence (AI), natural language processing (NLP) studies the interactions between humans and computers via natural language.It studies fundamental technologies for the meaning expressions of words, phrases,sentences,and documents,and for syntactic and semantic processing such as word breaking,syntactic parsers,and semantic parsing and develops applications such as machine translation (MT),question-answering (QA), information retrieval, dialog, text generation, and recommendation systems. NLP is vital to search engines, customer support systems, business intelligence, and spoken assistants.

    The history of NLP dates back to the 1950s. In the beginning of NLP research,rule-based methods were used to build NLP systems,including word/sentence analysis, QA, and MT. Such rules edited by experts were utilized in algorithms for various NLP tasks starting from MT.Normally,designing rules required significant human efforts. Furthermore, it is difficult to organize and manage rules when the number of rules is large. In the 1990s, along with the rapid development of the internet, large amounts of data became available, which enabled statistical learning methods to work on NLP tasks. With human-designed features, statistical learning models were learned by using labeled/mined data. The statistical learning method brought significant improvements to many NLP tasks,typically in MT and search engine technology.In 2012,deep learning approaches were introduced to NLP following deep learning’s success in object recognition with ImageNet[1]and in speech recognition with Switchboard [2]. Deep learning approaches quickly outperformed statistical learning methods with surprisingly better results. As of the present, the neural network-based NLP (referred to as ‘‘neural NLP” hereafter) framework has achieved new levels of quality and has become the dominating approach for NLP tasks, such as MT, machine reading comprehension (MRC), chatbot, and so forth. For example, the Bible system from Microsoft achieved a human parity result on the Chineseto-English news translation task of workshop on MT in 2017.R-NET and NLNet from Microsoft Research Asia (MSRA) achieved human-quality results on the Stanford Question Answering Dataset(SQuAD) evaluation task on both the exact match (EM) score and the fuzzy-match score (F1score). Recently, pre-trained models such as generative pre-training (GPT) [3], bidirectional encoder representations from transformers (BERT) [4], and XLNet [5] have demonstrated strong capabilities in multiple NLP tasks.The neural NLP framework works well for supervised tasks in which there is abundant labeled data for learning neural models, but still performs poorly for low-resource tasks where there is limited or no labeled data.

    This paper reviews the notable progress of the neural NLP framework in three categories of efforts: ①neural NLP modeling aimed at designing appropriate network structures for different tasks; ②neural NLP learning aimed at optimizing the model parameters; and ③reasoning aimed at generating answers to unseen questions by manipulating existing knowledge with inference techniques. Based on deep analysis of current technologies and the challenges of each of these aspects,we seek to identify and sort out future directions that are critical to advancing NLP technology.

    2. Modeling

    An NLP system consumes natural language sentences and generates a class type(for classification tasks),a sequence of labels(for sequence-labeling tasks), or another sentence (for QA, dialog,natural language generation, and MT). To apply neural NLP approaches, it is necessary to solve the following two key issues:

    (1)Encode the natural language sentence(a sequence of words)in the neural network.

    (2) Generate a sequence of labels or another natural language sentence.

    From these two aspects, in this section, we will introduce several popularly used neural NLP models,including word embedding,sentence embedding, and sequence-to-sequence modeling. Word embedding maps words in the input sentences into continuous space vectors. Based on the word embedding, complex networks such as recurrent neural networks (RNNs), convolutional neural networks(CNNs),and self-attention networks can be used for feature extraction, considering the context information of the whole sentence to build context-aware word embedding, or integrating all the information of the sentence to construct the sentence embedding. Context-aware word embedding can be used for sequential labeling tasks such as part-of-speech (POS) tagging and named-entity recognition (NER), and sentence embedding can be used for sentence-level tasks, such as sentiment analysis and paraphrase classification. Sentence embedding can also be used as input to another RNN or self-attention network to generate another sequence, which forms the encoder-decoder framework for the sequence-to-sequence modeling. Given an input sentence,the sequence-to-sequence modeling can be used to generate an answer for a question (i.e., a QA task), or to perform a translation in another language (i.e., an MT task).

    2.1. Word embedding and sentence embedding

    Word/sentence embedding attempts to map words and sentences from a discrete space into a semantic space, in which the semantically similar words/sentences have similar embedding vectors.

    2.1.1. Context-independent word embedding

    To map a word into a continuous semantic vector, Ref. [6]proposed the continuous bag-of-words (CBOW) and skip-gram models,based on which the implementation tool word2vec is used to learn word-embedding vectors with a large monolingual corpus.As shown in Fig. 1 [6], the CBOW model predicts the central word using its surrounding words in a window, while the skip-gram model predicts the surrounding words of the given word. These two models are designed based on the principle of ‘‘knowing a word by the company it keeps” [7]. In addition, to utilize the benefits of global co-occurrence statistics and meaningful linear substructures, Ref. [8] proposed a global log-bilinear regression model (GloVe) to learn word embedding.

    Word2vec and GloVe learn a constant embedding vector for a word;the embedding is the same for a word in different sentences.For example, we can learn an embedding vector for the word‘‘bank,” and the embedding remains the same, regardless of whether the word ‘‘bank” is used in the sentence ‘‘a(chǎn)n ant went to the river bank” or in the sentence ‘‘that is a good way to build up a bank account.” Ostensibly, the embedding of the word ‘‘bank”in the first sentence should be different from that of ‘‘bank” in the second sentence. To deal with this issue, context information of the sentence is used to predict a dynamic word embedding.

    2.1.2. RNN-based context-aware word embedding

    ELMo [9] leverages the bidirectional recurrent neural network(the long short-term memory(LSTM)network is particularly used)to model the context information,in which the word embedding is the concatenation of the hidden states of a forward RNN and a backward one, modeling the context at the left side and the right side, respectively. For example, given the input sentence ‘‘a(chǎn)n ant went to the river bank.” as shown in Fig. 2, the forward RNN first takes the first word ‘‘a(chǎn)n” as the input and generates the first hidden state, which contains the information of the first word. When the second word‘‘a(chǎn)nt”is inputted,the RNN combines the information of the first hidden state and the second word to generate the second hidden state, which should contain the information of the first two words. When the word ‘‘bank” is inputted, the previous hidden state should contain all the previous information of ‘‘a(chǎn)n ant went to the river.” Taking it as context information, the new hidden state of ‘‘bank” contains dynamic information from the given sentence.

    2.1.3. Self-attention-based context-aware word embedding

    Ref. [3] proposed GPT, which leverages the self-attention network to train a multi-layer left-to-right language model.Compared with the RNN used in ELMo, which is also a left-to-right language model, the self-attention network used in GPT allows direct interaction between the current word and each previous word, which leads to better context representations. Ref. [4] proposed BERT,which leverages the self-attention network to jointly consider both the left and right context information in the sentence.Whereas the RNN processes the input sentence in a sequential order,from left to right or from right to left,as shown in Fig.3,the self-attention network takes all the remaining words of the word‘‘bank”as the context to build the context-aware word embedding, which is a weighted sum of all the representations of the words in the sentence, whose weight is calculated by normalizing and computing the similarity between the current word ‘‘bank” and all the words in the sentence. To consider the ordering information, a position index is also used to enrich the input by summing the word embedding and position embedding. To alleviate BERT’s pretrain-fine-tune discrepancy issue,which means that artificial symbols such as [MASK] used by BERT during pre-training are absent from real data at fine-tuning time, Ref. [5] proposed XLNet, a pre-training method that enables the learning of bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order.

    2.1.4. CNN-based context-aware word embedding

    Fig. 1. Context-independent word-embedding methods [6]. CBOW: using the context words in a window to predict the central word. Skip-gram: using the central word to predict the context words in a window. Wt is the tth word in the sentence.

    Fig. 2. RNN-based context-aware word embedding.

    Both ELMo and BERT can consider all the context information in the input sentence to generate the dynamic embedding for a given word. In contrast to using all the words as context, a CNN can be used to generate the dynamic embedding with only the surrounding words as context[10].As shown in Fig.4,the CNN uses a window to slide along the sequence of the input sentence. Using the embeddings of the words in the window,linear mappings(i.e., filters) are used to generate a representation vector to integrate the information of the input words. For example, to generate the dynamic embedding for the word ‘‘bank,” a window with size 3 can be used to cover the span of‘‘river bank,”and the word‘‘river”can be used to generate a disambiguating dynamic embedding for the word ‘‘bank.”

    2.1.5. Sentence embedding

    Based on the representation of each word in the input sentence,the sentence representation can be obtained, via RNN, selfattention network, and CNN. For the RNN, the last hidden state(the blue one) should contain all the information in the sentence by consuming the input words one by one. For the self-attention network, a sentence-ending symbol, , can be added, and its hidden state (the blue one) can be used as the representation of the whole sentence. For CNN, a max pooling layer can be used to select the maximum value for each dimension and generate one semantic vector (with the same size as the convolution layer output) to summarize the whole sentence, which is processed by a feed-forward network (FFN) to generate the final sentence representation.The generated sentence embedding can be used in other tasks,such as predicting the sentiment class(i.e.,positive or negative) or predicting another sequence (MT).

    2.2. Sequence-to-sequence modeling

    2.2.1. Task of sequence-to-sequence modeling

    Sequence-to-sequence modeling attempts to generate one sequence with another sequence as input. Many NLP tasks can be formulated as a sequence-to-sequence task,such as MT(i.e.,given the source language word sequence, generate the target language word sequence), QA (i.e., given the word sequence of a question,generate the word sequence of an answer), and dialog (i.e., given the word sequence of user input, generate the word sequence of response).

    2.2.2. Encoder-decoder framework

    Fig. 3. Self-attention-based context-aware word embedding. : sentence-ending symbol.

    Fig. 4. CNN-based context-aware word embedding.

    Fig. 5. Encoder-decoder framework for MT from English to Chinese.

    Ref. [11] proposed an encoder-decoder framework for sequence-to-sequence modeling. As shown in Fig. 5, the encoder-decoder framework contains two parts: an encoder and a decoder.The encoder is an RNN to encode the input sentence into a semantic representation, by consuming the words from left to right, one by one. The final hidden states should contain the information of all the words in the sentence, and are used as the context vector (i.e., representation) of the input sentence. Based on the context vector, another decoder RNN is used to generate the target sequence one word after another until the sentenceending symbol()is generated.The decoder RNN takes the previous word, previous hidden state, and source sentence context vector as input to generate the current hidden state in order to predict the next target word.

    The original encoder-decoder framework has several drawbacks: ①Only the last hidden state is used to model the source sentence. With such a size-fixed vector, it is difficult to model any sentence in the source language. ②The information of previous words consumed by the encoder is difficult for the RNN cell to maintain to influence the target word. ③It is difficult to use only one context vector to predict all the words in the target sentence.

    2.2.3. Attention-based encoder-decoder framework

    In order to deal with these problems, the attention-based encoder-decoder framework [12] makes it possible for the neural network to pay attention to different parts of the input and to directly align the input sequence and the output result. As shown in Fig. 6, the attention mechanism leverages all the hidden states of the encoder and the previous decoder hidden state to compute a context vector, which is a weighted sum of the hidden state of the encoder, and the weights are computed and normalized by the similarity between the hidden states of the decoder and encoder.Together with the previous decoder hidden state and the previous target word, this context vector is used as the input to the decoder RNN to generate the next decoder hidden state. In this way, not only are all the encoder hidden states leveraged, but the decoder hidden state also directly relates to the corresponding encoder hidden states.

    2.2.4. All-attention-based encoder-decoder framework

    To use the strong modeling capacity of self-attention, Transformer [13] (as shown in Fig. 7) uses multi-head self-attention to replace the original attention mechanism and RNN cells in the encoder and decoder. Multi-head attention is a combination of attention networks. By projecting each query, key, and value into N vectors with linear layers, N attention networks are used to generate N context vectors, which are concatenated into one context vector. For the decoder self-attention, only the previous hidden states are used to compute the decoder context vector,because future words cannot be used during the inference.

    Fig. 6. Attention-based encoder-decoder framework for MT from English to Chinese.

    Fig. 7. All-attention-based encoder-decoder framework for MT from English to Chinese.

    2.3. Summary

    In this section, we introduced the network structures to learn word embedding, sentence embedding, and sequence generation.In order to improve modeling for various NLP tasks, several directions still require further exploration:

    · Prior knowledge modeling. Even though word-embedding methods that are trained with a huge amount of data can model certain kinds of commonsense knowledge [6], the problem of how to integrate the linguistic prior knowledge information, such as WordNet and HowNet, with specific NLP tasks should receive more attention [14].

    · Document/multi-turn modeling. Leveraging sentence context by means of word embedding effectively improves the performance of various tasks, but the problem of how to model long-distance context, such as other sentences in the same document (or even in another document), is still an ongoing research work [15]. For example, given the English sentence ‘‘the mouse is on the table,” it is impossible to tell whether the word ‘‘mouse” should be translated to the Chinese‘‘鼠標(biāo)”(a computer device)or the Chinese‘‘老鼠”(an animal) based on the information in the given sentence. To do semantic disambiguation,the context of the document should be leveraged. Similarly, the problem of how to better model the context information in multi-turn tasks such as chatbots and dialog systems [16] is still challenging.

    · Non-autoregressive generation. The current sequence-tosequence model generates the output sentence one word at a time in an autoregressive way,meaning that the word generated in the previous time-step is used as the input to generate the next word. Such an autoregressive generation process leads to the exposure bias problem,in which the mistake made in previous steps will be amplified in subsequent steps. To deal with this problem, non-autoregressive structures have been proposed, which show a performance drop compared with the autoregressive structure[17].More attention should be paid to developing better non-autoregressive structures in the future.

    3. Learning

    New and efficient training algorithms have been proposed to optimize the large number of parameters in deep learning models.To train the neural network,stochastic gradient descent(SGD)[18]is often used,which is usually based on back-propagation methods[19]. Momentum-based SGD has been proposed in order to introduce momentum to speed up the training process. The AdaGrad[20], AdaDelta [21], Adam [22], and RMSProp methods attempt to use different learning ratios for different parameters,which further improves the efficiency and stabilizes the training process.When the model is very complex, parallel training methods are used to leverage many computing devices—even hundreds or thousands (central processing units, graphics processing units, or field programmable gate arrays).Depending on whether the parameters are updated synchronously or not, distributed training methods can be grouped into synchronous SGD and asynchronous SGD.

    In addition to the progress that has been achieved in general optimization methods, better training methods have been proposed for specific NLP tasks. When large amounts of training data are available for rich resource tasks, using supervised learning methods, deep learning models achieve very good performance.For some specific tasks,such as MT for language pairs with a large volume of parallel data such as English-Chinese,neural models do a good job, sometimes achieving human parity results in the shared tasks.For many NLP tasks,however,it is difficult to acquire large amounts of labeled data. Such tasks are often referred to as low-resource tasks, including MT of sentiment analysis for rare languages.By using unlabeled data to enhance the models trained with a small amount of labeled data, semi-supervised learning methods can be used. Without any labeled data, unsupervised learning methods can be leveraged to learn NLP models. Another way to leverage unlabeled data is to pre-train models, which will be transferred to specific tasks with transfer learning. Instead of leveraging in-task labeled data, labeled data from other tasks can also be used with the help of multitask learning.If there is no data that can be used, human resources could be introduced to create the training data using active learning in order to maximize the model performance with a given budget.

    3.1. Supervised learning

    where p`e(yi|yi-1,...,y1,X)is the softmax layer output of the decoder in the sequence-to-sequence model. Based on the likelihood, the log-likelihood loss function is defined as follows:

    Based on this loss function, training algorithms (such as Adam or AdaDelta) can be used to optimize the parameters. Instead of only maximizing the generation probability of the golden target,in order to consider the task-specific error in the training process,Ref. [23] proposed minimum-risk (maximum-bilingual evaluation understudy (BLEU) [24]) training for the sequence-to-sequence generation model. The model is first optimized using the crossentropy loss with the bilingual training corpus, and then finetuned by maximizing the expected BLEU of the generated translation candidates (where BLEU is the evaluation metrics for MT,which measures the n-gram accuracy of the generated candidates against a human-made reference). To deal with the exposure bias problem caused by the autoregressive decoding of the sequenceto-sequence model from left to right,Ref.[25]introduced two Kullback-Leibler (KL) divergences into the training objective to maximize the agreement between the candidates generated by leftto-right and right-to-left models. A deliberation network [26] is a method to refine the translation candidate based on two-pass decoding that simulates the human translation process; the first pass generates the initial translation and the second pass refines it. In order to deal with the label bias problem, as mentioned in Section 2.3, Ref. [27] proposed sampling the context words not only from the ground truth sequence, but also from the predicted sequence in order to bridge the gap between the training and inference of MT training.

    3.2. Semi-supervised and unsupervised learning

    Semi-supervised and unsupervised learning use unlabeled data to improve the model performance. For semi-supervised learning,first, the model is usually trained with labeled data; next, it is fine-tuned with the help of unlabeled data. There are many semisupervised learning methods, such as self-learning, generative methods,and graph-based methods[28].In these methods,pseudo data generated by models are usually used to fine-tune the model itself.In neural machine translation(NMT),to control the errors or noise generated in semi-supervised learning, weights/rewards are usually leveraged to filter the bad translation candidates;examples of weights/rewards include the expected BLEU method [29] and the dual-learning approach [30]. To utilize unlabeled data to improve the performance of the sequence-to-sequence model,back-translation [31] uses a reverse translation model to translate the target monolingual data in order to build a pseudo bilingual corpus, which is used to fine-tune the source-to-target (S2T)model. The joint training method (as shown in Fig. 8) extends the back-translation method to iteratively boost the S2T and target-to-source (T2S) translation models in a unified generalized(T2S) expectation-maximization framework by leveraging both the source and target monolingual corpus [32]. The bilingual corpus is used to train the NMT models first, including S2T and T2S models. With the source monolingual data, the S2T model is used to generate pseudo data to fine-tune the T2S model,and the target monolingual data is used to fine-tune the S2T model.This training process is iterated until the performance on hold-out data is no longer improved.

    Fig. 8. Joint training of S2T and T2S NMT models. m means the mth epoch; Y′ is the translation of the source monolingual data X; X′ is the translation of the target monolingual data Y.

    Leveraging the deep learning technique, deep generative models have been proposed for unsupervised learning, such as the variational auto-encoder(VAE)[33]and generative adversarial networks (GANs) [34]. The VAE net follows the auto-encoder framework, in which there is an encoder to map the input to a semantic vector,and a decoder to reconstruct the input.Unlike the original auto-encoder methods, a VAE assumes that the distribution of the generated semantic vectors should be as close as possible to a standard normal distribution. Similar to the VAE nets,GANs also have two parts:a generator,which uses a given semantic vector to generate the output, and a discriminator, which tries to distinguish between the generated samples and the real samples.With an adversarial training loss function,the generator tries to output samples that are similar to real ones,in order to fool the discriminator,while the discriminator tries to distinguish between the real and fake samples. A great deal of research work has attempted to apply VAEs and GANs to natural languagegeneration tasks [35,36].

    Without using a bilingual corpus, but only using a small dictionary and a large monolingual corpus, unsupervised learning methods can be used for MT methods [37]. This method uses the joint training method to boost the S2T and T2S translation models jointly with the source and target monolingual corpus by generating a pseudo bilingual corpus.As there is no real bilingual data,the generated pseudo data may contain errors and noise,which will be reinforced in the subsequent iterative training process, when such examples are used for the model training. To deal with this problem, Ref. [38] introduced statistical machine translation (SMT)models as posterior regularization (PR) to filter this noise. The SMT and NMT models are optimized jointly and boost each other incrementally in a unified expectation-maximization framework.The whole procedure of this method consists of two parts (as shown in Fig. 9 [38]): model initialization and unsupervised NMT with SMT as PR.Given a language pair X-Y,two initial SMT models are first built with language models pre-trained using monolingual data, and word translation tables are inferred from cross-lingual embeddings. Then the initial SMT models will generate pseudo data to warm up two NMT models. The NMT models are trained using not only the pseudo data generated by SMT models,but also those generated by reverse NMT models with the joint training method. After that, the NMT-generated pseudo data is fed to SMT models. As PR, SMT models filter out noise and infrequent errors by constructing strong phrase tables with good and frequent translation patterns, and then generate denoised pseudo data to guide the subsequent NMT training. Benefiting from this process, NMT then produces better pseudo data for SMT to extract phrases of higher quality, while compensating for the deficiency in smoothness inherent in SMT via back-translation.The NMT and SMT models boost each other until final convergence. Compared with the work presented in Ref.[37],this method can significantly improve the translation results on four translation tasks with gaps of 1.4 BLEU points from French to English, 3.5 BLEU points from English to French, 3.1 BLEU points from German to English, and 2.2 BLEU points from English to German. Ref. [39] further introduced GPT methods to the unsupervised NMT tasks and proposed a cross-lingual language model pre-training to achieve the new state-of-the-art performance.

    3.3. Multitask learning

    Multitask learning attempts to leverage the information from other related tasks to improve the performance on the desired task.When a large amount of training data for the desired task is not available,a training corpus of related tasks can be introduced using a multitask learning approach. Ref. [10] proposed a unified neural network architecture with various NLP tasks including POS,chunking, named entity recognition, and semantic role labeling (SRL).This method learns internal representations based on vast amounts of mostly unlabeled data. This work was a milestone in learning features automatically with a neural network for NLP, which inspired the subsequent deep learning trend that has been applied in the field of NLP.

    Ref. [40] proposed treating ten NLP tasks, including QA, MT,summarization, natural language inference, and so forth, as QA tasks, and built a multitask question-answering network (MQAN)to model them in a unified framework, as shown in Fig. 10 [40].The inputs of the MQAN are a question and a context document.This input format is natural for the original QA task. For the MT task, the question is roughly ‘‘What is the translation from language X to language Y?” and the context is the source language sentence.For the summarization,the question is‘‘What is the summary?” and the context is the document to be summarized. The question and the context are encoded with BiLSTM, followed by dual co-attention to build conditional representations for both sequences. These conditional representations are processed with another two BiLSTMs, followed by two self-attention networks and two BiLSTMs to obtain the final encoding representations of the question and context. To generate the output, an attention mechanism is leveraged to focus on the necessary encoding hidden states, and a multi-pointer generator is used to decide whether to copy from the question and context or generate a new word. This model achieved a state-of-the-art result on WikiSQL with a 72.4%EM and an 80.4% execution accuracy. With multitask learning,the MQAN can lead to better generalization for zero-shot learning on the zero-shot relation extraction task on dataset QA-ZRE,with a gain of 11 F1points over the highest single-task models. Ref. [41]proposed MT-deep neural network (DNN), a multitask learningbased DNN based on BERT. By adding more specific tasks into the pre-training,MT-DNN obtains very good results on ten natural language understanding(NLU)tasks,including SNLI,SciTail,and eight out of nine GLUE [42] tasks. The effectiveness of MT-DNN also shows that different but related tasks can boost each other via multitask learning.

    Fig. 9. Illustration of the unsupervised NMT training [38].

    Fig. 10. Network structure of the MQAN [40].α is the attention weights;γ and λ are the scalars to switch the output distributions.

    3.4. Pre-trained models and transfer learning

    Task-unspecific models are pre-trained first, and can then be transferred to specific tasks with a fine-tuning process. With pretrained word embedding or sentence embedding,transfer learning can be used on top of them to fine-tune the task-specific models[43]. In recent years, many pre-trained models have been proposed, such as word2vec [6,8], ELMo [9], GPT [3], BERT [4], and XLNet [5], as introduced in Section 2.1, which are commonly used in NLP tasks such as MRC and QA.

    Zero-shot and one-shot transfer learning have been explored with no or only a few labeled samples. Ref. [44] proposed a zeroshot transfer learning method for text classification, in which the model is trained to learn relationships, sentences, and categories with a large dataset,and is then transferred to new categories with no training data. Ref. [45] used semantic parsing to map natural language explanations on classifying concepts to formal constraints relating features grounded in the observed attributes of unlabeled data.Such constraints are combined using PR to yield a classifier for the zero-shot classification task. For some tasks in which there is a large amount of training data in a rich language(e.g., English), but little or no data in other languages (e.g., Romanian), cross-lingual transfer learning can be used to transfer the model trained in the rich language to a model for the rare language.Multitask learning over different MT language pairs can be transferred to enable zero-shot translation for language pairs with no bilingual corpus [46].

    Another direction for transferring a pre-trained model to a wide variety of new and unseen situations is ‘‘learning to learn”—also known as meta learning.First proposed by Ref.[47],meta learning has recently become a hot topic,and is used for pre-trained model transfer, hyper-parameter tuning, and neural network optimization.By directly optimizing an initial model that can be effectively transferred with the help of a few examples, model agnostic meta learning (MAML) [48] has been proposed to learn a task agnostic model. The learned model can be quickly adapted (with a small number of update steps) to several related tasks. Ref. [49]leveraged MAML to optimize NMT models using 18 European languages. By treating each example as a unique pseudo-task, the original structured query-generation task is reduced to a fewshot problem for which meta learning is used, bringing significant performance improvement.

    Ref. [50] introduced an effective multitask learning framework to train a transferable sentence representation by combining a set of objectives including multi-lingual NMT, natural language inference,constituency parsing,and skip-thought vectors(i.e.,to predict the previous or following sentence),and the learned representation can be transferred to a low-complexity classifier on a new task.By switching between different tasks, the gated recurrent unit (GRU)based RNN network can learn different aspects of the sentence.With the NMT and parsing tasks,syntactic properties can be better encoded. Sentence characteristics such as length and word order have been found to be better encoded by parsing.The learned representations are visualized using dimensionality reduction and nearest-neighbor exploration on three different datasets (movie reviews,question type classification,and Wikipedia classification),and sentences are found to be clustered reasonably well according to the labels. The learned representations are used as features(without parameters updating), and transferred to sentimentclassification tasks (movie reviews, product reviews, subjectivity/objectivity classification) with gains of 1.1%-2.0%, question type classification with gains of 6%, and paraphrase identification(Microsoft research paraphrase corpus)with gains of 2.3%.

    3.5. Active learning

    Active learning can interactively query the user in order to selectively label the data, with the aim of maximizing the performance gain and minimizing the labeling costs. To deal with the low-resource problem, one straightforward method is to label more data,which presents the challenge of identifying which data should be labeled in order to improve the model performance best.To deal with this problem, active learning methods [51] can be used to automatically and iteratively select useful instances to be labeled to maximize the model performance. With the labeled data, the learning model can be trained and applied to the unlabeled data. Based on the labeled result, the active selection model can predict a best selection for the editors to label based on signals such as uncertainty. The new labeled data can be used to obtain a better learning model, and this process can be iterated until an acceptable performance is achieved.

    3.6. Summary

    In this section, we reviewed several typical training methods,including supervised,semi-supervised,and unsupervised learning;multitask learning; transfer learning; and active learning. When enough labeled data is available, the supervised method can achieve very good performances on many NLP tasks, such as MT and MRC. For other tasks, where there is insufficient labeled data,we have introduced several learning methods to enhance the model performance, including semi-supervised and unsupervised learning with unlabeled data, transfer learning with pre-trained models, multitask learning with labeled data of other tasks, and active learning to annotate the most valuable samples.To improve the performance of NLP models, we think that more research should be conducted in the following topics:

    · The topological training process. Although multitask learning and transfer learning can leverage the data from related tasks to enhance models for a desired task, the relationship between various NLP tasks should be further studied. For instance, a topological training process should be developed in the future, based on which the pre-trained models on fundamental tasks (e.g., language models, POS, and parsing) can be better transferred to higher-level tasks (e.g., MT and QA).

    · Reinforcement learning (RL). RL has also been explored in many NLP model training cases, but it seems that it has not achieved satisfactory results.For example,in the dialog system for custom service,the error or loss is not available in each turn of the dialog session.The only information is whether the tickets are booked successfully or not,or how many turns are used.For such scenarios,where only a long-term reward is available,RL can be used to learn a policy network to maximize the expected reward. The RL process still suffers from the exponential search space with respect to the length of the natural language sentence[52].

    · GANs. Even though many research efforts have tried to apply GANs to NLP tasks,such as MT[53]and natural language generation [54], there are several challenges.It is difficult for the discriminator to pass the gradient signal to the generator directly, as it does in image and speech processing, due to the discrete output of the generator.And the GAN is very sensitive to the random initialization and small deviations from the best hyper-parameter choice [36]. Such challenges amplify the difficulty of GAN training.

    4. Reasoning

    Neural approaches have achieved good progress on many NLP tasks,such as MT and MRC.However,they still have some unsolved problems.For example,most neural network models behave like a black box, which never tells how and why a system has solved a problem in the way it did.Besides,for tasks such as QA and dialog systems, only knowing the literal meanings of input utterances is often not enough.To generate the right responses,external and/or context knowledge may also be needed.To build such interpretable and knowledge-driven systems,reasoning is necessary.

    In this paper, we define reasoning as a mechanism that can generate answers to unseen questions by manipulating existing knowledge with inference techniques. Based on this definition, a reasoning system (Fig. 11) should have two components:

    ·Knowledge,such as a knowledge graph,common sense,rules,assertions extracted from raw texts, etc.;

    · An inference engine, to generate answers to questions by

    manipulating existing knowledge.

    Next, we use two examples to illustrate why reasoning is important to NLP tasks.

    The first example is a knowledge-based QA task. Given the question ‘‘When was Bill Gates’ wife born?,” the QA model must parse it into the logical form for answer generation:λxλy.DateOfBirth(y,x)∧Spouse(Bill Gates,y), where a knowledge graph-based reasoning is needed. Starting with this question,new questions can be further appended to this context, such as:‘‘What’s his/her job?” To answer such context-aware questions,co-reference resolution determines which person is meant by‘‘his/her.”This is also a reasoning procedure,which needs the commonsense knowledge that ‘‘his” can only refer to men and ‘‘her”can only refer to women.

    The second example is a dialog task.For example,if a user says‘‘I am very hungry now,” it would be more suitable to reply: ‘‘Let me recommend some good restaurants to you” instead of ‘‘Let me recommend some good movies to you.” This also requires reasoning, as the dialog system should know that hungry will lead to actions such as looking for restaurants instead of watching films.

    In the remainder of this section, we will first introduce two types of knowledge: the knowledge graph and common sense.Next, we will describe typical inference approaches, which have been or are being studied in the NLP area.

    Fig. 11. Overview of a reasoning system. ILP: integer linear programming; MLN: Markov logic network.

    4.1. Knowledge

    Knowledge plays an important role in reasoning-driven NLP tasks. It may refer to any information (e.g., dictionaries, rules,knowledge graphs, annotations of specific tasks) that can guide NLP systems to complete specific tasks. Here, we focus on two types of knowledge for reasoning: the knowledge graph and common sense.

    4.1.1. The knowledge graph

    A knowledge graph is a directed graph{V,E},which consists of nodes V and edges E.Each node v ∈V denotes an entity.Each edge e ∈E denotes a predicate between the two entities it connects with. Each triple <v1,e,v2>denotes a fact, where v1,v2∈V and e ∈E.

    For example,<Microsoft, Founder, Bill Gates >is a knowledge triple from a knowledge graph, where Microsoft is the subject entity, Bill Gates is the object entity, and Founder is the predicate indicating that Bill Gates is the founder of Microsoft. A knowledge graph is essential to those NLP tasks where parsing a natural language into machine-executable structured queries is indispensable, such as semantic searches, knowledge-based QA, and taskoriented dialog.

    The approaches of knowledge graph construction can be grouped into three categories:

    · Handcrafted methods. In these methods, knowledge graphs are manually constructed by human experts. Knowledge graphs built in this way, such as WordNet [55], are usually of high quality, but cover limited facts. In addition, the cost of maintenance is usually very high.

    · Crowdsourcing methods. In these methods, community members construct knowledge graphs in a collaborative way. Compared with handcrafted methods, knowledge graphs built in this way, such as DBpedia [56], Freebase[57], and WikiData [58], are large scale and high quality.

    · Information-extraction methods. These methods extract structured knowledge from web documents. KnowItAll [59],YAGO [60],and NELL[61]are representatives of this method.Compared with the first two approaches, these methods can extract more knowledge from free texts; however, they also bring a lot of noise into the resulting knowledge graphs.

    4.1.2. Common sense

    Common sense refers to common knowledge of things in the world, such as their properties, relationships, and interactions,which all humans are expected to know.Such knowledge is mostly location-,language-,and culture-independent,and is rarely explicitly expressed in texts.

    For example, ‘‘a(chǎn) father is a male” is property common sense,‘‘the sun is in the sky” is spatial common sense, and ‘‘plants grow from seeds” is procedural common sense.

    Building a commonsense knowledge base (CKB) is difficult. At present, there are three major methods (which are similar to the knowledge graph), but all suffer from significant problems.

    · Handcrafted methods. CYC [62] is a CKB built by human experts in this way, which focuses on things that are rarely written down or said, such as ‘‘every human has exactly one father and exactly one mother.” The goal of CYC is to enable AI systems to perform human-like reasoning and to be less brittle when confronting unseen situations. The sizes of CKBs such as CYC are limited, as human labeling is expensive.

    · Crowdsourcing methods. ConceptNet [63] is a CKB built by absorbing knowledge from knowledge graphs such as Word-Net,Wiktionary,Wikipedia,DBpedia,and Freebase.Compared with CYC,ConceptNet covers many more assertions.However,a large portion of the ConceptNet assertions is actually not common sense, as it comes from existing knowledge bases and are about ‘‘named entities,” such as Bill Gates and White House, instead of ‘‘common nouns,” such as human and building.

    · Information extraction methods. WebChild [64] is a CKB containing commonsense assertions that connect nouns with adjectives via fine-grained relations such as hasShape and hasTaste. Such assertions are extracted from web documents based on seed assertions from WordNet. Due to extraction errors,WebChild contains a great deal of noise.It now covers more than four million commonsense assertions.

    4.2. Inference engine

    Although this paper focuses on neural NLP methods, we will still start from two typical non-neural inference methods: integer linear programming (ILP) and Markov logic networks (MLNs), as they have been studied a great deal and have already been used in some NLP tasks.After that,we will introduce memory networks as a typical neural inference method. We will also describe their applications in two reasoning tasks: semantic parsing and response generation, which are based on the knowledge graph and common sense, respectively.

    4.2.1. Non-neural inference methods: ILP and MLN

    ILP is an optimization framework that maximizes a linear objective function over a finite set of variables x,subject to a set of linear inequality constraints:maxmize wTx

    where w is parameters to be optimized; A and b are pre-defined parameters in constraints; n is the number of variables.

    The constraints used in ILP can be considered as prior knowledge,and the optimization procedure is to reason out a global prediction by incorporating learned models based on such prior knowledge.This method can be used in the information extraction task[65],whose objective is to recognize named entities and their relationships from texts. Fig. 12 gives an example.

    Usually, an NER and a relation extractor (RE) will be trained separately and used in this task in a cascaded manner. First, the NER detects entity mentions from input and assigns possible types to each mention. Then, the RE assigns possible relations to each entity pair. Five prediction tables are generated for these three named entities and two relations, which are shown in Fig. 13.

    If local predictions with the highest probabilities are always chosen, errors will occur. For example, the type of Brooklyn will be recognized as Person and the relation between Adam and Anne will be classified as PlaceOfBirth. However, if it is known that the object type of PlaceOfBirth should be Location instead of Person,such errors can be avoided. Inference with such prior knowledge is a reasoning procedure, which can be done by ILP.

    In order to obtain correct predictions based on the above five local prediction tables, we formulate it as an ILP problem and manually define the following four constraints: ①Each entity mention can be assigned an entity type only once. ②Each entity pair can be assigned a relation only once. ③The type assignment to each entity mention should be consistent with the assignments to its neighboring relations. For example, when Anne is tagged as Person, if type of Brooklyn is also recognized as Person, then the relation assignment between Anne and Brooklyn cannot be PlaceOfBirth, as its object entity type should be Location, which is inconsistent with Person. ④The relation assignment to each entity pair should be consistent with the assignments to these two entities. Based on local NER and RE outputs and these four constraints, ILP can obtain the global optimized output (Fig. 14).

    This time,the type of Brooklyn is recognized as Location,as it is consistent to the relation (i.e., PlaceOfBirth) assigned to Anne and Brooklyn. Similarly, the relation between Adam and Anne can be correctly recognized as SpouseOf.

    ILP can be used in many NLP tasks, such as QA [66-68] and semantic role labeling[69,70].It is especially useful when the size of the training data is small, as prior knowledge (used as constraints in ILP) can play more important roles in such scenarios.However,this framework is still independent from existing neural network methods.

    An MLN [71] L=(Fl,wl) is defined as a set of pairs, where each Flis a first-order logic formula and wlis the weight of Fl.It combines probability and first-order logic in a unified model and can generate a Markov network ML,Cby grounding all variables in formulas to constants C ={c1,...,c|c|}.The basic idea of an MLN is to soften hard constraints in first-order logic: When a world violates one formula in the knowledge graph, it is less probable, but not impossible.

    Fig. 12. An example of an information-extraction task.

    Fig.13. Sub-optimal results without using ILP.E:entities occurred in the sentence;R: relations between entities.

    Fig. 14. Optimal results when using ILP.

    Ideally,an MLN can be applied to reasoning-required NLP tasks by the following steps. Here, we use the famous ‘‘friend smoking”case as an example:

    (1) Given a world described in natural language, parse it into a set of first-order logic formulas F ={F1,...,F(xiàn)|F|}.For example,given two sentences:

    Smoking causes cancer

    Friends have similar smoking habits

    The sentences are parsed into two first-order logic formulas:?xSmokes (x )?Cancer(x)

    In practice,each formula has a weight,which should be learned from existing relational databases by algorithms such as the pseudo-likelihood algorithm [72].

    (2) Given L and a set of constants C, a Markov network ML,Cis defined as follows: ①ML,Ccontains one binary node for each possible grounding of each predicate in L.The value of the node is 1 if the ground atom is true,and 0 otherwise.②ML,Ccontains one feature for each possible grounding of each formula Fiin L. The value of this feature is 1 if the ground formula is true, and 0 otherwise.The weight of the feature is the wiassociated with Fiin L. For example,given two constants Anna(A)and Bob(B),the ground ML,Cis described in Fig. 15.

    (3) Given a set of evidences, such as Friends(A ,B)=1 and Cancer(A )=1, reason the most likely state of the world, such as Smokes(A )=?, Smokes(B )=?, and Cancer(B )=?. This inference procedure can answer questions such as ‘‘What is the probability that Anna smokes, given that Anna has cancer?” In practice, algorithms such as MC-SAT [66] can be used to solve such inference tasks.

    As a statistical relational learning approach, the MLN has been applied to some NLP tasks already, such as semantic parsing [73],information extraction[74], and entity disambiguation[75]. However, it is still difficult to apply it to real tasks, as the size of the ground Markov network is usually very large, which will raise the computation issue. For example, based on the analysis of Ref.[76], the size of the corresponding ground Markov networks is exponential in |C|, which is the number of constants. If the constants come from a practical knowledge base (such as Freebase,which contains 39 million entities), the inference complexity in the resulting ground Markov networks will be enormous. In addition, similar to ILP, the MLN is independent of neural models. The necessities and solutions of designing neural versions of the MLN are still worthy of discussion and study.

    4.2.2. Neural inference methods: MemNN and its variants

    A memory network (MemNN) [77-79] is a neural network architecture in which reasoning can be supported by processing knowledge stored in a long-term memory component multiple times before outputting a result.

    We take the key-value memory networks (KV-MemNN) proposed by Ref. [79] as an example to show how this framework can support reasoning. Fig. 16 [79] gives an overview of the KVMemNN.

    Fig. 15. An example of a Markov network.

    where q=AΦX(x ) is the question vector, and R1is a d×d matrix.This procedure will repeat H times using different Rjand the final vector qH+1will be used to predict outputs.

    MemNN is widely used in many reasoning-required NLP tasks.For example, Ref. [77] first proposed the MemNN framework and applied it to bAbI, which is a reasoning-required QA task. Ref.[79] proposed KV-MemNN and evaluated it on two QA datasets:WikiMovies and WikiQA [80]. Ref. [81] described an end-to-end goal-oriented dialog method based on MemNN.

    In a broader sense, MemNN is a special case of memoryaugmented neural networks (MANNs), which extend the capabilities of neural networks by coupling them to external memory resources. In such methods, neural models interact with memory by attentional processes. As MANNs have many variants, we use two examples to illustrate the applications of a MANN in two reasoning-required tasks (i.e., semantic parsing and response generation), which are based on the knowledge graph and on commonsense knowledge, respectively.

    Ref.[82] proposed a unified semantic parsing approach to handle a knowledge-graph-based conversational QA task, in which a dialog memory motivated by MemNN is introduced to cope with co-reference and ellipsis phenomena in multi-turn interactions.Two examples are given below.

    Fig.16. Overview of KV-MemNN[79].a means the answer of the question;B means a d ×D matrix,which can be constrained to be identical to A. Rj means a d ×d matrix,which is used to update the representation of the input question in jth hop.

    In order to answer the first question:

    (1) Where was Donald Trump born?

    The semantic parser first generates its logical form:¨ex.PlaceOfBirth(Donald Trump,x)and then executes it against a knowledge base to get the answer New York. This is called context-independent semantic parsing,as only the current question is needed.

    In order to answer the second question:

    (2) Where did he graduate from?

    Both questions (1) and (2) are needed, as ‘‘he” in question (2)refers to Donald Trump in question (1). This is called contextdependent semantic parsing.

    Given a context-independent question, (e.g., question (1)),semantic parsing is done as an action sequence prediction task.Starting from a root symbol ‘‘Start,” a grammar-guided decoder recursively rewrites the leftmost nonterminal (i.e., the semantic category)in the logical form by applying a legitimate action.It terminates when no nonterminal is left. Fig. 17 illustrates this procedure.

    A list of actions is defined in Table 1,where each action consists of three parts: a semantic category; a function symbol, which might be omitted; and a list of arguments, each of which can be a semantic category,a constant,or an action subsequence.The first 15 actions (A1-A15) are designed to cover typical operations in semantic parsing.We take A5 as an example:"num"is the semantic category, which denotes the returned type of count(set);‘‘count” is the function symbol, which returns the number of elements in set; and ‘‘set” is the only argument of count. A16-A18 are designed to instantiate entity e, predicate r, and number num, respectively. A19-A21 are designed to reuse the previously predicated action subsequences.

    Given a context-dependent question(e.g.,question(2)),a dialog memory is used to maintain all entities,predicates,and action subsequences that come from either the generated logical forms of previous questions or the predicted answers. Such contents are considered to be the conversation history and will be used to restore the missing information (co-reference and entity/relation ellipsis) of the current question. For the example in Fig. 18, as‘‘he” in current question actually refers to Donald Trump in the previous question, the semantic parser can copy the action subsequence(i.e.,A15 eDT)from the dialog memory to complete the logical form generation.

    Experiments are conducted on the CSQA dataset [83], which consists of 2×105dialogs with 1.6×106turns over 1.28×107entities from WikiData, and this method achieves state-of-the-art results on both context-independent and context-dependent questions.

    Ref. [84] proposed a commonsense-aware encoder-decoder framework for the response-generation task in open-domain dialog systems. ConceptNet is used in this work to understand the background information of a given user utterance and then facilitate response generation.

    Table 1 List of actions, each consisting of a semantic category, function symbol, and list of arguments.

    In the decoder part, a knowledge-aware generator is designed to generate a response by making full use of the retrieved graphs.This part plays two roles: ①It attentively reads the retrieved graphs to obtain a graph-aware context vector,and uses the vector to update the decoder’s state.This procedure is similar to MemNN.②It adaptively chooses a generic word or an entity from the retrieved graphs for response generation.

    Experiments are conducted on the Commonsense Conversation dataset, and this method achieves state-of-the-art results compared with the traditional sequence-to-sequence approach with copy.

    With deep learning, different types of knowledge can be represented and used in neural inference methods for various reasoning tasks.We also see that such approaches still lack sufficient reasoning flexibilities.For example,the reasoning hop in MemNN is fixed without considering different input contents,which can be further improved in the future.

    4.3. Reasoning-aware shared tasks

    Based on different types of knowledge used,we classify recently proposed reasoning tasks into four categories:

    Fig. 17. An example of context-independent semantic parsing. DT: Donald Trump.

    Fig. 18. An example of content-dependent semantic parsing.

    · Tasks based on knowledge graphs. These tasks include WikiSQL [85], LC-QuAD [86], CSQA [83], and Complex-WebQuestions[87],in which various types of questions,such as multi-hop, multi-constraint, superlative, and multi-turn questions, must be answered based on knowledge graphs.In these tasks, reasoning is required and is done by performing semantic parsing explicitly.

    · Tasks based on commonsense knowledge. These tasks include the Winograd Schema Challenge[88],ARC[89],CommonsenseQA [90], and ATOMIC [91], in which questions can be answered based on different types of commonsense knowledge, such as temporal, spatial, and causal common sense. In these tasks, reasoning can either be performed explicitly by designing specific inference models, or performed implicitly by end-to-end training.

    · Tasks based on texts. These tasks include HotpotQA [92],NarrativeQA [93], MultiRC [94], and CoQA [95], in which answers can be obtained by reasoning across paragraphs,documents, or conversation turns. Currently, state-of-the-art results on these tasks are achieved by end-to-end neural models. One reason for the current poor performance is that existing knowledge bases still suffer from low coverage of open-domain natural language texts.

    · Tasks based on both texts and visual content. These tasks include GQA[96]and VCR[97],in which the goal is to answer a natural language question based on a given image. As the questions in these two datasets are either multi-hop questions or require commonsense knowledge, a model needs a strong reasoning ability to achieve a good performance.

    As data annotation is expensive, these datasets are often small in size, or big but generated automatically using templates. The models learned from such datasets lack strong generalization abilities. Recently, pre-trained models such as ELMo, GPT, BERT,and XLNet have shown good generalization performances across different NLP tasks. For the next step, it is practical to fine-tune existing pre-trained models with reasoning-aware datasets with the aim of building more generalized reasoning systems.

    4.4. Summary

    This section briefly reviewed the progress of reasoning in NLP.Typical (non-neural and neural) inference methods were introduced, including ILP, MLN, and MemNN, all of which have been successfully used in various NLP tasks,such as QA,dialog systems,and information extraction.

    In order to build more powerful reasoning systems, some challenges remain:

    · Knowledge extraction. Due to the limited coverage of existing knowledge sources,only a small portion of queries issued to current human-computer interaction systems (such as search, QA, and dialog engines) can be fully understood and addressed. Therefore, knowledge extraction is still a longterm research task in order to acquire high-quality knowledge to support reasoning.

    · Reasoning with explicit knowledge and pre-trained models. Typical reasoning approaches are built based on explicit knowledge bases. Recently, pre-trained models such as GPT,BERT, and XLNet have shown strong abilities on reasoningrequired NLP tasks, such as the Winograd Schema Challenge[88] and SWAG [98]. Therefore, the question of how to combine explicit knowledge and pre-trained models into a unified reasoning framework is worth exploring.

    · Datasets and metrics. Although the latest approaches can perform well on many NLP tasks, it is still difficult to tell how much reasoning capability an NLP model has. In order to measure such abilities, large-scale datasets and specific metrics should be built for the community.

    In general,reasoning is critical to advance NLP,as reasoning can bring existing knowledge into all NLP tasks for better natural language understanding and generation.

    5. Conclusion

    This paper reviewed the latest progress of neural NLP from three perspectives: modeling, learning, and reasoning. In general,NLP is entering a new era, in which neural network-based models dominate in research and application systems. For rich-source tasks with enough training data(e.g.,NMT and SQuAD),supervised learning can do pretty well.For low-resource tasks with little training data, semi-supervised and unsupervised learning, multitask learning, transfer learning, and active learning can be used; these methods can either generate more pseudo training data for model training, or leverage the knowledge learned from existing models and tasks. Different reasoning mechanisms have also been explored and used for tasks in which reasoning is required, such as commonsense QA, chitchat, and dialog systems.

    Looking ahead, we can see many exciting directions. For example,pre-trained models such as GPT,BERT,and XLNet have already shown strong performances on various NLP tasks, and there is no doubt that they will continue to evolve for both natural language understanding and generation tasks. It is also worth exploring how to integrate such pre-trained models into various NLP tasks with transfer learning, multitask learning, meta learning, and so on.Research on memory-argument neural networks and their variants will further advance reasoning approaches, with the help of new knowledge-extraction techniques. Furthermore, by combing NLP with multi-modal tasks such as speech recognition, image/video captioning, and QA, new research and application scenarios will emerge. It is an incredibly exciting time to work on NLP from research to applications.

    Compliance with ethics guidelines

    Ming Zhou, Nan Duan, Shujie Liu, and Heung-Yeung Shum declare that they have no conflict of interest or financial conflicts to disclose.

    猜你喜歡
    鼠標(biāo)老鼠
    老鼠怎么了
    抓老鼠
    小動作幫您擺脫“鼠標(biāo)手”
    老鼠開會
    大灰狼(2018年1期)2018-01-24 15:53:20
    笨貓種老鼠
    老鼠分油
    45歲的鼠標(biāo)
    超能力鼠標(biāo)
    IM家庭電子(2008年11期)2008-12-05 09:49:20
    鼠標(biāo)也可以是這樣的
    可愛的鼠標(biāo)防汗止滑貼等
    av天堂在线播放| 一边摸一边做爽爽视频免费| 欧美最黄视频在线播放免费 | 久久久精品免费免费高清| 黑人巨大精品欧美一区二区蜜桃| 一级a爱视频在线免费观看| 窝窝影院91人妻| 法律面前人人平等表现在哪些方面| 欧美日韩福利视频一区二区| 最近最新免费中文字幕在线| 女人久久www免费人成看片| 国产欧美日韩一区二区三| 新久久久久国产一级毛片| 妹子高潮喷水视频| a级毛片黄视频| 99精品在免费线老司机午夜| 18禁裸乳无遮挡动漫免费视频| 一级作爱视频免费观看| 一级,二级,三级黄色视频| 午夜亚洲福利在线播放| 人人妻人人澡人人爽人人夜夜| 日韩欧美一区二区三区在线观看 | 国产蜜桃级精品一区二区三区 | 国产精品自产拍在线观看55亚洲 | 最近最新中文字幕大全免费视频| 老汉色av国产亚洲站长工具| 日韩成人在线观看一区二区三区| 婷婷成人精品国产| av不卡在线播放| 久久热在线av| 日韩一卡2卡3卡4卡2021年| 麻豆国产av国片精品| 电影成人av| 人人妻人人澡人人爽人人夜夜| 99热只有精品国产| 亚洲美女黄片视频| 国产亚洲一区二区精品| 国产精品99久久99久久久不卡| 香蕉久久夜色| 又大又爽又粗| 免费看十八禁软件| 一级片免费观看大全| 亚洲精华国产精华精| 夜夜爽天天搞| 免费观看人在逋| 老汉色av国产亚洲站长工具| 深夜精品福利| 在线观看免费视频网站a站| 亚洲少妇的诱惑av| av天堂在线播放| 黑人巨大精品欧美一区二区蜜桃| 丝袜在线中文字幕| 老汉色∧v一级毛片| 久热爱精品视频在线9| 免费在线观看视频国产中文字幕亚洲| 丝袜美足系列| 一级毛片女人18水好多| av国产精品久久久久影院| 正在播放国产对白刺激| 日韩欧美在线二视频 | 精品人妻1区二区| 欧美日韩亚洲高清精品| 亚洲精品中文字幕在线视频| 欧美日韩av久久| 国产aⅴ精品一区二区三区波| 免费不卡黄色视频| 一个人免费在线观看的高清视频| 一个人免费在线观看的高清视频| 女人久久www免费人成看片| 久久香蕉国产精品| 啦啦啦 在线观看视频| 欧美黑人精品巨大| 国产男靠女视频免费网站| 一进一出抽搐gif免费好疼 | 亚洲中文字幕日韩| 久久精品亚洲精品国产色婷小说| 久久精品亚洲熟妇少妇任你| 一本综合久久免费| 亚洲午夜理论影院| 在线国产一区二区在线| 热re99久久国产66热| 亚洲色图综合在线观看| 久久久久久久久免费视频了| 50天的宝宝边吃奶边哭怎么回事| 成人影院久久| 亚洲国产中文字幕在线视频| 亚洲欧洲精品一区二区精品久久久| 国产成人一区二区三区免费视频网站| 欧美日韩福利视频一区二区| 久久香蕉激情| 亚洲人成电影免费在线| 日本一区二区免费在线视频| 日本撒尿小便嘘嘘汇集6| 国产精品美女特级片免费视频播放器 | 正在播放国产对白刺激| 国产片内射在线| 亚洲情色 制服丝袜| 精品国内亚洲2022精品成人 | 91在线观看av| 亚洲午夜理论影院| 国产免费男女视频| 一边摸一边抽搐一进一出视频| 国产精品99久久99久久久不卡| 天堂俺去俺来也www色官网| 亚洲熟女毛片儿| 欧美日韩av久久| 777久久人妻少妇嫩草av网站| 欧美不卡视频在线免费观看 | 亚洲人成77777在线视频| 老司机影院毛片| 亚洲人成77777在线视频| 国产精品自产拍在线观看55亚洲 | 高潮久久久久久久久久久不卡| 免费av中文字幕在线| 免费av中文字幕在线| 国产精品久久电影中文字幕 | 亚洲五月色婷婷综合| 手机成人av网站| 国产精品亚洲一级av第二区| 超碰97精品在线观看| 大型黄色视频在线免费观看| 免费黄频网站在线观看国产| 在线观看一区二区三区激情| av在线播放免费不卡| 天堂动漫精品| 色94色欧美一区二区| 中亚洲国语对白在线视频| 久久香蕉国产精品| 俄罗斯特黄特色一大片| 交换朋友夫妻互换小说| 亚洲精华国产精华精| 亚洲免费av在线视频| 欧美日韩福利视频一区二区| 极品人妻少妇av视频| 亚洲综合色网址| 黑人猛操日本美女一级片| 两个人看的免费小视频| 免费观看精品视频网站| 久久精品国产综合久久久| 欧美日韩精品网址| 午夜精品国产一区二区电影| 亚洲国产毛片av蜜桃av| 成年动漫av网址| 男女床上黄色一级片免费看| 日本黄色日本黄色录像| 一个人免费在线观看的高清视频| 亚洲av日韩在线播放| 黄片大片在线免费观看| 久99久视频精品免费| 最新在线观看一区二区三区| 午夜成年电影在线免费观看| 我的亚洲天堂| 丰满迷人的少妇在线观看| 亚洲一区高清亚洲精品| 久久久精品区二区三区| 亚洲精品久久午夜乱码| 岛国在线观看网站| 最新在线观看一区二区三区| 一级作爱视频免费观看| 久久久国产成人精品二区 | 18禁裸乳无遮挡免费网站照片 | 亚洲自偷自拍图片 自拍| 国产精品乱码一区二三区的特点 | 亚洲aⅴ乱码一区二区在线播放 | 亚洲成人免费电影在线观看| 国产精品久久久久久精品古装| av中文乱码字幕在线| 国产高清激情床上av| 欧美黄色片欧美黄色片| 久久精品人人爽人人爽视色| 亚洲黑人精品在线| 国产av又大| 亚洲成国产人片在线观看| 亚洲精品国产色婷婷电影| 一级毛片高清免费大全| 亚洲午夜理论影院| 国产精品1区2区在线观看. | 在线永久观看黄色视频| 日本精品一区二区三区蜜桃| 成人亚洲精品一区在线观看| 高清黄色对白视频在线免费看| 在线观看午夜福利视频| 如日韩欧美国产精品一区二区三区| 亚洲熟女精品中文字幕| 最新在线观看一区二区三区| 美女扒开内裤让男人捅视频| 窝窝影院91人妻| 国产亚洲精品一区二区www | 国产精品av久久久久免费| 又大又爽又粗| 最新在线观看一区二区三区| 久久人妻av系列| 久久精品国产综合久久久| 麻豆乱淫一区二区| 女人高潮潮喷娇喘18禁视频| 精品人妻在线不人妻| 午夜福利在线免费观看网站| 99国产精品一区二区蜜桃av | 777久久人妻少妇嫩草av网站| 日韩免费av在线播放| 国产精品国产av在线观看| 国产一区二区激情短视频| cao死你这个sao货| 别揉我奶头~嗯~啊~动态视频| 极品人妻少妇av视频| 国内久久婷婷六月综合欲色啪| 中文字幕精品免费在线观看视频| 国产99白浆流出| 国产一区在线观看成人免费| 久久精品人人爽人人爽视色| 久久天躁狠狠躁夜夜2o2o| 免费在线观看亚洲国产| 国产人伦9x9x在线观看| 日韩免费高清中文字幕av| 欧美 日韩 精品 国产| 国产av一区二区精品久久| 成在线人永久免费视频| 国产精品久久视频播放| 性少妇av在线| 久久久久精品人妻al黑| 男女免费视频国产| 久9热在线精品视频| 国产男女内射视频| 99久久99久久久精品蜜桃| 激情视频va一区二区三区| 少妇被粗大的猛进出69影院| 露出奶头的视频| 亚洲五月天丁香| 久久久久久亚洲精品国产蜜桃av| 午夜久久久在线观看| 女人爽到高潮嗷嗷叫在线视频| 国产乱人伦免费视频| www.自偷自拍.com| 在线免费观看的www视频| 男女下面插进去视频免费观看| 99精品久久久久人妻精品| 久久久国产一区二区| 亚洲成人手机| 麻豆乱淫一区二区| 日本欧美视频一区| 国产又色又爽无遮挡免费看| 国产亚洲精品第一综合不卡| 国产91精品成人一区二区三区| 国产伦人伦偷精品视频| av福利片在线| 亚洲精品国产一区二区精华液| 国产在线精品亚洲第一网站| 9热在线视频观看99| 国产野战对白在线观看| 捣出白浆h1v1| 极品教师在线免费播放| 一进一出抽搐动态| 久久午夜综合久久蜜桃| 欧美人与性动交α欧美软件| 婷婷精品国产亚洲av在线 | 午夜成年电影在线免费观看| 国产在视频线精品| 日韩免费av在线播放| 91麻豆精品激情在线观看国产 | 十八禁网站免费在线| 最近最新免费中文字幕在线| 女人精品久久久久毛片| 啦啦啦在线免费观看视频4| 亚洲国产精品合色在线| av中文乱码字幕在线| 一二三四社区在线视频社区8| 国产黄色免费在线视频| 两人在一起打扑克的视频| 男男h啪啪无遮挡| 国产高清视频在线播放一区| 美女午夜性视频免费| 久久国产精品大桥未久av| 嫁个100分男人电影在线观看| 欧美一级毛片孕妇| 成人永久免费在线观看视频| 国产成人欧美| 一区福利在线观看| 亚洲人成77777在线视频| 亚洲熟女精品中文字幕| 叶爱在线成人免费视频播放| 在线观看免费午夜福利视频| 丰满人妻熟妇乱又伦精品不卡| 欧美乱色亚洲激情| 久久人人97超碰香蕉20202| 欧美黑人欧美精品刺激| 麻豆av在线久日| 久久久久视频综合| 9191精品国产免费久久| 天天躁夜夜躁狠狠躁躁| 99re6热这里在线精品视频| 1024视频免费在线观看| 精品午夜福利视频在线观看一区| 久久国产亚洲av麻豆专区| 亚洲国产欧美日韩在线播放| 久久天堂一区二区三区四区| 1024视频免费在线观看| www.精华液| 又黄又爽又免费观看的视频| 免费少妇av软件| 中文字幕另类日韩欧美亚洲嫩草| 国产成人av教育| 啦啦啦在线免费观看视频4| 国产99白浆流出| 天堂动漫精品| 一区二区三区精品91| 免费av中文字幕在线| 老熟妇乱子伦视频在线观看| 精品亚洲成国产av| 99国产综合亚洲精品| 日韩有码中文字幕| 亚洲国产欧美网| 天天躁狠狠躁夜夜躁狠狠躁| 国产有黄有色有爽视频| 欧美丝袜亚洲另类 | 91av网站免费观看| 国产极品粉嫩免费观看在线| 亚洲在线自拍视频| 亚洲成人免费电影在线观看| 一进一出好大好爽视频| 亚洲精华国产精华精| 多毛熟女@视频| 精品少妇一区二区三区视频日本电影| 久久精品熟女亚洲av麻豆精品| av网站免费在线观看视频| 老司机福利观看| 亚洲欧美精品综合一区二区三区| 国产不卡一卡二| 天堂中文最新版在线下载| 大码成人一级视频| 一边摸一边抽搐一进一出视频| 精品久久久久久,| 欧美激情高清一区二区三区| 国产精品99久久99久久久不卡| ponron亚洲| 日韩三级视频一区二区三区| 亚洲人成电影免费在线| 亚洲精品av麻豆狂野| 男人舔女人的私密视频| 亚洲av日韩在线播放| 亚洲一卡2卡3卡4卡5卡精品中文| 精品一区二区三区av网在线观看| 妹子高潮喷水视频| 99在线人妻在线中文字幕 | 少妇 在线观看| 久久精品国产清高在天天线| 嫁个100分男人电影在线观看| 亚洲在线自拍视频| 国产高清视频在线播放一区| 大陆偷拍与自拍| 日本wwww免费看| 搡老熟女国产l中国老女人| 国产在线观看jvid| 高清黄色对白视频在线免费看| 777米奇影视久久| 水蜜桃什么品种好| 国产一卡二卡三卡精品| 男人的好看免费观看在线视频 | 无遮挡黄片免费观看| 亚洲性夜色夜夜综合| 亚洲精品乱久久久久久| a级毛片黄视频| 热99久久久久精品小说推荐| 又大又爽又粗| 制服人妻中文乱码| 黑人巨大精品欧美一区二区mp4| 在线观看免费视频网站a站| 欧美在线一区亚洲| 99久久综合精品五月天人人| 国产日韩一区二区三区精品不卡| 欧美国产精品va在线观看不卡| 12—13女人毛片做爰片一| 极品少妇高潮喷水抽搐| 天天添夜夜摸| 欧美黄色淫秽网站| 国产精品乱码一区二三区的特点 | 日韩一卡2卡3卡4卡2021年| 韩国av一区二区三区四区| 亚洲欧美色中文字幕在线| 亚洲精华国产精华精| 精品少妇久久久久久888优播| 中文字幕制服av| 国产精品一区二区在线观看99| 亚洲成a人片在线一区二区| 亚洲专区字幕在线| 国产又爽黄色视频| 麻豆av在线久日| 夜夜躁狠狠躁天天躁| 好男人电影高清在线观看| 天天添夜夜摸| 亚洲精品乱久久久久久| 亚洲熟女毛片儿| 国产欧美亚洲国产| 久久久国产成人精品二区 | 中文字幕精品免费在线观看视频| 亚洲欧美色中文字幕在线| 高清毛片免费观看视频网站 | 两性夫妻黄色片| 久久热在线av| av中文乱码字幕在线| 18禁美女被吸乳视频| 国产aⅴ精品一区二区三区波| 成人av一区二区三区在线看| 亚洲九九香蕉| 国内久久婷婷六月综合欲色啪| 丝袜在线中文字幕| 欧美激情高清一区二区三区| 最新美女视频免费是黄的| 一进一出抽搐gif免费好疼 | 天堂√8在线中文| 亚洲一区高清亚洲精品| 亚洲三区欧美一区| 五月开心婷婷网| 欧美成人免费av一区二区三区 | avwww免费| 久久精品亚洲av国产电影网| 真人做人爱边吃奶动态| 一级a爱片免费观看的视频| 99riav亚洲国产免费| 中文字幕av电影在线播放| 国产真人三级小视频在线观看| 午夜精品在线福利| 搡老熟女国产l中国老女人| av视频免费观看在线观看| 亚洲一卡2卡3卡4卡5卡精品中文| 国产在视频线精品| 亚洲精品一二三| 99热国产这里只有精品6| 又黄又爽又免费观看的视频| 丰满的人妻完整版| 人人妻人人澡人人看| 九色亚洲精品在线播放| av有码第一页| 午夜成年电影在线免费观看| 亚洲全国av大片| 黄频高清免费视频| 我的亚洲天堂| 亚洲欧美色中文字幕在线| 亚洲熟女精品中文字幕| 成年人午夜在线观看视频| 91麻豆精品激情在线观看国产 | 国产伦人伦偷精品视频| 9色porny在线观看| 国产在线观看jvid| 久久午夜亚洲精品久久| 欧美性长视频在线观看| 久久久水蜜桃国产精品网| 色综合欧美亚洲国产小说| 欧美亚洲日本最大视频资源| 狠狠狠狠99中文字幕| 一级毛片女人18水好多| 久久国产精品男人的天堂亚洲| 一区二区三区精品91| 一级毛片女人18水好多| 人妻 亚洲 视频| 性少妇av在线| 动漫黄色视频在线观看| 精品国产乱码久久久久久男人| 国产99久久九九免费精品| 日韩欧美一区视频在线观看| 久久久国产成人免费| 91九色精品人成在线观看| 深夜精品福利| 国产区一区二久久| 一二三四在线观看免费中文在| 一个人免费在线观看的高清视频| 亚洲avbb在线观看| 三上悠亚av全集在线观看| 69av精品久久久久久| 亚洲中文日韩欧美视频| 国产免费av片在线观看野外av| 99久久精品国产亚洲精品| 久久婷婷成人综合色麻豆| 男女免费视频国产| 一级毛片高清免费大全| 国产精品久久久人人做人人爽| 免费在线观看亚洲国产| 动漫黄色视频在线观看| 人成视频在线观看免费观看| 欧美国产精品一级二级三级| 在线观看www视频免费| 俄罗斯特黄特色一大片| 久久 成人 亚洲| 十分钟在线观看高清视频www| 免费在线观看黄色视频的| 国产精华一区二区三区| av超薄肉色丝袜交足视频| 99精品久久久久人妻精品| 在线十欧美十亚洲十日本专区| 在线天堂中文资源库| 黄色丝袜av网址大全| av天堂在线播放| 欧美亚洲 丝袜 人妻 在线| 亚洲一区二区三区不卡视频| 成在线人永久免费视频| 首页视频小说图片口味搜索| 国产精品综合久久久久久久免费 | 久久久久久亚洲精品国产蜜桃av| 身体一侧抽搐| 又大又爽又粗| 亚洲av片天天在线观看| 亚洲av美国av| 日韩欧美在线二视频 | 亚洲人成伊人成综合网2020| 中文亚洲av片在线观看爽 | 黑人操中国人逼视频| 国产精品二区激情视频| 欧美成人免费av一区二区三区 | 午夜福利免费观看在线| 最近最新中文字幕大全电影3 | 天天躁日日躁夜夜躁夜夜| 久久久水蜜桃国产精品网| 成人影院久久| 日韩 欧美 亚洲 中文字幕| 国产区一区二久久| 18禁观看日本| 男人操女人黄网站| 亚洲专区国产一区二区| 日韩欧美免费精品| 91精品国产国语对白视频| 欧美激情 高清一区二区三区| 久久草成人影院| 亚洲伊人色综图| 久9热在线精品视频| 在线观看www视频免费| 亚洲五月天丁香| 亚洲国产精品合色在线| 中文字幕人妻丝袜制服| 在线国产一区二区在线| 日韩三级视频一区二区三区| 精品久久久久久久久久免费视频 | 亚洲精品一卡2卡三卡4卡5卡| 欧美激情高清一区二区三区| 精品卡一卡二卡四卡免费| 99国产精品免费福利视频| 久久ye,这里只有精品| 国产不卡一卡二| 女人被狂操c到高潮| 日韩欧美国产一区二区入口| 欧美色视频一区免费| 婷婷成人精品国产| 午夜精品在线福利| 黄网站色视频无遮挡免费观看| a级毛片在线看网站| 国产精品国产高清国产av | netflix在线观看网站| 国产免费现黄频在线看| 亚洲色图综合在线观看| 99精品欧美一区二区三区四区| 国产高清国产精品国产三级| 午夜免费成人在线视频| 国产国语露脸激情在线看| 老汉色∧v一级毛片| 欧美乱妇无乱码| 热re99久久精品国产66热6| 国产黄色免费在线视频| 国产精品1区2区在线观看. | 老司机午夜福利在线观看视频| 久久久国产欧美日韩av| 欧美丝袜亚洲另类 | 一区二区三区国产精品乱码| 亚洲成av片中文字幕在线观看| 久久影院123| 久久精品国产亚洲av香蕉五月 | 日本五十路高清| 夜夜夜夜夜久久久久| 男女午夜视频在线观看| 精品一区二区三区四区五区乱码| 久久人人97超碰香蕉20202| 亚洲人成伊人成综合网2020| 亚洲少妇的诱惑av| 女人被狂操c到高潮| 国产精品久久久av美女十八| 国产单亲对白刺激| 99re在线观看精品视频| 高清av免费在线| 日韩欧美一区视频在线观看| 国产精品欧美亚洲77777| av在线播放免费不卡| 国内久久婷婷六月综合欲色啪| 两个人免费观看高清视频| www.熟女人妻精品国产| 久久天堂一区二区三区四区| 9色porny在线观看| 久久久久精品人妻al黑| 亚洲 欧美一区二区三区| 叶爱在线成人免费视频播放| 黄色a级毛片大全视频| 欧美激情 高清一区二区三区| 9色porny在线观看| 中文字幕人妻丝袜一区二区| 久久国产精品人妻蜜桃| 亚洲综合色网址| av在线播放免费不卡| 欧美精品av麻豆av| 免费在线观看完整版高清| 精品一区二区三区av网在线观看| 日韩三级视频一区二区三区| 国产成人精品久久二区二区免费| 欧美日韩亚洲综合一区二区三区_| 国产精品久久久久成人av| 亚洲av日韩在线播放| 国产精品99久久99久久久不卡| 99riav亚洲国产免费| 国产高清视频在线播放一区| 最新的欧美精品一区二区| 超碰97精品在线观看| 99香蕉大伊视频| 女性被躁到高潮视频| 亚洲五月天丁香| 动漫黄色视频在线观看| 午夜精品久久久久久毛片777| 欧美+亚洲+日韩+国产| 精品一区二区三区四区五区乱码| 色在线成人网| 高清av免费在线|