Guo Jianwei,Yan Na,Chen Jiayu
(Information Resources Department of Beijing Institute of Science and Technology Information,Beijing 100044)
Abstract:Today,language processing technology is widely used in all aspects of people's life,learning and work,which brings people a great convenience.However,whether it is Chinese search technology or Chinese speech recognition,or Chinese OCR,the development is not much mature compared with that of English.If the computer technology could develop to a certain stage when we could design a piece of intelligent accounting software,with which we enter the economic business data,it can automatically generate entries,vouchers,and reports,which will greatly improve the efficiency of the company's financial work and provide intelligence and knowledge support to the relevant personnel's economic and financial decisions.
Keywords:Language Processing;Information Extraction;Auto Abstraction
The query of the Internet search is broken down into a few keywords,rather than the actual problems expressed in natural language.People can provide a better answer by getting a clearer understanding of the context-based Q&A system.By contrast,the Internet search engine can only provide poor answers in return or nothing of value for a long query.People can only convert their problems to one or more relevant keywords to try to get a relatively appropriate reply.The long term goal of information retrieval research is to develop retrieval models to provide accurate results for longer and more specialized queries.That is why the better understanding of text information is required.Natural Language Processing[1][2](Natural Language Processing),includes two parts:natural language understanding and natural language generation.
Natural language is the language of human communication.The direction of the study in natural language processing includes as follows[2]:
Rule-based Method: You can prepare some empirical knowledge with the rule-based method.Then deal with unknown situations with the statistical method.Put the artificial experience into the rule base.Most of the time,people do not talk according to probability,and they just dot it casually.For example,about how to lose weight,you can only name something to eat for weight loss.
Statistical Method: A corpus is a sample database of documents.It make sense for probability statistics unless it is of a large scale,so that it can be assumed that many words and sentences appear multiple times.
Computing Framework:The algorithm design is a very difficult work,which requires a very good basic knowledge of data structure.The algorithm design techniques used in this paper mainly includes iterative method, divide-and-conquer method,dynamic programming method and so on.With the massive amounts of data in internet searching,a distributed computing framework is required to perform calculations such as rating of page importance.
Semantic Database:The semantics of natural language are complex and viable,for example,in“buy a doll and send to your girlfriend”,there is more than one semantic item with“send”.Opence provides an English knowledge base in OWL format.
There are two main modes for automatic syntax analysis[3]:One is the phrase structure grammar,and the dependency grammar.As the latter can better reflect the relationship between words in a sentence,this section will give analysis about it.Stanford Parser helps to achieve a syntactic parser based on model of factors,whose main idea is to break down a parser based on lexicalized model into a syntactic parser based on multiple factors.Stanford Parser breaks a lexicalized model into a probability context-free grammar (PCFG)and a dependence model.Sharpnlp can graphically display the syntactic tree, which is implemented with C #.And the dependency can be used to improve the uni-gram word segmentation model as well as the bi-gram word segmentation model.
The Internet provides people with countless information and web pages, many of which are repeated and redundant, which requires de-duplication of documents[4].For example,the credit center of the Central People's Bank receives massive profiles of clients from different banks that apply for loans.In this case,duplicate information needs to be combined and integrated into a more complete version, which can be achieved by combining the date from various sources through similarity calculations.Semantic fingerprint is one of the practical methods in de-duplication of documents.In terms of semantic similarity calculation,this paper introduces methods,such as the angle cosine method and the longest common sub-string algorithm.In a specific practice for plagiarism checking,you can generate Simhash by sentence,and then classify the document according to the document fingerprint information generated.
Synonym replacement: “Nian Gao”is also named as“Xinjiang nuts cake”,and if someone does not understand the word,it is still OK to change an expression.It is necessary to unify the expressions.For example,you can replace the Internal Revenue Service with IRS,and here IRS is the acronym of Internal Revenue Service.Meanwhile, the abbreviation and the full name of a company can also be considered semantically identical.In general,the short expressions can be replaced with the long ones.In terms of address,there is sometimes a variety of different formulations and codes of administrative regions.In this case,one of the synonym replacement methods is to convert the Chinese strings of the door number into the Arabic numerals.For example:“Building Number one in Ganjiakou”to:“Building Number 1 in Ganjiakou”.The rules of the construction of abbreviations in Chinese are a bit of different and there is no definite law about it at present.Hearing the question for the first time,almost everyone would make such a conjecture:The abbreviations constituted from the most core words of each component,such as“customs inspection”to“inspection”people's judge”to“judge”and so on.However,there are counter examples as well,such as the“ZIP Code”to“Zip”.But the“code”is certainly more a core word here.Of course,these abbreviations have developed into common words and can be accepted to the corpus.
Information Extraction (IE), which is to transform the information contained in the text into a tabular form of organization through structure processing[5].In the system of information extraction,the input is the original text,and the output is a fixed format of the information points.These extracted information points are integrated in a unified form.This is the main task of information extraction.For example,the price of strawberry increases and the price of cherry drops,when the semantic notes of strawberry and cherry are both fruit,here fruit is the keyword.Information extraction technology is not used to fully understand the entire document,but only to analyze the relevant pieces of information contained in the document.In information extraction,to complete the task of anaphora resolution,which is about how to extract useful information from the web page and categorize it.For example,about how to find and enter a collection of words with a word editing distance less than k,the method for finding the intersection of two finite state machines has been used.
Figure 1 Flowchart of Keyword Extraction
The word extraction is an important task of text information processing.For example,it can be used to find a hot topic in the news.The core subject term,similar to the keyword,has been widely used in many government documents.Besides, context-sensitive advertising systems may also use keyword extraction techniques to calculate the term frequence and the total number of times that the words appear in all documents.TF represents Term Frequence and IDF stands for Invert Document Frequence.For example,if the“of”appears in 40 documents out of the 100 documents,then DF(Document Frequence)is 40 and the IDF is 140.If the“of”appears 15 times in the first document,then TFWIDI(of)=15*140-0.375.If the other term“anti-corruption number”appears in 5 documents out of the 100 documents,then the DF is 5 and the IDF is 15.If the“anti-corruption”appears 5 times in the first document,then the TF*IDF (Anti-Corruption)=5*1/5=1,then the result is TF*DF(Anti-Corruption)TF+DF(of).
Approximate String Matching:It is mainly about how to develop the correct word list of tips,and it is usually not necessary to provide tips of the words that have not ever been searched by users.Since it is too slow to compare the editing distance between the correct word and the entering one by one,we need to get on how to find the word collection,whose edit distance is less than k from a large word list.Build a Finite State Automation to accurately identify the collection of strings of the target word within a given edit distance.You can enter any word,and then the Auto can decide to receive or reject, based on whether or not the edit distance of the target word is within the given distance.Furthermore,due to the inherent characteristics of FSA, it can be implemented within the O (n)time.Here,m is the length of the test string.The calculation method of the standard dynamic programming editing distance takes O (m*)time,and here m and n are the length of the two input words.As a result,editing distance Auto can quickly check whether a number of words and a target word are within the maximum distance given.
The so-called automatic abstraction is get ab straction from the original document with computer automatically.For example,the size of the smart phone's display is limited,so a much shorter summa ry of the news can be displayed on the phone.For long posts in BBS,some users may ask for a summa ry.The simplest way to generate a summary is to re turn to the first sentence,and a little more complex one is to identify firstly the most important sentences and then generate a summary based on them.
The classification of text is to let the computer classify a certain collection of texts according to some certain criteria.For example,Xiao Li is a football fan,who likes to watch news about football,so the news recommendation system would automatically recommend the news about football by using text classification technology.The program of text categorization could classify an unseen document into one or multiple known categories, such as categorizing news into domestic news and international news.Text classification technology can be used to categorize web pages,provide personalized news to users and filter spam.It is called two-category classification to classify a given document into one of the two categories,such as spam filtering,in which you only need to determine whether “yes”or “not”spam.It is called multiple-category classification if to classify into one of the multiple categories.For example,the Chinese Library Classification Catalogue could classify books into 22 basic categories and one is called multi-category classification.Text classification is mainly includes two phase,the training phase and the forecasting phase.The training phase provides basis for classification,which is called classification model.The forecasting phase classifies the new text according to the classification model.In the training phase,words are usually classified firstly,then the feature words used as classification basis are extracted,and finally the classified feature words and the related parameters would be written into the model file.This step of extracting feature words is called feature extraction.In the early stage,the Naive Bayesian method was often used as the text classification method, and later support vector machine (SVM)methods became the first choice.In addition,you can also do cluster analysis on people's behavior.In the late 1990s,the American Professor S.Reis showed that all the behaviors of mankind could be clustering into 15 behaviors,by analyzing the factors that were made by more than 300 behaviors of over 2300 subjects.
Speech Recognition Technology[7],also known as Automatic Speech Recognition (ASR), is an interdisciplinary that is closely related to people's lives and learning.The goal is to convert the speaker's vocabulary content into computer-readable inputs,such as keys,binary encoding or sequence of characters.For example,in the future when you call the customer service of a bank,you can directly interact with the bank system in spoken language,instead of being asked as a machine to process according to“please press 1 for Chinese”,which is voice interaction.When a beginner cannot write code,the experienced programmer can dictate the code,and then the beginner could put the code in.But to save the programmer's time, the speech recognition code can be translated into text according to the speech, and then the semantics can be recognized based on the text.So eventually it allows the machine communicates with our humans.After recognizing the picture,a child can say it is a tiger or an elephant.The system will use speech recognition technology to determine if the answer is correct or not.The tips will be given automatically by the system when it is not correct.It is not easy to do open speech recognition that can assist with manual input of subtitles,similar to voice input method.Julius is large vocabulary speech recognition,developed by Kyoto University in Japan and a Japanese company,as well as a high-performance, speech-related decoder software of research and development.It is built based on N-gram of words and the context-sensitive XMM model.At present, the software has been applied to large-scale continuous speech recognition of Japanese and Chinese.There are two models in the Julius system:the language model and the acoustic model.
Figure 2 Speech Recognition Structure
Natural language processing technology includes many aspects,such as text classification,dialogue system,machine translation,and so on.The inquiry function people often use,is supported by search engine technology.When users enter longer questions in search engines,the computer will be able to give accurate answers.In almost all the future we can see from film or television,search engines have developed to work,just like human assistants,with which,any complex problems of anything can be answered simply.However,even though the Internet search engine has been able to navigate a very large scope of knowledge,there is still a long way to go for us before having a smart assistant.