基于詞表示和深度學(xué)習(xí)的生物醫(yī)學(xué)關(guān)系抽取
發(fā)布時間:2018-06-24 09:02
本文選題:詞表示 + 深度學(xué)習(xí); 參考:《大連理工大學(xué)》2016年博士論文
【摘要】:蛋白質(zhì)關(guān)系抽取和藥物關(guān)系抽取對于生物醫(yī)學(xué)領(lǐng)域相關(guān)數(shù)據(jù)庫的構(gòu)建、生命科學(xué)研究、藥物開發(fā)和疾病的防治都具有重要意義。目前,大量生物醫(yī)學(xué)關(guān)系抽取方法的研究重點在于特征集合的選取和核函數(shù)的設(shè)計,經(jīng)過十余年的發(fā)展,基于特征和核函數(shù)的方法已經(jīng)相對成熟,提升空間變得有限。為了進一步提升性能,本文研究基于詞表示和深度學(xué)習(xí)的抽取方法。深度學(xué)習(xí)能夠建立更深層的關(guān)系抽取模型以提升抽取效果,而詞表示將語義信息融合到詞向量中,是深度學(xué)習(xí)的前提。本文主要貢獻包括:針對生物醫(yī)學(xué)領(lǐng)域文本的特點設(shè)計詞表示模型,在傳統(tǒng)詞表示模型基礎(chǔ)上,融合詞形、詞性、詞干、句法塊、生物醫(yī)學(xué)命名實體這五類重要信息,增強詞向量的語義表示能力,并在蛋白質(zhì)關(guān)系抽取、藥物關(guān)系抽取等任務(wù)上取得了較好的效果,驗證了在詞表示中融入詞性、實體等豐富信息的有效性,為基于深度學(xué)習(xí)的關(guān)系抽取方法提供了良好的詞表示基礎(chǔ)。針對蛋白質(zhì)二類關(guān)系抽取問題,克服傳統(tǒng)方法依賴于特征和核函數(shù)的局限性,提出一種基于實例表示的抽取模型,該模型包含詞向量、骨架特征、特征組合三個部分,在規(guī)模較大的語料上抽取效果達到了目前先進水平,從而驗證了基于詞表示和深度學(xué)習(xí)方法在蛋白質(zhì)關(guān)系抽取問題上的有效性。該模型考慮了蛋白質(zhì)關(guān)系實例的特點,以詞向量作為輸入,配合骨架特征和向量組合,從而在實例表示中融合豐富的語義信息。針對藥物多類關(guān)系抽取問題,提出一種兩階段方法:在第一階段,采用實例表示與句法特征相結(jié)合的方法,利用邏輯回歸分類器,識別出藥物關(guān)系正例;在第二階段,利用長短期記憶網(wǎng)絡(luò)將正例分成四種藥物關(guān)系類型。為了提升第二階段性能,從重要度、實現(xiàn)代價和計算代價這三個方面考慮了多種相關(guān)要素對長短期記憶網(wǎng)絡(luò)的影響,通過實驗發(fā)現(xiàn),詞向量、距離向量、詞性向量和雙層雙向長短期記憶網(wǎng)絡(luò)對于第二階段分類的性能具有提升作用,也是本文兩階段藥物關(guān)系抽取方法能夠取得較好效果的重要因素。綜上所述,本文針對蛋白質(zhì)間二分類關(guān)系抽取和藥物間多分類關(guān)系抽取,利用表示和深度學(xué)習(xí)等技術(shù)提出相應(yīng)的抽取方法,在一定程度上克服了基于特征和核函數(shù)方法的局限性,取得了較好的效果。詞表示和深度學(xué)習(xí)技術(shù)是近年來的研究熱點,在生物醫(yī)學(xué)文本挖掘領(lǐng)域的起步較晚,本文所提出的方法在生物醫(yī)學(xué)關(guān)系抽取任務(wù)上取得了一定成果,驗證了其有效性,并揭示了基于詞表示和深度學(xué)習(xí)方法在生物醫(yī)學(xué)文本挖掘領(lǐng)域具有廣闊的研究空間,值得在未來工作中繼續(xù)探索。
[Abstract]:Protein relation extraction and drug relationship extraction are of great significance to the construction of biomedical database, life science research, drug development and disease prevention and treatment. At present, a large number of biomedical relation extraction methods focus on the selection of feature sets and the design of kernel functions. After more than a decade of development, the methods based on features and kernel functions have been relatively mature, and the lifting space has become limited. To further improve performance, this paper studies extraction methods based on word representation and depth learning. Depth learning can build deeper relational extraction model to improve the extraction effect, and word representation fusion semantic information into word vector is the premise of deep learning. The main contributions of this paper are as follows: according to the characteristics of biomedical text, a word representation model is designed. Based on the traditional word representation model, five kinds of important information, such as lexical form, word-of-speech, stem, syntactic block and biomedical named entity, are fused. The ability of semantic representation of word vectors is enhanced, and good results are obtained in the tasks of protein relation extraction and drug relation extraction, which verify the effectiveness of incorporating part of speech and entity into word representation. It provides a good basis for relation extraction based on deep learning. In order to overcome the limitation of traditional methods, which depend on feature and kernel function, an extraction model based on case representation is proposed. The model consists of three parts: word vector, skeleton feature and feature combination. The effect of extraction on large scale corpus is up to the present advanced level, which verifies the validity of the method based on word representation and depth learning in the extraction of protein relationship. The model considers the characteristics of the case of protein relation, takes word vector as input, and combines skeleton feature and vector, so as to fuse rich semantic information in case representation. In order to solve the problem of drug multi-class relation extraction, a two-stage method is proposed: in the first stage, the method of case representation combined with syntactic features is used to identify the positive case of drug relationship by using logical regression classifier, and in the second stage, By using long-term and short-term memory networks, the positive cases are divided into four types of drug relationships. In order to improve the performance of the second stage, the effects of many related factors on the long-term and short-term memory network are considered from the three aspects of importance, realization cost and computational cost. Part of speech vector and double-layer bidirectional long-term and short-term memory network can improve the performance of the second stage classification, which is also an important factor that the two-stage drug relationship extraction method can achieve better results. To sum up, this paper proposes a new extraction method based on the techniques of representation and depth learning, aiming at the extraction of the two-class relationship between proteins and the multi-classification relationship between drugs. To some extent, the limitation of the method based on feature and kernel function is overcome, and good results are obtained. The technology of word representation and deep learning has been a hot research topic in recent years, and it started late in the field of biomedical text mining. The method proposed in this paper has achieved some results in the task of biomedical relation extraction, and verified its effectiveness. It is also revealed that the word representation and depth learning methods have a wide research space in biomedical text mining field, which is worthy of further exploration in the future work.
【學(xué)位授予單位】:大連理工大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2016
【分類號】:TP391.1
【相似文獻】
相關(guān)期刊論文 前1條
1 朱萬穎;張希府;高志強;;句法模式的泛化及其在關(guān)系學(xué)習(xí)中的應(yīng)用[J];重慶工學(xué)院學(xué)報(自然科學(xué)版);2008年10期
相關(guān)會議論文 前1條
1 虞歡歡;陳九昌;錢龍華;周國棟;;基于樹核函數(shù)的中文語義關(guān)系抽取[A];中國計算機語言學(xué)研究前沿進展(2007-2009)[C];2009年
,本文編號:2060941
本文鏈接:http://www.wukwdryxk.cn/shoufeilunwen/xxkjbs/2060941.html
最近更新
教材專著