基于多特征融合的網(wǎng)頁(yè)正文提取及雙語(yǔ)網(wǎng)站探測(cè)
發(fā)布時(shí)間:2019-04-10 12:57
【摘要】:隨著互聯(lián)網(wǎng)的快速發(fā)展,互聯(lián)網(wǎng)信息規(guī)模呈指數(shù)級(jí)增長(zhǎng),同時(shí)互聯(lián)網(wǎng)海量信息的背后伴隨著質(zhì)量的參差不齊,,準(zhǔn)確,快速,全面的獲取信息變得越來(lái)越困難,強(qiáng)大的信息提取能力變得備受關(guān)注,信息海量堆積也對(duì)信息抽取技術(shù)提出了新的機(jī)遇與挑戰(zhàn)。而隨著自然語(yǔ)言處理技術(shù)的飛速發(fā)展,機(jī)器翻譯技術(shù)在現(xiàn)實(shí)生活中的變得越來(lái)越實(shí)用,有道翻譯,Google翻譯,百度翻譯等相關(guān)產(chǎn)品已經(jīng)成為非專業(yè)人士進(jìn)行外文學(xué)習(xí)工作的重要工具。 雙語(yǔ)語(yǔ)料是機(jī)器翻譯的基礎(chǔ),是機(jī)器翻譯中訓(xùn)練、測(cè)試、分析機(jī)器翻譯模型的重要數(shù)據(jù)。雙語(yǔ)語(yǔ)料的數(shù)量與質(zhì)量直接關(guān)系到機(jī)器翻譯參數(shù)的訓(xùn)練結(jié)果,同時(shí)很大程度上對(duì)后續(xù)的機(jī)器翻譯產(chǎn)品性能產(chǎn)生影響。構(gòu)建一個(gè)質(zhì)量高、數(shù)量大的雙語(yǔ)語(yǔ)料庫(kù)對(duì)機(jī)器翻譯、自然語(yǔ)言處理等問題有巨大的應(yīng)用價(jià)值和學(xué)術(shù)意義。 本文著力于架構(gòu)并實(shí)現(xiàn)一個(gè)性能優(yōu)異、效率高的雙語(yǔ)文本抽取系統(tǒng)(此系統(tǒng)是互聯(lián)網(wǎng)雙語(yǔ)語(yǔ)料抓取系統(tǒng)的子系統(tǒng),不包括爬蟲和句子對(duì)齊)。本文的主要研究?jī)?nèi)容包含兩個(gè)方面:網(wǎng)頁(yè)正文提取和雙語(yǔ)網(wǎng)頁(yè)探測(cè)。 本文使用多特征融合技術(shù)針對(duì)網(wǎng)頁(yè)正文進(jìn)行提取,不同于傳統(tǒng)生成DOM樹的網(wǎng)頁(yè)處理方法,本文采用基于容器標(biāo)簽的線性化重構(gòu)方法對(duì)網(wǎng)頁(yè)進(jìn)行處理,在數(shù)據(jù)結(jié)構(gòu)上使得需要進(jìn)行樹操作的算法簡(jiǎn)化到基于線性表的處理,同時(shí)通過(guò)長(zhǎng)度,分詞結(jié)果,句子數(shù),等多個(gè)特征綜合判斷正文脈絡(luò),而后通過(guò)基于信息增益的聚類獲得網(wǎng)頁(yè)正文。在雙語(yǔ)網(wǎng)頁(yè)探測(cè)方面本文采用基于局部句子錨點(diǎn)搜索的互譯率計(jì)算對(duì)正文得到的雙語(yǔ)文本進(jìn)行互譯判斷。在此基礎(chǔ)上本文計(jì)加入了基于命名實(shí)體重合度、代詞比率等特征的輔助正文判斷算法,基于同一網(wǎng)站的大量網(wǎng)頁(yè)的模板自動(dòng)生成算法,來(lái)提升算法的準(zhǔn)確率。 本文的網(wǎng)頁(yè)正文提取和雙語(yǔ)網(wǎng)頁(yè)探測(cè)系統(tǒng)達(dá)到了目前同領(lǐng)域的頂級(jí)水平,本系統(tǒng)及后續(xù)處理系統(tǒng)生成中英三千萬(wàn)雙語(yǔ)語(yǔ)料并經(jīng)過(guò)了黑龍江省電子信息產(chǎn)品監(jiān)督檢驗(yàn)院軟件評(píng)測(cè)中心的嚴(yán)格檢測(cè)準(zhǔn)確率在95%以上。實(shí)驗(yàn)結(jié)果也驗(yàn)證了本文提出的多特征融合方法在雙語(yǔ)語(yǔ)料挖掘領(lǐng)域的有效性。
[Abstract]:With the rapid development of the Internet, the scale of Internet information is growing exponentially. At the same time, it is more and more difficult to obtain information in an all-round way with the uneven, accurate, rapid and all-round access to information behind the massive amount of information on the Internet. The powerful information extraction ability has been paid more and more attention, and the massive accumulation of information has brought new opportunities and challenges to the information extraction technology. With the rapid development of natural language processing technology, machine translation technology has become more and more practical in real life. Youdao Translation, Google translation, Baidu translation and other related products have become an important tool for non-professionals to study foreign languages. Bilingual corpus is the foundation of machine translation, and it is the important data of training, testing and analyzing machine translation model in machine translation. The quantity and quality of bilingual corpus are directly related to the training results of machine translation parameters and affect the performance of subsequent machine translation products to a great extent. The construction of a bilingual corpus with high quality and large quantity is of great practical and academic significance to machine translation, natural language processing and other problems. This paper focuses on the architecture and implementation of a bilingual text extraction system with excellent performance and high efficiency (this system is a subsystem of the bilingual data capture system on the Internet, excluding crawlers and sentence alignment). The main contents of this paper include two aspects: the extraction of web pages and the detection of bilingual web pages. In this paper, multi-feature fusion technology is used to extract the text of web page, which is different from the traditional method of generating DOM tree. In this paper, the linearization reconstruction method based on container tag is used to process the web page. In the data structure, the algorithm which needs tree operation is simplified to the linear table processing. At the same time, the text context is comprehensively judged by the length, the result of participle, the number of sentences, and so on. Then the text of the web page is obtained by clustering based on information gain. In the aspect of bilingual web page detection, this paper uses the mutual translation rate calculation based on local sentence anchor search to judge the mutual translation of the bilingual text obtained from the text. On this basis, this paper adds an auxiliary text judgment algorithm based on named entity coincidence degree, pronoun ratio and other features, and an automatic template generation algorithm based on a large number of web pages on the same website to improve the accuracy of the algorithm. The text extraction and bilingual web detection system of this paper has reached the top level in the same field at present. This system and its follow-up processing system generate Chinese-English 30 million bilingual corpus and pass through the software evaluation center of Heilongjiang Electronic Information products Supervision and Inspection Institute. The accuracy of strict detection is more than 95%. The experimental results also verify the effectiveness of the proposed multi-feature fusion method in bilingual corpus mining.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
本文編號(hào):2455815
[Abstract]:With the rapid development of the Internet, the scale of Internet information is growing exponentially. At the same time, it is more and more difficult to obtain information in an all-round way with the uneven, accurate, rapid and all-round access to information behind the massive amount of information on the Internet. The powerful information extraction ability has been paid more and more attention, and the massive accumulation of information has brought new opportunities and challenges to the information extraction technology. With the rapid development of natural language processing technology, machine translation technology has become more and more practical in real life. Youdao Translation, Google translation, Baidu translation and other related products have become an important tool for non-professionals to study foreign languages. Bilingual corpus is the foundation of machine translation, and it is the important data of training, testing and analyzing machine translation model in machine translation. The quantity and quality of bilingual corpus are directly related to the training results of machine translation parameters and affect the performance of subsequent machine translation products to a great extent. The construction of a bilingual corpus with high quality and large quantity is of great practical and academic significance to machine translation, natural language processing and other problems. This paper focuses on the architecture and implementation of a bilingual text extraction system with excellent performance and high efficiency (this system is a subsystem of the bilingual data capture system on the Internet, excluding crawlers and sentence alignment). The main contents of this paper include two aspects: the extraction of web pages and the detection of bilingual web pages. In this paper, multi-feature fusion technology is used to extract the text of web page, which is different from the traditional method of generating DOM tree. In this paper, the linearization reconstruction method based on container tag is used to process the web page. In the data structure, the algorithm which needs tree operation is simplified to the linear table processing. At the same time, the text context is comprehensively judged by the length, the result of participle, the number of sentences, and so on. Then the text of the web page is obtained by clustering based on information gain. In the aspect of bilingual web page detection, this paper uses the mutual translation rate calculation based on local sentence anchor search to judge the mutual translation of the bilingual text obtained from the text. On this basis, this paper adds an auxiliary text judgment algorithm based on named entity coincidence degree, pronoun ratio and other features, and an automatic template generation algorithm based on a large number of web pages on the same website to improve the accuracy of the algorithm. The text extraction and bilingual web detection system of this paper has reached the top level in the same field at present. This system and its follow-up processing system generate Chinese-English 30 million bilingual corpus and pass through the software evaluation center of Heilongjiang Electronic Information products Supervision and Inspection Institute. The accuracy of strict detection is more than 95%. The experimental results also verify the effectiveness of the proposed multi-feature fusion method in bilingual corpus mining.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 李霞;蔣盛益;;基于DOM樹及行文本統(tǒng)計(jì)去噪的網(wǎng)頁(yè)文本抽取技術(shù)[J];山東大學(xué)學(xué)報(bào)(理學(xué)版);2012年03期
2 常寶寶,詹衛(wèi)東,張華瑞;面向漢英機(jī)器翻譯的雙語(yǔ)語(yǔ)料庫(kù)的建設(shè)及其管理[J];術(shù)語(yǔ)標(biāo)準(zhǔn)化與信息技術(shù);2003年01期
本文編號(hào):2455815
本文鏈接:http://www.wukwdryxk.cn/guanlilunwen/ydhl/2455815.html
最近更新
教材專著