基于MapReduce模型文本分類算法的研究

發(fā)布時間：2018-10-18 16:41

【摘要】：隨著網(wǎng)絡規(guī)模的不斷擴大和信息量的不斷增加，集中式環(huán)境文本分類不能滿足現(xiàn)有的需要，因此在分布式環(huán)境下對大規(guī)模數(shù)據(jù)處理成為當前IT行業(yè)關注的焦點。無論是在廣告投放，還是在信息檢索等領域，都需要對大規(guī)模數(shù)據(jù)處理進行文本分類，因此研究云計算環(huán)境下的大規(guī)模數(shù)據(jù)文本分類就成為了焦點。本文就在Hadoop系統(tǒng)平臺下，以文本分類為前提，以本文設計的倒排索引樹結(jié)構(gòu)為基礎，對文本分類算法及其增量算法進行了以下研究。綜上所述：本文的主要研究成果、貢獻和創(chuàng)新點可概括以下幾點： 1.為了滿足特征選擇方法的計算速度和文本分類KNN、Bayes等算法以及文本向量維度分布稀松性，本文給出了倒排索引樹結(jié)構(gòu)，并在云平臺上將倒排索引樹結(jié)構(gòu)并行化。 2.結(jié)合倒排索引樹的結(jié)構(gòu)和文本分類算法，給出了海量數(shù)據(jù)的倒排索引樹構(gòu)建算法及其剪枝策略，同時也給出了增量倒排索引樹算法以及增量倒排索引樹并行化設計。 3.基于倒排索引樹結(jié)構(gòu)，設計了K-means增量分類算法，并給出了Hadoop平臺下該算法分類的并行化設計。 4.根據(jù)倒排索引樹結(jié)構(gòu)，提出了云計算hadoop平臺下基于倒排索引樹的樸素貝葉斯分類算法，并給出了該算法的三種改進方法，分別有采用TFIDF權(quán)重加權(quán)的，互信息加權(quán)的，期望交叉熵加權(quán)的樸素貝葉斯文本分類算法，同時也給出了基于倒排索引樹的局部樸素貝葉斯文本分類算法。 5.搭建hadoop集群進行實驗分析，驗證了倒排索引樹結(jié)構(gòu)及其文本分類改進算法的分類準確率，召回率和分類性能。
[Abstract]:With the continuous expansion of the network scale and the increase of the amount of information, the centralized environment text classification can not meet the existing needs, so large-scale data processing in the distributed environment has become the focus of attention in the current IT industry. It is necessary to classify the large-scale data processing in the field of advertising and information retrieval, so the research of large-scale data text classification in cloud computing environment has become the focus. In this paper, based on the inverted index tree structure designed in this paper, the text classification algorithm and its incremental algorithm are studied on the basis of text classification based on Hadoop system. To sum up: the main research results, contributions and innovations can be summarized as follows: 1. In order to satisfy the computation speed of feature selection method, text classification KNN,Bayes algorithm and text vector dimension distribution looseness, the inverted index tree structure is presented in this paper, and the inverted index tree structure is parallelized on cloud platform. 2. Combined with the structure of inverted index tree and text classification algorithm, this paper presents an inverted index tree construction algorithm and pruning strategy for massive data. At the same time, the incremental inverted index tree algorithm and the parallel design of incremental inverted index tree are presented. Based on the inverted index tree structure, the K-means incremental classification algorithm is designed, and the parallel design of the algorithm classification based on Hadoop platform is given. 4. According to inverted index tree structure, a naive Bayesian classification algorithm based on inverted index tree in cloud computing hadoop platform is proposed, and three improved methods are given, which are weighted by TFIDF weight and weighted by mutual information. A naive Bayesian text classification algorithm with expected cross-entropy weighted is proposed. At the same time, a local naive Bayesian text classification algorithm based on inverted index tree is presented. The hadoop cluster was built for experimental analysis to verify the classification accuracy recall rate and classification performance of the inverted index tree structure and its improved text classification algorithm.
【學位授予單位】：遼寧大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP391.1

【參考文獻】

相關期刊論文前10條

1 張玉芳;陳小莉;熊忠陽;;基于信息增益的特征詞權(quán)重調(diào)整算法研究[J];計算機工程與應用;2007年35期

2 秦鋒;任詩流;程澤凱;羅慧;;基于屬性加權(quán)的樸素貝葉斯分類算法[J];計算機工程與應用;2008年06期

3 唐亮;段建國;許洪波;梁玲;;基于互信息最大化的特征選擇算法及應用[J];計算機工程與應用;2008年13期

4 鄧維斌;王國胤;王燕;;基于Rough Set的加權(quán)樸素貝葉斯分類算法[J];計算機科學;2007年02期

5 向小軍;高陽;商琳;楊育彬;;基于Hadoop平臺的海量文本分類的并行化[J];計算機科學;2011年10期

6 張玉芳;彭時名;呂佳;;基于文本分類TFIDF方法的改進與應用[J];計算機工程;2006年19期

7 李學明;李海瑞;薛亮;何光軍;;基于信息增益與信息熵的TFIDF算法[J];計算機工程;2012年08期

8 鄧維斌;黃蜀江;周玉敏;;基于條件信息熵的自主式樸素貝葉斯分類算法[J];計算機應用;2007年04期

9 周敏;周繼鵬;丁光華;;PSL:針對大規(guī)模數(shù)據(jù)應用的并行Slope One算法[J];科學技術(shù)與工程;2010年03期

10 冀素琴;石洪波;衛(wèi)潔;;基于Map Reduce的Bagging貝葉斯文本分類[J];計算機工程;2012年16期

相關碩士學位論文前5條

1 李原;中文文本分類中分詞和特征選擇方法研究[D];吉林大學;2011年

2 劉叢山;基于Hadoop的文本分類研究[D];上海交通大學;2012年

3 王新麗;中文文本分類系統(tǒng)的研究與實現(xiàn)[D];天津大學;2007年

4 李軍華;云計算及若干數(shù)據(jù)挖掘算法的MapReduce化研究[D];電子科技大學;2010年

5 喬鴻欣;基于MapReduce的KNN分類算法的研究與實現(xiàn)[D];北京交通大學;2012年

，

本文編號：2279728

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.wukwdryxk.cn/wenyilunwen/guanggaoshejilunwen/2279728.html

上一篇：手機廣告運營模式的設計及其實現(xiàn)研究
下一篇：廣告設計專業(yè)造型基礎課程的創(chuàng)新項目教學

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

a国产,中文字幕久久波多野结衣AV,欧美粗大猛烈老熟妇,女人av天堂

基于MapReduce模型文本分類算法的研究