基于MapReduce模型文本分類算法的研究
[Abstract]:With the continuous expansion of the network scale and the increase of the amount of information, the centralized environment text classification can not meet the existing needs, so large-scale data processing in the distributed environment has become the focus of attention in the current IT industry. It is necessary to classify the large-scale data processing in the field of advertising and information retrieval, so the research of large-scale data text classification in cloud computing environment has become the focus. In this paper, based on the inverted index tree structure designed in this paper, the text classification algorithm and its incremental algorithm are studied on the basis of text classification based on Hadoop system. To sum up: the main research results, contributions and innovations can be summarized as follows: 1. In order to satisfy the computation speed of feature selection method, text classification KNN,Bayes algorithm and text vector dimension distribution looseness, the inverted index tree structure is presented in this paper, and the inverted index tree structure is parallelized on cloud platform. 2. Combined with the structure of inverted index tree and text classification algorithm, this paper presents an inverted index tree construction algorithm and pruning strategy for massive data. At the same time, the incremental inverted index tree algorithm and the parallel design of incremental inverted index tree are presented. Based on the inverted index tree structure, the K-means incremental classification algorithm is designed, and the parallel design of the algorithm classification based on Hadoop platform is given. 4. According to inverted index tree structure, a naive Bayesian classification algorithm based on inverted index tree in cloud computing hadoop platform is proposed, and three improved methods are given, which are weighted by TFIDF weight and weighted by mutual information. A naive Bayesian text classification algorithm with expected cross-entropy weighted is proposed. At the same time, a local naive Bayesian text classification algorithm based on inverted index tree is presented. The hadoop cluster was built for experimental analysis to verify the classification accuracy recall rate and classification performance of the inverted index tree structure and its improved text classification algorithm.
【學位授予單位】:遼寧大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前10條
1 張玉芳;陳小莉;熊忠陽;;基于信息增益的特征詞權(quán)重調(diào)整算法研究[J];計算機工程與應用;2007年35期
2 秦鋒;任詩流;程澤凱;羅慧;;基于屬性加權(quán)的樸素貝葉斯分類算法[J];計算機工程與應用;2008年06期
3 唐亮;段建國;許洪波;梁玲;;基于互信息最大化的特征選擇算法及應用[J];計算機工程與應用;2008年13期
4 鄧維斌;王國胤;王燕;;基于Rough Set的加權(quán)樸素貝葉斯分類算法[J];計算機科學;2007年02期
5 向小軍;高陽;商琳;楊育彬;;基于Hadoop平臺的海量文本分類的并行化[J];計算機科學;2011年10期
6 張玉芳;彭時名;呂佳;;基于文本分類TFIDF方法的改進與應用[J];計算機工程;2006年19期
7 李學明;李海瑞;薛亮;何光軍;;基于信息增益與信息熵的TFIDF算法[J];計算機工程;2012年08期
8 鄧維斌;黃蜀江;周玉敏;;基于條件信息熵的自主式樸素貝葉斯分類算法[J];計算機應用;2007年04期
9 周敏;周繼鵬;丁光華;;PSL:針對大規(guī)模數(shù)據(jù)應用的并行Slope One算法[J];科學技術(shù)與工程;2010年03期
10 冀素琴;石洪波;衛(wèi)潔;;基于Map Reduce的Bagging貝葉斯文本分類[J];計算機工程;2012年16期
相關碩士學位論文 前5條
1 李原;中文文本分類中分詞和特征選擇方法研究[D];吉林大學;2011年
2 劉叢山;基于Hadoop的文本分類研究[D];上海交通大學;2012年
3 王新麗;中文文本分類系統(tǒng)的研究與實現(xiàn)[D];天津大學;2007年
4 李軍華;云計算及若干數(shù)據(jù)挖掘算法的MapReduce化研究[D];電子科技大學;2010年
5 喬鴻欣;基于MapReduce的KNN分類算法的研究與實現(xiàn)[D];北京交通大學;2012年
,本文編號:2279728
本文鏈接:http://www.wukwdryxk.cn/wenyilunwen/guanggaoshejilunwen/2279728.html