當(dāng)前位置：主頁(yè) > 碩博論文 > 信息類(lèi)博士論文 >

文本分類(lèi)中特征加權(quán)算法和文本表示策略研究

發(fā)布時(shí)間：2018-07-11 18:10

本文選題：機(jī)器學(xué)習(xí) + 文本分類(lèi)�。� 參考：《東北師范大學(xué)》2016年博士論文

【摘要】：數(shù)據(jù)已經(jīng)滲透到各個(gè)行業(yè),成為重要的生產(chǎn)因素。隨著大數(shù)據(jù)時(shí)代的到來(lái),對(duì)文本信息處理技術(shù)的需求與日俱增,人工管理方式已經(jīng)無(wú)法滿(mǎn)足社會(huì)需求,因此,自動(dòng)文本分類(lèi)技術(shù)變得越來(lái)越重要,已成為廣大科研團(tuán)體研究的熱點(diǎn)。本文在分析和總結(jié)文本分類(lèi)框架、文本表示模型、文本預(yù)處理、特征選擇、特征提取、特征加權(quán)、文本分類(lèi)器以及分類(lèi)性能評(píng)估的基礎(chǔ)上,對(duì)文本特征加權(quán)和文本表示策略進(jìn)行了深入研究。面向均衡數(shù)據(jù)集,提出了兩種特征加權(quán)算法;面向失衡數(shù)據(jù)集,提出了一種特征加權(quán)算法,共計(jì)三種有監(jiān)督特征加權(quán)算法。此外,針對(duì)有監(jiān)督特征加權(quán)算法,本文提出了一種最優(yōu)文本表示策略。取得的階段性成果如下:1.基于類(lèi)別信息的特征加權(quán)算法對(duì)于采用向量空間模型的大多數(shù)文本分類(lèi)器來(lái)說(shuō),特征加權(quán)一直是分類(lèi)的瓶頸,特征加權(quán)的效果直接影響分類(lèi)器的分類(lèi)性能。在分析傳統(tǒng)特征加權(quán)算法的基礎(chǔ)上,提出了一種新的特征加權(quán)算法。通過(guò)將基于詞的特征轉(zhuǎn)換為基于類(lèi)別的特征,使數(shù)據(jù)集的特征維度由原始成千上萬(wàn)維降低到了與數(shù)據(jù)集的類(lèi)別數(shù)相同的維度。從而使得特征表示矩陣不再是稀疏矩陣。相比其他特征加權(quán)方法,本文的方法不但可以提高文本分類(lèi)精度,而且可以有效地提高分類(lèi)速度、降低分類(lèi)時(shí)間。2.基于類(lèi)空間密度的特征加權(quán)算法在分析傳統(tǒng)特征加權(quán)算法中的逆類(lèi)別頻率方法基礎(chǔ)上,引入了類(lèi)空間密度,進(jìn)而將逆類(lèi)別空間密度頻率引入到了特征加權(quán)算法中。在度量特征的區(qū)分能力時(shí),針對(duì)類(lèi)別頻率相同,但在此類(lèi)別頻率下文檔頻率不同的情況,可以為特征賦予不同的權(quán)重。該方法能更加客觀地反映特征對(duì)分類(lèi)的重要程度,有效地改善樣本空間分布狀態(tài),使同類(lèi)別樣本更加緊湊,異類(lèi)別樣本更加松散。通過(guò)將tf*icf和icf-based方法中的逆類(lèi)別頻率參數(shù)更新為本文提出的逆類(lèi)別空間密度頻率參數(shù),得到了兩個(gè)新的特征加權(quán)算法:tf*ICSDF和ICSDF-based。實(shí)驗(yàn)結(jié)果表明,本文的特征加權(quán)算法可以獲得較好的文本分類(lèi)性能。3.面向失衡數(shù)據(jù)集的特征加權(quán)算法當(dāng)采用常用特征加權(quán)算法對(duì)失衡數(shù)據(jù)集進(jìn)行加權(quán),經(jīng)常不能達(dá)到預(yù)期的效果。主要是由于失衡數(shù)據(jù)集數(shù)據(jù)分布的特殊性所導(dǎo)致。本文在分析失衡數(shù)據(jù)集數(shù)據(jù)分布特點(diǎn)的基礎(chǔ)上,提出了一種面向失衡數(shù)據(jù)集的特征加權(quán)算法。算法通過(guò)結(jié)合特征在正類(lèi)別文檔中出現(xiàn)的概率與特征在負(fù)類(lèi)別文檔中出現(xiàn)的概率兩個(gè)方面,綜合度量失衡數(shù)據(jù)集中不同特征對(duì)于文本分類(lèi)的重要性,并根據(jù)其重要性賦予相應(yīng)的特征權(quán)重。實(shí)驗(yàn)中,將提出的tf*WID特征加權(quán)算法與四個(gè)常用的特征加權(quán)算法(tf*idf,tf*ig,tf*chi2以及tf*or)在WebKB和Yahoo!Answers(100-1000)兩個(gè)失衡數(shù)據(jù)集上,采用Rocchio分類(lèi)器和支持向量機(jī)分類(lèi)器,針對(duì)微平均F1值與宏平均F1值兩個(gè)方面進(jìn)行了對(duì)比與分析。結(jié)果顯示,本文提出的特征加權(quán)算法對(duì)于失衡數(shù)據(jù)集分類(lèi),可以有效地提高分類(lèi)性能。4.有監(jiān)督特征加權(quán)方法的最優(yōu)文本表示策略在分析傳統(tǒng)文本表示策略的基礎(chǔ)上(全局策略和局部策略),本文基于向量空間模型,提出了一種對(duì)于有監(jiān)督特征加權(quán)方法的最優(yōu)文本表示策略。提出的方法采用在訓(xùn)練集上尋找最優(yōu)模型的思想,可以從所有類(lèi)別的特征加權(quán)向量中,獲得一個(gè)對(duì)訓(xùn)練集最優(yōu)的特征加權(quán)向量,將其應(yīng)用于測(cè)試集后,最終可以得到測(cè)試集的最優(yōu)文本表示。在兩個(gè)數(shù)據(jù)集(均衡數(shù)據(jù)集20Newsgroups和非均衡數(shù)據(jù)集Reuters-21578)上,對(duì)本文所提出的方法進(jìn)行了驗(yàn)證。實(shí)驗(yàn)中采用兩個(gè)常用的有監(jiān)督特征加權(quán)方法(tf*or和tf*rf)對(duì)兩個(gè)數(shù)據(jù)集的特征矩陣進(jìn)行加權(quán),應(yīng)用提出的方法,在訓(xùn)練集上尋找最優(yōu)特征加權(quán)向量,然后應(yīng)用于測(cè)試集,最后采用支持向量機(jī)分類(lèi)器進(jìn)行分類(lèi)。實(shí)驗(yàn)結(jié)果表明,本文提出的有監(jiān)督特征加權(quán)方法的最優(yōu)文本表示策略能夠有效地提高分類(lèi)性能。
[Abstract]:Data has penetrated into various industries and becomes an important production factor. With the advent of the era of large data, the demand for text information processing technology is increasing, and manual management has not been able to meet the needs of the society. Therefore, the automatic text classification technology has become more and more important and has become a hot spot in the research group. On the basis of the text classification framework, text representation model, text preprocessing, feature selection, feature extraction, feature weighting, text classifier and classification performance evaluation, the text feature weighting and text representation strategy are deeply studied. Two feature weighting algorithms are proposed for balanced data sets, and unbalance data sets are put forward. A feature weighting algorithm is proposed, including three supervised feature weighting algorithms. In addition, an optimal text representation strategy is proposed for the supervised feature weighting algorithm. The results obtained are as follows: 1. the feature weighting algorithm based on category information is used for most text classifiers using vector space model. Feature weighting has always been the bottleneck of classification. The effect of feature weighting directly affects the classification performance of the classifier. Based on the analysis of the traditional feature weighting algorithm, a new feature weighting algorithm is proposed. By converting the features based on the word to the category based feature, the feature dimension of the dataset is reduced from the original thousand dimensions to the universal dimension. The feature representation matrix is no longer a sparse matrix. Compared with other feature weighting methods, this method can not only improve the accuracy of text classification, but also effectively improve the classification speed and reduce the classification time.2. based on the characteristic weighting algorithm based on the class space density. On the basis of the inverse class frequency method in the eigen weighted algorithm, the class space density is introduced, and then the inverse class space density frequency is introduced into the feature weighting algorithm. When measuring the distinguishing ability of the feature, the class frequency is the same, but the frequency of the document is different at the same frequency, which can give different weights to the feature. The method can more objectively reflect the importance of characteristics to the classification, effectively improve the distribution of sample space, make the same class samples more compact, and the different classes of samples are looser. By updating the inverse class frequency parameters in the tf*icf and ICF-based methods into the inverse class space density frequency parameters proposed in this paper, two new ones are obtained. Feature weighting algorithm: tf*ICSDF and ICSDF-based. experimental results show that the feature weighting algorithm in this paper can obtain better text classification performance.3. feature weighted algorithm oriented to unbalance data set, when using the common feature weighting algorithm to weigh the unbalanced data set, often can not achieve the expected effect. Mainly because of the unbalanced data. In this paper, based on the analysis of the characteristics of the data distribution of the unbalance data set, this paper presents a feature weighting algorithm for unbalance data sets. The algorithm combines the probability and the probability of the feature in the positive category document with the two aspects of the probability of the appearance of the character in the negative category document. In the experiment, the proposed tf*WID feature weighting algorithm and four common feature weighting algorithms (tf*idf, tf*ig, tf*chi2 and tf*or) are used on the two unbalanced data sets of WebKB and Yahoo! Answers (100-1000), using the Rocchio classifier and the Rocchio classifier in the experiment. The support vector machine classifier is compared and analyzed in two aspects: the micro average F1 value and the macro average F1 value. The results show that the feature weighting algorithm proposed in this paper can effectively improve the optimal text representation strategy of the classification performance.4. with supervised feature weighting method to analyze the traditional text representation strategy. On the basis of the global strategy and local strategy, based on the vector space model, this paper proposes an optimal text representation strategy for supervised feature weighting methods. The proposed method uses the idea of finding the optimal model on the training set, and can obtain an optimal feature added to the training set from the feature weighted vector of all categories. The weight vector, which is applied to the test set, can finally get the optimal text representation of the test set. On the two data sets (the balanced dataset 20Newsgroups and the disequilibrium data set Reuters-21578), the proposed method is verified. In the experiment, two commonly used supervised feature weighting methods (tf*or and tf*rf) are used for two numbers. According to the feature matrix of the set, the optimal feature weighting vector is found on the training set, and then applied to the test set. Finally, the support vector machine classifier is used to classify them. The experimental results show that the most Youleben representation strategy with supervised feature weighting method proposed in this paper can effectively improve the classification performance.
【學(xué)位授予單位】：東北師范大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2016
【分類(lèi)號(hào)】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 華銳;梁娜;;特征加權(quán)樸素貝葉斯分類(lèi)器在小樣本中的應(yīng)用[J];統(tǒng)計(jì)與決策;2012年23期

2 朱紅寧;張斌;;特征加權(quán)集對(duì)分析方法[J];計(jì)算機(jī)科學(xué);2009年09期

3 張翔;鄧趙紅;王士同;;具有更好適應(yīng)性的間距最大化特征加權(quán)[J];計(jì)算機(jī)應(yīng)用;2010年09期

4 付劍鋒;劉宗田;劉煒;單建芳;;基于特征加權(quán)的事件要素識(shí)別[J];計(jì)算機(jī)科學(xué);2010年03期

5 陳新泉;;特征加權(quán)的模糊C聚類(lèi)算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2007年22期

6 王晨;樊小紅;;基于特征加權(quán)的交通事件檢測(cè)研究[J];微電子學(xué)與計(jì)算機(jī);2012年10期

7 黎佳;王明文;何世柱;柯麗;;基于特征加權(quán)的半監(jiān)督聚類(lèi)研究[J];廣西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年01期

8 陸成剛,陳剛;利用特征加權(quán)進(jìn)行基于小波框架變換的紅外目標(biāo)檢測(cè)[J];系統(tǒng)仿真學(xué)報(bào);2001年03期

9 劉磊;陳興蜀;尹學(xué)淵;段意;呂昭;;基于特征加權(quán)樸素貝葉斯分類(lèi)算法的網(wǎng)絡(luò)用戶(hù)識(shí)別[J];計(jì)算機(jī)應(yīng)用;2011年12期

10 饒剛;劉瓊蓀;高君健;;基于灰色特征加權(quán)支持向量機(jī)的二維函數(shù)擬合[J];計(jì)算機(jī)工程與設(shè)計(jì);2012年10期

相關(guān)會(huì)議論文前1條

1 杜玫芳;王昕;;基于特征加權(quán)的模糊c均值聚類(lèi)算法及其應(yīng)用[A];2008通信理論與技術(shù)新進(jìn)展——第十三屆全國(guó)青年通信學(xué)術(shù)會(huì)議論文集（上）[C];2008年

相關(guān)博士學(xué)位論文前1條

1 賈隆嘉;文本分類(lèi)中特征加權(quán)算法和文本表示策略研究[D];東北師范大學(xué);2016年

相關(guān)碩士學(xué)位論文前10條

1 張娜娜;基于進(jìn)化極限學(xué)習(xí)機(jī)的特征加權(quán)近鄰分類(lèi)算法[D];大連海事大學(xué);2016年

2 馬會(huì)敏;幾種特征加權(quán)支持向量機(jī)方法的比較研究[D];河北大學(xué);2010年

3 王秀菲;基于特征加權(quán)支持向量機(jī)的復(fù)合材料粘接缺陷量化識(shí)別研究[D];內(nèi)蒙古大學(xué);2011年

4 馬萍;貝葉斯網(wǎng)絡(luò)與基于特征加權(quán)的聚類(lèi)研究[D];大連理工大學(xué);2011年

5 黃瓊芳;特征加權(quán)組稀疏模式分析算法及其在水電機(jī)組故障診斷中的應(yīng)用[D];浙江工業(yè)大學(xué);2015年

6 吳彪;基于信息論的特征加權(quán)和主題驅(qū)動(dòng)協(xié)同聚類(lèi)算法研究[D];哈爾濱工業(yè)大學(xué);2008年

7 周計(jì)美;基于特征加權(quán)單類(lèi)支持向量機(jī)的顏色識(shí)別算法及其在異色物檢測(cè)中的研究[D];內(nèi)蒙古大學(xué);2012年

8 陳曉琳;采用ReliefF特征加權(quán)的NIC算法研究[D];鄭州大學(xué);2014年

9 周徐寧;基于特征加權(quán)連續(xù)隱馬爾可夫模型的故障診斷方法研究[D];上海交通大學(xué);2012年

10 劉建林;基于樣本—特征加權(quán)的模糊核聚類(lèi)算法研究及應(yīng)用[D];華東交通大學(xué);2013年

，

本文編號(hào)：2116182

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://www.wukwdryxk.cn/shoufeilunwen/xxkjbs/2116182.html

上一篇：面向復(fù)用成像的像素設(shè)計(jì)研究
下一篇：面向大規(guī)模異構(gòu)數(shù)據(jù)的哈希表征學(xué)習(xí)研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

a国产,中文字幕久久波多野结衣AV,欧美粗大猛烈老熟妇,女人av天堂

文本分類(lèi)中特征加權(quán)算法和文本表示策略研究