a国产,中文字幕久久波多野结衣AV,欧美粗大猛烈老熟妇,女人av天堂

當(dāng)前位置:主頁 > 碩博論文 > 經(jīng)管博士論文 >

領(lǐng)域UGC文本中話題-特征關(guān)系抽取及應(yīng)用研究

發(fā)布時間:2018-05-08 08:48

  本文選題:領(lǐng)域文本 + UGC; 參考:《電子科技大學(xué)》2016年博士論文


【摘要】:Web2.0時代,社會化媒體促使用戶既是信息的使用者也是信息的發(fā)布者。網(wǎng)絡(luò)中每時每刻都有新的數(shù)據(jù)產(chǎn)生,網(wǎng)絡(luò)數(shù)據(jù)資源大量累積,人們進入大數(shù)據(jù)時代。大數(shù)據(jù)是一把“雙刃劍”,在擁有巨大價值的同時,龐大的數(shù)據(jù)量和紛繁的數(shù)據(jù)結(jié)構(gòu)對信息處理提出了巨大的挑戰(zhàn)。文本是最古老的信息存儲方式之一。在網(wǎng)絡(luò)數(shù)據(jù)資源中,UGC文本占有很大比重。海量UGC文本蘊含豐富的信息,尤其是域內(nèi)信息。近年來,文本挖掘技術(shù)作為一個有力的工具被應(yīng)用于人工自然語言處理的研究中來處理如何從文檔中挖掘出有用的信息。但是,UGC文本由于撰寫者層次不一,具有內(nèi)容表達隨意、寫作不規(guī)范等特點,給從海量UGC文本中的信息抽取工作帶來了巨大的挑戰(zhàn)。此外,傳統(tǒng)信息抽取方法挖掘出紛繁復(fù)雜的信息關(guān)系,不利于用戶理解信息。在信息爆炸的時代,文本挖掘出的信息需要符合用戶需求,且易于用戶理解和記憶。因此,對UGC文本以話題方式進行信息抽取,并根據(jù)多話題間相互關(guān)系構(gòu)建一個基于用戶需求的信息抽取和管理系統(tǒng)至關(guān)重要。基于上述思考,本論文對海量UGC文本的信息抽取及相關(guān)應(yīng)用展開了深入的研究。具體的研究內(nèi)容和相關(guān)結(jié)論如下:(1)基于詞單元依賴關(guān)系的復(fù)合新詞發(fā)現(xiàn)分詞效果決定了文本挖掘最終結(jié)果的優(yōu)劣。由于傳統(tǒng)分詞軟件不能很好處理UGC文本中的復(fù)合新詞,本論文提出了一種新的無需詞典、無需前期語料庫訓(xùn)練,基于統(tǒng)計的復(fù)合新詞發(fā)現(xiàn)方法(FPSMC)。該方法首先利用序列頻繁模式挖掘出候選復(fù)合新詞,然后通過計算候選復(fù)合新詞的序列最大置信度(Max-confidence)進行篩選,反復(fù)迭代最終得到文本中存在的復(fù)合新詞。實驗結(jié)果表明,FPSMC算法UGC文本數(shù)據(jù)集中,有較好的復(fù)合新詞抽取效果。與其他復(fù)合新詞抽取算法相比,FPSMC更善于發(fā)現(xiàn)復(fù)合新詞中的人名、地名、組織機構(gòu)名稱、專有名詞、時間等命名實體。通常來說,命名實體大多是UGC文本中的話題詞。所以,FPSMC對復(fù)合新詞抽取的良好效果,更有助于發(fā)現(xiàn)UGC文本數(shù)據(jù)集中用戶表達出的行為偏好,為后續(xù)的話題識別及其特征抽取、商務(wù)應(yīng)用分析奠定良好的基礎(chǔ)。(2)域內(nèi)文本話題界限劃分及其特征詞抽取話題是UGC文本中隱含的重要信息元素,對UGC文本進行基于話題的信息組織能夠讓用戶更方便全面的獲取UGC文本中的信息。鑒于傳統(tǒng)話題抽取技術(shù)中抽取出的話題結(jié)果經(jīng)常受到公共熱點詞的干擾,且挖掘出與話題相關(guān)的特征中信息粒度較粗的泛化特征較多。所以,本論文提出了一種新的文檔數(shù)據(jù)關(guān)聯(lián)分析方法,從海量UGC中分析出“熱點話題詞和話題界限”,最后根據(jù)熱點話題界限對UGC文本進行切分,找出與各熱點話題詞關(guān)聯(lián)的“局部特征詞”。實驗證明,本論文提出TVS算法可以有效的屏蔽高頻詞的干擾,從大規(guī)模網(wǎng)絡(luò)文本數(shù)據(jù)中抓取出領(lǐng)域的熱點話題詞及其局部特征。同時,適應(yīng)性實驗和可擴展性實驗結(jié)果表明,該算法能適用于不同類型文本數(shù)據(jù)集;并且該算法既能通過并行計算的方式實現(xiàn),也能在單個計算機上保持良好挖掘性能。(3)UGC文本中多話題關(guān)系及其特征抽取的應(yīng)用研究傳統(tǒng)話題發(fā)現(xiàn)與抽取方法,很難識別和理清UGC文本中話題與話題之間的相互關(guān)系。而UGC文本中話題之間的相互關(guān)系也包含了信息,UGC文本中話題之間的相互關(guān)系能有效的促進信息使用者理解和掌握信息。本論文基于旅游博客文本數(shù)據(jù),結(jié)合相應(yīng)的多話題關(guān)系及其特征抽取方法挖掘出了熱門旅游景點話題、景點話題的局部特征、景點話題之間的相互關(guān)系,并基于此構(gòu)建了基于旅行者需求的旅游信息抽取與管理系統(tǒng)。該系統(tǒng)從旅行者面臨的“去哪里玩”、“玩什么”以及“怎么去玩”三大需求出發(fā),構(gòu)建了旅游博客文本預(yù)處理、熱門旅游景點及其TOI抽取、熱門旅游景點區(qū)域化、旅游路徑發(fā)現(xiàn)及推薦四大模塊,分別有針對性的解決旅行者的三大需求。本論文利用北京旅游博客數(shù)據(jù)集對系統(tǒng)各模塊進行了示例實驗,并將實驗結(jié)果采用可視化技術(shù)進行展示。實驗證明,本旅游信息抽取與管理系統(tǒng)能有效的從大規(guī)模旅游博客文本數(shù)據(jù)中提取出旅行者需要的旅游信息,并能夠很好的協(xié)助旅行者完成自己的旅游出行規(guī)劃。
[Abstract]:In the Web2.0 era, social media made users not only the users of information but also the publisher of information. In the network, new data were generated every time, and the network data resources accumulated much. People entered the era of big data. Large data is a "double-edged sword", with huge value and numerous data nodes at the same time. Text is a great challenge to information processing. Text is one of the oldest information storage methods. In network data resources, UGC text occupies a large proportion. Massive UGC text contains rich information, especially in domain information. In recent years, text mining technology has been used as a powerful tool in the research of artificial Natural Language Processing. This paper deals with how to extract useful information from the document. However, UGC text has brought great challenges to information extraction from massive UGC text because of the different composer level, random content expression and non standard writing. Users understand information. In the era of information explosion, the information extracted from text needs to meet the needs of the user and is easy to understand and remember. Therefore, it is very important to extract information from UGC text by topic mode and to build a user requirement based information extraction and management system based on the relationship between multiple topics. This paper studies the information extraction and related applications of massive UGC text. The specific research content and relevant conclusions are as follows: (1) the results of the compound new words based on the dependency relationship of words determine the final result of the text mining. Because the traditional word segmentation software can not handle the compound in the UGC text well In this paper, a new kind of new word discovery method (FPSMC) is proposed without a dictionary, without pre corpus training and statistical based compound neologism. This method first uses sequential frequent patterns to excavate candidate compound words, and then iterates the final iteration by calculating the maximum confidence degree of the candidate compound new word (Max-confidence). The experimental results show that the FPSMC algorithm UGC text data set has a good compound new word extraction effect. Compared with other compound new word extraction algorithms, FPSMC is better at discovering the names of people, place names, organization names, proper nouns, time and other named entities in compound neologisms. Most of them are the topic words in the UGC text. So, the good effect of FPSMC on the extraction of the compound new words helps to find the behavior preference expressed by the UGC text data centralized users, and lays a good foundation for the subsequent topic identification and feature extraction and business application analysis. (2) the topic of the topic boundary division and the feature word extraction in the domain is the topic of the topic. The important information element implied in the UGC text, the topic based information organization for the UGC text can make the user more convenient and comprehensive to obtain the information in the UGC text. In this paper, a new method of document data association analysis is proposed, which analyzes "hot topic words and topic boundaries" from massive UGC. Finally, according to the boundaries of hot topic, UGC text is divided and the "local feature words" associated with various hot topics are found. The experiment shows that this paper proposes TVS calculation. The method can effectively shield the interference of high frequency words and take out the hot topic words and local features of the domain from the large-scale network text data. At the same time, the results of adaptive experiment and extensibility experiment show that the algorithm can be applied to different types of text data sets, and the algorithm can be realized by parallel computing and can also be used. Good mining performance on a single computer. (3) the application of multi topic relationship and feature extraction in UGC text. Research on the traditional topic discovery and extraction method, it is difficult to identify and clear the relationship between topic and topic in UGC text. The interrelationship between topics in UGC text also contains information, and between topics in UGC text The relationship can effectively promote the information users to understand and grasp the information. Based on the text data of Tourism Blog, this paper excavates the hot tourist attractions, the local features of the scenic spots and the relationship between the scenic topics, and based on this, based on the tourist blog text data, and based on this construction. From the three needs of "where to play", "what to play" and "how to play", the system has constructed four major modules, such as the text preprocessing of tourist blogs, popular tourist attractions and their TOI extraction, the regionalization of popular tourist attractions, tourism path discovery and recommendation, respectively. In order to solve the three major needs of travelers, this paper uses the Beijing Tourism Blog data set to carry out an example experiment on the system modules, and shows the experimental results using visual technology. The experiment proves that the tourism information extraction and management system can effectively extract traveler needs from the large-scale travel blog text data. The travel information can help travelers to complete their own travel planning.

【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2016
【分類號】:TP391.1;F592

【相似文獻】

相關(guān)期刊論文 前10條

1 郎宇潔;;面向UGC的網(wǎng)絡(luò)信息資源開發(fā)研究[J];科技創(chuàng)業(yè)月刊;2012年07期

2 張建;李益;;手機UGC——審美創(chuàng)造新舞臺[J];新聞愛好者;2012年23期

3 汪科科;;UGC視頻,做自己生活的導(dǎo)演[J];數(shù)碼影像時代;2012年02期

4 王劍;;UGC語境下傳統(tǒng)媒體的表現(xiàn)以及應(yīng)對[J];視聽縱橫;2012年04期

5 張博;任殿順;;大數(shù)據(jù)背景下UGC的價值研究和出版應(yīng)用[J];科技與出版;2014年03期

6 仲釔霏;杜志紅;;UGC時代電視媒體的被動與主動[J];視聽界;2013年02期

7 王光文;;論視頻網(wǎng)站UGC經(jīng)營者的版權(quán)侵權(quán)注意義務(wù)[J];國際新聞界;2012年03期

8 呂尚彬;;重視UGC 激勵用戶分享和原創(chuàng)[J];新聞戰(zhàn)線;2013年07期

9 鄭無邊;;不存在的馬戲團 互聯(lián)網(wǎng)文字UGC雜談[J];數(shù)碼影像時代;2013年04期

10 董平;;UGC將是移動互聯(lián)網(wǎng)的新熱點[J];通信世界;2008年05期

相關(guān)重要報紙文章 前4條

1 本報記者 傅盛裕;UGC、粉絲經(jīng)濟、作者營銷及其他[N];文匯報;2014年

2 記者 高少華;視頻網(wǎng)站押注UGC戰(zhàn)略[N];經(jīng)濟參考報;2013年

3 馬斌;移動UGC業(yè)務(wù)將隨3G崛起[N];人民郵電;2008年

4 商報記者 魏蔚;視頻業(yè)借電商模式運營UGC內(nèi)容謀盈利[N];北京商報;2012年

相關(guān)博士學(xué)位論文 前1條

1 徐華林;領(lǐng)域UGC文本中話題-特征關(guān)系抽取及應(yīng)用研究[D];電子科技大學(xué);2016年

相關(guān)碩士學(xué)位論文 前3條

1 李莎;基于UGC的旅游目的地吸引力分析[D];哈爾濱工業(yè)大學(xué);2011年

2 嚴瑤;用戶創(chuàng)造內(nèi)容(UGC)的受眾角色研究[D];華中師范大學(xué);2014年

3 藍勤華;用戶創(chuàng)造內(nèi)容(UGC)動機研究[D];南京大學(xué);2011年



本文編號:1860744

資料下載
論文發(fā)表

本文鏈接:http://www.wukwdryxk.cn/shoufeilunwen/jjglbs/1860744.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶68a16***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
97视频免费观看| 美女站立式x0x0又黄动态图| 欧美V国产V亚洲V日韩九九 | 国内精品久久欧美野战| 久久这里只精品国产免费99热4| 午夜无码无遮挡在线视频| 手机看片av无码免费| 欧洲av网站| 久久中文字幕人妻丝袜系列 | 色哟哟免费精品网站入口| 好爽毛片一区二区三区四无码视色| 99视频在线精品免费观看6| 亚洲综合色婷婷七月丁香| 人妻2| 98噜噜噜在线观| 亚洲欧美日韩在线资源观看| 国产精品videossex国产高清| 国内精品伊人久久久久AV一坑| 午夜18禁自慰JK爆乳网站| 久9热| 欧美狠狠爱| 乱中年女人伦AV| 国产VA免费精品高清在线| 天堂网www中文在线| 国产丰满乱子伦无码专| 欧美日韩人妻精品一区二区三区 | 国产一二三视频| 超碰久草| 国产精品亚洲W码日韩中文| 亚洲人成网站在线观看播放| 国内精品久久人妻无码不卡| 盐津县| 免费看一区二区三区四区| 99在线精品免费视频| 99国产精品99久久久久久 | 99久久精品国产亚洲| 久久精品99国产精品日本| 国产高清不卡免费视频| 久久亚洲精品国产精品婷婷| 乱子伦视频在线看| 国产日韩亚洲大尺度高清|