HTML頁面中的文獻(xiàn)記錄分析算法
發(fā)布時(shí)間:2019-04-26 00:39
【摘要】:為了使出版機(jī)構(gòu)能夠及時(shí)從大量網(wǎng)頁中發(fā)現(xiàn)所需文獻(xiàn),需要設(shè)計(jì)能夠從超文本標(biāo)記語言頁面中自動(dòng)提取文獻(xiàn)信息的算法.為此,設(shè)計(jì)了基于條件隨機(jī)場的文獻(xiàn)記錄分析算法:首先,設(shè)計(jì)了文檔對象樹的分割算法,通過分割標(biāo)記將頁面數(shù)據(jù)分成獨(dú)立的部分,這些數(shù)據(jù)塊由標(biāo)簽和文本序列構(gòu)成;隨后,將該序列作為條件隨機(jī)場模型的特征向量,建立文獻(xiàn)信息標(biāo)記模型;最后,設(shè)計(jì)啟發(fā)式算法,從標(biāo)記模型中提取文獻(xiàn)信息數(shù)據(jù),并通過實(shí)驗(yàn)驗(yàn)證了其有效性.
[Abstract]:In order for publishers to find the required documents from a large number of web pages in time, it is necessary to design an algorithm that can automatically extract literature information from hypertext markup language pages. For this reason, a document record analysis algorithm based on conditional random field is designed. Firstly, the segmentation algorithm of document object tree is designed. The page data is divided into independent parts by segmenting tags, and these data blocks are composed of tags and text sequences. Then, using this sequence as the feature vector of conditional random field model, the document information marking model is established. Finally, the heuristic algorithm is designed to extract the literature information data from the marking model, and the validity of the model is verified by experiments.
【作者單位】: 北京印刷學(xué)院信息工程學(xué)院;清華大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)博士后流動(dòng)站;國家新聞出版廣電總局廣播電視衛(wèi)星直播管理中心;
【基金】:北京市教委科技創(chuàng)新服務(wù)能力建設(shè)項(xiàng)目(PXM2016_014223_000025) 北京印刷學(xué)院校級重點(diǎn)項(xiàng)目(ea201507);北京印刷學(xué)院教師隊(duì)伍建設(shè)—博士啟動(dòng)金項(xiàng)目(27170116005/062);北京印刷學(xué)院科研項(xiàng)目—出版物數(shù)據(jù)資產(chǎn)評估實(shí)驗(yàn)室建設(shè)項(xiàng)目(20190116005/006)
【分類號】:TP393.092
,
本文編號:2465603
[Abstract]:In order for publishers to find the required documents from a large number of web pages in time, it is necessary to design an algorithm that can automatically extract literature information from hypertext markup language pages. For this reason, a document record analysis algorithm based on conditional random field is designed. Firstly, the segmentation algorithm of document object tree is designed. The page data is divided into independent parts by segmenting tags, and these data blocks are composed of tags and text sequences. Then, using this sequence as the feature vector of conditional random field model, the document information marking model is established. Finally, the heuristic algorithm is designed to extract the literature information data from the marking model, and the validity of the model is verified by experiments.
【作者單位】: 北京印刷學(xué)院信息工程學(xué)院;清華大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)博士后流動(dòng)站;國家新聞出版廣電總局廣播電視衛(wèi)星直播管理中心;
【基金】:北京市教委科技創(chuàng)新服務(wù)能力建設(shè)項(xiàng)目(PXM2016_014223_000025) 北京印刷學(xué)院校級重點(diǎn)項(xiàng)目(ea201507);北京印刷學(xué)院教師隊(duì)伍建設(shè)—博士啟動(dòng)金項(xiàng)目(27170116005/062);北京印刷學(xué)院科研項(xiàng)目—出版物數(shù)據(jù)資產(chǎn)評估實(shí)驗(yàn)室建設(shè)項(xiàng)目(20190116005/006)
【分類號】:TP393.092
,
本文編號:2465603
本文鏈接:http://www.wukwdryxk.cn/guanlilunwen/ydhl/2465603.html
最近更新
教材專著