數(shù)據(jù)時效性的理論和算法研究

發(fā)布時間：2018-06-21 21:43

本文選題：數(shù)據(jù)質(zhì)量 + 數(shù)據(jù)可用性�。� 參考：《哈爾濱工業(yè)大學》2016年博士論文

【摘要】：隨著大數(shù)據(jù)時代的到來,數(shù)據(jù)的可用性受到廣泛的關(guān)注。真實世界會隨著時間的流逝迅速變化,進而導致數(shù)據(jù)庫中的數(shù)據(jù)過時失效。當前已有統(tǒng)計表明過時數(shù)據(jù)會對企業(yè)決策和國民生活造成眾多不良影響,且會引起其他維度的可用性下降,如引起數(shù)據(jù)不一致、不精確、不完整等,因此確保數(shù)據(jù)的時效性至關(guān)重要。當前數(shù)據(jù)可用性領(lǐng)域?qū)τ跁r效性的研究仍然不成體系,數(shù)據(jù)時效性研究面臨極大挑戰(zhàn)。首先,很多數(shù)據(jù)庫中都沒有精確可用的時間戳,這使得數(shù)據(jù)集合在給定時刻的時效性,即絕對時效性,很難判定。其次,不同的查詢或應用場景對時效性有不同的要求,在一些情境下絕對時效性可能無法判定,這使得數(shù)據(jù)相對于查詢或者用戶的時效性判定尤為重要。第三,在得到數(shù)據(jù)庫的時效性判定結(jié)果之后,必須進一步給出數(shù)據(jù)時效性的修復方法,當前數(shù)據(jù)可用性領(lǐng)域的研究并沒有給出可以直接用于修復時效性的數(shù)據(jù)修復方法。第四,在僅有一個數(shù)據(jù)源的情況下,完全地修復一個數(shù)據(jù)庫是非常困難,甚至不可行的。因為不同數(shù)據(jù)源包含的數(shù)據(jù)不同,所以往往要需要根據(jù)現(xiàn)有知識,將來自其他數(shù)據(jù)源的數(shù)據(jù)和目標數(shù)據(jù)源的最新值整合起來才能得到完整的目標數(shù)據(jù)表最新值。為了有效地應對上述挑戰(zhàn),本文嘗試給出一系列理論和算法,解決了數(shù)據(jù)時效性的一些關(guān)鍵問題,主要研究內(nèi)容可以概括如下。(1)本文研究了數(shù)據(jù)絕對時效性的表達原理及判定算法。為了克服當前基于時間戳和基于規(guī)則的兩類時效性判定方法的局限性,形式化地定義了不確定時效規(guī)則及相應的數(shù)據(jù)時效性模型。該規(guī)則和模型可以表達不確定的領(lǐng)域知識,定量地判定數(shù)據(jù)時效性,且能夠判定數(shù)據(jù)在特定時刻是否過時。在此基礎(chǔ)上,本文首先研究了不確定時效規(guī)則的基礎(chǔ)問題,如公理化、可滿足、蘊含等問題;然后給出了定量地判定數(shù)據(jù)時效性的模型,分別定義了數(shù)據(jù)項、元組、數(shù)據(jù)集合的時效性;接著,將數(shù)據(jù)項間的時序關(guān)系構(gòu)建成時序圖,并基于時序圖給出了多項式時間的時效性判定算法;最后在真實數(shù)據(jù)上的實驗驗證了算法的有效性。(2)本文研究了數(shù)據(jù)相對時效性表達原理及判定算法。在數(shù)據(jù)的絕對時效性無法判定,或判定結(jié)果不能有效地表達用戶需求的情況下,可以利用一些冗余記錄和時效約規(guī)則來實現(xiàn)數(shù)據(jù)相對時效性的判定。本文借助冗余記錄和時效規(guī)則研究數(shù)據(jù)相對時效性判定問題,建立了相對時效性的判定模型并提出了相關(guān)求解算法。本文首先定義了查詢相關(guān)時效性,將查詢歸結(jié)為最新值查詢和時效序列查詢兩類,對每類查詢,設計了查詢結(jié)果的時效性判定方法,并將每類查詢作為一個整體,給出了數(shù)據(jù)集合相對于一類查詢的平均時效性判定方法;然后,將用戶按查詢偏好分為3類,研究了用戶相關(guān)時效性;最后在真實數(shù)據(jù)和虛擬數(shù)據(jù)上分別進行了實驗,驗證了算法的有效性,分析了各參數(shù)對算法的影響。(3)本文研究了基于規(guī)則的數(shù)據(jù)時效性錯誤修復模型及修復算法。將數(shù)據(jù)庫中的過時數(shù)據(jù)修復為最新值是提高數(shù)據(jù)質(zhì)量的關(guān)鍵步驟。當前主要有基于規(guī)則和基于統(tǒng)計兩類數(shù)據(jù)修復方法:基于規(guī)則的修復方法難以表達數(shù)據(jù)中某些復雜的關(guān)聯(lián)關(guān)系,而基于統(tǒng)計的方法需要學習較復雜的條件概率分布,且難以直接應用數(shù)據(jù)語義相關(guān)的領(lǐng)域知識。為了克服上述兩類方法的缺點,本文提出一類新的修復規(guī)則,將規(guī)則和統(tǒng)計的方法結(jié)合起來修復過時數(shù)據(jù),該規(guī)則一方面能夠通過規(guī)則模式表達領(lǐng)域知識,另一方面還能夠使用其特有的分布表來描述數(shù)據(jù)隨時間變化的統(tǒng)計信息。首先,本文研究了靜態(tài)數(shù)據(jù)上的最小規(guī)則模式生成問題,證明了靜態(tài)數(shù)據(jù)上的規(guī)則模式生成問題是NP-難的,并給出了兩個解決該問題的多項式時間近似算法。接著,本文研究了動態(tài)數(shù)據(jù)上的最小規(guī)則模式生成問題,給出算法可在數(shù)據(jù)動態(tài)變化的情況下迅速更新現(xiàn)有的規(guī)則模式集合,最好情況下,只需O(1)時間即可完成更新。同時,本文還給出了靜態(tài)數(shù)據(jù)上的分布表學習算法和數(shù)據(jù)動態(tài)變化情況下的分布表更新算法。然后,本文研究了不同修復代價約束條件下的最優(yōu)修復計劃產(chǎn)生問題,證明了在修復預算為正無窮時,該問題在多項式時間內(nèi)可解,否則該問題是NP-難的,并給出了上述兩種情況下該問題的解決方法。最后本文通過真實和虛擬數(shù)據(jù)集合上的實驗證明了上述方法的有效性。(4)本文研究了基于查詢的數(shù)據(jù)時效性錯誤修復問題。在數(shù)據(jù)集成或Web環(huán)境下,許多數(shù)據(jù)表被分散地存儲在不同地方,這些數(shù)據(jù)表之間往往存在著部分數(shù)據(jù)重疊的情況,但不同數(shù)據(jù)源的更新頻率不盡相同。如果我們向某數(shù)據(jù)源請求一個數(shù)據(jù)表或發(fā)出一個查詢,往往會因為數(shù)據(jù)源更新不及時而無法得到目標數(shù)據(jù)表的最新數(shù)據(jù)。為了將目標數(shù)據(jù)表修復為最新值,需根據(jù)數(shù)據(jù)庫中的時序約束和參照完整性約束構(gòu)造一個合取查詢,使得該查詢的結(jié)果恰由目標數(shù)據(jù)表對應的最新值構(gòu)成,稱為時效保持查詢。本文研究在給定數(shù)據(jù)庫時序關(guān)系和參照完整性約束的情況下時效保持查詢構(gòu)造問題。首先,本文給出了時效保持查詢的形式化定義,使用該查詢可以給出目標數(shù)據(jù)表的最新值。接著,本文定義了模式時效圖,用于表達數(shù)據(jù)庫中不同數(shù)據(jù)表之間的時序約束和參照完整性約束,并將時效保持查詢等價的表達為圖中的一個終點樹。然后,本文形式化了最小時效保持查詢生成問題,證明了最小化時效保持查詢是一個NP-難問題,并分別給出了不同情況下的最小化時效保持查詢算法;最后,本文通過實驗驗證了所提模型和算法的有效性。
[Abstract]:With the advent of the big data age, the availability of data has been widely concerned. The real world will change rapidly over time, resulting in data outdated data in the database. The current statistics show that outdated data will cause many undesirable effects on enterprise decision-making and national life, and will cause other dimensions to be available. Drop, such as causing data inconsistency, inaccuracy, incomplete and so on, so it is essential to ensure timeliness of data. The current data availability field is still not a system for timeliness, and data aging research is facing great challenges. First, many databases have no precise time stamps, which makes the data set at a time time. Engraved timeliness, namely absolute timeliness, is difficult to determine. Secondly, different queries or application scenarios have different requirements for timeliness, and in some situations the absolute timeliness may not be judged, which makes the data relative to the query or user's timeliness determination is particularly important. Third, after getting the results of the timeliness of the database, It is necessary to further give a method of data aging repair. Research in the field of current data availability does not give a data repair method that can be directly used to repair the timeliness. Fourth, it is very difficult and even infeasible to completely repair a database in the case of only one data source. The data is different, so it is often necessary to integrate the latest data from other data sources and target data sources to get the latest value of the target data table. In order to cope with the challenges mentioned above, a series of theories and algorithms are given to solve the key problems of data aging. The main research contents can be summarized as follows. (1) this paper studies the principle of absolute timeliness of data and the algorithm of decision. In order to overcome the limitations of the two kinds of time stamp based time stamp and rule based time limitation method, we formally define the uncertain Aging Rule and the corresponding data aging model. To express uncertain domain knowledge, determine data timeliness quantificationally and determine whether data is out of date at a specific time. On the basis of this, this paper first studies the basic problems of uncertain aging rules, such as axiom, satisfaction and implication, and then gives a model to determine the timeliness of data quantificationally, and defines the number of data. According to the item, the data set is timeliness of the data set; then, the time series relation between the data items is constructed into a time series graph, and the time timeliness determination algorithm of polynomial time is given based on the time series graph. Finally, the validity of the algorithm is verified on the real data. (2) the data relative timeliness expression principle and the decision algorithm are studied in this paper. When the absolute timeliness is unable to be judged, or when the result can not effectively express the user's demand, some redundant records and time limitation rules can be used to determine the relative timeliness of data. In this paper, the relative timeliness determination of data is studied with the aid of redundant records and aging rules, and a relative timeliness determination model is established. In this paper, the correlation algorithm is proposed. Firstly, the validity of query is defined, and the query is reduced to the two classes of the latest value query and the time series query. For each class of queries, the timeliness determination method of the query results is designed, and each class of queries is taken as a whole, and the average timeliness judgment of the data set relative to a class of queries is given. Then, the user is divided into 3 categories according to the query preference, and the user related timeliness is studied. Finally, experiments are carried out on real and virtual data to verify the effectiveness of the algorithm and analyze the influence of the parameters on the algorithm. (3) this paper studies the rule based data aging error repair model and the repair algorithm. It is the key step to improve the quality of data, according to the outdated data in the library. At present, there are two types of data repair methods based on rule and Statistics: rule based repair methods are difficult to express some complex relationships in data, and the statistical method needs to learn more complex conditional probability distribution, and it is difficult. In order to overcome the shortcomings of the two kinds of methods, a new kind of repair rule is proposed in this paper, which combines rules and statistical methods to repair outdated data. On the one hand, the rule can express domain knowledge in a regular pattern, and on the other hand it can be described with its unique distribution table. In this paper, the minimum rule pattern generation problem on static data is studied. It is proved that the rule pattern generation problem on static data is NP- difficult, and two polynomial time approximation algorithms for solving the problem are given. Then, this paper studies the minimum rule pattern generation on dynamic data. The algorithm can quickly update the existing rule pattern set in the case of dynamic change of data. In the best case, it only needs O (1) time to complete the update. At the same time, the distribution table learning algorithm on static data and the distribution table updating algorithm under the dynamic change of data are also given. The problem of optimal repair plan generation under the complex cost constraint proves that the problem can be solved in polynomial time when the repair budget is positive infinity, otherwise the problem is NP- difficult, and the solution of the problem under the two circumstances is given. Finally, this paper proves the above method through the experiments on the real and virtual data sets. (4) 4. In this paper, we study the problem of time dependent error repair based on query. In data integration or Web environment, many data tables are stored in different places. There are often partial data overlaps between these data tables, but the update frequency of different data sources is not the same. If we are to a data source Request a data table or issue a query, often because the data source is not updated in time and can not get the latest data of the target data table. In order to repair the target data table to the latest value, a conjunctive query is constructed based on the time series constraint and the reference integrity constraint in the database, so that the result of the query is exactly the target data. The latest value composition of the table is called the timeliness retention query. This paper studies the query construction problem in the case of a given database timing relationship and reference integrity constraints. First, a formal definition of the time retention query is given in this paper. The query can be used to give the latest value of the target data table. Pattern aging diagram is used to express time series constraints and reference integrity constraints between different data tables in a database, and to express the time preserving query equivalence as an end tree in the graph. Then, this paper formalizes the minimum aging preserving query generation problem. It is proved that the minimum aging maintenance query is a NP- difficult problem, and respectively Finally, the effectiveness of the proposed model and algorithm is verified by experiments.
【學位授予單位】：哈爾濱工業(yè)大學
【學位級別】：博士
【學位授予年份】：2016
【分類號】：TP311.13
，

本文編號：2050107

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://www.wukwdryxk.cn/shoufeilunwen/xxkjbs/2050107.html

上一篇：Delta算子切換系統(tǒng)的容錯控制
下一篇：基于非二進制量化算法的逐次逼近模數(shù)轉(zhuǎn)換器的設計

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

a国产,中文字幕久久波多野结衣AV,欧美粗大猛烈老熟妇,女人av天堂

數(shù)據(jù)時效性的理論和算法研究