自然鄰居思想概念及其在數(shù)據(jù)挖掘領(lǐng)域的應(yīng)用
發(fā)布時(shí)間:2018-07-28 14:44
【摘要】:近些年來(lái),最近鄰居的概念與對(duì)應(yīng)的鄰居搜索算法在數(shù)據(jù)挖掘、圖像處理以及模式識(shí)別等多個(gè)領(lǐng)域中有著廣泛的應(yīng)用基礎(chǔ)并取得了許多令人滿意的結(jié)果。最近鄰居概念中近鄰關(guān)系的定義是最近鄰居思想的根本基礎(chǔ),對(duì)該思想的各種方法起著決定性的作用。在眾多的近鄰關(guān)系中,最為廣泛應(yīng)用的無(wú)疑是k-最近鄰居(KNN:k-Nearest Neighbor)和逆k最近鄰居(RkNN:Reverse k-Nearest Neighbor)。然而,無(wú)論是KNN還是Rk NN,最近鄰居的概念中始終縈繞著一個(gè)懸而未決的問(wèn)題——如何選擇合適的鄰域大小。鄰域參數(shù)的最優(yōu)選擇并不固定,其取值通常依賴數(shù)據(jù)集自身的分布情況。比如在對(duì)數(shù)據(jù)集進(jìn)行分類操作時(shí),較大的鄰域參數(shù)能夠減少噪聲點(diǎn)對(duì)分類結(jié)果的影響,但同時(shí)會(huì)模糊不同類之間的差異性。特別是當(dāng)數(shù)據(jù)集呈現(xiàn)流形分布時(shí),過(guò)大的鄰域參數(shù)會(huì)造成短路的現(xiàn)象,對(duì)流形產(chǎn)生破壞。而與之相反,過(guò)小的鄰域參數(shù)會(huì)削弱數(shù)據(jù)間的鄰居關(guān)系,極端情況下則會(huì)將屬于同一個(gè)類的數(shù)據(jù)點(diǎn)分割到多個(gè)不同的區(qū)域中。因此,在當(dāng)前基于最近鄰居思想的算法中,鄰域選擇問(wèn)題成為了制約算法效率的重要部分。為了從根本上解決這個(gè)問(wèn)題,本文提出了自然鄰居思想及其應(yīng)用。首先,論文提出了自然鄰居思想的基本概念。自然鄰居思想擺脫了鄰域參數(shù)選擇的難題,在自然鄰居的查找過(guò)程中自適應(yīng)的完成鄰居關(guān)系的構(gòu)建,同時(shí)獲得具有數(shù)據(jù)集特征信息的自然鄰居特征值和自然鄰居鄰域圖。自然鄰居思想的主要特點(diǎn)為:1)自然鄰居思想能根據(jù)不同數(shù)據(jù)集的局部特征創(chuàng)建對(duì)應(yīng)的自然鄰居鄰域圖,其能夠直觀準(zhǔn)確地呈現(xiàn)數(shù)據(jù)分布規(guī)律,特別是流形數(shù)據(jù)和噪聲數(shù)據(jù)。2)自然鄰居思想能夠?qū)Σ煌瑪?shù)據(jù)集自適應(yīng)得到自然鄰居特征值,而自然鄰居特征值能夠動(dòng)態(tài)的反映不同數(shù)據(jù)集的分布狀態(tài)。3)自然鄰居思想中每個(gè)數(shù)據(jù)點(diǎn)的鄰居數(shù)量是可變的,鄰居的多少反映了數(shù)據(jù)點(diǎn)與數(shù)據(jù)集的真實(shí)關(guān)系。在自然鄰居思想的概念之上,論文提出了自然鄰居思想對(duì)傳統(tǒng)算法中鄰域參數(shù)選擇問(wèn)題的解決辦法。自然鄰居思想中的自然鄰居特征值反映了數(shù)據(jù)集的分布情況,因此其可以作為傳統(tǒng)最近鄰居思想中的鄰域參數(shù)k。基于該思路,論文提出了自然鄰居特征值的快速計(jì)算算法,高效地計(jì)算自然鄰居特征值,進(jìn)而將其作為鄰域參數(shù)應(yīng)用于離群檢測(cè)、聚類分析等領(lǐng)域的多個(gè)算法中,并且取得了令人滿意的實(shí)驗(yàn)結(jié)果。除了自然鄰居特征值之外,反映自然鄰居關(guān)系的自然鄰居鄰域圖也具有極強(qiáng)的研究?jī)r(jià)值。論文在最后提出了一種基于加權(quán)自然鄰居鄰域圖的數(shù)據(jù)挖掘算法,將自然鄰居查找過(guò)程中的查找深度作為自然鄰居鄰域圖中邊的權(quán)值構(gòu)造加權(quán)自然鄰居鄰域圖,在其基礎(chǔ)上能夠?qū)θ我夥植嫉臄?shù)據(jù)集一次性地進(jìn)行聚類分析、離群挖掘和數(shù)據(jù)可視化分析。
[Abstract]:In recent years, the concept of nearest neighbor and the corresponding neighbor search algorithm have been widely used in many fields, such as data mining, image processing and pattern recognition, and many satisfactory results have been obtained. The definition of nearest neighbor in the concept of nearest neighbor is the fundamental basis of the thought of nearest neighbor, which plays a decisive role in various methods of this idea. Among the many nearest neighbor relationships, the most widely used ones are undoubtedly k-nearest neighbor (KNN:k-Nearest Neighbor) and inverse k-nearest neighbor (RkNN:Reverse k-Nearest Neighbor). However, whether KNN or Rk NN, the concept of nearest neighbor is always haunted by the question of how to choose the appropriate neighborhood size. The optimal selection of neighborhood parameters is not fixed, and its value usually depends on the distribution of the data set itself. For example, when classifying data sets, larger neighborhood parameters can reduce the influence of noise points on classification results, but at the same time, the differences between different classes will be blurred. Especially when the data set presents manifold distribution, excessive neighborhood parameters will cause short-circuit and convection destruction. On the other hand, too small neighborhood parameters will weaken the neighbor relationship between the data, and in extreme cases, the data points belonging to the same class will be divided into several different regions. Therefore, in the current algorithm based on nearest neighbor, neighborhood selection problem has become an important part of the algorithm efficiency constraints. In order to solve this problem fundamentally, this paper puts forward the idea of natural neighbor and its application. Firstly, the paper puts forward the basic concept of the idea of natural neighbor. The idea of natural neighbor gets rid of the difficult problem of neighborhood parameter selection and adaptively completes the construction of neighbor relationship in the process of natural neighbor search. At the same time the natural neighbor characteristic value and natural neighbor neighborhood graph with feature information of data set are obtained at the same time. The main feature of the natural neighbor thought is: (1) the natural neighbor thought can create the corresponding natural neighbor neighborhood graph according to the local characteristics of different data sets, and it can present the data distribution law intuitively and accurately. In particular, manifold data and noise data .2) the idea of natural neighbor can adaptively obtain the eigenvalues of natural neighbors for different data sets. The natural neighbor feature value can dynamically reflect the distributed state of different data sets. 3) the number of neighbors of each data point is variable in the concept of natural neighbor. The number of neighbors reflects the real relationship between data points and data sets. Based on the concept of natural neighbor, this paper proposes a solution to the problem of neighborhood parameter selection in traditional algorithm. The eigenvalue of natural neighbor in natural neighbor thought reflects the distribution of data set, so it can be used as the neighborhood parameter kin the traditional nearest neighbor thought. Based on this idea, this paper proposes a fast algorithm for calculating the eigenvalues of natural neighbors, which can be used as neighborhood parameters in many algorithms in the field of outlier detection, clustering analysis and so on. Satisfactory experimental results have been obtained. Besides the eigenvalue of natural neighbor, the graph of natural neighbor which reflects the relation of natural neighbor is also of great value. At the end of this paper, a data mining algorithm based on weighted natural neighbor graph is proposed. The search depth in the natural neighbor search process is taken as the weight value of the edge of the natural neighbor graph to construct the weighted natural neighbor graph. On the basis of this method, clustering analysis, outlier mining and data visualization can be performed on arbitrarily distributed data sets at one time.
【學(xué)位授予單位】:重慶大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
本文編號(hào):2150574
[Abstract]:In recent years, the concept of nearest neighbor and the corresponding neighbor search algorithm have been widely used in many fields, such as data mining, image processing and pattern recognition, and many satisfactory results have been obtained. The definition of nearest neighbor in the concept of nearest neighbor is the fundamental basis of the thought of nearest neighbor, which plays a decisive role in various methods of this idea. Among the many nearest neighbor relationships, the most widely used ones are undoubtedly k-nearest neighbor (KNN:k-Nearest Neighbor) and inverse k-nearest neighbor (RkNN:Reverse k-Nearest Neighbor). However, whether KNN or Rk NN, the concept of nearest neighbor is always haunted by the question of how to choose the appropriate neighborhood size. The optimal selection of neighborhood parameters is not fixed, and its value usually depends on the distribution of the data set itself. For example, when classifying data sets, larger neighborhood parameters can reduce the influence of noise points on classification results, but at the same time, the differences between different classes will be blurred. Especially when the data set presents manifold distribution, excessive neighborhood parameters will cause short-circuit and convection destruction. On the other hand, too small neighborhood parameters will weaken the neighbor relationship between the data, and in extreme cases, the data points belonging to the same class will be divided into several different regions. Therefore, in the current algorithm based on nearest neighbor, neighborhood selection problem has become an important part of the algorithm efficiency constraints. In order to solve this problem fundamentally, this paper puts forward the idea of natural neighbor and its application. Firstly, the paper puts forward the basic concept of the idea of natural neighbor. The idea of natural neighbor gets rid of the difficult problem of neighborhood parameter selection and adaptively completes the construction of neighbor relationship in the process of natural neighbor search. At the same time the natural neighbor characteristic value and natural neighbor neighborhood graph with feature information of data set are obtained at the same time. The main feature of the natural neighbor thought is: (1) the natural neighbor thought can create the corresponding natural neighbor neighborhood graph according to the local characteristics of different data sets, and it can present the data distribution law intuitively and accurately. In particular, manifold data and noise data .2) the idea of natural neighbor can adaptively obtain the eigenvalues of natural neighbors for different data sets. The natural neighbor feature value can dynamically reflect the distributed state of different data sets. 3) the number of neighbors of each data point is variable in the concept of natural neighbor. The number of neighbors reflects the real relationship between data points and data sets. Based on the concept of natural neighbor, this paper proposes a solution to the problem of neighborhood parameter selection in traditional algorithm. The eigenvalue of natural neighbor in natural neighbor thought reflects the distribution of data set, so it can be used as the neighborhood parameter kin the traditional nearest neighbor thought. Based on this idea, this paper proposes a fast algorithm for calculating the eigenvalues of natural neighbors, which can be used as neighborhood parameters in many algorithms in the field of outlier detection, clustering analysis and so on. Satisfactory experimental results have been obtained. Besides the eigenvalue of natural neighbor, the graph of natural neighbor which reflects the relation of natural neighbor is also of great value. At the end of this paper, a data mining algorithm based on weighted natural neighbor graph is proposed. The search depth in the natural neighbor search process is taken as the weight value of the edge of the natural neighbor graph to construct the weighted natural neighbor graph. On the basis of this method, clustering analysis, outlier mining and data visualization can be performed on arbitrarily distributed data sets at one time.
【學(xué)位授予單位】:重慶大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 茍和平;景永霞;馮百明;李勇;;基于DBSCAN聚類的改進(jìn)KNN文本分類算法[J];科學(xué)技術(shù)與工程;2013年01期
2 李曉霞,李剛,林凌,劉玉良,王焱,李健,杜江;Outlier Detection in Near Infra-Red Spectra with Self-Organizing Map[J];Transactions of Tianjin University;2005年02期
,本文編號(hào):2150574
本文鏈接:http://www.wukwdryxk.cn/shoufeilunwen/xxkjbs/2150574.html
最近更新
教材專著