版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、<p><b> 電氣信息工程學(xué)院</b></p><p><b> 外 文 翻 譯</b></p><p> 英文名稱(chēng): Data mining-clustering </p><p> 譯文名稱(chēng): 數(shù)據(jù)挖掘—聚類(lèi)分析 </p><p>
2、 專(zhuān) 業(yè): 自動(dòng)化 </p><p> 姓 名: **** </p><p> 班級(jí)學(xué)號(hào): **** </p><p> 指導(dǎo)教師: ****** </p>
3、<p> 譯文出處: Data mining:Ian H.Witten, Eibe Frank 著 </p><p> Clustering</p><p> 5.1 INTRODUCTION </p><p> Clustering is similar to classification in t
4、hat data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The g
5、roups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many definitions for clusters have bee
6、n proposed: </p><p> Set of like elements. Elements from different clusters are not alike. </p><p> The distance between points in a cluster is less than the distance between a point in the cl
7、uster and any point outside it.</p><p> A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database
8、 into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clustering is found in Example 5.1. This examp
9、le illustrates the fact that that determining how to do the clustering is not straightforward. </p><p> As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a gro
10、up of homes in a geographic area is shown. The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, home
11、s are grouped based on the size of the house.</p><p> Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics. Clustering applications inclu
12、de plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent uses includ
13、e examining Web log data to detect usage patterns.</p><p> When clustering is applied to a real-world database, many interesting problems occur:</p><p> Outlier handling is difficult. Here the
14、 elements do not naturally fall into any cluster. They can be viewed as solitary clusters. However, if a clustering algorithm attempts to find larger clusters, these outliers will be forced to be placed in some cluster.
15、This process may result in the creation of poor clusters by combining two existing clusters and leaving the outlier in its own cluster.</p><p> Dynamic data in the database implies that cluster membership m
16、ay change over time.</p><p> Interpreting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may n
17、ot be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not be obvious. Here is where a domain expert is needed to assign a label or interpretation for
18、 each cluster.</p><p> There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may
19、be required. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into simila
20、r groupings, it would not be clear how many groups should be created. </p><p> Another related issue is what data should be used of clustering. Unlike learning during a classification process, where there i
21、s some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as similar to unsupervised lear
22、ning.</p><p> We can then summarize some basic features of clustering (as opposed to classification):</p><p> The (best) number of clusters is not known.</p><p> There may not be
23、 any a priori knowledge concerning the clusters.</p><p> Cluster results are dynamic.</p><p> The clustering problem is stated as shown in Definition 5.1. Here we assume that the number of clu
24、sters to be created is an input value, k. The actual content (and interpretation) of each cluster,,, is determined as a result of the function definition. Without loss of generality, we will view that the result of solvi
25、ng a clustering problem is that a set of clusters is created: K={}.</p><p> DEFINITION 5.1.Given a database D={} of tuples and an integer value k, the clustering problem is to define a mapping f: where ea
26、ch is assigned to one cluster ,. A cluster, contains precisely those tuples mapped to it; that is, ={and }.</p><p> A classification of the different types of clustering algorithms is shown in Figure 5.2.
27、Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest
28、level, each item is in its own unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With part</p><p> The ty
29、pes of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can be categorized as agglomerative or divisive. ”Agglomerative” implies that the clusters are
30、created in a bottom-up fashion, while divisive algorithms work in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is m
31、ore associated with hierarchical algorithms. Another d</p><p> We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that have been pro
32、posed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers. </p><p> 5.2 SIMILARITY AND DISTANCE MEASURES</p><p&
33、gt; There are many desirable properties for the clusters created by a solution to a specific clustering problem. The most important one is that a tuple within one cluster is more like tuples within that cluster than it
34、is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim(), defined between any two tuples, . This provides a more strict and alternative clustering definition,
35、 as found in Definition 5.2. Unless otherwise stated, we use the </p><p> A distance measure, dis(), as opposed to similarity, is often used in clustering. The clustering problem then has the desirable prop
36、erty that given a cluster,, and .</p><p> Some clustering algorithms look only at numeric data, usually assuming metric data points. Metric attributes satisfy the triangular inequality. The cluster can then
37、 be described by using several characteristic values. Given a cluster, of N points { }, we make the following definitions [ZRL96]:</p><p> Here the centroid is the “middle” of the cluster; it need not be a
38、n actual point in the cluster. Some clustering algorithms alternatively assume that the cluster is represented by one centrally located object in the cluster called a medoid. The radius is the square root of the average
39、mean squared distance from any point in the cluster to the centroid, and of points in the cluster. We use the notation to indicate the medoid for cluster.</p><p> Many clustering algorithms require that th
40、e distance between clusters (rather than elements) be determined. This is not an easy task given that there are many interpretations for distance between clusters. Given clusters and, there are several standard alternati
41、ves to calculate the distance between clusters. A representative list is:</p><p> Single link: Smallest distance between an element in one cluster and an element in the other. We thus have dis()=and.</p&
42、gt;<p> Complete link: Largest distance between an element in one cluster and an element in the other. We thus have dis()=and.</p><p> Average: Average distance between an element in one cluster and
43、 an element in the other. We thus have dis()=and.</p><p> Centroid: If cluster have a representative centroid, then the centroid distance is defined as the distance between the centroids. We thus have dis()
44、=dis(), whereis the centroid forand similarly for .</p><p> Medoid: Using a medoid to represent each cluster, the distance between the clusters can be defined by the distance between the medoids: dis()=<
45、/p><p> 5.3 OUTLIERS</p><p> As mentioned earlier, outliers are sample points with values much different from those of the remaining set of data. Outliers may represent errors in the data (perhap
46、s a malfunctioning sensor recorded an incorrect data value) or could be correct data values that are simply much different from the remaining data. A person who is 2.5 meters tall is much taller than most people. In anal
47、yzing the height of individuals, this value probably would be viewed as an outlier.</p><p> Some clustering techniques do not perform well with the presence of outliers. This problem is illustrated in Figur
48、e 5.3. Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data will be placed in one clus
49、ter because they are closer together than the outlier. This problem is complicated by the fact that many clustering algorithms actually have as input the number of de</p><p> Clustering algorithms may actua
50、lly find and remove outliers to ensure that they perform better. However, care must be taken in actually removing outliers. For example, suppose that the data mining problem is to predict flooding. Extremely high water l
51、evel values occur very infrequently, and when compared with the normal water level values may seem to be outliers. However, removing these values may not allow the data mining algorithms to work effectively because there
52、 would be no data that showed</p><p> Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data mining, algorithms may then choose to remove or
53、 treat these values differently. Some outlier detection techniques are based on statistical techniques. These usually assume that the set of data follows a known distribution and that outliers can be detected by well-kno
54、wn tests such as discordancy tests. However, these tests are not very realistic for real-world data because real</p><p><b> 聚類(lèi)分析</b></p><p> 5.1 INTRODUCTION 5.1簡(jiǎn)介 </p><
55、p> Clustering is similar to classification in that data are grouped.聚類(lèi)分析與分類(lèi)數(shù)據(jù)分組類(lèi)似。然而,與數(shù)據(jù)分類(lèi)不同的是,所分的組預(yù)先是不確定的。相反,分組是根據(jù)在實(shí)際數(shù)據(jù)中發(fā)現(xiàn)的特點(diǎn)通過(guò)尋找數(shù)據(jù)之間的相關(guān)性來(lái)實(shí)現(xiàn)的。這些組被稱(chēng)為聚類(lèi)。一些作者認(rèn)為聚類(lèi)分析作為一種特殊類(lèi)型的分類(lèi)。但是,在本文兩個(gè)不同的觀(guān)點(diǎn)中我們遵循更傳統(tǒng)的看法。提出了許多有關(guān)聚類(lèi)的定義: <
56、;/p><p> 類(lèi)似元素的集合Set of like elements. Elements from different clusters are not al類(lèi)類(lèi)。不同聚類(lèi)中的元素是不一樣的。 </p><p> The distance between points in a cluster is less than the distance between a point in t
57、he cluster and any point outside it.在聚類(lèi)中的點(diǎn)之間的距離比在聚類(lèi)中的一個(gè)點(diǎn)和聚類(lèi)之外任何一點(diǎn)之間的距離要小。 </p><p> A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. 與聚類(lèi)類(lèi)似的
58、術(shù)語(yǔ)是數(shù)據(jù)庫(kù)分割,其中數(shù)據(jù)庫(kù)中的元組(記錄)被放在一起。 This is done to partition or segment the database into components that then give the user a more general view of the data.這樣做是為了分割或劃分成數(shù)據(jù)的數(shù)據(jù)庫(kù)組件,然后給用戶(hù)一個(gè)普遍的看法。這樣本文In this case text, we do not di
59、fferentiate between segmentation and clusterin這樣 本本我們就不區(qū)分分割和聚類(lèi)。A simple example of clustering is found in Example 5.1.This example illustrates the fact t</p><p> As illustrated in Figure 5.1,a given set
60、 of data may be clustered on different attributes. 正如圖5.1所示,一個(gè)給定的數(shù)據(jù)集合可能匯聚不同的屬性。Here a group of homes in a geographic area is show這里顯示了一個(gè)地域的住宅群。The first floor type of clustering is based on the location of the home. Home
61、s that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.一樓的聚類(lèi)類(lèi)型是基于家庭的位置。家庭地理位置相近,彼此都聚集在一起。在第二個(gè)聚類(lèi),家庭的分類(lèi)是基于房子的大小分類(lèi)。 </p><p
62、> Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics.聚類(lèi)已被用于許多應(yīng)用領(lǐng)域,包括生物學(xué),醫(yī)學(xué),人類(lèi)學(xué),市場(chǎng)營(yíng)銷(xiāo)和經(jīng)濟(jì)學(xué)。 Clustering applications include plant and animal classif
63、ication, disease classification, image processing, pattern recognition, and document retrieval. 聚類(lèi)分析的應(yīng)用包括植物和動(dòng)物分類(lèi),疾病分類(lèi),圖像處理,模式識(shí)別,文獻(xiàn)檢索。最先One of the first domains in which clustering was used was biological taxonom使用聚類(lèi)分析的領(lǐng)域
64、是生物分類(lèi)學(xué)。Recent uses include examining Web log data to detect usage </p><p> When clustering is applied to a real-world database, many interesting problems occur:當(dāng)聚類(lèi)分析應(yīng)用到現(xiàn)實(shí)世界的數(shù)據(jù)庫(kù),許多有趣的問(wèn)題將出現(xiàn): </p><p
65、> 異常值的處理是困難的。這里的元素通常不屬于任何一個(gè)集合。它們可以被看作是孤立集合。但是,如果聚類(lèi)算法試圖找到更大的集合,這些異常值將被迫放在某個(gè)集合內(nèi)。此過(guò)程可能會(huì)導(dǎo)致結(jié)合兩個(gè)現(xiàn)有的聚類(lèi)來(lái)建立出貧乏的聚類(lèi),并且新建立的聚類(lèi)本身會(huì)出現(xiàn)異常。</p><p> Dynamic data in the database implies that cluster membership may change
66、over time.數(shù)據(jù)庫(kù)的動(dòng)態(tài)數(shù)據(jù)意味著聚類(lèi)成員可能會(huì)隨時(shí)間而改變。 </p><p> 解釋每個(gè)聚類(lèi)的意義可能是困難的。通過(guò)分類(lèi),類(lèi)的標(biāo)簽提前了。然而,聚類(lèi)可能并非如此。這樣,當(dāng)聚類(lèi)過(guò)程生成了一個(gè)聚類(lèi)集合,每個(gè)集合的確切含義可能不非常明顯。下面是其中一個(gè)領(lǐng)域?qū)<沂切枰獮槊總€(gè)聚類(lèi)分配一個(gè)標(biāo)簽或解釋。</p><p> There is no one correct answer to
67、 a clustering problem.對(duì)于聚類(lèi)問(wèn)題沒(méi)有準(zhǔn)確的答案In fact, many answers may be。事實(shí)上,也可以找到很多答案。該聚類(lèi)所需的確切數(shù)目是不容易的確定。同樣,一個(gè)領(lǐng)域的專(zhuān)家可能需要。For example, suppose we have a set of data about plants that have been collected during a field trip. Without
68、 any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created例如,假設(shè)我們有經(jīng)過(guò)實(shí)地考察采集的植物數(shù)據(jù)。分析之前沒(méi)有任何有關(guān)植物分類(lèi)的知識(shí),如果我們?cè)噲D將這些數(shù)據(jù)劃分為
69、類(lèi)似的分組,我們不知道應(yīng)該建立多少分組。 </p><p> Another related issue is what data should be used of clustering.另一個(gè)相關(guān)的問(wèn)題是聚類(lèi)分析應(yīng)該使用什么樣的數(shù)據(jù)。與分類(lèi)過(guò)程中的學(xué)習(xí)不同,分類(lèi)有一些先驗(yàn)知識(shí),知道每個(gè)分類(lèi)的屬性,而在聚類(lèi)分析中,沒(méi)有有監(jiān)督的學(xué)習(xí)來(lái)促進(jìn)這一過(guò)程。事實(shí)上,聚類(lèi)分析可以看作無(wú)監(jiān)督學(xué)習(xí)。 </p>&
70、lt;p> 這樣我們總結(jié)一些聚類(lèi)分析的本特征(相對(duì)于分類(lèi)而言): </p><p> The (best) number of clusters is not known.聚類(lèi)的(最佳)數(shù)目是不知道的</p><p> There may not be any a priori knowledge concerning the clusters.對(duì)于某個(gè)聚類(lèi)可能沒(méi)有任何先驗(yàn)知識(shí)
71、</p><p> Cluster results are dynamic.聚類(lèi)的果是動(dòng)態(tài)的。 </p><p> The clustering problem is stated as shown in Definition 5.1.Here we assume that the number of clusters to be created is an input value,
72、k.聚類(lèi)問(wèn)題敘述的正如定義5.1.所示,這里我們假設(shè)創(chuàng)建的聚類(lèi)的數(shù)目為一個(gè)輸入值k,每個(gè)聚類(lèi),()的The actual content (and interpretation) of each cluster, kj, 1<j<k, is determined as a result of the function definiti實(shí)際內(nèi)容(說(shuō)明),作為一個(gè)功能定義。不失一般性Without loss of general
73、ity, we will view that the result of solving a clustering problem is that a set of clusters is creat: K={}不失一,我們認(rèn)為,解決問(wèn)題的結(jié)果建立的聚類(lèi)集合:K={}。 </p><p> D EFINITION 5.1. Given a database D ={} of tuples and an inte
74、ger value k , the clustering problem is to define a mapping f : D{} where each ti is assigned to one cluster Kj,1< j < k . A cluster, Kj, contains precisely those tuples mapped to it; that is, Kj={}, 定義 5.1已知一個(gè)數(shù)組集合
75、D={}和一個(gè)整數(shù)k,聚類(lèi)問(wèn)題是定義一個(gè)映射f:,其中分配到聚類(lèi)() 。聚類(lèi),就是集合D映射到={and }。 </p><p> A classification of the different types of clustering algorithms is shown in Figure 5.2.Clustering algorithms themselves may be viewed as hie
76、rarchical or partitiona聚類(lèi)算法的不同類(lèi)型的分類(lèi)如圖5.2。聚類(lèi)算法本身就可視為分層或分塊的。分層聚類(lèi)分析可以建立一個(gè)嵌套的聚類(lèi)集合。在層次結(jié)構(gòu)中的每層都有單獨(dú)的聚類(lèi)。在最低層,每個(gè)項(xiàng)目都劃分在不同的特殊的集合中。在最頂層,所有的項(xiàng)目屬于同一集合。通過(guò)分層聚類(lèi),需要的聚類(lèi)數(shù)目并沒(méi)有輸入。分塊聚類(lèi)分析算法只創(chuàng)建一個(gè)聚類(lèi)集合。這些方法通過(guò)所需的聚類(lèi)集合數(shù)目促使最終集合的建立。傳統(tǒng)的聚類(lèi)算法往往是針對(duì)適合小數(shù)據(jù)庫(kù)。然而,
77、現(xiàn)在的聚類(lèi)算法,從分類(lèi)數(shù)據(jù)上來(lái)看,是針對(duì)動(dòng)態(tài)的大數(shù)據(jù)庫(kù)。Algorithms targeted to larger databases may adapt to memory constraints by either sampling the database or usin</p><p> The types of clustering algorithms can be furthered classi
78、fied based on the implementation technique used. 聚類(lèi)算法的類(lèi)型基于實(shí)現(xiàn)技術(shù)使用的基礎(chǔ)上可以被進(jìn)一步分類(lèi)。分層算法可以歸類(lèi)為凝聚算法或者分裂算法。“凝聚”意味著在一個(gè)聚類(lèi)是通過(guò)自下而上的方式產(chǎn)生,而分裂算法則是以自上而下的方式工作。雖然分層和分塊的算法用凝聚與分裂的標(biāo)簽來(lái)描述,但它通常與分層算法聯(lián)系更緊密。另一種描述標(biāo)簽是指是否對(duì)每個(gè)元素一一處理,一系列(有時(shí)稱(chēng)為增量)的一起處理,或者是否
79、所有的項(xiàng)目都放在一起同步研究。如果一個(gè)特定的數(shù)組被視為具有在該架構(gòu)中的所有屬性,然后可以用不同聚類(lèi)算法進(jìn)行屬性檢查。由于通常用分層分類(lèi)的技術(shù)來(lái)完成,有些算法分析屬性值每次只分析一個(gè)。Polythetic算法考慮的是每次的所有屬性值。最后,聚類(lèi)算法以算法的數(shù)學(xué)公式被表示出來(lái):圖表或矩陣代數(shù)的理論基礎(chǔ)。這一章中,我們采用圖形方式,并且把聚類(lèi)算法的輸入描述為鄰近距陣中距離變化。我們?cè)谝韵赂鞴?jié)討論許多聚類(lèi)算法。這只是已在文獻(xiàn)中提出了很多算法中
80、具有</p><p> We discuss many clustering algorithms in the following sections. 5.2 SIMILARITY AND DISTANCE ME5.2相似性和距離測(cè)量 </p><p> There are many desirable properties for the clusters created by a
81、 solution to a specific clustering problem.一個(gè)特殊的聚類(lèi)問(wèn)題的解決方案可以產(chǎn)生很多理想的特性。最重要的是,在某個(gè)聚類(lèi)中的一個(gè)數(shù)組比聚類(lèi)外的數(shù)組更像聚類(lèi)中的。至于經(jīng)過(guò)分類(lèi),那么,假設(shè)我們定義一個(gè)近似度,sim(),。定義5.2提供了一個(gè)更嚴(yán)格的定義和可替代的聚類(lèi)。Unless otherwise stated, we use the first definition rather than th
82、e second除非另有說(shuō)明,我們使用第一個(gè)定義而不是第二個(gè)。Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, proper在第二個(gè)定義中的敘述的相似關(guān)系是可以獲得的特點(diǎn),但是并不總能獲得。</p><p>
83、; A distance measure, dis(), as opposed to similarity, is often used in clustering. 距離量dis(),而不是相似度,往往被用于聚類(lèi)分析。The clustering problem then has the desirable property that given a cluster, Kj根據(jù)這樣聚類(lèi)問(wèn)題可以獲得, 和 這兩個(gè)集合所表示的特性。&l
84、t;/p><p> Some clustering algorithms look only at numeric data, usually assuming metric data points. 一些聚類(lèi)算法只看數(shù)字型數(shù)據(jù),通常假定度量數(shù)據(jù)點(diǎn)。Metric attributes satisfy the triangular inequalit度量屬性滿(mǎn)足三角不等式。 那么聚類(lèi)集合The cluster can
85、 then be described by using several characteristic values. Given a cluster, Km of N points {}, we make the following definitions [ZRL96聚類(lèi)集可以使用多種特征值來(lái)描述。給出一個(gè)聚類(lèi)集合, N點(diǎn){ }中的,我們提出以下定義[ZRL96]: </p><p> Here the ce
86、ntroid is the “middle” of the cluster; it need not be an actual point in the cluster. 這里的質(zhì)心是聚類(lèi)集合的“中心”,它不一定是一個(gè)聚類(lèi)集合中的實(shí)點(diǎn)。Some clustering algorithms alternatively assume that the cluster is represented by one centrally locat
87、ed object in the cluster called a medoid一些聚類(lèi)算法可能假設(shè)聚類(lèi)集合是由位于聚類(lèi)集合中心的中心點(diǎn)代替。半徑是從集合中的中心點(diǎn)到聚類(lèi)集合中的任何點(diǎn)間的距離的平方根,并且是對(duì)聚類(lèi)集合中所有點(diǎn)而言的。我 我們使用符號(hào)來(lái)表示聚類(lèi)集合的中心點(diǎn)。</p><p> Many clustering algorithms require that the distance between
88、 clusters (rather than elements) be determined. 許多聚類(lèi)算法要求確定聚類(lèi)集合(而不是元素)中的距離This is not an easy task given that there are many interpretations for distance between clusters.。這不是一件容易的事,因?yàn)榫垲?lèi)集合中的距離有很多解釋。已知聚類(lèi)集合和,,,,有幾個(gè)標(biāo)準(zhǔn)的供選方案來(lái)計(jì)
89、算聚類(lèi)集合之間的距離。 典型的A representative list is:================典型典列表如下: </p><p> 單鏈接:在一個(gè)聚類(lèi)集合中的一個(gè)元素與另一個(gè)聚類(lèi)集合中的一個(gè)元素之間的最小距離。這樣,我們可以得到dis()=and。</p><p> 完整的鏈接:在一個(gè)聚類(lèi)集合中的一個(gè)元素與另一個(gè)聚類(lèi)集合中的一個(gè)元素之間的最大距離。這樣我們可以得到di
90、s()=and。</p><p> 平均:在一個(gè)聚類(lèi)集合中的一個(gè)元素與另一個(gè)聚類(lèi)集合中的一個(gè)元素之間的平均距離。這樣我們可以得到dis()=and。</p><p> 質(zhì)心:如果聚類(lèi)集合有具有代表性的質(zhì)心,那么中心距離可以定義為這些質(zhì)心之間的距離。這樣我們可以得到dis()=dis(),為的質(zhì)心并且與類(lèi)似。</p><p> 中心點(diǎn):使用中心點(diǎn)來(lái)代替每個(gè)聚類(lèi)集
91、合,集合之間的距離可以由中心點(diǎn)之間的距離來(lái)定義:dis()=</p><p><b> 55.3異常數(shù)據(jù) </b></p><p> As mentioned earlier, outliers are sample points with values much different from those of the remaining set of data.
92、 如前所述,離群點(diǎn)是不同于集合里的剩余數(shù)據(jù)的采樣點(diǎn)。Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data value) or could be correct data values that are simply much different from the remaining dat
93、a離群值可能代表數(shù)據(jù)里的錯(cuò)誤值(可能是一個(gè)傳感器故障記錄)或者可能是與其余數(shù)據(jù)值差異過(guò)大的正確數(shù)據(jù)。一個(gè)2.5米高的人比大多數(shù)人都要高得多。在分析個(gè)人的高度時(shí),此值就應(yīng)該被視為一個(gè)離群值。 </p><p> Some clustering techniques do not perform well with the presence of outliers. This problem is illustra
94、ted in Figure 5.3. 一些聚類(lèi)分析技術(shù)對(duì)于存在離群值的模型的分析表現(xiàn)的并不好。如圖5.3描述的問(wèn)題所示。在這里,如果發(fā)現(xiàn)三個(gè)聚類(lèi)集合(實(shí)線(xiàn)),異常值將某個(gè)集合自身內(nèi)發(fā)生。但是,如果兩個(gè)集合被發(fā)現(xiàn)(虛線(xiàn)),兩個(gè)(顯然)不同的數(shù)據(jù)集合將被放置在聚類(lèi)集合中,因?yàn)樗鼈儽入x群值聯(lián)系更緊密。 This problem is complicated by the fact that many clustering algorithms
95、 actually have as input the number of desired clusters to be found. 這個(gè)問(wèn)題是復(fù)雜的事實(shí),實(shí)際上有許多聚類(lèi)算法作為輸入所需數(shù)目的簇被發(fā)現(xiàn)。 實(shí)際上許多聚類(lèi)算法想找到理想聚類(lèi)集合的輸入數(shù)目,這個(gè)做事實(shí)上是很復(fù)雜的。</p><p> Clustering algorithms may actually find and remove outlie
96、rs to ensure that they perform better. 聚類(lèi)算法實(shí)際上可能發(fā)現(xiàn)和消除異常點(diǎn),以確保其有更好的表現(xiàn)。但是,在實(shí)際消除異常點(diǎn)時(shí)必須要注意。例如,假設(shè)有預(yù)測(cè)洪水的數(shù)據(jù)挖掘問(wèn)題。水位值極高非常不容易出現(xiàn),與正常水位值相比可能就是異常值。然而,刪除這些值可能使數(shù)據(jù)挖掘算法不能有效的工作,因?yàn)閷⒉粫?huì)有數(shù)據(jù)表明曾經(jīng)發(fā)生過(guò)水災(zāi)。 </p><p> Outlier detection, o
97、r outlier mining, is the process of identifying outliers in a set of data. 異常檢測(cè),或離群數(shù)據(jù)挖掘,是在數(shù)據(jù)集合中確定離群值的過(guò)程。聚類(lèi)分析,或者其它的數(shù)據(jù)挖掘,便可以選擇算法刪除這些值或者賦上不同的其它值。一些異常檢測(cè)技術(shù)是基于統(tǒng)計(jì)技術(shù)基礎(chǔ)上的。這樣These usually assume that the set of data follows a know
98、n distribution and that outliers can be detected by well-known tests such as discordancy tests這樣通常假設(shè)數(shù)據(jù)集合遵循已知的分布并且假設(shè)離群值可以被一個(gè)著名測(cè)試檢測(cè)出來(lái),例如discordancy測(cè)試。 However, these tests are not very realistic for real-world data because
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 英文翻譯文獻(xiàn).doc
- 英文翻譯文獻(xiàn).doc
- 英文翻譯外文文獻(xiàn)翻譯150機(jī)械外文英文翻譯剝殼機(jī)的介紹
- 聚類(lèi)分析外文文獻(xiàn)及翻譯
- 英文翻譯外文文獻(xiàn)翻譯150機(jī)械外文英文翻譯剝殼機(jī)的介紹
- 中央文獻(xiàn)重要術(shù)語(yǔ)英文翻譯
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
- 中英文翻譯文獻(xiàn).doc
評(píng)論
0/150
提交評(píng)論