聚類分析文獻(xiàn)英文翻譯

上傳人：奔*** IP屬地：河北更新時(shí)間：2024-03-01 格式：doc 頁數(shù)：14 大小：221.00KB 人氣指數(shù)：12 舉報(bào) 版權(quán)申訴

已閱讀1頁，還剩13頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、　　電氣信息工程學(xué)院　　外文翻譯　　英文名稱： Data mining-clustering 　　譯文名稱：數(shù)據(jù)挖掘—聚類分析

2、　專業(yè)：自動(dòng)化 　　姓名： **** 　　班級(jí)學(xué)號(hào)： **** 　　指導(dǎo)教師： ******

3、　　譯文出處： Data mining：Ian H.Witten, Eibe Frank 著 　　Clustering　　5.1 INTRODUCTION 　　Clustering is similar to classification in t

4、hat data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. The g

5、roups are called clusters. Some authors view clustering as a special type of classification. In this text, however, we follow a more conventional view in that the two are different. Many definitions for clusters have bee

6、n proposed: 　　Set of like elements. Elements from different clusters are not alike. 　　The distance between points in a cluster is less than the distance between a point in the cl

7、uster and any point outside it.　　A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. This is done to partition or segment the database

8、 into components that then give the user a more general view of the data. In this case text, we do not differentiate between segmentation and clustering. A simple example of clustering is found in Example 5.1. This examp

9、le illustrates the fact that that determining how to do the clustering is not straightforward. 　　As illustrated in Figure 5.1, a given set of data may be clustered on different attributes. Here a gro

10、up of homes in a geographic area is shown. The first floor type of clustering is based on the location of the home. Homes that are geographically close to each other are clustered together. In the second clustering, home

11、s are grouped based on the size of the house.　　Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics. Clustering applications inclu

12、de plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. One of the first domains in which clustering was used was biological taxonomy. Recent uses includ

13、e examining Web log data to detect usage patterns.　　When clustering is applied to a real-world database, many interesting problems occur:　　Outlier handling is difficult. Here the

14、 elements do not naturally fall into any cluster. They can be viewed as solitary clusters. However, if a clustering algorithm attempts to find larger clusters, these outliers will be forced to be placed in some cluster.

15、This process may result in the creation of poor clusters by combining two existing clusters and leaving the outlier in its own cluster.　　Dynamic data in the database implies that cluster membership m

16、ay change over time.　　Interpreting the semantic meaning of each cluster may be difficult. With classification, the labeling of the classes is known ahead of time. However, with clustering, this may n

17、ot be the case. Thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not be obvious. Here is where a domain expert is needed to assign a label or interpretation for

18、 each cluster.　　There is no one correct answer to a clustering problem. In fact, many answers may be found. The exact number of clusters required is not easy to determine. Again, a domain expert may

19、be required. For example, suppose we have a set of data about plants that have been collected during a field trip. Without any prior knowledge of plant classification, if we attempt to divide this set of data into simila

20、r groupings, it would not be clear how many groups should be created. 　　Another related issue is what data should be used of clustering. Unlike learning during a classification process, where there i

21、s some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. Indeed, clustering can be viewed as similar to unsupervised lear

22、ning.　　We can then summarize some basic features of clustering (as opposed to classification):　　The (best) number of clusters is not known.　　There may not be

23、 any a priori knowledge concerning the clusters.　　Cluster results are dynamic.　　The clustering problem is stated as shown in Definition 5.1. Here we assume that the number of clu

24、sters to be created is an input value, k. The actual content (and interpretation) of each cluster,,, is determined as a result of the function definition. Without loss of generality, we will view that the result of solvi

25、ng a clustering problem is that a set of clusters is created: K={}.　　DEFINITION 5.1.Given a database D={} of tuples and an integer value k, the clustering problem is to define a mapping f: where ea

26、ch is assigned to one cluster ,. A cluster, contains precisely those tuples mapped to it; that is, ={and }.　　A classification of the different types of clustering algorithms is shown in Figure 5.2.

27、Clustering algorithms themselves may be viewed as hierarchical or partitional. With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has a separate set of clusters. At the lowest

28、level, each item is in its own unique cluster. At the highest level, all items belong to the same cluster. With hierarchical clustering, the desired number of clusters is not input. With part　　The ty

29、pes of clustering algorithms can be furthered classified based on the implementation technique used. Hierarchical algorithms can be categorized as agglomerative or divisive. ”Agglomerative” implies that the clusters are

30、created in a bottom-up fashion, while divisive algorithms work in a top-down fashion. Although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is m

31、ore associated with hierarchical algorithms. Another d　　We discuss many clustering algorithms in the following sections. This is only a representative subset of the many algorithms that have been pro

32、posed in the literature. Before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers. 　　5.2 SIMILARITY AND DISTANCE MEASURES<p&

33、gt;　　There are many desirable properties for the clusters created by a solution to a specific clustering problem. The most important one is that a tuple within one cluster is more like tuples within that cluster than it

34、is similar to tuples outside it. As with classification, then, we assume the definition of a similarity measure, sim(), defined between any two tuples, . This provides a more strict and alternative clustering definition,

35、 as found in Definition 5.2. Unless otherwise stated, we use the 　　A distance measure, dis(), as opposed to similarity, is often used in clustering. The clustering problem then has the desirable prop

36、erty that given a cluster,, and .　　Some clustering algorithms look only at numeric data, usually assuming metric data points. Metric attributes satisfy the triangular inequality. The cluster can then

37、 be described by using several characteristic values. Given a cluster, of N points { }, we make the following definitions [ZRL96]:　　Here the centroid is the “middle” of the cluster; it need not be a

38、n actual point in the cluster. Some clustering algorithms alternatively assume that the cluster is represented by one centrally located object in the cluster called a medoid. The radius is the square root of the average

39、mean squared distance from any point in the cluster to the centroid, and of points in the cluster. We use the notation to indicate the medoid for cluster.　　Many clustering algorithms require that th

40、e distance between clusters (rather than elements) be determined. This is not an easy task given that there are many interpretations for distance between clusters. Given clusters and, there are several standard alternati

41、ves to calculate the distance between clusters. A representative list is:　　Single link: Smallest distance between an element in one cluster and an element in the other. We thus have dis()=and.</p&

42、gt;　　Complete link: Largest distance between an element in one cluster and an element in the other. We thus have dis()=and.　　Average: Average distance between an element in one cluster and

43、 an element in the other. We thus have dis()=and.　　Centroid: If cluster have a representative centroid, then the centroid distance is defined as the distance between the centroids. We thus have dis()

44、=dis(), whereis the centroid forand similarly for .　　Medoid: Using a medoid to represent each cluster, the distance between the clusters can be defined by the distance between the medoids: dis()=<

45、/p>　　5.3 OUTLIERS　　As mentioned earlier, outliers are sample points with values much different from those of the remaining set of data. Outliers may represent errors in the data (perhap

46、s a malfunctioning sensor recorded an incorrect data value) or could be correct data values that are simply much different from the remaining data. A person who is 2.5 meters tall is much taller than most people. In anal

47、yzing the height of individuals, this value probably would be viewed as an outlier.　　Some clustering techniques do not perform well with the presence of outliers. This problem is illustrated in Figur

48、e 5.3. Here if three clusters are found (solid line), the outlier will occur in a cluster by itself. However, if two clusters are found (dashed line), the two (obviously) different sets of data will be placed in one clus

49、ter because they are closer together than the outlier. This problem is complicated by the fact that many clustering algorithms actually have as input the number of de　　Clustering algorithms may actua

50、lly find and remove outliers to ensure that they perform better. However, care must be taken in actually removing outliers. For example, suppose that the data mining problem is to predict flooding. Extremely high water l

51、evel values occur very infrequently, and when compared with the normal water level values may seem to be outliers. However, removing these values may not allow the data mining algorithms to work effectively because there

52、 would be no data that showed　　Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other data mining, algorithms may then choose to remove or

53、 treat these values differently. Some outlier detection techniques are based on statistical techniques. These usually assume that the set of data follows a known distribution and that outliers can be detected by well-kno

54、wn tests such as discordancy tests. However, these tests are not very realistic for real-world data because real　　聚類分析　　5.1 INTRODUCTION 5.1簡(jiǎn)介 <

55、p>　　Clustering is similar to classification in that data are grouped.聚類分析與分類數(shù)據(jù)分組類似。然而，與數(shù)據(jù)分類不同的是，所分的組預(yù)先是不確定的。相反，分組是根據(jù)在實(shí)際數(shù)據(jù)中發(fā)現(xiàn)的特點(diǎn)通過尋找數(shù)據(jù)之間的相關(guān)性來實(shí)現(xiàn)的。這些組被稱為聚類。一些作者認(rèn)為聚類分析作為一種特殊類型的分類。但是，在本文兩個(gè)不同的觀點(diǎn)中我們遵循更傳統(tǒng)的看法。提出了許多有關(guān)聚類的定義： <

56、;/p>　　類似元素的集合Set of like elements. Elements from different clusters are not al類類。不同聚類中的元素是不一樣的。 　　The distance between points in a cluster is less than the distance between a point in t

57、he cluster and any point outside it.在聚類中的點(diǎn)之間的距離比在聚類中的一個(gè)點(diǎn)和聚類之外任何一點(diǎn)之間的距離要小。 　　A term similar to clustering is database segmentation, where like tuple (record) in a database are grouped together. 與聚類類似的

58、術(shù)語是數(shù)據(jù)庫分割，其中數(shù)據(jù)庫中的元組（記錄）被放在一起。 This is done to partition or segment the database into components that then give the user a more general view of the data.這樣做是為了分割或劃分成數(shù)據(jù)的數(shù)據(jù)庫組件，然后給用戶一個(gè)普遍的看法。這樣本文In this case text, we do not di

59、fferentiate between segmentation and clusterin這樣本本我們就不區(qū)分分割和聚類。A simple example of clustering is found in Example 5.1.This example illustrates the fact t　　As illustrated in Figure 5.1,a given set

60、 of data may be clustered on different attributes. 正如圖5.1所示，一個(gè)給定的數(shù)據(jù)集合可能匯聚不同的屬性。Here a group of homes in a geographic area is show這里顯示了一個(gè)地域的住宅群。The first floor type of clustering is based on the location of the home. Home

61、s that are geographically close to each other are clustered together. In the second clustering, homes are grouped based on the size of the house.一樓的聚類類型是基于家庭的位置。家庭地理位置相近，彼此都聚集在一起。在第二個(gè)聚類，家庭的分類是基于房子的大小分類。 <p

62、>　　Clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics.聚類已被用于許多應(yīng)用領(lǐng)域，包括生物學(xué)，醫(yī)學(xué)，人類學(xué)，市場(chǎng)營(yíng)銷和經(jīng)濟(jì)學(xué)。 Clustering applications include plant and animal classif

63、ication, disease classification, image processing, pattern recognition, and document retrieval. 聚類分析的應(yīng)用包括植物和動(dòng)物分類，疾病分類，圖像處理，模式識(shí)別，文獻(xiàn)檢索。最先One of the first domains in which clustering was used was biological taxonom使用聚類分析的領(lǐng)域

64、是生物分類學(xué)。Recent uses include examining Web log data to detect usage 　　When clustering is applied to a real-world database, many interesting problems occur:當(dāng)聚類分析應(yīng)用到現(xiàn)實(shí)世界的數(shù)據(jù)庫，許多有趣的問題將出現(xiàn)： <p

65、>　　異常值的處理是困難的。這里的元素通常不屬于任何一個(gè)集合。它們可以被看作是孤立集合。但是，如果聚類算法試圖找到更大的集合，這些異常值將被迫放在某個(gè)集合內(nèi)。此過程可能會(huì)導(dǎo)致結(jié)合兩個(gè)現(xiàn)有的聚類來建立出貧乏的聚類，并且新建立的聚類本身會(huì)出現(xiàn)異常。　　Dynamic data in the database implies that cluster membership may change

66、over time.數(shù)據(jù)庫的動(dòng)態(tài)數(shù)據(jù)意味著聚類成員可能會(huì)隨時(shí)間而改變。 　　解釋每個(gè)聚類的意義可能是困難的。通過分類，類的標(biāo)簽提前了。然而，聚類可能并非如此。這樣，當(dāng)聚類過程生成了一個(gè)聚類集合，每個(gè)集合的確切含義可能不非常明顯。下面是其中一個(gè)領(lǐng)域?qū)＜沂切枰獮槊總€(gè)聚類分配一個(gè)標(biāo)簽或解釋。　　There is no one correct answer to

67、 a clustering problem.對(duì)于聚類問題沒有準(zhǔn)確的答案In fact, many answers may be。事實(shí)上，也可以找到很多答案。該聚類所需的確切數(shù)目是不容易的確定。同樣，一個(gè)領(lǐng)域的專家可能需要。For example, suppose we have a set of data about plants that have been collected during a field trip. Without

68、 any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created例如，假設(shè)我們有經(jīng)過實(shí)地考察采集的植物數(shù)據(jù)。分析之前沒有任何有關(guān)植物分類的知識(shí)，如果我們?cè)噲D將這些數(shù)據(jù)劃分為

69、類似的分組，我們不知道應(yīng)該建立多少分組。 　　Another related issue is what data should be used of clustering.另一個(gè)相關(guān)的問題是聚類分析應(yīng)該使用什么樣的數(shù)據(jù)。與分類過程中的學(xué)習(xí)不同，分類有一些先驗(yàn)知識(shí)，知道每個(gè)分類的屬性，而在聚類分析中，沒有有監(jiān)督的學(xué)習(xí)來促進(jìn)這一過程。事實(shí)上，聚類分析可以看作無監(jiān)督學(xué)習(xí)。 &

70、lt;p>　　這樣我們總結(jié)一些聚類分析的本特征（相對(duì)于分類而言）： 　　The (best) number of clusters is not known.聚類的（最佳）數(shù)目是不知道的　　There may not be any a priori knowledge concerning the clusters.對(duì)于某個(gè)聚類可能沒有任何先驗(yàn)知識(shí)

71、　　Cluster results are dynamic.聚類的果是動(dòng)態(tài)的。 　　The clustering problem is stated as shown in Definition 5.1.Here we assume that the number of clusters to be created is an input value,

72、k.聚類問題敘述的正如定義5.1.所示，這里我們假設(shè)創(chuàng)建的聚類的數(shù)目為一個(gè)輸入值k，每個(gè)聚類,（）的The actual content (and interpretation) of each cluster, kj, 1<j<k, is determined as a result of the function definiti實(shí)際內(nèi)容（說明），作為一個(gè)功能定義。不失一般性Without loss of general

73、ity, we will view that the result of solving a clustering problem is that a set of clusters is creat: K={}不失一，我們認(rèn)為，解決問題的結(jié)果建立的聚類集合：K={}。 　　D EFINITION 5.1. Given a database D ={} of tuples and an inte

74、ger value k , the clustering problem is to define a mapping f : D{} where each ti is assigned to one cluster Kj,1< j < k . A cluster, Kj, contains precisely those tuples mapped to it; that is, Kj={}, 定義 5.1已知一個(gè)數(shù)組集合

75、D={}和一個(gè)整數(shù)k，聚類問題是定義一個(gè)映射f：，其中分配到聚類（）。聚類，就是集合D映射到={and }。 　　A classification of the different types of clustering algorithms is shown in Figure 5.2.Clustering algorithms themselves may be viewed as hie

76、rarchical or partitiona聚類算法的不同類型的分類如圖5.2。聚類算法本身就可視為分層或分塊的。分層聚類分析可以建立一個(gè)嵌套的聚類集合。在層次結(jié)構(gòu)中的每層都有單獨(dú)的聚類。在最低層，每個(gè)項(xiàng)目都劃分在不同的特殊的集合中。在最頂層，所有的項(xiàng)目屬于同一集合。通過分層聚類，需要的聚類數(shù)目并沒有輸入。分塊聚類分析算法只創(chuàng)建一個(gè)聚類集合。這些方法通過所需的聚類集合數(shù)目促使最終集合的建立。傳統(tǒng)的聚類算法往往是針對(duì)適合小數(shù)據(jù)庫。然而，

77、現(xiàn)在的聚類算法，從分類數(shù)據(jù)上來看，是針對(duì)動(dòng)態(tài)的大數(shù)據(jù)庫。Algorithms targeted to larger databases may adapt to memory constraints by either sampling the database or usin　　The types of clustering algorithms can be furthered classi

78、fied based on the implementation technique used. 聚類算法的類型基于實(shí)現(xiàn)技術(shù)使用的基礎(chǔ)上可以被進(jìn)一步分類。分層算法可以歸類為凝聚算法或者分裂算法?！澳邸币馕吨谝粋€(gè)聚類是通過自下而上的方式產(chǎn)生，而分裂算法則是以自上而下的方式工作。雖然分層和分塊的算法用凝聚與分裂的標(biāo)簽來描述，但它通常與分層算法聯(lián)系更緊密。另一種描述標(biāo)簽是指是否對(duì)每個(gè)元素一一處理，一系列（有時(shí)稱為增量）的一起處理，或者是否

79、所有的項(xiàng)目都放在一起同步研究。如果一個(gè)特定的數(shù)組被視為具有在該架構(gòu)中的所有屬性，然后可以用不同聚類算法進(jìn)行屬性檢查。由于通常用分層分類的技術(shù)來完成，有些算法分析屬性值每次只分析一個(gè)。Polythetic算法考慮的是每次的所有屬性值。最后，聚類算法以算法的數(shù)學(xué)公式被表示出來：圖表或矩陣代數(shù)的理論基礎(chǔ)。這一章中，我們采用圖形方式，并且把聚類算法的輸入描述為鄰近距陣中距離變化。我們?cè)谝韵赂鞴?jié)討論許多聚類算法。這只是已在文獻(xiàn)中提出了很多算法中

80、具有　　We discuss many clustering algorithms in the following sections. 5.2 SIMILARITY AND DISTANCE ME5.2相似性和距離測(cè)量 　　There are many desirable properties for the clusters created by a

81、 solution to a specific clustering problem.一個(gè)特殊的聚類問題的解決方案可以產(chǎn)生很多理想的特性。最重要的是，在某個(gè)聚類中的一個(gè)數(shù)組比聚類外的數(shù)組更像聚類中的。至于經(jīng)過分類，那么，假設(shè)我們定義一個(gè)近似度，sim()，。定義5.2提供了一個(gè)更嚴(yán)格的定義和可替代的聚類。Unless otherwise stated, we use the first definition rather than th

82、e second除非另有說明，我們使用第一個(gè)定義而不是第二個(gè)。Keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, proper在第二個(gè)定義中的敘述的相似關(guān)系是可以獲得的特點(diǎn)，但是并不總能獲得。

83、;　　A distance measure, dis(), as opposed to similarity, is often used in clustering. 距離量dis()，而不是相似度，往往被用于聚類分析。The clustering problem then has the desirable property that given a cluster, Kj根據(jù)這樣聚類問題可以獲得, 和這兩個(gè)集合所表示的特性。&l

84、t;/p>　　Some clustering algorithms look only at numeric data, usually assuming metric data points. 一些聚類算法只看數(shù)字型數(shù)據(jù)，通常假定度量數(shù)據(jù)點(diǎn)。Metric attributes satisfy the triangular inequalit度量屬性滿足三角不等式。那么聚類集合The cluster can

85、 then be described by using several characteristic values. Given a cluster, Km of N points {}, we make the following definitions [ZRL96聚類集可以使用多種特征值來描述。給出一個(gè)聚類集合， N點(diǎn){ }中的,我們提出以下定義[ZRL96]： 　　Here the ce

86、ntroid is the “middle” of the cluster; it need not be an actual point in the cluster. 這里的質(zhì)心是聚類集合的“中心”，它不一定是一個(gè)聚類集合中的實(shí)點(diǎn)。Some clustering algorithms alternatively assume that the cluster is represented by one centrally locat

87、ed object in the cluster called a medoid一些聚類算法可能假設(shè)聚類集合是由位于聚類集合中心的中心點(diǎn)代替。半徑是從集合中的中心點(diǎn)到聚類集合中的任何點(diǎn)間的距離的平方根，并且是對(duì)聚類集合中所有點(diǎn)而言的。我我們使用符號(hào)來表示聚類集合的中心點(diǎn)。　　Many clustering algorithms require that the distance between

88、 clusters (rather than elements) be determined. 許多聚類算法要求確定聚類集合（而不是元素）中的距離This is not an easy task given that there are many interpretations for distance between clusters.。這不是一件容易的事，因?yàn)榫垲惣现械木嚯x有很多解釋。已知聚類集合和，，，，有幾個(gè)標(biāo)準(zhǔn)的供選方案來計(jì)

89、算聚類集合之間的距離。典型的A representative list is:================典型典列表如下： 　　單鏈接：在一個(gè)聚類集合中的一個(gè)元素與另一個(gè)聚類集合中的一個(gè)元素之間的最小距離。這樣，我們可以得到dis()=and。　　完整的鏈接：在一個(gè)聚類集合中的一個(gè)元素與另一個(gè)聚類集合中的一個(gè)元素之間的最大距離。這樣我們可以得到di

90、s()=and。　　平均：在一個(gè)聚類集合中的一個(gè)元素與另一個(gè)聚類集合中的一個(gè)元素之間的平均距離。這樣我們可以得到dis()=and。　　質(zhì)心：如果聚類集合有具有代表性的質(zhì)心，那么中心距離可以定義為這些質(zhì)心之間的距離。這樣我們可以得到dis()=dis()，為的質(zhì)心并且與類似。　　中心點(diǎn)：使用中心點(diǎn)來代替每個(gè)聚類集

91、合，集合之間的距離可以由中心點(diǎn)之間的距離來定義：dis()=　　55.3異常數(shù)據(jù) 　　As mentioned earlier, outliers are sample points with values much different from those of the remaining set of data.

92、如前所述，離群點(diǎn)是不同于集合里的剩余數(shù)據(jù)的采樣點(diǎn)。Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect data value) or could be correct data values that are simply much different from the remaining dat

93、a離群值可能代表數(shù)據(jù)里的錯(cuò)誤值（可能是一個(gè)傳感器故障記錄）或者可能是與其余數(shù)據(jù)值差異過大的正確數(shù)據(jù)。一個(gè)2.5米高的人比大多數(shù)人都要高得多。在分析個(gè)人的高度時(shí)，此值就應(yīng)該被視為一個(gè)離群值。 　　Some clustering techniques do not perform well with the presence of outliers. This problem is illustra

94、ted in Figure 5.3. 一些聚類分析技術(shù)對(duì)于存在離群值的模型的分析表現(xiàn)的并不好。如圖5.3描述的問題所示。在這里，如果發(fā)現(xiàn)三個(gè)聚類集合（實(shí)線），異常值將某個(gè)集合自身內(nèi)發(fā)生。但是，如果兩個(gè)集合被發(fā)現(xiàn)（虛線），兩個(gè)（顯然）不同的數(shù)據(jù)集合將被放置在聚類集合中，因?yàn)樗鼈儽入x群值聯(lián)系更緊密。 This problem is complicated by the fact that many clustering algorithms

95、 actually have as input the number of desired clusters to be found. 這個(gè)問題是復(fù)雜的事實(shí)，實(shí)際上有許多聚類算法作為輸入所需數(shù)目的簇被發(fā)現(xiàn)。實(shí)際上許多聚類算法想找到理想聚類集合的輸入數(shù)目，這個(gè)做事實(shí)上是很復(fù)雜的。　　Clustering algorithms may actually find and remove outlie

96、rs to ensure that they perform better. 聚類算法實(shí)際上可能發(fā)現(xiàn)和消除異常點(diǎn)，以確保其有更好的表現(xiàn)。但是，在實(shí)際消除異常點(diǎn)時(shí)必須要注意。例如，假設(shè)有預(yù)測(cè)洪水的數(shù)據(jù)挖掘問題。水位值極高非常不容易出現(xiàn)，與正常水位值相比可能就是異常值。然而，刪除這些值可能使數(shù)據(jù)挖掘算法不能有效的工作，因?yàn)閷⒉粫?huì)有數(shù)據(jù)表明曾經(jīng)發(fā)生過水災(zāi)。 　　Outlier detection, o

97、r outlier mining, is the process of identifying outliers in a set of data. 異常檢測(cè)，或離群數(shù)據(jù)挖掘，是在數(shù)據(jù)集合中確定離群值的過程。聚類分析，或者其它的數(shù)據(jù)挖掘，便可以選擇算法刪除這些值或者賦上不同的其它值。一些異常檢測(cè)技術(shù)是基于統(tǒng)計(jì)技術(shù)基礎(chǔ)上的。這樣These usually assume that the set of data follows a know

98、n distribution and that outliers can be detected by well-known tests such as discordancy tests這樣通常假設(shè)數(shù)據(jù)集合遵循已知的分布并且假設(shè)離群值可以被一個(gè)著名測(cè)試檢測(cè)出來，例如discordancy測(cè)試。 However, these tests are not very realistic for real-world data because

眾賞文庫> 全部分類> 畢業(yè)設(shè)計(jì)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 眾賞文庫僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

聚類分析文獻(xiàn)英文翻譯

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

聚類分析文獻(xiàn)英文翻譯

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

免費(fèi)下載