版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)
文檔簡介
1、<p><b> 本科畢業(yè)論文</b></p><p><b> 外文文獻及譯文</b></p><p> 文獻、資料題目: Cluster Analysis</p><p> —Basic Concepts and Algorithms</p><p> 文獻、資料來源:
2、http://www.statsoft.com</p><p> 文獻、資料發(fā)表(出版)日期:</p><p> 院 (部): 土木工程學(xué)院</p><p> 專 業(yè): 土木工程</p><p><b> 班 級: </b></p><p><b> 姓 名:
3、 </b></p><p><b> 學(xué) 號: </b></p><p><b> 指導(dǎo)教師:</b></p><p><b> 翻譯日期: </b></p><p><b> 外文文獻: </b></p><
4、p> Cluster Analysis</p><p> —Basic Concepts and Algorithms</p><p> Cluster analysis divides data into groups (clusters) that are meaningful, useful,or both. If meaningful groups are the go
5、al, then the clusters should capture the natural structure of the data. In some cases, however, cluster analysis is only a useful starting point for other purposes, such as data summarization. Whether for understanding o
6、r utility, cluster analysis has long played an important role in a wide variety of ?elds: psychology and other social sciences, biology,statistics, pattern rec</p><p> There have been many applications of c
7、luster analysis to practical problems. We provide some speci?c examples, organized by whether the purpose of the clustering is understanding or utility.</p><p> Clustering for Understanding Classes, or con
8、ceptually meaningful groups of objects that share common characteristics, play an important role in how people analyze and describe the world. Indeed, human beings are skilled at dividing objects into groups (clustering)
9、 and assigning particular objects to these groups (classi?cation). For example, even relatively young children can quickly label the objects in a photograph as buildings, vehicles, people, animals, plants, etc. In the co
10、ntext of unders</p><p> Biology. Biologists have spent many years creating a taxonomy (hierarchical classi?cation) of all living things: kingdom, phylum, class,order, family, genus, and species. Thus, it is
11、 perhaps not surprising that much of the early work in cluster analys is sought to create a discipline of mathematical taxonomy that could automatically ?nd such classi?cation structures. More recently, biologists have a
12、pplied clustering to analyze the large amounts of genetic information that are now available. For </p><p> ? Information Retrieval. The World Wide Web consists of billions of Web pages, and the results of a
13、 query to a search engine can return thousands of pages. Clustering can be used to group these search results into a small number of clusters, each of which captures a particular aspect of the query. For instance, a quer
14、y of “movie” might return Web pages grouped into categories such as reviews, trailers, stars, and theaters. Each category (cluster) can be broken into subcategories (sub-clusters), </p><p> ? Climate. Under
15、standing the Earth’s climate requires ?nding patternsin the atmosphere and ocean. To that end, cluster analysis has been applied to ?nd patterns in the atmospheric pressure of polar regions and areas of the ocean that ha
16、ve a signi?cant impact on land climate.</p><p> ? Psychology and Medicine. An illness or condition frequently has a number of variations, and cluster analysis can be used to identify these different subcate
17、gories. For example, clustering has been used to identify different types of depression. Cluster analysis can also be used to detect patterns in the spatial or temporal distribution of a disease.</p><p> ?
18、Business. Businesses collect large amounts of information on current and potential customers. Clustering can be used to segment customers into a small number of groups for additional analysis and marketing activities.<
19、;/p><p> Clustering for Utility:Cluster analysis provides an abstraction from individual data objects to the clusters in which those data objects reside. Additionally, some clustering techniques characterize e
20、ach cluster in terms of a cluster prototype; i.e., a data object that is representative of the other objects in the cluster. These cluster prototypes can be used as the basis for a number of data analysis or data process
21、ing techniques. Therefore, in the context of utility, cluster analysis is the st</p><p> ? Summarization. Many data analysis techniques, such as regression or PCA, have a time or space complexity of O(m2) o
22、r higher (where m is the number of objects), and thus, are not practical for large data sets. However, instead of applying the algorithm to the entire data set, it can be applied to a reduced data set consisting only of
23、cluster prototypes. Depending on the type of analysis, the number of prototypes, and the accuracy with which the prototypes represent the data, the results can be </p><p> ? Compression. Cluster prototypes
24、can also be used for data compres-sion. In particular, a table is created that consists of the prototypes for each cluster; i.e., each prototype is assigned an integer value that is its position (index) in the table. Eac
25、h object is represented by the index of the prototype associated with its cluster. This type of compression is known as vector quantization and is often applied to image, sound, and video data, where (1) many of the data
26、 objects are highly simila</p><p> ? Effciently Finding Nearest Neighbors. Finding nearest neighbors can require computing the pairwise distance between all points. Often clusters and their cluster prototyp
27、es can be found much more effciently. If objects are relatively close to the prototype of their cluster, then we can use the prototypes to reduce the number of distance computations that are necessary to ?nd the nearest
28、neighbors of an object. Intuitively, if two cluster prototypes are far apart, then the objects in the corresp</p><p> This chapter provides an introduction to cluster analysis. We begin with a high-level ov
29、erview of clustering, including a discussion of the various ap- proaches to dividing objects into sets of clusters and the different types of clusters. We then describe three speci?c clustering techniques that represent
30、broad categories of algorithms and illustrate a variety of concepts: K-means, agglomerative hierarchical clustering, and DBSCAN. The ?nal section of this chapter is devoted to cluster validity</p><p> 1.1Ov
31、erview</p><p> Before discussing speci?c clustering techniques, we provide some necessary background. First, we further de?ne cluster analysis, illustrating why it isdiffcult and explaining its relationship
32、 to other techniques that group data.Then we explore two important topics: (1) different ways to group a set ofobjects into a set of clusters, and (2) types of clusters.</p><p> 1.1.1What Is Cluster Analysi
33、s?</p><p> Cluster analysis groups data objects based only on information found in thedata that describes the objects and their relationships. The goal is that theobjects within a group be similar (or relat
34、ed) to one another and di?erent from(or unrelated to) the objects in other groups. The greater the similarity (orhomogeneity) within a group and the greater the di?erence between groups,the better or more distinct the cl
35、ustering.</p><p> Cluster analysis is related to other techniques that are used to divide data objects into groups. For instance, clustering can be regarded as a form of classi?cation in that it creates a l
36、abeling of objects with class (cluster) labels.However, it derives these labels only from the data. In contrast, classi?cationn the sense of Chapter 4 is supervised classi?cation; i.e., new, unlabeled objects are assigne
37、d a class label using a model developed from objects with known class labels. For this reaso</p><p> Also, while the terms segmentation and partitioning are sometimesused as synonyms for clustering, these t
38、erms are frequently used for approaches outside the traditional bounds of cluster analysis. For example, the termpartitioning is often used in connection with techniques that divide graphs into subgraphs and that are not
39、 strongly connected to clustering. Segmentation often refers to the division of data into groups using simple techniques; e.g.,an image can be split into segments based only o</p><p> 1.1.2 Different Types
40、of Clusterings</p><p> An entire collection of clusters is commonly referred to as a clustering, and in this section, we distinguish various types of clusterings: hierarchical (nested) versus partitional (u
41、nnested), exclusive versus overlapping versus fuzzy, and complete versus partial.</p><p> Hierarchical versus Partitional The most commonly discussed distinc- tion among different types of clusterings is wh
42、ether the set of clusters is nested or unnested, or in more traditional terminology, hierarchical or partitional. Apartitional clustering is simply a division of the set of data objects into non-overlapping subsets (clus
43、ters) such that each data object is in exactly onesubset. </p><p> If we permit clusters to have subclusters, then we obtain a hierarchical clustering, which is a set of nested clusters that are organized a
44、s a tree. Each node (cluster) in the tree (except for the leaf nodes) is the union of its children (subclusters), and the root of the tree is the cluster containing all the objects.Often, but not always, the leaves of th
45、e tree are singleton clusters of individual data objects. If we allow clusters to be nested, then one interpretation of Figure 8.1(a) is that</p><p> Exclusive versus Overlapping versus Fuzzy The clustering
46、s shown in Figure 8.1 are all exclusive, as they assign each object to a single cluster.There are many situations in which a point could reasonably be placed in more than one cluster, and these situations are better addr
47、essed by non-exclusiveclustering. In the most general sense, an overlapping or non-exclusiveclustering is used to re?ect the fact that an object can simultaneously belong to more than one group (class). For instance, a p
48、erso</p><p> In a fuzzy clustering, every object belongs to every cluster with a membership weight that is between 0 (absolutely doesn’t belong) and 1 (absolutelybelongs). In other words, clusters are treat
49、ed as fuzzy sets. (Mathematically,a fuzzy set is one in which an object belongs to any set with a weight thatis between 0 and 1. In fuzzy clustering, we often impose the additional constraint that the sum of the weights
50、for each object must equal 1.) Similarly,probabilistic clustering techniques compute th</p><p> Complete versus Partial A complete clustering assigns every object to a cluster, whereas a partial clustering
51、does not. The motivation for a partial clustering is that some objects in a data set may not belong to well-de?ned groups. Many times objects in the data set may represent noise, outliers, or“uninteresting background.” F
52、or example, some newspaper stories may share a common theme, such as global warming, while other stories are more genericor one-of-a-kind. Thus, to ?nd the important topi</p><p> 1.1.3Different Types of Clu
53、sters</p><p> Clustering aims to ?nd useful groups of objects (clusters), where usefulness is de?ned by the goals of the data analysis. Not surprisingly, there are several different notions of a cluster tha
54、t prove useful in practice. In order to visually illustrate the differences among these types of clusters, we use two-dimensional points, as shown in Figure 8.2, as our data objects. We stress, however, thatthe types of
55、clusters described here are equally valid for other kinds of data.</p><p> Well-Separated A cluster is a set of objects in which each object is closer (or more similar) to every other object in the cluster
56、than to any object notin the cluster. Sometimes a threshold is used to specify that all the objects in a cluster must be su?ciently close (or similar) to one another. This ideal istic de?nition of a cluster is satis?ed o
57、nly when the data contains natural clusters that are quite far from each other. Figure 8.2(a) gives an example of well-separated clusters that consis</p><p> Prototype-Based A cluster is a set of objects in
58、 which each object is closer(more similar) to the prototype that de?nes the cluster than to the prototype of any other cluster. For data with continuous attributes, the prototype of a cluster is often a centroid, i.e., t
59、he average (mean) of all the points in the cluster. When a centroid is not meaningful, such as when the data has categorical attributes, the prototype is often a medoid, i.e., the most representative pointof a cluster. F
60、or many type</p><p> Graph-Based If the data is represented as a graph, where the nodes are objects and the links represent connections among objects (see Section 2.1.2),then a cluster can be de?ned as a co
61、nnected component; i.e., a group of objects that are connected to one another, but that have no connection to objects outside the group. An important example of graph-based clusters are contiguity-based clusters, where t
62、wo objects are connected only if they are within a speci?ed distance of each other. This implie</p><p> Other types of graph-based clusters are also possible. One such approach (Section 8.3.2) de?nes a clus
63、ter as a clique; i.e., a set of nodes in a graph that are completely connected to each other. Speci?cally, if we add connections between objects in the order of their distance from one another, a cluster is formed when a
64、 set of objects forms a clique. Like prototype-based clusters, such clusters tend to be globular.</p><p> Density-Based A cluster is a dense region of objects that is surrounded bya region of low density. F
65、igure 8.2(d) shows some density-based clusters for data created by adding noise to the data of Figure 8.2(c). The two circular clusters are not merged, as in Figure 8.2(c), because the bridge between them fades into the
66、noise. Likewise, the curve that is present in Figure 8.2(c) also fades into the noise and does not form a cluster in Figure 8.2(d). A density-based de?nition of a cluster is often </p><p> Shared-Property (
67、Conceptual Clusters) More generally, we can de?ne a cluster as a set of objects that share some property. This de?nition encompasses all the previous de?nitions of a cluster; e.g., objects in a center-based cluster share
68、 the property that they are all closest to the same centroid or medoid. However, the shared-property approach also includes new types of clusters. Consider the clusters shown in Figure 8.2(e). A triangular area (cluster)
69、 is adjacent to a rectangular one, and ther</p><p><b> Road Map</b></p><p> In this chapter, we use the following three simple, but important techniques to introduce many of the co
70、ncepts involved in cluster analysis.</p><p> ? K-means. This is a prototype-based, partitional clustering technique that attempts to ?nd a user-speci?ed number of clusters (K ), which are represented by the
71、ir centroids.</p><p> ? Agglomerative Hierarchical Clustering. This clustering approach refers to a collection of closely related clustering techniques that produce a hierarchical clustering by starting wit
72、h each point as a singleton cluster and then repeatedly merging the two closest clusters until a single, all- encompassing cluster remains. Some of these techniques have a natural interpretation in terms of graph-based c
73、lustering, while others have an interpretation in terms of a prototype-based approach.</p><p> ? DBSCAN. This is a density-based clustering algorithm that producesa partitional clustering, in which the numb
74、er of clusters is automatically determined by the algorithm. Points in low-density regions are classi?ed as noise and omitted; thus, DBSCAN does not produce a complete lustering.</p><p><b> 中文譯文:</
75、b></p><p><b> 聚類分析</b></p><p><b> —基本概念及算法</b></p><p> 聚類分析將數(shù)據(jù)分為有意義的,有用的,或兩者兼而有之的組(集群)。如果目標(biāo)群體是有意義的,那么集群應(yīng)該捕獲數(shù)據(jù)的自然結(jié)構(gòu)。但是在某些情況下,聚類分析只是一個用于其他目的有用的起點,如數(shù)據(jù)匯總。無
76、論是理解或效用,聚類分析,長久以來在各個領(lǐng)域扮演重要角色:心理學(xué)等社會科學(xué),生物學(xué),統(tǒng)計,模式識別,信息檢索,機器學(xué)習(xí)和數(shù)據(jù)采集。</p><p> 聚類分析已經(jīng)被應(yīng)用到許多實際問題中。我們按照聚類的目的是了解或者實用而提供了一些具體的例子。 聚類理解類。對象或概念上是有意義的群體,有著共同的特點,在人們?nèi)绾畏治龊驼f明事物上有重要的作用。事實上,人類善于將對象和特定對象分成組(集群)并且將它們分類。例如
77、,即使是相對較小的兒童也可以快速識別出照片中的建筑物,車輛,人物,動物,植物等拍攝對象。在數(shù)據(jù)理解方面方面,集群是潛在的類,而聚類分析就是自動將集群分類的技術(shù)。以下是一些例子: ?生物學(xué)。生物學(xué)家們花了很多年創(chuàng)造了萬物分類(分層分類):領(lǐng)域、語系、類、秩序、科、屬和種。因此,這也許并不奇怪,在群集分析的早期多是試圖建立一個數(shù)學(xué)學(xué)科分類,可以自動找到這樣的分類結(jié)構(gòu)。最近,生物學(xué)家已經(jīng)將聚類應(yīng)用到遺傳信息處理放面。例如,集群已被用于
78、尋找具有類似的功能基因組。 ?信息檢索。萬維網(wǎng)有數(shù)十億的網(wǎng)頁,一個搜索引擎的查詢結(jié)果可以返回數(shù)千頁。集群可以用來將這些搜索結(jié)果歸為具有相同點的一大類。例如,一個“電影”的查詢可能會返回到諸如評論,預(yù)告片,明星和劇院類別分組的網(wǎng)頁。每個類別(集群)可以分成子類別(子集),產(chǎn)生一個層次結(jié)構(gòu),</p><p> 在一個模糊聚類,每個對象屬于每一個成員的體重是介于0(絕對不屬于)和1群集。換句話說,集群被視為模
79、糊集。 (數(shù)學(xué),模糊集在其中任何一個對象屬于0和1之間的權(quán)重)。模糊聚類,我們經(jīng)常施加額外的約束,即對每個對象的權(quán)重之和必須等于1。)同樣,概率聚類技術(shù)的概率計算每個點屬于每個集群,還必須總結(jié)這些概率為1。因為成員權(quán)或任何對象之和為1,概率聚類沒有解決真正的多用戶情況下,如一個學(xué)生的雇員,其中一個對象屬于多個類的情況。相反,這些方法是最合適的避免在分配一個對象只有一個群集的隨意性可能接近時數(shù)。 或概率的模糊聚類經(jīng)常轉(zhuǎn)換為非重疊群,每個對
80、象分配到集群中,其成員的重量或概率最高。 整的聚類與偏每個對象分配到集群,而不是一個局部聚類。對于一個局部聚類的動機是,在一個數(shù)據(jù)集的某些對象可能不屬于明確界定的群體。很多時候,在數(shù)據(jù)集對象可能代表噪音,孤立點,或“無趣的背景?!崩?,一些報紙報道,可能都有一個共同的主題,如全球變暖一類。因此,為了找到在上個月的故事中的重要議題,我們可能要搜索的文件是由一個共同的主題緊密相關(guān)的集群。在其他情況下,需求完整的對象聚類.例如,一個
81、應(yīng)用程序,使用聚類組織</p><p> 1.1.3簇的不同類型</p><p> 聚類的目的是找到對象(組),其中的用處是通過數(shù)據(jù)分析確定的目標(biāo)有用的群體。毫不奇怪,有幾種不同的群集的概念在實踐中證明是有益的。為了直觀地說明這些類型的集群中的差異,我們使用二維點,如圖8.2所示,由于我們的數(shù)據(jù)對象,我們強調(diào),但是,這里描述的集群類型同樣為其他類型的數(shù)據(jù)無效。</p>&
82、lt;p> 分隔集群是一組對象,其中每個對象是密切(或更多類似)集群中的每個對象比其他任何對象群集。有時,一個閾值用來指定集群中的所有對象都必須充分接近(或類似)彼此。這種理想的集群技術(shù)信息研究所的定義,只有當(dāng)數(shù)據(jù)滿足包含從很遠的自然相互集群。圖8.2(a)給出了一個良好的分離集群的兩個例子,在一個兩維空間點群組成。任意兩點之間的距離是不同的群體大于他任意兩點間的距離保持在一組。井分隔群集不須球狀,但可以有任意形狀。</p
83、><p> 基于原型的集群是一組對象,其中每個對象是密切(更多類似)的原型定義,而不是任何其他群集原型群集。對于連續(xù)屬性的數(shù)據(jù),集群的原型往往是重心,即所有在集群點的平均值(平均)。當(dāng)質(zhì)心是沒有意義的,例如,當(dāng)數(shù)據(jù)類別屬性,原型往往是最有代表性的集群。對于許多類型的數(shù)據(jù),該原型可以被看作是最核心的一點,在這種情況下,我們通常所說的中心的集群原型為基礎(chǔ)的集群。毫不奇怪,這種集群往往是球狀。圖8.2(b)顯示了一個中心
84、為基礎(chǔ)的集群的例子。</p><p> 基于圖的數(shù)據(jù)是,如果作為一個圖,其中的節(jié)點對象和對象之間的聯(lián)系表示連接(參見第2.1.2節(jié)),然后一組可以作為一個連接組件中定義的代表,也就是說,對象的組彼此相連,但不會對本集團以外對象的連接。一種基于圖的集群重要的例子是連續(xù)性的群集,其中只有兩個對象,如果它們連接在一個相互指定距離之內(nèi)。這意味著,每一個連續(xù)性的群集對象是接近到群集中的其他一些對象比任何一個不同的聚點。圖
85、8.2(c)給出了兩維點,如簇的例子。這群集的定義是有用的當(dāng)集群不規(guī)則或交織在一起,但可以有麻煩時,噪音存在,因為由兩個圖8.2(c)項的積分可以合并兩個不同的簇群小橋球形說明。的圖形為基礎(chǔ)的集群其他類型也是可能的。其中一種辦法(第8.3.2)定義為一個集團群集,也就是說,一組節(jié)點在一個完全相互連接圖。特別是,如果我們加入了對象之間的距離為彼此連接,形成一個群集,當(dāng)一個對象的形式設(shè)置一個集團。像原型為基礎(chǔ)的集群,這種集群往往是球狀。&
86、lt;/p><p> 基于密度的群集是一組環(huán)繞低密度區(qū)域?qū)ο蟮拿芗瘏^(qū)。圖8.2(d)顯示一定密度的數(shù)據(jù)為基礎(chǔ),加入噪聲的數(shù)據(jù)圖8.2(三)創(chuàng)建群集。集群的兩個圓形不合并,如圖8.2(c)項,因為它們之間的橋梁進入噪音消失。同樣地,存在的曲線圖8.2(c)又消失的噪音之中,并不構(gòu)成群集在圖8.2(d)項。集群的一個基于密度的定義時,往往采用集群不規(guī)則或交織,在噪聲和離群點都存在。相比之下,集群的一個連續(xù)性的定義都不能
87、很好的工作數(shù)據(jù),圖8.2(d)由于噪聲往往會形成集群之間的橋梁。</p><p> 共享屬性(概念集群)更普遍,我們可以定義為一個共享對象的一些屬性設(shè)置群集。這個定義包括所有以前的群集的定義,在一個中心的群集共享,例如,對象的屬性,它們都是相同的質(zhì)心。然而,共享屬性的方法還包括集群研究的新類型??紤]圖8.2所示的群集(五)。一個三角區(qū)(集群)毗鄰長方形之一,有兩個交織在一起的圓圈(集群)。在這兩種情況下,聚類算
88、法將需要集群的一個非常具體的概念,成功地檢測到這些集群。對查找安泰這種集群的過程稱為概念聚類。然而,過于復(fù)雜的集群的一個概念,會考慮在模式識別領(lǐng)域的我們,??所以我們只考慮在這本書集群簡單的類型。</p><p><b> 路線圖</b></p><p> 在這一章中,我們使用以下三個簡單的,但聚類分析所涉及的許多重要概念。</p><p>
89、; ?K均值。這是一個以原型為基礎(chǔ),試圖找到一個均值的聚類技術(shù)。</p><p> ?凝聚層次聚類。此分群方式是指一組密切相關(guān)的聚類技術(shù),開始時各自作為一個單身聚點,然后反復(fù),直到一個單一的合并兩個最接近的集群的層次聚類集合,全方位的集群仍然存在。這些方法在一些有基于圖形的聚類方面自然解釋,而其他人在一個原型為基礎(chǔ)的方法方面作出解釋。</p><p> ?DBSCAN。這是一個基于密
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
評論
0/150
提交評論