版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、<p> What is Data Mining?</p><p> Many people treat data mining as a synonym for another popularly used term, “Knowledge Discovery in Databases”, or KDD. Alternatively, others view data mining as simp
2、ly an essential step in the process of knowledge discovery in databases. Knowledge discovery consists of an iterative sequence of the following steps: </p><p> · data cleaning: to remove noise or irrel
3、evant data, </p><p> · data integration: where multiple data sources may be combined,</p><p> · data selection : where data relevant to the analysis task are retrieved from the datab
4、ase,</p><p> · data transformation : where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance,</p><p>
5、183; data mining: an essential process where intelligent methods are applied in order to extract data patterns, </p><p> · pattern evaluation: to identify the truly interesting patterns representing kn
6、owledge based on some interestingness measures, and </p><p> · knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user .
7、 </p><p> The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user, and may be stored as new knowledge in the knowledge base. Note that accordi
8、ng to this view, data mining is only one step in the entire process, albeit an essential one since it uncovers hidden patterns for evaluation. </p><p> We agree that data mining is a knowledge discovery pro
9、cess. However, in industry, in media, and in the database research milieu, the term “data mining” is becoming more popular than the longer term of “knowledge discovery in databases”. Therefore, in this book, we choose to
10、 use the term “data mining”. We adopt a broad view of data mining functionality: data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or
11、other </p><p> Based on this view, the architecture of a typical data mining system may have the following major components: </p><p> 1. Database, data warehouse, or other information reposito
12、ry. This is one or a set of databases, data warehouses, spread sheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. </p><p> 2. Dat
13、abase or data warehouse server. The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. </p><p> 3. Knowledge base. This is the domain k
14、nowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of a
15、bstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness con
16、straints or threshold</p><p> 4. Data mining engine. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association analysis,
17、 classification, evolution and deviation analysis.</p><p> 5. Pattern evaluation module. This component typically employs interestingness measures and interacts with the data mining modules so as to focus t
18、he search towards interesting patterns. It may access interestingness thresholds stored in the knowledge base. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the imple
19、mentation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern inter</p><p> 6. Graphical user interface. This module communicates between us
20、ers and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the
21、 intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.<
22、/p><p> From a data warehouse perspective, data mining can be viewed as an advanced stage of on-1ine analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style a
23、nalytical processing of data warehouse systems by incorporating more advanced techniques for data understanding. </p><p> While there may be many “data mining systems” on the market, not all of them can per
24、form true data mining. A data analysis system that does not handle large amounts of data can at most be categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A s
25、ystem that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorize</p><p> D
26、ata mining involves an integration of techniques from mult1ple disciplines such as database technology, statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization,
27、information retrieval, image and signal processing, and spatial data analysis. We adopt a database perspective in our presentation of data mining in this book. That is, emphasis is placed on efficient and scalable data m
28、ining techniques for large databases. By performing data mi</p><p> A classification of data mining systems </p><p> Data mining is an interdisciplinary field, the confluence of a set of disci
29、plines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as ne
30、ural networks, fuzzy and or rough set theory, knowledge representation, inductive logic programming, or high performance computing. Depending on the kinds of data to be mined or on the given data mining ap</p><
31、;p> Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classificati
32、on of data mining systems. Such a classification may help potential users distinguish data mining systems and identify those that best match their needs. Data mining systems can be categorized according to various criter
33、ia, as follows. </p><p> 1) Classification according to the kinds of databases mined. </p><p> A data mining system can be classified according to the kinds of databases mined. Database system
34、s themselves can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can theref
35、ore be classified accordingly. </p><p> For instance, if classifying according to data models, we may have a relational, transactional, object-oriented, object-relational, or data warehouse mining system. I
36、f classifying according to the special types of data handled, we may have a spatial, time -series, text, or multimedia data mining system , or a World-Wide Web mining system . Other system types include heterogeneous dat
37、a mining systems, and legacy data mining systems.</p><p> 2) Classification according to the kinds of knowledge mined.</p><p> Data mining systems can be categorized according to the kinds of
38、knowledge they mine, i.e., based on data mining functionalities, such as characterization, discrimination, association, classification, clustering, trend and evolution analysis, deviation analysis , similarity analysis,
39、etc. A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities. </p><p> Moreover, data mining systems can also be distinguished based on the granularity or
40、levels of abstraction of the knowledge mined, including generalized knowledge(at a high level of abstraction), primitive-level knowledge(at a raw data level), or knowledge at multiple levels (considering several levels o
41、f abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction.</p><p> 3) Classification according to the kinds of techniques utilized. </p
42、><p> Data mining systems can also be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., auto
43、nomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis employed(e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics, visualization,
44、 pattern recognition, neural networks, and so on ) .A sophisticated data min</p><p><b> 什么是數(shù)據(jù)挖掘?</b></p><p> 許多人把數(shù)據(jù)挖掘視為另一個(gè)常用的術(shù)語—數(shù)據(jù)庫中的知識(shí)發(fā)現(xiàn)或KDD的同義詞。而另一些人只是把數(shù)據(jù)挖掘視為數(shù)據(jù)庫中知識(shí)發(fā)現(xiàn)過程的一個(gè)基本驟。知識(shí)
45、發(fā)現(xiàn)的過程由以下步驟組成:</p><p> 1)數(shù)據(jù)清理:消除噪聲或不一致數(shù)據(jù),</p><p> 2)數(shù)據(jù)集成:多種數(shù)據(jù)可以組合在一起,</p><p> 3)數(shù)據(jù)選擇:從數(shù)據(jù)庫中檢索與分析任務(wù)相關(guān)的數(shù)據(jù),</p><p> 4)數(shù)據(jù)變換:數(shù)據(jù)變換或統(tǒng)一成適合挖掘的形式,如通過匯總或聚集操作,</p><p&g
46、t; 5)數(shù)據(jù)挖掘:基本步驟,使用智能方法提取數(shù)據(jù)模式,</p><p> 6)模式評(píng)估:根據(jù)某種興趣度度量,識(shí)別表示知識(shí)的真正有趣的模式,</p><p> 7)知識(shí)表示:使用可視化和知識(shí)表示技術(shù),向用戶提供挖掘的知識(shí)。</p><p> 數(shù)據(jù)挖掘的步驟可以與用戶或知識(shí)庫進(jìn)行交互。把有趣的模式提供給用戶,或作為新的知識(shí)存放在知識(shí)庫中。注意,根據(jù)這種觀點(diǎn),數(shù)
47、據(jù)挖掘只是整個(gè)過程中的一個(gè)步驟,盡管是最重要的一步,因?yàn)樗l(fā)現(xiàn)隱藏的模式。</p><p> 我們同意數(shù)據(jù)挖掘是知識(shí)發(fā)現(xiàn)過程中的一個(gè)步驟。然而,在產(chǎn)業(yè)界、媒體和數(shù)據(jù)庫研究界,“數(shù)據(jù)挖掘”比那個(gè)較長(zhǎng)的術(shù)語“數(shù)據(jù)庫中知識(shí)發(fā)現(xiàn)”更為流行。因此,在本書中,選用的術(shù)語是數(shù)據(jù)挖掘。我們采用數(shù)據(jù)挖掘的廣義觀點(diǎn):數(shù)據(jù)挖掘是從存放在數(shù)據(jù)庫中或其他信息庫中的大量數(shù)據(jù)中挖掘出有趣知識(shí)的過程。</p><p>
48、 基于這種觀點(diǎn),典型的數(shù)據(jù)挖掘系統(tǒng)具有以下主要成分:</p><p> 數(shù)據(jù)庫、數(shù)據(jù)倉庫或其他信息庫:這是一個(gè)或一組數(shù)據(jù)庫、數(shù)據(jù)倉庫、電子表格或其他類型的信息庫??梢栽跀?shù)據(jù)上進(jìn)行數(shù)據(jù)清理和集成。</p><p> 數(shù)據(jù)庫、數(shù)據(jù)倉庫服務(wù)器:根據(jù)用戶的數(shù)據(jù)挖掘請(qǐng)求,數(shù)據(jù)庫、數(shù)據(jù)倉庫服務(wù)器負(fù)責(zé)提取相關(guān)數(shù)據(jù)。</p><p> 知識(shí)庫:這是領(lǐng)域知識(shí),用于指導(dǎo)搜索,或
49、評(píng)估結(jié)果模式的興趣度。這種知識(shí)可能包括概念分層,用于將屬性或?qū)傩灾到M織成不同的抽象層。用戶確信方面的知識(shí)也可以包含在內(nèi)。可以使用這種知識(shí),根據(jù)非期望性評(píng)估模式的興趣度。領(lǐng)域知識(shí)的其他例子有興趣度限制或閾值和元數(shù)據(jù)(例如,描述來自多個(gè)異種數(shù)據(jù)源的數(shù)據(jù))。</p><p> 數(shù)據(jù)挖掘引擎:這是數(shù)據(jù)挖掘系統(tǒng)基本的部分,由一組功能模塊組成,用于特征化、關(guān)聯(lián)、分類、聚類分析以及演變和偏差分析。</p>&l
50、t;p> 模式評(píng)估模塊:通常,此成分使用興趣度度量,并與數(shù)據(jù)挖掘模塊交互,以便將搜索聚集在有趣的模式上。它可能使用興趣度閾值過濾發(fā)現(xiàn)的模式。模式評(píng)估模塊也可以與挖掘模塊集成在一起,這依賴于所用的數(shù)據(jù)挖掘方法的實(shí)現(xiàn)。對(duì)于有效的數(shù)據(jù)挖掘,建議盡可能深地將模式評(píng)估推進(jìn)到挖掘過程之中,以便將搜索限制在有興趣的模式上。</p><p> 從數(shù)據(jù)倉庫觀點(diǎn),數(shù)據(jù)挖掘可以看作聯(lián)機(jī)分析處理(OLAP)的高級(jí)階段。然而,通
51、過結(jié)合更高級(jí)的數(shù)據(jù)理解技術(shù),數(shù)據(jù)挖掘比數(shù)據(jù)倉庫的匯總型分析處理走得更遠(yuǎn)。</p><p> 盡管市場(chǎng)上已有許多“數(shù)據(jù)挖掘系統(tǒng)”,但是并非所有系統(tǒng)的都能進(jìn)行真正的數(shù)據(jù)挖掘。不能處理大量數(shù)據(jù)的數(shù)據(jù)分析系統(tǒng),最多是被稱作機(jī)器學(xué)習(xí)系統(tǒng)、統(tǒng)計(jì)數(shù)據(jù)分析工具或?qū)嶒?yàn)系統(tǒng)原型。一個(gè)系統(tǒng)只能夠進(jìn)行數(shù)據(jù)或信息檢索,包括在大型數(shù)據(jù)庫中找出聚集的值或回答演繹查詢,應(yīng)當(dāng)歸類為數(shù)據(jù)庫系統(tǒng),或信息檢索系統(tǒng),或演繹數(shù)據(jù)庫系統(tǒng)。</p>
52、;<p> 數(shù)據(jù)挖掘涉及多學(xué)科技術(shù)的集成,包括數(shù)據(jù)庫技術(shù)、統(tǒng)計(jì)學(xué)、機(jī)器學(xué)習(xí)、高性能計(jì)算、模式識(shí)別、神經(jīng)網(wǎng)絡(luò)、數(shù)據(jù)可視化、信息檢索、圖像與信號(hào)處理和空間數(shù)據(jù)分析。在本書討論數(shù)據(jù)挖掘的時(shí)候,我們采用數(shù)據(jù)庫的觀點(diǎn)。即,著重強(qiáng)調(diào)在大型數(shù)據(jù)庫中有效的和可伸縮的數(shù)據(jù)挖掘技術(shù)。一個(gè)算法是可伸縮的,如果給定內(nèi)存和磁盤空間等可利用的系統(tǒng)資源,其運(yùn)行時(shí)間應(yīng)當(dāng)隨數(shù)據(jù)庫大小線性增加。通過數(shù)據(jù)挖掘,可以從數(shù)據(jù)庫提取有趣的知識(shí)、規(guī)律或者高層信息,并
53、可以從不同的角度來觀察或?yàn)g覽。發(fā)現(xiàn)的知識(shí)可以用于決策、過程控制、信息管理、查詢處理,等等。因此,數(shù)據(jù)挖掘被信息產(chǎn)業(yè)界認(rèn)為是數(shù)據(jù)庫系統(tǒng)最重要的前沿之一,是信息產(chǎn)業(yè)中最有前途的交叉學(xué)科。</p><p> 數(shù)據(jù)挖掘是一個(gè)交叉學(xué)科的領(lǐng)域,受到多個(gè)學(xué)科的影響,包括數(shù)據(jù)庫系統(tǒng)、統(tǒng)計(jì)學(xué)、機(jī)器學(xué)習(xí)、可視化和信息科學(xué)。此外,依賴于所用的數(shù)據(jù)挖掘方法,以及可以使用的其他學(xué)科的技術(shù),如神經(jīng)網(wǎng)絡(luò)、模糊和/或粗糙集理論、知識(shí)表示、歸納
54、邏輯程序設(shè)計(jì)或高性能計(jì)算。依賴于所挖掘的數(shù)據(jù)類型或給定的數(shù)據(jù)挖掘應(yīng)用,數(shù)據(jù)挖掘系統(tǒng)也可以集成空間數(shù)據(jù)分析、信息檢索、模式識(shí)別、圖形分析、信號(hào)處理、計(jì)算機(jī)圖形學(xué)、Web技術(shù)、經(jīng)濟(jì)、商業(yè)、生物信息學(xué)或心理學(xué)領(lǐng)域的技術(shù)。</p><p> 由于數(shù)據(jù)挖掘源于多個(gè)學(xué)科,因此在數(shù)據(jù)挖掘研究中就產(chǎn)生了大量的、各種不同類型的數(shù)據(jù)挖掘系統(tǒng)。這樣,就需要對(duì)數(shù)據(jù)挖掘系統(tǒng)給出一個(gè)清楚的分類。這種分類可以幫助用戶區(qū)分?jǐn)?shù)據(jù)挖掘系統(tǒng),確定出
55、最適合其需要的數(shù)據(jù)挖掘系統(tǒng)。根據(jù)不同的標(biāo)準(zhǔn),數(shù)據(jù)挖掘系統(tǒng)可以有如下分類:</p><p> 1)根據(jù)挖掘的數(shù)據(jù)庫類型進(jìn)行分類。</p><p> 數(shù)據(jù)挖掘系統(tǒng)可以根據(jù)挖掘的數(shù)據(jù)庫類型進(jìn)行分類。數(shù)據(jù)庫系統(tǒng)本身可以根據(jù)不同的標(biāo)準(zhǔn)(如數(shù)據(jù)模型,或數(shù)據(jù)或所涉及的應(yīng)用類型)來分類,每一類都可能需要自己的數(shù)據(jù)挖掘技術(shù)。這樣,數(shù)據(jù)挖掘系統(tǒng)就可以據(jù)此進(jìn)行相應(yīng)的分類。</p><p&
56、gt; 例如,如果是根據(jù)數(shù)據(jù)模型來分類,我們可以有關(guān)系的、事務(wù)的、面向?qū)ο蟮?、?duì)象-關(guān)系的或數(shù)據(jù)倉庫的數(shù)據(jù)挖掘系統(tǒng)。如果是根據(jù)所處理的數(shù)據(jù)的特定類型分類,我們可以有空間的、時(shí)間序列的、文本的或多媒體的數(shù)據(jù)挖掘系統(tǒng),或是WWW的數(shù)據(jù)挖掘系統(tǒng)。</p><p> 2)根據(jù)挖掘的知識(shí)類型進(jìn)行分類。</p><p> 數(shù)據(jù)挖掘系統(tǒng)可以根據(jù)所挖掘的知識(shí)類型進(jìn)行分類。即根據(jù)數(shù)據(jù)挖掘的功能,如特征
57、化、區(qū)分、關(guān)聯(lián)、分類聚類、孤立點(diǎn)分析和演變分析、偏差分析、類似性分析等進(jìn)行分類。一個(gè)全面的數(shù)據(jù)挖掘系統(tǒng)應(yīng)當(dāng)提供多種和/或集成的數(shù)據(jù)挖掘功能。</p><p> 此外,數(shù)據(jù)挖掘系統(tǒng)也可以根據(jù)所挖掘的知識(shí)的粒度或抽象層進(jìn)行區(qū)分,包括概化知識(shí)(在高抽象層),原始層知識(shí)(在原始數(shù)據(jù)層),或多層知識(shí)(考慮若干抽象層)。一個(gè)高級(jí)的數(shù)據(jù)挖掘系統(tǒng)應(yīng)當(dāng)支持多抽象層的知識(shí)發(fā)現(xiàn)。</p><p> 數(shù)據(jù)挖掘
58、系統(tǒng)還可以分類為挖掘數(shù)據(jù)規(guī)則性(通常出現(xiàn)的模式)和數(shù)據(jù)不規(guī)則性(如異?;蚬铝Ⅻc(diǎn))這幾種。一般地,概念描述、關(guān)聯(lián)分析、分類、預(yù)測(cè)和聚類挖掘數(shù)據(jù)規(guī)律,將孤立點(diǎn)作為噪聲排除。這些方法也能幫助檢測(cè)孤立點(diǎn)。</p><p> 3)根據(jù)所用的技術(shù)進(jìn)行分類。</p><p> 數(shù)據(jù)挖掘系統(tǒng)也可以根據(jù)所用的數(shù)據(jù)挖掘技術(shù)進(jìn)行分類。這些技術(shù)可以根據(jù)用戶交互程度(例如自動(dòng)系統(tǒng)、交互探查系統(tǒng)、查詢驅(qū)動(dòng)系統(tǒng)),
59、或利用的數(shù)據(jù)分析方法(例如面向數(shù)據(jù)庫或數(shù)據(jù)倉庫的技術(shù)、機(jī)器學(xué)習(xí)、統(tǒng)計(jì)學(xué)、可視化、模式識(shí)別、神經(jīng)網(wǎng)絡(luò)等)來描述。復(fù)雜的數(shù)據(jù)挖掘系統(tǒng)通常采用多種數(shù)據(jù)挖掘技術(shù),或是采用有效的、集成的技術(shù),結(jié)合一些方法的優(yōu)點(diǎn)。</p><p> Data Mining and Data Publishing</p><p> Data mining is the extraction of vast inte
60、resting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive informatio
61、n. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data
62、that is not suppos</p><p> Although data mining is potentially useful, many data holders are reluctant to provide their data for data mining for the fear of violating individual privacy. In recent years, st
63、udy has been made to ensure that the sensitive information of individuals cannot be identified easily.</p><p> Anonymity Models, k-anonymization techniques have been the focus of intense research in the las
64、t few years. In order to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications, everal extending models are proposed, which are discussed as follows. </p
65、><p> 1.k-Anonymity </p><p> k-anonymity is one of the most classic models, which technique that prevents joining attacks by generalizing and/or suppressing portions of the released microdata so
66、that no individual can be uniquely distinguished from a group of size k. In the k-anonymous tables, a data set is k-anonymous (k ≥ 1) if each record in the data set is in- distinguishable from at least (k . 1) other reco
67、rds within the same data set. The larger the value of k, the better the privacy is protected. k-anonymity can ensu</p><p> 2. Extending Models </p><p> Since k-anonymity does not provide suffi
68、cient protection against attribute disclosure. The notion of l-diversity attempts to solve this problem by requiring that each equivalence class has at least l well-represented value for each sensitive attribute. The tec
69、hnology of l-diversity has some advantages than k-anonymity. Because k-anonymity dataset permits strong attacks due to lack of diversity in the sensitive attributes. In this model, an equivalence class is said to have l-
70、diversity if there a</p><p> 3. Related Research Areas </p><p> Several polls show that the public has an in- creased sense of privacy loss. Since data mining is often a key component of infor
71、mation systems, homeland security systems, and monitoring and surveillance systems, it gives a wrong impression that data mining is a technique for privacy intrusion. This lack of trust has become an obstacle to the bene
72、fit of the technology. For example, the potentially beneficial data mining re- search project, Terrorism Information Awareness (TIA), was terminated by </p><p> 1) PPDP focuses on techniques for publishing
73、data, not techniques for data mining. In fact, it is expected that standard data mining techniques are applied on the published data. In contrast, the data holder in PPDM needs to randomize the data in such a way that da
74、ta mining results can be recovered from the randomized data. To do so, the data holder must understand the data mining tasks and algorithms involved. This level of involvement is not expected of the data holder in PPDP w
75、ho usually is n</p><p> 2) Both randomization and encryption do not preserve the truthfulness of values at the record level; therefore, the released data are basically meaningless to the recipients. In such
76、 a case, the data holder in PPDM may consider releasing the data mining results rather than the scrambled data. </p><p> 3) PPDP primarily “anonymizes” the data by hiding the identity of record owners, wher
77、eas PPDM seeks to directly hide the sensitive data. Excellent surveys and books in randomization and cryptographic techniques for PPDM can be found in the existing literature. A family of research work called privacy-pre
78、serving distributed data mining (PPDDM) aims at performing some data mining task on a set of private databases owned by different parties. It follows the principle of Secure Multiparty Computatio</p><p> So
79、me other works of SDC focus on the study of the non-interactive query model, in which the data recipients can submit one query to the system. This type of non-interactive query model may not fully address the informati
80、on needs of data recipients because, in some cases, it is very difficult for a data recipient to accurately construct a query for a data mining task in one shot. Consequently, there are a series of studies on the interac
81、tive query model, in which the data recipients, including </p><p> This paper presents a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explains their effects o
82、n Data Privacy. k-anonymity is used for security of respondents identity and decreases linking attack in the case of homogeneity attack a simple k-anonymity model fails and we need a concept which prevent from this attac
83、k solution is l-diversity. All tuples are arranged in well represented form and adversary will divert to l places or on l sensitive attributes. l</p><p><b> 數(shù)據(jù)挖掘和數(shù)據(jù)發(fā)布</b></p><p> 數(shù)
84、據(jù)挖掘中提取出大量有趣的模式從大量的數(shù)據(jù)或知識(shí)。數(shù)據(jù)挖掘隱私保護(hù)PPDM的最初的想法是將傳統(tǒng)的數(shù)據(jù)挖掘技術(shù)擴(kuò)展到處理數(shù)據(jù)修改為屏蔽敏感信息。關(guān)鍵問題是如何修改數(shù)據(jù)以及如何從修改后的數(shù)據(jù)恢復(fù)數(shù)據(jù)挖掘的結(jié)果。隱私保護(hù)數(shù)據(jù)挖掘認(rèn)為機(jī)密數(shù)據(jù)上運(yùn)行數(shù)據(jù)挖掘算法的問題不應(yīng)該透露方運(yùn)行算法。相比之下,隱私保護(hù)數(shù)據(jù)發(fā)布(PPDP)不一定是綁定到一個(gè)特定的數(shù)據(jù)挖掘任務(wù),和數(shù)據(jù)挖掘任務(wù)時(shí)可能是未知的數(shù)據(jù)發(fā)布。PPDP研究如何將原始數(shù)據(jù)轉(zhuǎn)換成一個(gè)版本接種隱私
85、攻擊,但仍然支持有效的數(shù)據(jù)挖掘任務(wù)。隱私保護(hù)數(shù)據(jù)挖掘(PPDM)和數(shù)據(jù)發(fā)布(PPDP)已成為越來越受歡迎,因?yàn)樗试S共享隱私的敏感數(shù)據(jù)進(jìn)行分析的目的。深入研究方法之一是k-anonymity匿名模型進(jìn)而導(dǎo)致信心邊界等模型,l-diversity, t-closeness,(α,k)-anonymity,等。特別是,所有已知的機(jī)制,盡量減少信息損失,試圖提供一個(gè)漏洞攻擊。本文的目的是提出一項(xiàng)調(diào)查最常見的攻擊技術(shù)即PPDM & PP
86、DP和解釋它們對(duì)數(shù)據(jù)隱私的影響。</p><p> 盡管數(shù)據(jù)挖掘可能是有用的,很多數(shù)據(jù)持有者不愿提供他們的數(shù)據(jù)對(duì)數(shù)據(jù)挖掘的恐懼侵犯?jìng)€(gè)人隱私。近年來,研究了以確保個(gè)人敏感信息不能輕易識(shí)別。</p><p> 匿名模型(k-匿名)技術(shù)一直是研究的焦點(diǎn),在過去的幾年里。為了確保匿名數(shù)據(jù)的同時(shí)盡量減少所造成的信息損失數(shù)據(jù)的修改,提出了幾個(gè)擴(kuò)展模型,討論如下。</p><p&
87、gt;<b> 1. k-匿名模型</b></p><p> k-anonymity最經(jīng)典模型之一,加入的攻擊技術(shù),防止泛化和/或抑制微數(shù)據(jù)發(fā)布的一部分,這樣任何個(gè)人可以獨(dú)特區(qū)別一群大小k。k-anonymous表,一個(gè)數(shù)據(jù)集是k-anonymous(k≥1)如果每個(gè)記錄的數(shù)據(jù)集——至少(k區(qū)分開來)其他相同的數(shù)據(jù)集內(nèi)的記錄。k值越大,更好的隱私保護(hù)。英蒂k-anonymity可以確?!?/p>
88、—viduals不能唯一標(biāo)識(shí)鏈接攻擊。</p><p><b> 2.擴(kuò)展模型</b></p><p> 因?yàn)閗-anonymity不提供足夠的保護(hù)屬性披露。l-diversity的概念試圖解決這個(gè)問題,要求每個(gè)等價(jià)類至少l上流每個(gè)敏感屬性值。比k-anonymity l-diversity技術(shù)有一定的優(yōu)勢(shì)。因?yàn)閗-anonymity數(shù)據(jù)集允許強(qiáng)大的攻擊由于缺乏多
89、樣性的敏感屬性。在這個(gè)模型中,一個(gè)等價(jià)類據(jù)說l-diversity如果至少有l(wèi)上流的敏感屬性的值。因?yàn)橛姓Z義屬性值之間的關(guān)系,以及不同價(jià)值觀有不同水平的敏感性。anonymization之后,在任何等價(jià)類,一個(gè)敏感的頻率(分?jǐn)?shù))值不超過α。</p><p><b> 3.相關(guān)研究領(lǐng)域</b></p><p> 一些民意調(diào)查顯示,公眾有——有折痕的隱私的失落感。由于
90、數(shù)據(jù)挖掘通常是信息系統(tǒng)的一個(gè)關(guān)鍵組成部分,國土安全系統(tǒng),以及監(jiān)測(cè)和監(jiān)測(cè)系統(tǒng),它給了一個(gè)錯(cuò)誤的印象,荷蘭國際集團(tuán)數(shù)據(jù)隱私入侵的技術(shù)。這種缺乏信任已經(jīng)成為障礙的技術(shù)中獲益。例如,潛在的有益的數(shù)據(jù)挖掘,搜索項(xiàng)目,恐怖主義信息意識(shí)(TIA),是由美國國會(huì)終止由于其爭(zhēng)議的程序收集、分享和分析個(gè)人留下的痕跡。出于隱私問題的數(shù)據(jù)挖掘工具,一個(gè)叫隱私保護(hù)的數(shù)據(jù)挖掘研究領(lǐng)域(PPDM)出現(xiàn)在2000年。PPDM的最初的想法是將傳統(tǒng)的數(shù)據(jù)挖掘技術(shù)擴(kuò)展到處理
91、數(shù)據(jù)修改為屏蔽敏感信息。關(guān)鍵問題是如何修改數(shù)據(jù)以及如何從修改后的數(shù)據(jù)恢復(fù)數(shù)據(jù)挖掘的結(jié)果。這些解決方案通常與數(shù)據(jù)挖掘算法在考慮緊密耦合。相比之下,隱私保護(hù)數(shù)據(jù)發(fā)布(PPDP)不一定綁到一個(gè)特定的數(shù)據(jù)挖掘任務(wù),和數(shù)據(jù)挖掘任務(wù)有時(shí)是未知的數(shù)據(jù)發(fā)布的時(shí)候。此外,一些PPDP解決方案強(qiáng)調(diào)保存數(shù)據(jù)記錄級(jí)別的真實(shí)性,但是PPDM解決方案通常不保留這樣的財(cái)產(chǎn)。PPDP有別于PPDM在幾個(gè)主要方面如下:</p><p> 1)P
92、PDP關(guān)注技術(shù)發(fā)布數(shù)據(jù),數(shù)據(jù)挖掘技術(shù)。事實(shí)上,它預(yù)計(jì),標(biāo)準(zhǔn)的數(shù)據(jù)挖掘技術(shù)應(yīng)用于分析數(shù)據(jù)。相反,數(shù)據(jù)持有人在PPDM需要隨機(jī)數(shù)據(jù)的方式,數(shù)據(jù)挖掘結(jié)果可以從隨機(jī)數(shù)據(jù)中恢復(fù)過來。為此,持有人必須了解數(shù)據(jù)挖掘任務(wù)的數(shù)據(jù)和算法。這種級(jí)別的預(yù)計(jì)數(shù)據(jù)持有人參與PPDP通常不是一個(gè)數(shù)據(jù)挖掘?qū)<摇?lt;/p><p> 2)隨機(jī)化和加密不保存記錄的真實(shí)值水平;因此,公布的數(shù)據(jù)基本上是毫無意義的決策。在這種情況下,數(shù)據(jù)持有人PPDM可
93、能考慮釋放數(shù)據(jù)挖掘結(jié)果而不是加密數(shù)據(jù)。</p><p> 3)PPDP主要“anonymizes”通過隱藏的數(shù)據(jù)記錄所有者的身份,而PPDM尋求直接隱藏敏感數(shù)據(jù)。優(yōu)秀的調(diào)查和書籍PPDM隨機(jī)化和加密技術(shù)可以在現(xiàn)有的文獻(xiàn)中找到。家庭中的數(shù)據(jù)稱為隱私保護(hù)數(shù)據(jù),分布式數(shù)據(jù)挖掘的研究工作(PPDDM)旨在執(zhí)行一些私有數(shù)據(jù)庫的數(shù)據(jù)挖掘任務(wù)在一組由不同的政黨。它遵循的原則,安全多方計(jì)算(SMC),并禁止任何數(shù)據(jù)共享除了最后
94、一個(gè)數(shù)據(jù)挖掘的結(jié)果??死蝾D等人提出一套SMC操作,如安全,安全設(shè)置,安全設(shè)置十字路口的大小,和標(biāo)量的產(chǎn)品,有很多的有用的數(shù)據(jù)挖掘任務(wù)。相比之下,PPDP不執(zhí)行實(shí)際的數(shù)據(jù)挖掘任務(wù),但擔(dān)憂如何發(fā)布的匿名數(shù)據(jù)是有用的數(shù)據(jù),以便數(shù)據(jù)挖掘。我們可以說,PPDP保護(hù)隱私數(shù)據(jù)層面而PPDDM保護(hù)隱私在流程級(jí)別。他們處理的是不同的隱私保護(hù)數(shù)據(jù)挖掘模型和場(chǎng)景。領(lǐng)域的統(tǒng)計(jì)信息披露控制(SDC),研究工作集中在隱私保護(hù)出版統(tǒng)計(jì)表的方法。SDC關(guān)注三種類型的
95、披露,即身份披露,屬性信息披露和推論披露。身份信息披露發(fā)生如果敵人可以識(shí)別被公布的數(shù)據(jù)。透露一個(gè)人是一個(gè)被調(diào)查者的數(shù)據(jù)收集可能會(huì)或可能不會(huì)違反保密要求。屬性披露機(jī)密信息被披露時(shí),可以歸因于被申請(qǐng)人。屬性信</p><p> 其他一些作品SDC關(guān)注非交互式查詢模型的研究,在數(shù)據(jù)接收者可以向系統(tǒng)提交一個(gè)查詢。這種類型的非交互式查詢模型不能完全解決數(shù)據(jù)接收者的信息需求,因?yàn)樵谀承┣闆r下,它是非常困難的一個(gè)數(shù)據(jù)接收方準(zhǔn)
96、確地構(gòu)造一個(gè)一次查詢一個(gè)數(shù)據(jù)挖掘的任務(wù)。因此,有一系列的交互式查詢模型,研究數(shù)據(jù)接收者,包括敵人,可以根據(jù)先前提交的查詢序列得到查詢結(jié)果。數(shù)據(jù)庫服務(wù)器負(fù)責(zé)跟蹤每個(gè)用戶的所有查詢并確定當(dāng)前收到的查詢是否有違反了隱私要求對(duì)所有先前的查詢。任何互動(dòng)隱私保護(hù)查詢系統(tǒng)的一個(gè)限制是,它只能在總回答亞線性數(shù)量的查詢;否則,敵人(或一組損壞數(shù)據(jù)接收者)能夠重建。原始數(shù)據(jù)是一個(gè)非常強(qiáng)大的侵犯隱私。當(dāng)達(dá)到最大數(shù)量的查詢,查詢服務(wù)必須關(guān)閉,以避免隱私泄漏。在
97、非交互式查詢模型的情況下,對(duì)手只能發(fā)行一個(gè)查詢,因此,非交互式查詢模型無法達(dá)到同樣程度的隱私定義的介紹互動(dòng)模型。你可能認(rèn)為隱私保護(hù)數(shù)據(jù)發(fā)布的非交互式查詢模型是一個(gè)特例。</p><p> 本文提出一項(xiàng)調(diào)查為最常見的攻擊技術(shù)PPDM & PPDP和解釋對(duì)數(shù)據(jù)隱私的影響。k-anonymity匿名模型用于安全的受訪者身份和減少鏈接攻擊在同質(zhì)性的情況下攻擊失敗,我們需要一個(gè)簡(jiǎn)單的k-anonymity模型概念
98、,l-diversity防止這種攻擊的解決方案。所有元組都安排在很好的體現(xiàn)形式和對(duì)手會(huì)把l地方或l敏感屬性。l-diversity限制在背景知識(shí)的情況下攻擊,因?yàn)闆]有人預(yù)測(cè)對(duì)手的知識(shí)水平。觀察,使用泛化和鎮(zhèn)壓我們也應(yīng)用這些技術(shù)在這些屬性不需要這種程度的隱私,這導(dǎo)致減少發(fā)布表的精度。e-NSTAM(擴(kuò)展敏感元組匿名方法)應(yīng)用于敏感元組,可以減少信息損失,這種方法也不能在多個(gè)敏感元組。泛化與抑制數(shù)據(jù)丟失的原因也因?yàn)橐种茝?qiáng)調(diào)不釋放值不適合導(dǎo)熱
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 外文文獻(xiàn)翻譯---數(shù)據(jù)挖掘技術(shù)簡(jiǎn)介
- 外文文獻(xiàn)及翻譯---信息系統(tǒng)開發(fā)和數(shù)據(jù)庫開發(fā)
- 外文文獻(xiàn)及翻譯:信息系統(tǒng)開發(fā)和數(shù)據(jù)庫開發(fā)
- 數(shù)據(jù)庫外文文獻(xiàn)翻譯
- 外文翻譯-----數(shù)據(jù)挖掘什么是數(shù)據(jù)挖掘?
- 數(shù)據(jù)倉庫和數(shù)據(jù)挖掘
- 數(shù)據(jù)庫畢業(yè)設(shè)計(jì)外文文獻(xiàn)及翻譯
- 數(shù)據(jù)庫外文文獻(xiàn)翻譯2篇
- 外文翻譯----數(shù)據(jù)庫和數(shù)據(jù)倉庫
- 大數(shù)據(jù)挖掘外文翻譯—大數(shù)據(jù)挖掘研究
- 外文翻譯----數(shù)據(jù)庫和數(shù)據(jù)倉庫
- 數(shù)據(jù)倉庫和數(shù)據(jù)挖掘題庫
- 多路數(shù)據(jù)采集與分析系統(tǒng)的設(shè)計(jì)及應(yīng)用 外文翻譯 外文文獻(xiàn) 英文文獻(xiàn)
- 大數(shù)據(jù)挖掘外文翻譯—大數(shù)據(jù)挖掘研究(原文)
- 外文文獻(xiàn)翻譯--數(shù)據(jù)庫管理系統(tǒng)的介紹
- 數(shù)據(jù)通信畢業(yè)論文外文文獻(xiàn)翻譯
- 外文翻譯----數(shù)據(jù)庫和數(shù)據(jù)庫系統(tǒng)
- 外文文獻(xiàn)翻譯--數(shù)據(jù)包處理的硬件支持
- 外文翻譯----gis軟件和數(shù)據(jù)結(jié)構(gòu)
- 外文翻譯----什么是數(shù)據(jù)挖掘
評(píng)論
0/150
提交評(píng)論