版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、<p> What is Data Mining?</p><p> Many people treat data mining as a synonym for another popularly used term, “Knowledge Discovery in Databases”, or KDD. Alternatively, others view data mining as simp
2、ly an essential step in the process of knowledge discovery in databases. Knowledge discovery consists of an iterative sequence of the following steps: </p><p> · data cleaning: to remove noise or irrel
3、evant data, </p><p> · data integration: where multiple data sources may be combined,</p><p> · data selection : where data relevant to the analysis task are retrieved from the datab
4、ase,</p><p> · data transformation : where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance,</p><p>
5、183; data mining: an essential process where intelligent methods are applied in order to extract data patterns, </p><p> · pattern evaluation: to identify the truly interesting patterns representing kn
6、owledge based on some interestingness measures, and </p><p> · knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user .
7、 </p><p> The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user, and may be stored as new knowledge in the knowledge base. Note that accordi
8、ng to this view, data mining is only one step in the entire process, albeit an essential one since it uncovers hidden patterns for evaluation. </p><p> We agree that data mining is a knowledge discovery pro
9、cess. However, in industry, in media, and in the database research milieu, the term “data mining” is becoming more popular than the longer term of “knowledge discovery in databases”. Therefore, in this book, we choose to
10、 use the term “data mining”. We adopt a broad view of data mining functionality: data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or
11、other </p><p> Based on this view, the architecture of a typical data mining system may have the following major components: </p><p> 1. Database, data warehouse, or other information reposito
12、ry. This is one or a set of databases, data warehouses, spread sheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. </p><p> 2. Dat
13、abase or data warehouse server. The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. </p><p> 3. Knowledge base. This is the domain k
14、nowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of a
15、bstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness con
16、straints or threshold</p><p> 4. Data mining engine. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association analysis,
17、 classification, evolution and deviation analysis.</p><p> 5. Pattern evaluation module. This component typically employs interestingness measures and interacts with the data mining modules so as to focus t
18、he search towards interesting patterns. It may access interestingness thresholds stored in the knowledge base. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the imple
19、mentation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern inter</p><p> 6. Graphical user interface. This module communicates between us
20、ers and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the
21、 intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.<
22、/p><p> From a data warehouse perspective, data mining can be viewed as an advanced stage of on-1ine analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style a
23、nalytical processing of data warehouse systems by incorporating more advanced techniques for data understanding. </p><p> While there may be many “data mining systems” on the market, not all of them can per
24、form true data mining. A data analysis system that does not handle large amounts of data can at most be categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A s
25、ystem that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorize</p><p> D
26、ata mining involves an integration of techniques from mult1ple disciplines such as database technology, statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization,
27、information retrieval, image and signal processing, and spatial data analysis. We adopt a database perspective in our presentation of data mining in this book. That is, emphasis is placed on efficient and scalable data m
28、ining techniques for large databases. By performing data mi</p><p> A classification of data mining systems </p><p> Data mining is an interdisciplinary field, the confluence of a set of disci
29、plines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as ne
30、ural networks, fuzzy and or rough set theory, knowledge representation, inductive logic programming, or high performance computing. Depending on the kinds of data to be mined or on the given data mining ap</p><
31、;p> Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classificati
32、on of data mining systems. Such a classification may help potential users distinguish data mining systems and identify those that best match their needs. Data mining systems can be categorized according to various criter
33、ia, as follows. </p><p> 1) Classification according to the kinds of databases mined. </p><p> A data mining system can be classified according to the kinds of databases mined. Database system
34、s themselves can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can theref
35、ore be classified accordingly. </p><p> For instance, if classifying according to data models, we may have a relational, transactional, object-oriented, object-relational, or data warehouse mining system. I
36、f classifying according to the special types of data handled, we may have a spatial, time -series, text, or multimedia data mining system , or a World-Wide Web mining system . Other system types include heterogeneous dat
37、a mining systems, and legacy data mining systems.</p><p> 2) Classification according to the kinds of knowledge mined.</p><p> Data mining systems can be categorized according to the kinds of
38、knowledge they mine, i.e., based on data mining functionalities, such as characterization, discrimination, association, classification, clustering, trend and evolution analysis, deviation analysis , similarity analysis,
39、etc. A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities. </p><p> Moreover, data mining systems can also be distinguished based on the granularity or
40、levels of abstraction of the knowledge mined, including generalized knowledge(at a high level of abstraction), primitive-level knowledge(at a raw data level), or knowledge at multiple levels (considering several levels o
41、f abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction.</p><p> 3) Classification according to the kinds of techniques utilized. </p
42、><p> Data mining systems can also be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., auto
43、nomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis employed(e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics, visualization,
44、 pattern recognition, neural networks, and so on ) .A sophisticated data min</p><p><b> 什么是數(shù)據(jù)挖掘?</b></p><p> 許多人把數(shù)據(jù)挖掘視為另一個常用的術(shù)語—數(shù)據(jù)庫中的知識發(fā)現(xiàn)或KDD的同義詞。而另一些人只是把數(shù)據(jù)挖掘視為數(shù)據(jù)庫中知識發(fā)現(xiàn)過程的一個基本驟。知識
45、發(fā)現(xiàn)的過程由以下步驟組成:</p><p> 1)數(shù)據(jù)清理:消除噪聲或不一致數(shù)據(jù),</p><p> 2)數(shù)據(jù)集成:多種數(shù)據(jù)可以組合在一起,</p><p> 3)數(shù)據(jù)選擇:從數(shù)據(jù)庫中檢索與分析任務(wù)相關(guān)的數(shù)據(jù),</p><p> 4)數(shù)據(jù)變換:數(shù)據(jù)變換或統(tǒng)一成適合挖掘的形式,如通過匯總或聚集操作,</p><p&g
46、t; 5)數(shù)據(jù)挖掘:基本步驟,使用智能方法提取數(shù)據(jù)模式,</p><p> 6)模式評估:根據(jù)某種興趣度度量,識別表示知識的真正有趣的模式,</p><p> 7)知識表示:使用可視化和知識表示技術(shù),向用戶提供挖掘的知識。</p><p> 數(shù)據(jù)挖掘的步驟可以與用戶或知識庫進行交互。把有趣的模式提供給用戶,或作為新的知識存放在知識庫中。注意,根據(jù)這種觀點,數(shù)
47、據(jù)挖掘只是整個過程中的一個步驟,盡管是最重要的一步,因為它發(fā)現(xiàn)隱藏的模式。</p><p> 我們同意數(shù)據(jù)挖掘是知識發(fā)現(xiàn)過程中的一個步驟。然而,在產(chǎn)業(yè)界、媒體和數(shù)據(jù)庫研究界,“數(shù)據(jù)挖掘”比那個較長的術(shù)語“數(shù)據(jù)庫中知識發(fā)現(xiàn)”更為流行。因此,在本書中,選用的術(shù)語是數(shù)據(jù)挖掘。我們采用數(shù)據(jù)挖掘的廣義觀點:數(shù)據(jù)挖掘是從存放在數(shù)據(jù)庫中或其他信息庫中的大量數(shù)據(jù)中挖掘出有趣知識的過程。</p><p>
48、 基于這種觀點,典型的數(shù)據(jù)挖掘系統(tǒng)具有以下主要成分:</p><p> 數(shù)據(jù)庫、數(shù)據(jù)倉庫或其他信息庫:這是一個或一組數(shù)據(jù)庫、數(shù)據(jù)倉庫、電子表格或其他類型的信息庫??梢栽跀?shù)據(jù)上進行數(shù)據(jù)清理和集成。</p><p> 數(shù)據(jù)庫、數(shù)據(jù)倉庫服務(wù)器:根據(jù)用戶的數(shù)據(jù)挖掘請求,數(shù)據(jù)庫、數(shù)據(jù)倉庫服務(wù)器負責(zé)提取相關(guān)數(shù)據(jù)。</p><p> 知識庫:這是領(lǐng)域知識,用于指導(dǎo)搜索,或
49、評估結(jié)果模式的興趣度。這種知識可能包括概念分層,用于將屬性或?qū)傩灾到M織成不同的抽象層。用戶確信方面的知識也可以包含在內(nèi)??梢允褂眠@種知識,根據(jù)非期望性評估模式的興趣度。領(lǐng)域知識的其他例子有興趣度限制或閾值和元數(shù)據(jù)(例如,描述來自多個異種數(shù)據(jù)源的數(shù)據(jù))。</p><p> 數(shù)據(jù)挖掘引擎:這是數(shù)據(jù)挖掘系統(tǒng)基本的部分,由一組功能模塊組成,用于特征化、關(guān)聯(lián)、分類、聚類分析以及演變和偏差分析。</p>&l
50、t;p> 模式評估模塊:通常,此成分使用興趣度度量,并與數(shù)據(jù)挖掘模塊交互,以便將搜索聚集在有趣的模式上。它可能使用興趣度閾值過濾發(fā)現(xiàn)的模式。模式評估模塊也可以與挖掘模塊集成在一起,這依賴于所用的數(shù)據(jù)挖掘方法的實現(xiàn)。對于有效的數(shù)據(jù)挖掘,建議盡可能深地將模式評估推進到挖掘過程之中,以便將搜索限制在有興趣的模式上。</p><p> 從數(shù)據(jù)倉庫觀點,數(shù)據(jù)挖掘可以看作聯(lián)機分析處理(OLAP)的高級階段。然而,通
51、過結(jié)合更高級的數(shù)據(jù)理解技術(shù),數(shù)據(jù)挖掘比數(shù)據(jù)倉庫的匯總型分析處理走得更遠。</p><p> 盡管市場上已有許多“數(shù)據(jù)挖掘系統(tǒng)”,但是并非所有系統(tǒng)的都能進行真正的數(shù)據(jù)挖掘。不能處理大量數(shù)據(jù)的數(shù)據(jù)分析系統(tǒng),最多是被稱作機器學(xué)習(xí)系統(tǒng)、統(tǒng)計數(shù)據(jù)分析工具或?qū)嶒炏到y(tǒng)原型。一個系統(tǒng)只能夠進行數(shù)據(jù)或信息檢索,包括在大型數(shù)據(jù)庫中找出聚集的值或回答演繹查詢,應(yīng)當(dāng)歸類為數(shù)據(jù)庫系統(tǒng),或信息檢索系統(tǒng),或演繹數(shù)據(jù)庫系統(tǒng)。</p>
52、;<p> 數(shù)據(jù)挖掘涉及多學(xué)科技術(shù)的集成,包括數(shù)據(jù)庫技術(shù)、統(tǒng)計學(xué)、機器學(xué)習(xí)、高性能計算、模式識別、神經(jīng)網(wǎng)絡(luò)、數(shù)據(jù)可視化、信息檢索、圖像與信號處理和空間數(shù)據(jù)分析。在本書討論數(shù)據(jù)挖掘的時候,我們采用數(shù)據(jù)庫的觀點。即,著重強調(diào)在大型數(shù)據(jù)庫中有效的和可伸縮的數(shù)據(jù)挖掘技術(shù)。一個算法是可伸縮的,如果給定內(nèi)存和磁盤空間等可利用的系統(tǒng)資源,其運行時間應(yīng)當(dāng)隨數(shù)據(jù)庫大小線性增加。通過數(shù)據(jù)挖掘,可以從數(shù)據(jù)庫提取有趣的知識、規(guī)律或者高層信息,并
53、可以從不同的角度來觀察或瀏覽。發(fā)現(xiàn)的知識可以用于決策、過程控制、信息管理、查詢處理,等等。因此,數(shù)據(jù)挖掘被信息產(chǎn)業(yè)界認為是數(shù)據(jù)庫系統(tǒng)最重要的前沿之一,是信息產(chǎn)業(yè)中最有前途的交叉學(xué)科。</p><p> 數(shù)據(jù)挖掘是一個交叉學(xué)科的領(lǐng)域,受到多個學(xué)科的影響,包括數(shù)據(jù)庫系統(tǒng)、統(tǒng)計學(xué)、機器學(xué)習(xí)、可視化和信息科學(xué)。此外,依賴于所用的數(shù)據(jù)挖掘方法,以及可以使用的其他學(xué)科的技術(shù),如神經(jīng)網(wǎng)絡(luò)、模糊和/或粗糙集理論、知識表示、歸納
54、邏輯程序設(shè)計或高性能計算。依賴于所挖掘的數(shù)據(jù)類型或給定的數(shù)據(jù)挖掘應(yīng)用,數(shù)據(jù)挖掘系統(tǒng)也可以集成空間數(shù)據(jù)分析、信息檢索、模式識別、圖形分析、信號處理、計算機圖形學(xué)、Web技術(shù)、經(jīng)濟、商業(yè)、生物信息學(xué)或心理學(xué)領(lǐng)域的技術(shù)。</p><p> 由于數(shù)據(jù)挖掘源于多個學(xué)科,因此在數(shù)據(jù)挖掘研究中就產(chǎn)生了大量的、各種不同類型的數(shù)據(jù)挖掘系統(tǒng)。這樣,就需要對數(shù)據(jù)挖掘系統(tǒng)給出一個清楚的分類。這種分類可以幫助用戶區(qū)分數(shù)據(jù)挖掘系統(tǒng),確定出
55、最適合其需要的數(shù)據(jù)挖掘系統(tǒng)。根據(jù)不同的標準,數(shù)據(jù)挖掘系統(tǒng)可以有如下分類:</p><p> 1)根據(jù)挖掘的數(shù)據(jù)庫類型進行分類。</p><p> 數(shù)據(jù)挖掘系統(tǒng)可以根據(jù)挖掘的數(shù)據(jù)庫類型進行分類。數(shù)據(jù)庫系統(tǒng)本身可以根據(jù)不同的標準(如數(shù)據(jù)模型,或數(shù)據(jù)或所涉及的應(yīng)用類型)來分類,每一類都可能需要自己的數(shù)據(jù)挖掘技術(shù)。這樣,數(shù)據(jù)挖掘系統(tǒng)就可以據(jù)此進行相應(yīng)的分類。</p><p&
56、gt; 例如,如果是根據(jù)數(shù)據(jù)模型來分類,我們可以有關(guān)系的、事務(wù)的、面向?qū)ο蟮摹ο?關(guān)系的或數(shù)據(jù)倉庫的數(shù)據(jù)挖掘系統(tǒng)。如果是根據(jù)所處理的數(shù)據(jù)的特定類型分類,我們可以有空間的、時間序列的、文本的或多媒體的數(shù)據(jù)挖掘系統(tǒng),或是WWW的數(shù)據(jù)挖掘系統(tǒng)。</p><p> 2)根據(jù)挖掘的知識類型進行分類。</p><p> 數(shù)據(jù)挖掘系統(tǒng)可以根據(jù)所挖掘的知識類型進行分類。即根據(jù)數(shù)據(jù)挖掘的功能,如特征
57、化、區(qū)分、關(guān)聯(lián)、分類聚類、孤立點分析和演變分析、偏差分析、類似性分析等進行分類。一個全面的數(shù)據(jù)挖掘系統(tǒng)應(yīng)當(dāng)提供多種和/或集成的數(shù)據(jù)挖掘功能。</p><p> 此外,數(shù)據(jù)挖掘系統(tǒng)也可以根據(jù)所挖掘的知識的粒度或抽象層進行區(qū)分,包括概化知識(在高抽象層),原始層知識(在原始數(shù)據(jù)層),或多層知識(考慮若干抽象層)。一個高級的數(shù)據(jù)挖掘系統(tǒng)應(yīng)當(dāng)支持多抽象層的知識發(fā)現(xiàn)。</p><p> 數(shù)據(jù)挖掘
58、系統(tǒng)還可以分類為挖掘數(shù)據(jù)規(guī)則性(通常出現(xiàn)的模式)和數(shù)據(jù)不規(guī)則性(如異?;蚬铝Ⅻc)這幾種。一般地,概念描述、關(guān)聯(lián)分析、分類、預(yù)測和聚類挖掘數(shù)據(jù)規(guī)律,將孤立點作為噪聲排除。這些方法也能幫助檢測孤立點。</p><p> 3)根據(jù)所用的技術(shù)進行分類。</p><p> 數(shù)據(jù)挖掘系統(tǒng)也可以根據(jù)所用的數(shù)據(jù)挖掘技術(shù)進行分類。這些技術(shù)可以根據(jù)用戶交互程度(例如自動系統(tǒng)、交互探查系統(tǒng)、查詢驅(qū)動系統(tǒng)),
59、或利用的數(shù)據(jù)分析方法(例如面向數(shù)據(jù)庫或數(shù)據(jù)倉庫的技術(shù)、機器學(xué)習(xí)、統(tǒng)計學(xué)、可視化、模式識別、神經(jīng)網(wǎng)絡(luò)等)來描述。復(fù)雜的數(shù)據(jù)挖掘系統(tǒng)通常采用多種數(shù)據(jù)挖掘技術(shù),或是采用有效的、集成的技術(shù),結(jié)合一些方法的優(yōu)點。</p><p> Data Mining and Data Publishing</p><p> Data mining is the extraction of vast inte
60、resting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive informatio
61、n. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data
62、that is not suppos</p><p> Although data mining is potentially useful, many data holders are reluctant to provide their data for data mining for the fear of violating individual privacy. In recent years, st
63、udy has been made to ensure that the sensitive information of individuals cannot be identified easily.</p><p> Anonymity Models, k-anonymization techniques have been the focus of intense research in the las
64、t few years. In order to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications, everal extending models are proposed, which are discussed as follows. </p
65、><p> 1.k-Anonymity </p><p> k-anonymity is one of the most classic models, which technique that prevents joining attacks by generalizing and/or suppressing portions of the released microdata so
66、that no individual can be uniquely distinguished from a group of size k. In the k-anonymous tables, a data set is k-anonymous (k ≥ 1) if each record in the data set is in- distinguishable from at least (k . 1) other reco
67、rds within the same data set. The larger the value of k, the better the privacy is protected. k-anonymity can ensu</p><p> 2. Extending Models </p><p> Since k-anonymity does not provide suffi
68、cient protection against attribute disclosure. The notion of l-diversity attempts to solve this problem by requiring that each equivalence class has at least l well-represented value for each sensitive attribute. The tec
69、hnology of l-diversity has some advantages than k-anonymity. Because k-anonymity dataset permits strong attacks due to lack of diversity in the sensitive attributes. In this model, an equivalence class is said to have l-
70、diversity if there a</p><p> 3. Related Research Areas </p><p> Several polls show that the public has an in- creased sense of privacy loss. Since data mining is often a key component of infor
71、mation systems, homeland security systems, and monitoring and surveillance systems, it gives a wrong impression that data mining is a technique for privacy intrusion. This lack of trust has become an obstacle to the bene
72、fit of the technology. For example, the potentially beneficial data mining re- search project, Terrorism Information Awareness (TIA), was terminated by </p><p> 1) PPDP focuses on techniques for publishing
73、data, not techniques for data mining. In fact, it is expected that standard data mining techniques are applied on the published data. In contrast, the data holder in PPDM needs to randomize the data in such a way that da
74、ta mining results can be recovered from the randomized data. To do so, the data holder must understand the data mining tasks and algorithms involved. This level of involvement is not expected of the data holder in PPDP w
75、ho usually is n</p><p> 2) Both randomization and encryption do not preserve the truthfulness of values at the record level; therefore, the released data are basically meaningless to the recipients. In such
76、 a case, the data holder in PPDM may consider releasing the data mining results rather than the scrambled data. </p><p> 3) PPDP primarily “anonymizes” the data by hiding the identity of record owners, wher
77、eas PPDM seeks to directly hide the sensitive data. Excellent surveys and books in randomization and cryptographic techniques for PPDM can be found in the existing literature. A family of research work called privacy-pre
78、serving distributed data mining (PPDDM) aims at performing some data mining task on a set of private databases owned by different parties. It follows the principle of Secure Multiparty Computatio</p><p> So
79、me other works of SDC focus on the study of the non-interactive query model, in which the data recipients can submit one query to the system. This type of non-interactive query model may not fully address the informati
80、on needs of data recipients because, in some cases, it is very difficult for a data recipient to accurately construct a query for a data mining task in one shot. Consequently, there are a series of studies on the interac
81、tive query model, in which the data recipients, including </p><p> This paper presents a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explains their effects o
82、n Data Privacy. k-anonymity is used for security of respondents identity and decreases linking attack in the case of homogeneity attack a simple k-anonymity model fails and we need a concept which prevent from this attac
83、k solution is l-diversity. All tuples are arranged in well represented form and adversary will divert to l places or on l sensitive attributes. l</p><p><b> 數(shù)據(jù)挖掘和數(shù)據(jù)發(fā)布</b></p><p> 數(shù)
84、據(jù)挖掘中提取出大量有趣的模式從大量的數(shù)據(jù)或知識。數(shù)據(jù)挖掘隱私保護PPDM的最初的想法是將傳統(tǒng)的數(shù)據(jù)挖掘技術(shù)擴展到處理數(shù)據(jù)修改為屏蔽敏感信息。關(guān)鍵問題是如何修改數(shù)據(jù)以及如何從修改后的數(shù)據(jù)恢復(fù)數(shù)據(jù)挖掘的結(jié)果。隱私保護數(shù)據(jù)挖掘認為機密數(shù)據(jù)上運行數(shù)據(jù)挖掘算法的問題不應(yīng)該透露方運行算法。相比之下,隱私保護數(shù)據(jù)發(fā)布(PPDP)不一定是綁定到一個特定的數(shù)據(jù)挖掘任務(wù),和數(shù)據(jù)挖掘任務(wù)時可能是未知的數(shù)據(jù)發(fā)布。PPDP研究如何將原始數(shù)據(jù)轉(zhuǎn)換成一個版本接種隱私
85、攻擊,但仍然支持有效的數(shù)據(jù)挖掘任務(wù)。隱私保護數(shù)據(jù)挖掘(PPDM)和數(shù)據(jù)發(fā)布(PPDP)已成為越來越受歡迎,因為它允許共享隱私的敏感數(shù)據(jù)進行分析的目的。深入研究方法之一是k-anonymity匿名模型進而導(dǎo)致信心邊界等模型,l-diversity, t-closeness,(α,k)-anonymity,等。特別是,所有已知的機制,盡量減少信息損失,試圖提供一個漏洞攻擊。本文的目的是提出一項調(diào)查最常見的攻擊技術(shù)即PPDM & PP
86、DP和解釋它們對數(shù)據(jù)隱私的影響。</p><p> 盡管數(shù)據(jù)挖掘可能是有用的,很多數(shù)據(jù)持有者不愿提供他們的數(shù)據(jù)對數(shù)據(jù)挖掘的恐懼侵犯個人隱私。近年來,研究了以確保個人敏感信息不能輕易識別。</p><p> 匿名模型(k-匿名)技術(shù)一直是研究的焦點,在過去的幾年里。為了確保匿名數(shù)據(jù)的同時盡量減少所造成的信息損失數(shù)據(jù)的修改,提出了幾個擴展模型,討論如下。</p><p&
87、gt;<b> 1. k-匿名模型</b></p><p> k-anonymity最經(jīng)典模型之一,加入的攻擊技術(shù),防止泛化和/或抑制微數(shù)據(jù)發(fā)布的一部分,這樣任何個人可以獨特區(qū)別一群大小k。k-anonymous表,一個數(shù)據(jù)集是k-anonymous(k≥1)如果每個記錄的數(shù)據(jù)集——至少(k區(qū)分開來)其他相同的數(shù)據(jù)集內(nèi)的記錄。k值越大,更好的隱私保護。英蒂k-anonymity可以確保—
88、—viduals不能唯一標識鏈接攻擊。</p><p><b> 2.擴展模型</b></p><p> 因為k-anonymity不提供足夠的保護屬性披露。l-diversity的概念試圖解決這個問題,要求每個等價類至少l上流每個敏感屬性值。比k-anonymity l-diversity技術(shù)有一定的優(yōu)勢。因為k-anonymity數(shù)據(jù)集允許強大的攻擊由于缺乏多
89、樣性的敏感屬性。在這個模型中,一個等價類據(jù)說l-diversity如果至少有l(wèi)上流的敏感屬性的值。因為有語義屬性值之間的關(guān)系,以及不同價值觀有不同水平的敏感性。anonymization之后,在任何等價類,一個敏感的頻率(分數(shù))值不超過α。</p><p><b> 3.相關(guān)研究領(lǐng)域</b></p><p> 一些民意調(diào)查顯示,公眾有——有折痕的隱私的失落感。由于
90、數(shù)據(jù)挖掘通常是信息系統(tǒng)的一個關(guān)鍵組成部分,國土安全系統(tǒng),以及監(jiān)測和監(jiān)測系統(tǒng),它給了一個錯誤的印象,荷蘭國際集團數(shù)據(jù)隱私入侵的技術(shù)。這種缺乏信任已經(jīng)成為障礙的技術(shù)中獲益。例如,潛在的有益的數(shù)據(jù)挖掘,搜索項目,恐怖主義信息意識(TIA),是由美國國會終止由于其爭議的程序收集、分享和分析個人留下的痕跡。出于隱私問題的數(shù)據(jù)挖掘工具,一個叫隱私保護的數(shù)據(jù)挖掘研究領(lǐng)域(PPDM)出現(xiàn)在2000年。PPDM的最初的想法是將傳統(tǒng)的數(shù)據(jù)挖掘技術(shù)擴展到處理
91、數(shù)據(jù)修改為屏蔽敏感信息。關(guān)鍵問題是如何修改數(shù)據(jù)以及如何從修改后的數(shù)據(jù)恢復(fù)數(shù)據(jù)挖掘的結(jié)果。這些解決方案通常與數(shù)據(jù)挖掘算法在考慮緊密耦合。相比之下,隱私保護數(shù)據(jù)發(fā)布(PPDP)不一定綁到一個特定的數(shù)據(jù)挖掘任務(wù),和數(shù)據(jù)挖掘任務(wù)有時是未知的數(shù)據(jù)發(fā)布的時候。此外,一些PPDP解決方案強調(diào)保存數(shù)據(jù)記錄級別的真實性,但是PPDM解決方案通常不保留這樣的財產(chǎn)。PPDP有別于PPDM在幾個主要方面如下:</p><p> 1)P
92、PDP關(guān)注技術(shù)發(fā)布數(shù)據(jù),數(shù)據(jù)挖掘技術(shù)。事實上,它預(yù)計,標準的數(shù)據(jù)挖掘技術(shù)應(yīng)用于分析數(shù)據(jù)。相反,數(shù)據(jù)持有人在PPDM需要隨機數(shù)據(jù)的方式,數(shù)據(jù)挖掘結(jié)果可以從隨機數(shù)據(jù)中恢復(fù)過來。為此,持有人必須了解數(shù)據(jù)挖掘任務(wù)的數(shù)據(jù)和算法。這種級別的預(yù)計數(shù)據(jù)持有人參與PPDP通常不是一個數(shù)據(jù)挖掘?qū)<摇?lt;/p><p> 2)隨機化和加密不保存記錄的真實值水平;因此,公布的數(shù)據(jù)基本上是毫無意義的決策。在這種情況下,數(shù)據(jù)持有人PPDM可
93、能考慮釋放數(shù)據(jù)挖掘結(jié)果而不是加密數(shù)據(jù)。</p><p> 3)PPDP主要“anonymizes”通過隱藏的數(shù)據(jù)記錄所有者的身份,而PPDM尋求直接隱藏敏感數(shù)據(jù)。優(yōu)秀的調(diào)查和書籍PPDM隨機化和加密技術(shù)可以在現(xiàn)有的文獻中找到。家庭中的數(shù)據(jù)稱為隱私保護數(shù)據(jù),分布式數(shù)據(jù)挖掘的研究工作(PPDDM)旨在執(zhí)行一些私有數(shù)據(jù)庫的數(shù)據(jù)挖掘任務(wù)在一組由不同的政黨。它遵循的原則,安全多方計算(SMC),并禁止任何數(shù)據(jù)共享除了最后
94、一個數(shù)據(jù)挖掘的結(jié)果??死蝾D等人提出一套SMC操作,如安全,安全設(shè)置,安全設(shè)置十字路口的大小,和標量的產(chǎn)品,有很多的有用的數(shù)據(jù)挖掘任務(wù)。相比之下,PPDP不執(zhí)行實際的數(shù)據(jù)挖掘任務(wù),但擔(dān)憂如何發(fā)布的匿名數(shù)據(jù)是有用的數(shù)據(jù),以便數(shù)據(jù)挖掘。我們可以說,PPDP保護隱私數(shù)據(jù)層面而PPDDM保護隱私在流程級別。他們處理的是不同的隱私保護數(shù)據(jù)挖掘模型和場景。領(lǐng)域的統(tǒng)計信息披露控制(SDC),研究工作集中在隱私保護出版統(tǒng)計表的方法。SDC關(guān)注三種類型的
95、披露,即身份披露,屬性信息披露和推論披露。身份信息披露發(fā)生如果敵人可以識別被公布的數(shù)據(jù)。透露一個人是一個被調(diào)查者的數(shù)據(jù)收集可能會或可能不會違反保密要求。屬性披露機密信息被披露時,可以歸因于被申請人。屬性信</p><p> 其他一些作品SDC關(guān)注非交互式查詢模型的研究,在數(shù)據(jù)接收者可以向系統(tǒng)提交一個查詢。這種類型的非交互式查詢模型不能完全解決數(shù)據(jù)接收者的信息需求,因為在某些情況下,它是非常困難的一個數(shù)據(jù)接收方準
96、確地構(gòu)造一個一次查詢一個數(shù)據(jù)挖掘的任務(wù)。因此,有一系列的交互式查詢模型,研究數(shù)據(jù)接收者,包括敵人,可以根據(jù)先前提交的查詢序列得到查詢結(jié)果。數(shù)據(jù)庫服務(wù)器負責(zé)跟蹤每個用戶的所有查詢并確定當(dāng)前收到的查詢是否有違反了隱私要求對所有先前的查詢。任何互動隱私保護查詢系統(tǒng)的一個限制是,它只能在總回答亞線性數(shù)量的查詢;否則,敵人(或一組損壞數(shù)據(jù)接收者)能夠重建。原始數(shù)據(jù)是一個非常強大的侵犯隱私。當(dāng)達到最大數(shù)量的查詢,查詢服務(wù)必須關(guān)閉,以避免隱私泄漏。在
97、非交互式查詢模型的情況下,對手只能發(fā)行一個查詢,因此,非交互式查詢模型無法達到同樣程度的隱私定義的介紹互動模型。你可能認為隱私保護數(shù)據(jù)發(fā)布的非交互式查詢模型是一個特例。</p><p> 本文提出一項調(diào)查為最常見的攻擊技術(shù)PPDM & PPDP和解釋對數(shù)據(jù)隱私的影響。k-anonymity匿名模型用于安全的受訪者身份和減少鏈接攻擊在同質(zhì)性的情況下攻擊失敗,我們需要一個簡單的k-anonymity模型概念
98、,l-diversity防止這種攻擊的解決方案。所有元組都安排在很好的體現(xiàn)形式和對手會把l地方或l敏感屬性。l-diversity限制在背景知識的情況下攻擊,因為沒有人預(yù)測對手的知識水平。觀察,使用泛化和鎮(zhèn)壓我們也應(yīng)用這些技術(shù)在這些屬性不需要這種程度的隱私,這導(dǎo)致減少發(fā)布表的精度。e-NSTAM(擴展敏感元組匿名方法)應(yīng)用于敏感元組,可以減少信息損失,這種方法也不能在多個敏感元組。泛化與抑制數(shù)據(jù)丟失的原因也因為抑制強調(diào)不釋放值不適合導(dǎo)熱
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 外文文獻翻譯---數(shù)據(jù)挖掘技術(shù)簡介
- 外文文獻及翻譯---信息系統(tǒng)開發(fā)和數(shù)據(jù)庫開發(fā)
- 外文文獻及翻譯:信息系統(tǒng)開發(fā)和數(shù)據(jù)庫開發(fā)
- 數(shù)據(jù)庫外文文獻翻譯
- 外文翻譯-----數(shù)據(jù)挖掘什么是數(shù)據(jù)挖掘?
- 數(shù)據(jù)倉庫和數(shù)據(jù)挖掘
- 數(shù)據(jù)庫畢業(yè)設(shè)計外文文獻及翻譯
- 數(shù)據(jù)庫外文文獻翻譯2篇
- 外文翻譯----數(shù)據(jù)庫和數(shù)據(jù)倉庫
- 大數(shù)據(jù)挖掘外文翻譯—大數(shù)據(jù)挖掘研究
- 外文翻譯----數(shù)據(jù)庫和數(shù)據(jù)倉庫
- 數(shù)據(jù)倉庫和數(shù)據(jù)挖掘題庫
- 多路數(shù)據(jù)采集與分析系統(tǒng)的設(shè)計及應(yīng)用 外文翻譯 外文文獻 英文文獻
- 大數(shù)據(jù)挖掘外文翻譯—大數(shù)據(jù)挖掘研究(原文)
- 外文文獻翻譯--數(shù)據(jù)庫管理系統(tǒng)的介紹
- 數(shù)據(jù)通信畢業(yè)論文外文文獻翻譯
- 外文翻譯----數(shù)據(jù)庫和數(shù)據(jù)庫系統(tǒng)
- 外文文獻翻譯--數(shù)據(jù)包處理的硬件支持
- 外文翻譯----gis軟件和數(shù)據(jù)結(jié)構(gòu)
- 外文翻譯----什么是數(shù)據(jù)挖掘
評論
0/150
提交評論