版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、<p><b> 什么是數(shù)據(jù)挖掘?</b></p><p> 簡單地說,數(shù)據(jù)挖掘是從大量的數(shù)據(jù)中提取或“挖掘”知識。該術(shù)語實際上有點兒用詞不當(dāng)。注意,從礦石或砂子中挖掘黃金叫做黃金挖掘,而不是叫做礦石挖掘。這樣,數(shù)據(jù)挖掘應(yīng)當(dāng)更準(zhǔn)確地命名為“從數(shù)據(jù)中挖掘知識”,不幸的是這個有點兒長?!爸R挖掘”是一個短術(shù)語,可能它不能反映出從大量數(shù)據(jù)中挖掘的意思。畢竟,挖掘是一個很生動的術(shù)語,它
2、抓住了從大量的、未加工的材料中發(fā)現(xiàn)少量金塊這一過程的特點。這樣,這種用詞不當(dāng)攜帶了“數(shù)據(jù)”和“挖掘”,就成了流行的選擇。還有一些術(shù)語,具有和數(shù)據(jù)挖掘類似但稍有不同的含義,如數(shù)據(jù)庫中的知識挖掘、知識提取、數(shù)據(jù)/模式分析、數(shù)據(jù)考古和數(shù)據(jù)捕撈。</p><p> 許多人把數(shù)據(jù)挖掘視為另一個常用的術(shù)語—數(shù)據(jù)庫中的知識發(fā)現(xiàn)或KDD的同義詞。而另一些人只是把數(shù)據(jù)挖掘視為數(shù)據(jù)庫中知識發(fā)現(xiàn)過程的一個基本步驟。知識發(fā)現(xiàn)的過程由以
3、下步驟組成:</p><p> 1)數(shù)據(jù)清理:消除噪聲或不一致數(shù)據(jù),</p><p> 2)數(shù)據(jù)集成:多種數(shù)據(jù)可以組合在一起,</p><p> 3)數(shù)據(jù)選擇:從數(shù)據(jù)庫中檢索與分析任務(wù)相關(guān)的數(shù)據(jù),</p><p> 4)數(shù)據(jù)變換:數(shù)據(jù)變換或統(tǒng)一成適合挖掘的形式,如通過匯總或聚集操作,</p><p> 5)數(shù)
4、據(jù)挖掘:基本步驟,使用智能方法提取數(shù)據(jù)模式,</p><p> 6)模式評估:根據(jù)某種興趣度度量,識別表示知識的真正有趣的模式,</p><p> 7)知識表示:使用可視化和知識表示技術(shù),向用戶提供挖掘的知識。</p><p> 數(shù)據(jù)挖掘的步驟可以與用戶或知識庫進行交互。把有趣的模式提供給用戶,或作為新的知識存放在知識庫中。注意,根據(jù)這種觀點,數(shù)據(jù)挖掘只是整個
5、過程中的一個步驟,盡管是最重要的一步,因為它發(fā)現(xiàn)隱藏的模式。</p><p> 我們同意數(shù)據(jù)挖掘是知識發(fā)現(xiàn)過程中的一個步驟。然而,在產(chǎn)業(yè)界、媒體和數(shù)據(jù)庫研究界,“數(shù)據(jù)挖掘”比那個較長的術(shù)語“數(shù)據(jù)庫中知識發(fā)現(xiàn)”更為流行。因此,在本書中,選用的術(shù)語是數(shù)據(jù)挖掘。我們采用數(shù)據(jù)挖掘的廣義觀點:數(shù)據(jù)挖掘是從存放在數(shù)據(jù)庫中或其他信息庫中的大量數(shù)據(jù)中挖掘出有趣知識的過程。</p><p> 基于這種觀
6、點,典型的數(shù)據(jù)挖掘系統(tǒng)具有以下主要成分:</p><p> 數(shù)據(jù)庫、數(shù)據(jù)倉庫或其他信息庫:這是一個或一組數(shù)據(jù)庫、數(shù)據(jù)倉庫、電子表格或其他類型的信息庫??梢栽跀?shù)據(jù)上進行數(shù)據(jù)清理和集成。</p><p> 數(shù)據(jù)庫、數(shù)據(jù)倉庫服務(wù)器:根據(jù)用戶的數(shù)據(jù)挖掘請求,數(shù)據(jù)庫、數(shù)據(jù)倉庫服務(wù)器負責(zé)提取相關(guān)數(shù)據(jù)。</p><p> 知識庫:這是領(lǐng)域知識,用于指導(dǎo)搜索,或評估結(jié)果模式的
7、興趣度。這種知識可能包括概念分層,用于將屬性或?qū)傩灾到M織成不同的抽象層。用戶確信方面的知識也可以包含在內(nèi)??梢允褂眠@種知識,根據(jù)非期望性評估模式的興趣度。領(lǐng)域知識的其他例子有興趣度限制或閾值和元數(shù)據(jù)(例如,描述來自多個異種數(shù)據(jù)源的數(shù)據(jù))。</p><p> 數(shù)據(jù)挖掘引擎:這是數(shù)據(jù)挖掘系統(tǒng)基本的部分,由一組功能模塊組成,用于特征化、關(guān)聯(lián)、分類、聚類分析以及演變和偏差分析。</p><p>
8、 模式評估模塊:通常,此成分使用興趣度度量,并與數(shù)據(jù)挖掘模塊交互,以便將搜索聚集在有趣的模式上。它可能使用興趣度閾值過濾發(fā)現(xiàn)的模式。模式評估模塊也可以與挖掘模塊集成在一起,這依賴于所用的數(shù)據(jù)挖掘方法的實現(xiàn)。對于有效的數(shù)據(jù)挖掘,建議盡可能深地將模式評估推進到挖掘過程之中,以便將搜索限制在有興趣的模式上。</p><p> 圖形用戶界面:本模塊在用戶和數(shù)據(jù)挖掘系統(tǒng)之間進行通信,允許用戶與系統(tǒng)進行交互,指定數(shù)據(jù)挖掘
9、查詢或任務(wù),提供信息、幫助搜索聚焦,根據(jù)數(shù)據(jù)挖掘的中間結(jié)果進行探索式數(shù)據(jù)挖掘。此外,此成分還允許用戶瀏覽數(shù)據(jù)庫和數(shù)據(jù)倉庫模式或數(shù)據(jù)結(jié)構(gòu),評估挖掘的模式,以不同的形式對模式進行可視化。</p><p> 從數(shù)據(jù)倉庫觀點,數(shù)據(jù)挖掘可以看作聯(lián)機分析處理(OLAP)的高級階段。然而,通過結(jié)合更高級的數(shù)據(jù)理解技術(shù),數(shù)據(jù)挖掘比數(shù)據(jù)倉庫的匯總型分析處理走得更遠。</p><p> 盡管市場上已有許多
10、“數(shù)據(jù)挖掘系統(tǒng)”,但是并非所有系統(tǒng)的都能進行真正的數(shù)據(jù)挖掘。不能處理大量數(shù)據(jù)的數(shù)據(jù)分析系統(tǒng),最多是被稱作機器學(xué)習(xí)系統(tǒng)、統(tǒng)計數(shù)據(jù)分析工具或?qū)嶒炏到y(tǒng)原型。一個系統(tǒng)只能夠進行數(shù)據(jù)或信息檢索,包括在大型數(shù)據(jù)庫中找出聚集的值或回答演繹查詢,應(yīng)當(dāng)歸類為數(shù)據(jù)庫系統(tǒng),或信息檢索系統(tǒng),或演繹數(shù)據(jù)庫系統(tǒng)。</p><p> 數(shù)據(jù)挖掘涉及多學(xué)科技術(shù)的集成,包括數(shù)據(jù)庫技術(shù)、統(tǒng)計學(xué)、機器學(xué)習(xí)、高性能計算、模式識別、神經(jīng)網(wǎng)絡(luò)、數(shù)據(jù)可視化、
11、信息檢索、圖像與信號處理和空間數(shù)據(jù)分析。在本書討論數(shù)據(jù)挖掘的時候,我們采用數(shù)據(jù)庫的觀點。即,著重強調(diào)在大型數(shù)據(jù)庫中有效的和可伸縮的數(shù)據(jù)挖掘技術(shù)。一個算法是可伸縮的,如果給定內(nèi)存和磁盤空間等可利用的系統(tǒng)資源,其運行時間應(yīng)當(dāng)隨數(shù)據(jù)庫大小線性增加。通過數(shù)據(jù)挖掘,可以從數(shù)據(jù)庫提取有趣的知識、規(guī)律或者高層信息,并可以從不同的角度來觀察或瀏覽。發(fā)現(xiàn)的知識可以用于決策、過程控制、信息管理、查詢處理,等等。因此,數(shù)據(jù)挖掘被信息產(chǎn)業(yè)界認為是數(shù)據(jù)庫系統(tǒng)最重
12、要的前沿之一,是信息產(chǎn)業(yè)中最有前途的交叉學(xué)科。</p><p> 數(shù)據(jù)挖掘是一個交叉學(xué)科的領(lǐng)域,受到多個學(xué)科的影響,包括數(shù)據(jù)庫系統(tǒng)、統(tǒng)計學(xué)、機器學(xué)習(xí)、可視化和信息科學(xué)。此外,依賴于所用的數(shù)據(jù)挖掘方法,以及可以使用的其他學(xué)科的技術(shù),如神經(jīng)網(wǎng)絡(luò)、模糊和/或粗糙集理論、知識表示、歸納邏輯程序設(shè)計或高性能計算。依賴于所挖掘的數(shù)據(jù)類型或給定的數(shù)據(jù)挖掘應(yīng)用,數(shù)據(jù)挖掘系統(tǒng)也可以集成空間數(shù)據(jù)分析、信息檢索、模式識別、圖形分析、
13、信號處理、計算機圖形學(xué)、Web技術(shù)、經(jīng)濟、商業(yè)、生物信息學(xué)或心理學(xué)領(lǐng)域的技術(shù)。</p><p> 由于數(shù)據(jù)挖掘源于多個學(xué)科,因此在數(shù)據(jù)挖掘研究中就產(chǎn)生了大量的、各種不同類型的數(shù)據(jù)挖掘系統(tǒng)。這樣,就需要對數(shù)據(jù)挖掘系統(tǒng)給出一個清楚的分類。這種分類可以幫助用戶區(qū)分數(shù)據(jù)挖掘系統(tǒng),確定出最適合其需要的數(shù)據(jù)挖掘系統(tǒng)。根據(jù)不同的標(biāo)準(zhǔn),數(shù)據(jù)挖掘系統(tǒng)可以有如下分類:</p><p> 1)根據(jù)挖掘的數(shù)據(jù)
14、庫類型進行分類。</p><p> 數(shù)據(jù)挖掘系統(tǒng)可以根據(jù)挖掘的數(shù)據(jù)庫類型進行分類。數(shù)據(jù)庫系統(tǒng)本身可以根據(jù)不同的標(biāo)準(zhǔn)(如數(shù)據(jù)模型,或數(shù)據(jù)或所涉及的應(yīng)用類型)來分類,每一類都可能需要自己的數(shù)據(jù)挖掘技術(shù)。這樣,數(shù)據(jù)挖掘系統(tǒng)就可以據(jù)此進行相應(yīng)的分類。</p><p> 例如,如果是根據(jù)數(shù)據(jù)模型來分類,我們可以有關(guān)系的、事務(wù)的、面向?qū)ο蟮?、對?關(guān)系的或數(shù)據(jù)倉庫的數(shù)據(jù)挖掘系統(tǒng)。如果是根據(jù)所處理的
15、數(shù)據(jù)的特定類型分類,我們可以有空間的、時間序列的、文本的或多媒體的數(shù)據(jù)挖掘系統(tǒng),或是WWW的數(shù)據(jù)挖掘系統(tǒng)。</p><p> 2)根據(jù)挖掘的知識類型進行分類。</p><p> 數(shù)據(jù)挖掘系統(tǒng)可以根據(jù)所挖掘的知識類型進行分類。即根據(jù)數(shù)據(jù)挖掘的功能,如特征化、區(qū)分、關(guān)聯(lián)、分類聚類、孤立點分析和演變分析、偏差分析、類似性分析等進行分類。一個全面的數(shù)據(jù)挖掘系統(tǒng)應(yīng)當(dāng)提供多種和/或集成的數(shù)據(jù)挖掘功
16、能。</p><p> 此外,數(shù)據(jù)挖掘系統(tǒng)也可以根據(jù)所挖掘的知識的粒度或抽象層進行區(qū)分,包括概化知識(在高抽象層),原始層知識(在原始數(shù)據(jù)層),或多層知識(考慮若干抽象層)。一個高級的數(shù)據(jù)挖掘系統(tǒng)應(yīng)當(dāng)支持多抽象層的知識發(fā)現(xiàn)。</p><p> 數(shù)據(jù)挖掘系統(tǒng)還可以分類為挖掘數(shù)據(jù)規(guī)則性(通常出現(xiàn)的模式)和數(shù)據(jù)不規(guī)則性(如異?;蚬铝Ⅻc)這幾種。一般地,概念描述、關(guān)聯(lián)分析、分類、預(yù)測和聚類挖掘
17、數(shù)據(jù)規(guī)律,將孤立點作為噪聲排除。這些方法也能幫助檢測孤立點。</p><p> 3)根據(jù)所用的技術(shù)進行分類。</p><p> 數(shù)據(jù)挖掘系統(tǒng)也可以根據(jù)所用的數(shù)據(jù)挖掘技術(shù)進行分類。這些技術(shù)可以根據(jù)用戶交互程度(例如自動系統(tǒng)、交互探查系統(tǒng)、查詢驅(qū)動系統(tǒng)),或利用的數(shù)據(jù)分析方法(例如面向數(shù)據(jù)庫或數(shù)據(jù)倉庫的技術(shù)、機器學(xué)習(xí)、統(tǒng)計學(xué)、可視化、模式識別、神經(jīng)網(wǎng)絡(luò)等)來描述。復(fù)雜的數(shù)據(jù)挖掘系統(tǒng)通常采用
18、多種數(shù)據(jù)挖掘技術(shù),或是采用有效的、集成的技術(shù),結(jié)合一些方法的優(yōu)點。</p><p> What is Data Mining?</p><p> Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a
19、misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, “data mining” should have been more appropriately named “knowledge mining from data”, whi
20、ch is unfortunately somewhat long. “Knowledge mining”, a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining </p><p> Many people treat data mining as a syn
21、onym for another popularly used term, “Knowledge Discovery in Databases”, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery in databases. Knowledge discovery
22、 consists of an iterative sequence of the following steps: </p><p> · data cleaning: to remove noise or irrelevant data, </p><p> · data integration: where multiple data sources may
23、be combined,</p><p> · data selection : where data relevant to the analysis task are retrieved from the database,</p><p> · data transformation : where data are transformed or consol
24、idated into forms appropriate for mining by performing summary or aggregation operations, for instance,</p><p> · data mining: an essential process where intelligent methods are applied in order to ext
25、ract data patterns, </p><p> · pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interestingness measures, and </p><p> · knowledge
26、presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user . </p><p> The data mining step may interact with the user or a knowledge base.
27、 The interesting patterns are presented to the user, and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one
28、 since it uncovers hidden patterns for evaluation. </p><p> We agree that data mining is a knowledge discovery process. However, in industry, in media, and in the database research milieu, the term “data mi
29、ning” is becoming more popular than the longer term of “knowledge discovery in databases”. Therefore, in this book, we choose to use the term “data mining”. We adopt a broad view of data mining functionality: data mining
30、 is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other </p><p> Based on this view, the architecture of a typical data mining s
31、ystem may have the following major components: </p><p> 1. Database, data warehouse, or other information repository. This is one or a set of databases, data warehouses, spread sheets, or other kinds of inf
32、ormation repositories. Data cleaning and data integration techniques may be performed on the data. </p><p> 2. Database or data warehouse server. The database or data warehouse server is responsible for fet
33、ching the relevant data, based on the user’s data mining request. </p><p> 3. Knowledge base. This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patt
34、erns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interest
35、ingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or threshold</p><p> 4. Data mining engine. This is essential to the
36、data mining system and ideally consists of a set of functional modules for tasks such as characterization, association analysis, classification, evolution and deviation analysis.</p><p> 5. Pattern evaluati
37、on module. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. It may access interestingness thresholds stored in t
38、he knowledge base. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommende
39、d to push the evaluation of pattern inter</p><p> 6. Graphical user interface. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying
40、a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse dat
41、abase and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms.</p><p> From a data warehouse perspective, data mining can be viewed as an advanc
42、ed stage of on-1ine analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style analytical processing of data warehouse systems by incorporating more advanced techniques for
43、 data understanding. </p><p> While there may be many “data mining systems” on the market, not all of them can perform true data mining. A data analysis system that does not handle large amounts of data can
44、 at most be categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including finding aggregate value
45、s, or that performs deductive query answering in large databases should be more appropriately categorize</p><p> Data mining involves an integration of techniques from mult1ple disciplines such as database
46、technology, statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. We adopt a da
47、tabase perspective in our presentation of data mining in this book. That is, emphasis is placed on efficient and scalable data mining techniques for large databases. By performing data mi</p><p> A classifi
48、cation of data mining systems </p><p> Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and informa
49、tion science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and or rough set theory, knowledge representation, inductive logic prog
50、ramming, or high performance computing. Depending on the kinds of data to be mined or on the given data mining ap</p><p> Because of the diversity of disciplines contributing to data mining, data mining res
51、earch is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems. Such a classification may help potential users distinguish data m
52、ining systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows. </p><p> 1) Classification according to the kinds of databases
53、 mined. </p><p> A data mining system can be classified according to the kinds of databases mined. Database systems themselves can be classified according to different criteria (such as data models, or the
54、types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly. </p><p> For instance, if classifying according
55、 to data models, we may have a relational, transactional, object-oriented, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time -s
56、eries, text, or multimedia data mining system , or a World-Wide Web mining system . Other system types include heterogeneous data mining systems, and legacy data mining systems.</p><p> 2) Classification ac
57、cording to the kinds of knowledge mined.</p><p> Data mining systems can be categorized according to the kinds of knowledge they mine, i.e., based on data mining functionalities, such as characterization, d
58、iscrimination, association, classification, clustering, trend and evolution analysis, deviation analysis , similarity analysis, etc. A comprehensive data mining system usually provides multiple and/or integrated data min
59、ing functionalities. </p><p> Moreover, data mining systems can also be distinguished based on the granularity or levels of abstraction of the knowledge mined, including generalized knowledge(at a high leve
60、l of abstraction), primitive-level knowledge(at a raw data level), or knowledge at multiple levels (considering several levels of abstraction). An advanced data mining system should facilitate the discovery of knowledge
61、at multiple levels of abstraction.</p><p> 3) Classification according to the kinds of techniques utilized. </p><p> Data mining systems can also be categorized according to the underlying dat
62、a mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methods of da
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 外文翻譯----什么是數(shù)據(jù)挖掘
- 什么是數(shù)據(jù)挖掘?
- 什么是數(shù)據(jù)挖掘
- 外文翻譯---什么是忠誠
- 什么是集群【外文翻譯】
- 什么是信托?【外文翻譯】
- 什么是博客【外文翻譯】
- 什么是庫存管理?【外文翻譯】
- 什么是營銷策略-外文翻譯
- [雙語翻譯]安全外文翻譯--什么是安全科學(xué)?
- 大數(shù)據(jù)挖掘外文翻譯—大數(shù)據(jù)挖掘研究
- 什么是金融風(fēng)險管理【外文翻譯】
- 大數(shù)據(jù)挖掘外文翻譯—大數(shù)據(jù)挖掘研究(原文)
- [雙語翻譯]安全外文翻譯--什么是安全科學(xué)?(英文)
- 2014年安全外文翻譯--什么是安全科學(xué)?
- [雙語翻譯]安全外文翻譯--什么是安全科學(xué)中英全
- 什么是翻譯評論
- 什么是第三方物流【外文翻譯】
- 什么是液壓系統(tǒng)設(shè)計外文文獻翻譯.doc
- 什么是數(shù)據(jù)結(jié)構(gòu)
評論
0/150
提交評論