10challengingproblemsindataminingresearch_第1頁(yè)
已閱讀1頁(yè),還剩23頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、1,Data Mining: Opportunities and Challenges,Xindong Wu University of Vermont, USA;Hefei University of Technology, China(合肥工業(yè)大學(xué)計(jì)算機(jī)應(yīng)用長(zhǎng)江學(xué)者講座教授),2,Deduction Induction: My Research Background,,3,Outline,Data Mining

2、 OpportunitiesMajor Conferences and Journals in Data MiningMain Topics in Data Mining Some Research Directions in Data Mining10 Challenging Problems in Data Mining Research,4,What Is Data Mining?,The discovery of kno

3、wledge (in the form of rules, trees, frequent patterns etc.) from large volumes of dataA hot field: 15 “data mining” conferences in 2003, including KDD, ICDM, SDM, IDA, PKDD and PAKDD excluding IJCAI, COMPSTAT, SIGMOD

4、 and other more general conferences that also publish data mining papers.,5,Main Activities in Data Mining: Conferences,The birth of data mining/KDD: 1989 IJCAI Workshop on Knowledge Discovery in Databases 1991-1994 Wor

5、kshops on Knowledge Discovery in Databases1995 – date: International Conferences on Knowledge Discovery in Databases and Data Mining (KDD)2001 – date: IEEE ICDM and SIAM-DM (SDM)Several regional conferences, incl. PA

6、KDD (since 1997) & PKDD (since 1997).,6,Data Mining: Major Journals,Data Mining and Knowledge Discovery (DMKD, since 1997)Knowledge and Information Systems (KAIS, since 1999)IEEE Transactions on Knowledge and Data

7、Engineering (TKDE)Many others, incl. TPAMI, ML, IDA, …,7,ACM KDD vs. IEEE ICDM,8,Main Topics in Data Mining,Association analysis (frequent patterns)Classification (trees, Bayesian methods, etc) Clustering and outlier

8、analysisSequential and spatial patterns, and time-series analysisText and Web miningData visualization and visual data mining.,9,Some Research Directions,Web mining (incl. Web structures, usage analysis, authoritative

9、 pages, and document classification)Intelligent data analysis in BioinformaticsMining with data streams (in continuous, real-time, dynamic data environments)Integrated, intelligent data mining environments and tools (

10、incl. induction, deduction, and heuristic computation).,10,Outline,Data Mining OpportunitiesMajor Conferences and Journals in Data MiningMain Topics in Data Mining Some Research Directions in Data Mining10 Challengin

11、g Problems in Data Mining Research,11,10 Challenging Problems in Data Mining Research,Joint Efforts with Qiang Yang (Hong Kong Univ. of Sci. & Tech.)With Contributions with ICDM & KDD OrganizersXindong Wu, (U

12、niversity of Vermont, USA;Hefei University of Technology, China),12,Why “Most Challenging Problems”?,What are the 10 most challenging problems in data mining, today?Different people have different views, a function of

13、time as wellWhat do the experts think?Experts we consulted: Previous organizers of IEEE ICDM and ACM KDDWe asked them to list their 10 problems (requests sent out in Oct 05, and replies Obtained in Nov 05)Replies

14、Edited into an article: hopefully be useful for young researchersNot in any particular importance order,13,1. Developing a Unifying Theory of Data Mining,The current state of the art of data-mining research is too ``ad-

15、hoc“techniques are designed for individual problemsno unifying theoryNeeds unifying researchExploration vs explanationLong standing theoretical issuesHow to avoid spurious correlations?Deep research.Knowledge dis

16、covery on hidden causes?Similar to discovery of Newton’s Law?,An Example (from Tutorial Slides by Andrew Moore):VC dimension. If you've got a learning algorithm in one hand and a dataset in the other hand, to what

17、extent can you decide whether the learning algorithm is in danger of overfitting or underfitting?formal analysis into the fascinating question of how overfitting can happen, estimating how well an algorithm will perfor

18、m on future data that is solely based on its training set error, a property (VC dimension) of the learning algorithm. VC-dimension thus gives an alternative to cross-validation, called Structural Risk Minimization (SRM)

19、, for choosing classifiers. CV,SRM, AIC and BIC.,14,2. Scaling Up for High Dimensional Data and High Speed Streams,Scaling up is neededultra-high dimensional classification problems (millions or billions of features, e

20、.g., bio data)Ultra-high speed data streamsStreams.continuous, online processe.g. how to monitor network packets for intruders?concept drift and environment drift?RFID network and sensor network data,Excerpt from J

21、ian Pei’s Tutorialhttp://www.cs.sfu.ca/~jpei/,15,3. Sequential and Time Series Data,How to efficiently and accurately cluster, classify and predict the trends ?Time series data used for predictions are contaminated by

22、noise.How to do accurate short-term and long-term predictions?Signal processing techniques introduce lags in the filtered data, which reduces accuracyKey in source selection, domain knowledge in rules, and optimizatio

23、n methods,Real time series data obtained fromwireless sensors in Hong Kong USTCS department hallway,16,4. Mining Complex Knowledge from Complex Data,Mining graphsData that are not i.i.d. (independent and identically d

24、istributed)many objects are not independent of each other, and are not of a single type. mine the rich structure of relations among objects, E.g.: interlinked Web pages, social networks, metabolic networks in the cell

25、Integration of data mining and knowledge inference The biggest gap: unable to relate the results of mining to the real-world decisions they affect - all they can do is hand the results back to the userMore research on

26、 interestingness of knowledge.,Citation (Paper 2),Author (Paper1),Title,Conference Name,17,5. Data Mining in a Network Setting,Community and Social NetworksLinked data between emails, Web pages, blogs, citations, seque

27、nces and peopleStatic and dynamic structural behaviorMining in and for Computer Networks.detect anomalies (e.g., sudden traffic spikes due to a DoS (Denial of Service) attackNeed to handle 10Gig Ethernet links (a) de

28、tect (b) trace back (c ) drop packet,Picture from Matthew Pirretti’s slides, Penn StateAn Example of packet streams (data courtesy of NCSA, UIUC),18,6. Distributed Data Mining and Mining Multi-agent Data,Need to correl

29、ate the data seen at the various probes (such as in a sensor network)Adversary data mining: deliberately manipulate the data to sabotage them (e.g., make them produce false negatives)Game theory may be needed for help.

30、,Games,,,,,,,,Player 1:miner,Player 2,Action: H,H,H,T,T,T,(-1,1),(-1,1),(1,-1),(1,-1),Outcome,19,7. Data Mining for Biological and Environmental Problems,New problems raise new questionsLarge scale problems especially s

31、oBiological data mining, such as HIV vaccine designDNA, chemical properties, 3D structures, and functional properties ? need to be fused Environmental data miningMining for solving the energy crisis.,,20,8. Data-min

32、ing-Process Related Problems,How to automate mining process?the composition of data mining operationsData cleaning, with logging capabilitiesVisualization and mining automation.,Need a methodology: help users avoid ma

33、ny data mining mistakesWhat is a canonical set of data mining operations?,Sampling,Feature Sel,Mining…,21,9. Security, Privacy and Data Integrity,How to ensure the users privacy while their data are being mined?How to

34、do data mining for protection of security and privacy?Knowledge integrity assessment. Data are intentionally modified from their original version, in order to misinform the recipients or for privacy and securityDevelo

35、pment of measures to evaluate the knowledge integrity of a collection of DataKnowledge and patterns,http://www.cdt.org/privacy/,Headlines (Nov 21 2005)Senate Panel Approves Data Security Bill - The Senate Judiciary Co

36、mmittee on Thursday passed legislation designed to protect consumers against data security failures by, among other things, requiring companies to notify consumers when their personal information has been compromised. Wh

37、ile several other committees in both the House and Senate have their own versions of data security legislation, S. 1789 breaks new ground by including provisions permitting consumers to access their personal files …,22,1

38、0. Dealing with Non-static, Unbalanced and Cost-sensitive Data,The UCI datasets are small and not highly unbalancedReal world data are large (10^5 features) but only < 1% of the useful classes (+’ve)There is much in

39、formation on costs and benefits, but no overall model of profit and lossData may evolve with a bias introduced by sampling.,,Each test incurs a cost Data extremely unbalanced Data change with time,23,10 Challenging

40、Problems: Summary,Developing a Unifying Theory of Data Mining Scaling Up for High Dimensional Data/High Speed Streams Mining Sequence Data and Time Series Data Mining Complex Knowledge from Complex Data Data Mining

41、in a Network Setting Distributed Data Mining and Mining Multi-agent DataData Mining for Biological and Environmental Problems Data-Mining-Process Related Problems Security, Privacy and Data Integrity Dealing with No

42、n-static, Unbalanced and Cost-sensitive Data,24,Contributors,Pedro Domingos, Charles Elkan, Johannes Gehrke, Jiawei Han, David Heckerman, Daniel Keim, Jiming Liu, David Madigan, Gregory Piatetsky-Shapiro, Vijay V. Raghav

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫(kù)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論