Curs 1 Data Mining

download Curs 1 Data Mining

of 49

  • date post

    26-Jun-2015
  • Category

    Education

  • view

    3.897
  • download

    0

Embed Size (px)

Transcript of Curs 1 Data Mining

  • 1. Introducere n Data Mining Curs 1: Prezentare general Lucian Sasu, Ph.D. Universitatea Transilvania din Braov, Facultatea de Matematic i Informatic April 7, 2014 lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 1 / 42

2. Outline 1 Bibliograa recomandat Bibliograe pentru curs Bibliograe pentru laborator 2 Data Mining - introducere Deniii, exemple i motivaie Data Mining i Knowledge Discovery Puncte de dicultate Originile DM Tipuri de aplicaii DM lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 2 / 42 3. Bibliograe pentru curs 1 Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Addison-Wesley, 2006 lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 3 / 42 4. Bibliograe pentru curs 1 Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Addison-Wesley, 2006 2 David J. Hand, Heikki Mannila and Padhraic Smyth: Principles of Data Mining, MIT Press, 2001 lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 3 / 42 5. Bibliograe pentru curs 1 Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Addison-Wesley, 2006 2 David J. Hand, Heikki Mannila and Padhraic Smyth: Principles of Data Mining, MIT Press, 2001 3 Jiawei Han, Micheline Kamber, Jian Pei: Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann Publishers, 2011 lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 3 / 42 6. Bibliograe pentru curs 1 Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Addison-Wesley, 2006 2 David J. Hand, Heikki Mannila and Padhraic Smyth: Principles of Data Mining, MIT Press, 2001 3 Jiawei Han, Micheline Kamber, Jian Pei: Data Mining: Concepts and Techniques, 3rd ed., Morgan Kaufmann Publishers, 2011 4 Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, Springer 2009, liber la download lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 3 / 42 7. Bibliograe pentru laborator 1 RapidMiner: http://rapidminerresources.com lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 4 / 42 8. Bibliograe pentru laborator 1 RapidMiner: http://rapidminerresources.com 2 RapidMiner: http://rapidminer.com/learning/getting-started/ lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 4 / 42 9. Bibliograe pentru laborator 1 RapidMiner: http://rapidminerresources.com 2 RapidMiner: http://rapidminer.com/learning/getting-started/ 3 Weka: Ian H. Witten, Eibe Frank: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edition, Morgan Kaufmann, 2005 lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 4 / 42 10. Bibliograe pentru laborator 1 RapidMiner: http://rapidminerresources.com 2 RapidMiner: http://rapidminer.com/learning/getting-started/ 3 Weka: Ian H. Witten, Eibe Frank: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edition, Morgan Kaufmann, 2005 4 Weka: Curs Weka la https://weka.waikato.ac.nz/dataminingwithweka/course lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 4 / 42 11. Unelte folosite la laborator (1) Weka: Data Mining Software in Java, Download de aici Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classication, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 5 / 42 12. Unelte folosite la laborator (1) Weka: Data Mining Software in Java, Download de aici Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classication, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Software multiplatform dezvoltat n Java; poate folosit din GUI sau prin API-ul expus; posibil s se apeleze din .NET via ikvm.net. lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 5 / 42 13. Unelte folosite la laborator (2) RapidMiner Community Edition The main product of Rapid-I, the data analysis solution RapidMiner, is the world-leading open-source system for data and text mining. Mecanisme: Data Integration, Analytical ETL, Data Analysis, and Reporting; graphical user interface for the design of analysis processes; Repositories for process, data and meta data handling; Hundreds of data loading, data transformation, data modeling, and data visualization methods [. . . ] Alte softuri larg folosite, dar neabordate la laborator: http://www.kdnuggets.com/software/index.html, http://www.kdnuggets.com/polls/2010/data-mining-analytics-tools.html http://www-users.cs.umn.edu/~kumar/dmbook/resources.htm lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 6 / 42 14. Outline 1 Bibliograa recomandat Bibliograe pentru curs Bibliograe pentru laborator 2 Data Mining - introducere Deniii, exemple i motivaie Data Mining i Knowledge Discovery Puncte de dicultate Originile DM Tipuri de aplicaii DM lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 7 / 42 15. Deniii Deniie Data Mining este procesul descoperirii (semi)automate a informaiilor utile n depozite mari de date (Tan et al). Deniie Data Mining este analiza seturilor de date deseori de dimensiuni mari rezultate prin observaii pentru a gsi relaii noi i pentru sumarizarea datelor n moduri care sunt att uor de neles ct i utile celui ce deine datele (Hand et al). Deniie Data mining este procesul netrivial de extragere a informaiei implicite, anterior necunoscute, interesante i potenial utile din date, de regul sub forma de modele i abloane de cunoatere (Schapiro et al). lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 8 / 42 16. Termeni alternativi: mineritul cunotinelor din date extragere de cunotine (eng: Knowledge Discovery) sinonim discutabil analiza date/abloane Ce NU e Data Mining: gsirea datelor complete privind o persoan folosind interogare ntro baz de date; gsirea paginilor web care conin anumii termeni; Acestea sunt activiti de regsire a informaiei. lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 9 / 42 17. Ce poate Data Mining: s descoperi c anumite nume sunt mai frecvente n unele zone: OBrien, ORurke, OReilly n zona Boston; gruparea clienilor pe baza unui prol de consum comun; gruparea paginilor dintr-un motor de cutare pe baza similaritilor: motorul http://clusty.com/; lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 10 / 42 18. Clustering de pagini web in Clusty lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 11 / 42 19. Farecast: s cumpr sau nu acum un bilet de avion? lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 12 / 42 20. De ce Data Mining: din punctul de vedere al afacerilor (1) O mulime de date sunt colectate i depozitate prin sisteme de data warehouse date din Web, comer electronic cumprturi n magazine/lanuri de desfacere tranzacii nanciare, carduri de debit/credit Calculatoarele au devenit tot mai ieftine i mai puternice; procesarea distribuit este ceva comun. lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 13 / 42 21. De ce Data Mining: din punctul de vedere al afacerilor (2) Presiunea impus de competiie este motivant: aducerea unui nou client ntro reea de telefonie este de pn la 4 ori mai scump dect pstrarea lui: Customer attrition Cerine specice mediului de afaceri: customer proling, targetted marketing, fraud detection Probleme stringente: Care sunt cei mai protabili clieni?, Care produse cumprate atrag achiziia altor produse?, Care va evoluia companiei/pieei pe segmentul . . . ?, Care sunt niele de pia? lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 14 / 42 22. De ce Data Mining: din punct de vedere tiinic n domenii precum medicina, inginerie i tiin se acumuleaz rapid date ce trebuie exploatate pentru a duce la noi descoperiri; Exemplu: dezvoltarea de sisteme de satelii pentru observaii climatice; Date genetice generate prin microarrays; se dorete decodicarea complet a genomului uman, determinarea genelor care cauzeaz diferite afeciuni, nelegerea structurii i funcionalitii genelor; DM e unealt de baz pentru bioinformatic = aplicarea statisticii i a informaticii n domeniul biologiei moleculare. lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 15 / 42 23. Competiii Neix prize: 100.480.507 rating-uri date de 480.189 utilizatori pentru 17.770 lme KDDCup: 2013: Author-Paper Identication Challenge 2012: User Modeling based on Microblog Data and Search Click Data 2011: Recomandare de muzic 2010: Evaluarea performanelor studenilor 2009: Predicia relaiei cu clienii competiia merge pn n 1997 Alte competiii www.kdnuggets.com, kaggle lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 16 / 42 24. Paii unui proces de extragere de cunotine (1) Data Mining este parte integrant a domeniului Knowledge discovery in databases (KDD), care e un ntreg proces de conversie a datelor primare n cunotine (informaie). Procesul const ntro succesiune de pai: Datele de intrare se pot gsi ntr-o larg varietate de formate: iere text, baze de date relaionale, date semistructurate (e.g. XML, HTML), imagini, lme etc. lucian.sasu@ieee.org (UNITBV) Curs 1 April 7, 2014 17 / 42 25. Paii unui proces de extragere de cunotine (2) Datele se selecteaz din multitudinea de surse; Preprocesarea i transformarea pot include: selectarea dimensiunilor, reducerea dimensionalitii, tratarea datelor incomplete, normalizarea; Preprocesarea i transformarea pot lua chiar i 60% din durata total a unui proces de extragere a cunotinelor; Partea de Data Mining se face printro varietate de tehnici; deseori se testeaz mai multe metode; La nal, cunotinele rezultate sunt postprocesate (e.g. se elimin rezultatele invalide sau neinteresante) i trebuie prezentate ntro form inteligibil factorilor de decizie (e.g. vizualizare sau regu