KDD 2008: Tutorials

Home Key Dates Activities KDD Cup Program Registration Organizers

Papers Workshops Tutorials Panels Demos Exhibits

All tutorials will be held on Sunday, August 24, 2008.

Please click on the tutorial titles for more details.

+ J. Han, J. Lee, H. Gonzalez, X. Li, "Mining Massive RFID, Trajectory, and Traffic Data Sets"

Jiawei Han, Jae-Gil Lee, Hector Gonzalez, Xiaolei Li
Department of Computer Science, University of Illinois at Urbana-Champaign

Abstract
With the wide availability of satellite, RFID, GPS, sensor, wireless, and video technologies, moving-object data has been collected in massive scale and is becoming increasingly rich, complex, and ubiquitous. There is an imminent need for scalable and flexible data analysis over moving-object information; and thus mining moving-object data has become one of major challenges in data mining. There have been considerable research efforts on data mining for RFID, trajectory, and traffic data sets. However, there has been no systematic tutorial on knowledge discovery from such moving-object data sets. This tutorial presents a comprehensive, organized, and state-of-the-art survey on methodologies and algorithms on analyzing different kinds of moving-object data sets, with an emphasis on several important mining tasks: clustering, classification, outlier analysis, and multidimensional analysis. Besides a thorough survey of the recent research work on this topic, we also show how real-world applications can benefit from data mining of RFID, trajectory, and traffic data sets. The tutorial consists of three parts: (1) RFID data mining, (2) trajectory data mining, and (3) traffic data mining. In the first part, warehousing, cleaning, and flow mining for RFID data are explored. In the second part, pattern mining, clustering, classification, and outlier detection for trajectory data are explored. In the third part, route discovery, destination prediction, and hot-route or outlier detection for traffic data are explored. This tutorial is prepared for data mining, database, and machine learning researchers who are interested in moving-object data.

+ J. Neville, F. Provost, "Predictive Modeling with Social Networks"

Jennifer Neville, Purdue University
Foster Provost, New York University

Abstract
Recently there has been a surge of interest in methods for analyzing complex social networks: from communication networks, to friendship networks, to professional and organizational networks. The dependencies among linked entities in the networks present an opportunity to improve inference about properties of individuals, as birds of a feather do indeed flock together. For example, when deciding how to market a product to people in MySpace or Facebook, it may be helpful to consider whether a person's friends are likely to purchase the product.

This tutorial will explore the unique opportunities and challenges for modeling social network data. We will begin with a description of the problem setting, including examples of various applications of social network mining (e.g., marketing, fraud detection). We will then present a number of characteristics of social network data that differentiate it from traditional inference and learning settings, and outline the resulting opportunities for significantly improved inference and learning. We will discuss specific techniques for capitalizing on each of the opportunities in statistical models, and outline both methodological issues and potential modeling pathologies that are unique to network data. We will give links to the recent literature to guide study, and present results demonstrating the effectiveness of the techniques.

Prerequisites: The tutorial assumes a basic knowledge of AI-style inference and machine learning, equivalent to an introductory graduate or advanced undergraduate class.

+ J. Pei, M. Hua, Y. Tao, X. Lin, "Mining Uncertain and Probabilistic Data: problems, Challenges, Methods, and Applications"

Jian Pei, Simon Fraser University, Canada
Ming Hua, Simon Fraser University, Canada
Yufei Tao, The Chinese University of Hong Kong
Xuemin Lin, The University of New South Wales, Australia

Abstract
Uncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Uncertain data in those applications are generally caused by factors like data randomness and incompleteness, limitations of measuring equipment, delayed data updates, etc. Due to the importance of those applications and the rapidly increasing amount of uncertain data collected and accumulated, analyzing and mining large collections of uncertain data have become an important task and attracted more and more interest from the data mining community. In this tutorial, we will give a systematic survey on the motivations/applications, the problems, the challenges, the fundamental principles and the state-of-the-art methods of mining uncertain and probabilistic data. We will motivate the survey with several interesting practical applications of uncertain data analysis. To set the stage, we will discuss two major models for uncertain and probabilistic data briefly. We will cover several important data mining tasks on uncertain data, including clustering, classification, frequent pattern mining and online analytical processing (OLAP). For each task, we will analyze the challenges posed by uncertain and probabilistic data and the state-of-the-art solutions.

+ R. Feldman, L. Ungar, "Applied Text Mining"

Ronen Feldman, Hebrew University
Lyle Ungar, University of Pennsylvania

Abstract
The information age has made it easy to store large amounts of data. The proliferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming. However, while the amount of data available to us is constantly increasing, our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes. Text Mining is an exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, NLP, IR and knowledge management. Text Mining involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (distribution analysis, clustering, trend analysis, association rules etc) and visualization of the results. In this tutorial we will present the general theory of Text Mining and will demonstrate several systems that use these principles to enable interactive exploration of large textual collections. We will present a general architecture for text mining and will outline the algorithms and data structures behind the systems. Special emphasis will be given to lessons learned from years of experience in developing real world text mining systems. The Tutorial will cover the state of the art in this rapidly growing area of research. Several real world applications of text mining will be presented.

Important Dates

Proposals due: March 15, 2008
Notification of Acceptance: May 25, 2008

Call for Tutorial Proposals

KDD'08 will host tutorials covering topics in data mining of interest to the research community as well as application developers. The tutorials will be part of the main conference technical program, and are free of charge to the attendees of the conference. We invite proposals for half-day tutorials from active researchers and experienced tutors. Ideally, a tutorial will cover the state-of-the-art research, development and applications in a specific data mining direction, and stimulate and facilitate future work. Tutorials on interdisciplinary directions, novel and fast growing directions, and significant applications are highly encouraged.

A tutorial proposal should be formatted in the following sections.

Title
Abstract (up to 150 words)
Target audience and prerequisites. Proposals must clearly identify the intended audience for the tutorial (e.g., novice users of statistical techniques, or expert researchers in text mining). What background will be required of the audience? Why is this topic important/interesting to the KDD community? What is the benefit to participants?
Outline of the tutorial. Enough material should be included to provide a sense of both the scope of material to be covered and the depth to which it will be covered. The more details that can be provided, the better (up to and including links to the actual slides). Note that the tutors should NOT focus mainly on their own research results. A KDD tutorial is not a forum for promoting one's research or product.
A list of forums and their time and locations if the tutorial or a similar/highly related tutorial has been presented by the same author(s) before, and highlight the similarity/difference between those and the one proposed for KDD'08 (up to 100 words for each entry)
A list of tutorials on the same/similar/highly related topics given by other people, and highlight the difference between yours and theirs (up to 100 words for each entry)
A list of other tutorials given by the authors, please list the titles, the presenters and the forums only.
Tutors' short bio and their expertise related to the tutorial (up to 200 words per tutor)
A list of up to 20 most important references that will be covered in the tutorial
(Optional) URLs of the slides/notes of the previous tutorials given by the authors, and any specific audio/video/computer requirements for the tutorial.

Please send your submission to kdd08tutorial@gmail.com

All tutorials will be held on Sunday, August 24, 2008.

Jiawei Han, Jae-Gil Lee, Hector Gonzalez, Xiaolei Li Department of Computer Science, University of Illinois at Urbana-Champaign

Jennifer Neville, Purdue University Foster Provost, New York University

Jian Pei, Simon Fraser University, Canada Ming Hua, Simon Fraser University, Canada Yufei Tao, The Chinese University of Hong Kong Xuemin Lin, The University of New South Wales, Australia

Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek Institute for Informatics, Ludwig-Maximilians-Universitat Munchen, Germany

Huan Liu, Arizona State University Nitin Agarwal, Arizona State University

Karsten Borgwardt, University of Cambridge Xifeng Yan, IBM T.J. Watson Research Center New York

Ronen Feldman, Hebrew University Lyle Ungar, University of Pennsylvania

Important Dates

Call for Tutorial Proposals

Jiawei Han, Jae-Gil Lee, Hector Gonzalez, Xiaolei Li
Department of Computer Science, University of Illinois at Urbana-Champaign

Jennifer Neville, Purdue University
Foster Provost, New York University

Jian Pei, Simon Fraser University, Canada
Ming Hua, Simon Fraser University, Canada
Yufei Tao, The Chinese University of Hong Kong
Xuemin Lin, The University of New South Wales, Australia

Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek
Institute for Informatics, Ludwig-Maximilians-Universitat Munchen, Germany

Huan Liu, Arizona State University
Nitin Agarwal, Arizona State University

Karsten Borgwardt, University of Cambridge
Xifeng Yan, IBM T.J. Watson Research Center New York

Ronen Feldman, Hebrew University
Lyle Ungar, University of Pennsylvania