KDD 2008: Papers

Home Key Dates Activities KDD Cup Program Registration Organizers

Papers Workshops Tutorials Panels Demos Exhibits

Research Track Papers

Industrial/Government Applications Track Papers

Important Dates

Paper Submission

Research Track Call for Papers

Industrial/Government Applications Track Call for Papers

Research Track Accepted Papers

15. Spectral Domain-Transfer Learning. Xiao Ling, Wenyuan Dai, Gui-Rong Xue, Qiang Yang, Yong Yu.

35. Learning Classifiers from Only Positive and Unlabeled Data. Charles Elkan, Keith Noto.

38. Automatic Record Linkage using Seeded Nearest Neighbour and Support Vector Machine Classification. Peter Christen.

46. SPIRAL: Efficient and Exact Model Identification for Hidden Markov Models. Yasuhiro Fujiwara, Yasushi Sakurai, Masashi Yamamuro.

50. Microscopic Evolution of Social Networks. Jure Leskovec, Lars Backstrom, Ravi Kumar, Andrew Tomkins.

52. A Family of Dissimilarity Measures between Nodes Generalizing both the Shortest-Path and the Commute-time Distances. Luh Yen, Amin Mantrach, Masashi Shimbo, Marco Saerens.

62. Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree. Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip Yu, Olivier Verscheure.

75. Mining Preferences from Superior and Inferior Examples. Bin Jiang, Jian Pei, Xuemin Lin, David W. Cheung, Jiawei Han.

89. Structured Metric Learning for High Dimensional Problems. Jason V. Davis, Inderjit S. Dhillon.

92. Permu-pattern: Discovery of Mutable Permutation Patterns with Proximity Constraint. Meng Hu, Jiong Yang, Wei Su.

99. Partitioned Logistic Regression for Spam Filtering. Ming-wei Chang, Wen-tau Yih, Christopher Meek.

105. Finding Non-Redundant, Statistically Significant Regions in High Dimensional Data: a Novel Approach to Projected and Subspace Clustering. Gabriela Moise, Jorg Sander.

106. Weighted graphs and disconnected components: Patterns and a generator. Mary McGlohon, Leman Akoglu, Christos Faloutsos.

125. Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model. Yehuda Koren.

127. Discrimination-aware Data Mining. Dino Pedreschi, Salvatore Ruggieri, Franco Turini.

140. Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams. Albert Bifet, Ricard Gavaldà.

142. The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing. Justin Brickell. Vitaly Shmatikov.

149. Colibri: Fast Mining of Large Static and Dynamic Graphs. Hanghang Tong. Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos.

153. A Sequential Dual Method for Large Scale Multi-Class Linear SVMs. S. Sathiya Keerthi, S. Sundararajan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin.

160. Efficient Computation of Personal Aggregate Queries on Blogs. Ka Cheung Sia, Junghoo Cho, Yun Chi, Belle L. Tseng.

163. Feedback Effects between Similarity and Social Influence in Online Communities. David Crandall, Dan Cosley, Daniel Huttenlocher, Jon Kleinberg, Siddharth Suri.

168. CutS3VM: A Fast Semi-Supervised SVM Algorithm. Bin Zhao, Fei Wang, Changshui Zhang.

169. Probabilistic Latent Semantic Visualization: Topic Model for Visualizing Documents. Tomoharu Iwata, Takeshi Yamada, Naonori Ueda.

181. Structured Entity Identification and Document Categorization: Two Tasks with One Joint Model . Indrajit Bhattacharya, Shantanu Godbole, Sachindra Joshi.

220. Categorizing and Mining Concept Drifting Data Streams. Peng Zhang, Xingquan Zhu, Yong Shi.

251. Efficient Semi-streaming Algorithms for Local Triangle Counting in Massive Graphs. Luca Becchetti, Paolo Boldi, Carlos Castillo, Aristides Gionis.

269. Angle-Based Outlier Detection in High-dimensional Data. Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek.

276. Efficient Ticket Routing by Resolution Sequence Mining. Qihong Shao, Yi Chen, Shu Tao, Xifeng Yan, Nikos Anerousis.

277. Building Semantic Kernels for Text Classification using Wikipedia. Pu Wang, Carlotta Domeniconi.

289. Unsupervised Deduplication using Cross-Field Dependencies. Robert Hall, Charles Sutton, Andrew Mccallum.

290. Interpretable Nonnegative Matrix Decompositions. Saara Hyvönen, Pauli Miettinen, Evimaria Terzi.

291. Constraint Programming for Itemset Mining. Luc De Raedt, Tias Guns, Siegfried Nijssen.

296. On Updates that Constrain the Features' Connections During Learning. Omid Madani, Jian Huang.

305. FastANOVA: an Efficient Algorithm for Genome-Wide Association Study. Xiang Zhang, Fei Zou, Wei Wang.

307. Fast Logistic Regression for Text Categorization with Variable-Length N-grams. Georgiana Ifrim, Goekhan Bakir, Gerhard Weikum.

318. Bridging Centrality: Graph Mining from Element Level to Group Level. Woochang Hwang, Taehyong Kim, Murali Ramanathan, Aidong Zhang.

320. Banded Structure in Binary Matrices. Gemma C. Garriga, Esa Junttila, Heikki Mannila.

325. Model-Based Document Clustering with a Collapsed Gibbs Sampler. Daniel David Walker, Eric K. Ringger.

335. Constructing Comprehensive Summaries of Large Event Sequences. Jerry Kiernan, Evimaria Terzi.

340. A Bayesian Mixture Model with Linear Regression Mixing Proportions. Xiuyao Song, Chris Jermaine, Sanjay Ranka, John Gums.

342. Training Structural SVMs with Kernels Using Sampled Cuts. Chun-Nam John Yu, Thorsten Joachims.

347. Using Ghost Edges for Classification in Sparsely Labeled Networks. Brian Gallagher, Hanghang Tong, Tina Eliassi-Rad, Christos Faloutsos.

349. Mobile Call Graphs: Beyond Power-Law and Lognormal Distributions. Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos Faloutsos, Jure Leskovec.

362. Stable Feature Selection via Dense Feature Groups. Lei Yu, Chris Ding, Steven Loscalzo.

372. Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. Victor Sheng, Foster Provost, Panagiotis G. Ipeirotis.

378. Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation. Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, Max Welling.

388. iSAX: Indexing and Mining Terabyte Sized Time Series. Jin Shieh, Eamonn Keogh.

389. Active Learning with Direct Query Construction. Charles X. Ling, Jun Du.

394. Local Peculiarity Factor and its Application in Outlier Detection. Jian Yang, Ning Zhong, Yiyu Yao, Jue Wang.

400. Locality Sensitive Hash Functions Based on Concomitant Rank Order Statistics. Kave Eshghi, Shyamsundar Rajaram.

401. Composition Attacks and Auxiliary Information in Data Privacy. Srivatsava Ranjit Ganta, Shiva Prasad Kasiviswanathan, Adam Smith.

402. Scaling Up Text Classification for Large File Systems. George Forman, Shyamsundar Rajaram.

404. Entity Categorization over Large Document Collections. Venkatesh Ganti, Arnd C. König, Rares Vernica.

413. The Structure of Information Pathways in Social Communication Networks. Gueorgi Kossinets, Jon Kleinberg, Duncan Watts.

420. SAIL: Summation-based Incremental Learning for Information-Theoretic Clustering. Junjie Wu, Hui Xiong, Jian Chen.

426. Stream Prediction Using a Generative Model Based on Frequent Episodes in Event Sequences. Srivatsan Laxman, Vikram Tankasali, Ryen W. White.

434. Knowledge Transfer via Multiple Model Local Structure Mapping. Jing Gao, Wei Fan, Jing Jiang, Jiawei Han.

439. Relational Learning via Collective Matrix Factorization. Ajit P. Singh, Geoffrey J. Gordon.

440. Classification with Partial Labels. Nam Nguyen, Rich Caruana.

442. Volatile Correlation Computation: A Checkpoint View. Wenjun Zhou, Hui Xiong.

455. Anonymizing Transaction Databases for Publication. Yabo Xu, Ke Wang, Ada Wai-Chee Fu, Philip S. Yu.

456. Can Complex Network Metrics Predict the Behavior of NBA Teams?. Pedro O.S. Vaz de Melo, Virgilio A.F. Almeida, Antonio A.F. Loureiro.

460. Cut-And-Stitch: Efficient Parallel Learning of Linear Dynamical Systems on SMPs. Lei Li, Wenjie Fu, Fan Guo, Todd C. Mowry, Christos Faloutsos.

463. Information Extraction from Wikipedia: Moving Down the Long Tail. Fei Wu, Raphael Hoffmann, Daniel S. Weld.

469. Bypass Rates: Reducing Query Abandonment using Negative Inferences. Atish Das Sarma, Sreenivas Gollapudi, Samuel Ieong.

472. Community Evolution in Dynamic Multi-Mode Networks. Lei Tang, Huan Liu, Jianping Zhang, Zohreh Nazeri.

496. Knowledge Discovery of Semantic Relationships between Words Using Nonparametric Bayesian Graph Model. Issei Sato, Minoru Yoshida, Hiroshi Nakagawa.

518. Reconstructing Chemical Reaction Networks: Data Mining meets System Identification. Yong Ju Cho, Naren Ramakrishnan, Yang Cao.

537. Unsupervised Feature Selection for Principal Components Analysis. Christos Boutsidis, Michael W. Mahoney, Petros Drineas.

548. Effective Label Acquisition for Collective Classification. Mustafa Bilgic, Lise Getoor.

554. A Semi-Supervised Approach to Rapid and Reliable Labeling of Large Data Sets . Gyorgy J. Simon, Vipin Kumar, Zhi-Li Zhang, Francesco Bonchi.

563. Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere. Robert Grossman, Yunhong Gu.

569. Topical Query Decomposition. Francesco Bonchi, Carlos Castillo, Debora Donato, Aristides Gionis.

571. Automatic Identification of Quasi-experimental Designs for Discovering Causal Knowledge. David D. Jensen, Andrew S. Fast, Brian J. Taylor, Marc E. Maier.

576. Identifying Biologically Relevant Genes via Multiple Heterogeneous Data Sources. Zheng Zhao, Jiangxin Wang, Huan Liu, Jieping Ye, Yung Chang.

577. Anomaly Pattern Detection in Categorical Datasets. Kaustav Das, Jeff Schneider, Daniel B. Neill.

594. Semi-supervised Learning with Data Calibration for Long-Term Time Series Forecasting. Haibin Cheng, Pang-Ning Tan.

611. De-duping URLs via Rewrite Rules. Anirban Dasgupta, Ravi Kumar, Amit Sasturkar.

613. FAST: A ROC-based Feature Selection Metric for Small Samples and Imbalanced Data Classification Problems. Xue-wen Chen, Mike Wasikowski.

623. Asymmetric Support Vector Machines: Low False-Positive Learning Under the User Tolerance. Shan-Hung Wu, Keng-Pei Lin, Chung-Min Chen, Ming-Syan Chen.

632. Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms. Rohit Gupta, Gang Fang, Blayne Field, Michael Steinbach, Vipin Kumar.

672. Partial Least Squares Regression for Graph Mining. Hiroto Saigo, Nicole Krämer, Koji Tsuda.

681. Generating Succinct Titles for Web URLs. Deepayan Chakrabarti, Ravi Kumar, Kunal Punera.

685. Succinct Summarization of Transactional Databases: An Overlapped Hyperrectangle Scheme.Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan.

686. Influence and Correlation in Social Networks. Aris Anagnostopoulos, Ravi Kumar, Mohammad Mahdian.

692. Extracting Shared Subspace for Multi-label Classification. Shuiwang Ji, Lei Tang, Shipeng Yu, Jieping Ye.

695. Effective and Efficient Itemset Pattern Summarization: Regression-based Approaches. Ruoming Jin, Muad Abu-Ata, Yang Xiang, Ning Ruan.

702. Learning Subspace Kernels for Classification. Jianhui Chen, Shuiwang Ji, Betul Ceran, Qi Li, Mingrui Wu, Jieping Ye.

750. Mining Multi-Faceted Overviews of Arbitrary Topics in a Text Collection. Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz.

751. Joint Latent Topic Models for Text and Citations. Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, William W. Cohen.

758. Hypergraph Spectral Learning for Multi-label Classification. Liang Sun, Shuiwang Ji, Jieping Ye.

769. Simultaneous Tensor Subspace Selection and Clustering: The Equivalence of High Order SVD and K-Means Clustering. Heng Huang, Chris Ding, Dijun Luo.

773. A Unified Approach for Schema Matching, Coreference, and Canonicalization. Michael L. Wick, Khashayar Rohanimanesh, Karl Schultz, Andrew McCallum.

787. Multi-class Cost-sensitive Boosting with p-norm Loss Functions. Aurelie C. Lozano, Naoki Abe.

836. Combinational Collaborative Filtering for Personalized Community Recommendation. Wen-Yen Chen, Dong Zhang, Edward Y. Chang.

850. Structured Learning for Non-Smooth Ranking Losses. Rajiv Khanna, Uma Sawant, Soumen Chakrabarti, Chiru Bhattacharyya.

Industrial/Government Applications Track Accepted Papers

65. Privacy Leaks Using Corpus-Based Association Rules. Richard Chow, Philippe Golle, Jessica Staddon.

80. TagMark: Reliable Estimations of RFID Tags for Business Processes. Leonardo Weiss Ferreira Chaves, Erik Buchmann, Klemens Böhm.

124. Spotting Out Emerging Artists Using Geo-Aware Analysis of P2P Query Strings. Noam Koenigstein, Yuval Shavitt, Tomer Tankel

128. Identifying Authoritative Actors in Question-Answering Forums - The Case of Yahoo! Answers. Mohamed Bouguessa, Benoit Dumoulin, Shengrui Wang.

178. Text Classification, Business Intelligence, and Interactivity: Automating C-Sat Analysis for Services Industry. Shantanu Godbole, Shourya Roy.

183. Context-Aware Query Suggestion by Mining Click-Through and Session Data. Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, Hang Li.

221. Identifying Domain Expertise of Developers from Source Code. Renuka Sindhgatta.

265. Temporal Pattern Discovery for Trends and Transient Effects: Its Application to Patient Records. G. Niklas Norén, Andrew Bate, Johan Hopstadius, Kristina Star, I. Ralph Edwards.

328. Anticipating Annotations and Emerging Trends in Biomedical Literature. Fabian Moerchen, Mathaeus Dejori, Dmitryi Fradkin, Julien Etienne, Bernd Wachmann, Markus Bundschus.

330. A Visual-Analytic Toolkit for Dynamic Interaction Graphs. Xintian Yang, Sitaram Asur, Srinivasan Parthasarathy, Sameep Mehta.

337. Using Predictive Analysis to Improve Invoice-to-Cash Collection. Sai Zeng, Prem Melville, Christian A. Lang, Ioana Boier-Martin, Conrad Murphy.

368. Automated Cyclone Discovery and Tracking using Knowledge Sharing in Multiple Heterogeneous Satellite Data. Shen-Shyang Ho, Ashit Talukder.

391. Land Cover Change Detection: A Case Study. Shyam Boriah, Vipin Kumar, Michael Steinbach, Christopher Potter, Steven Klooster.

435. ArnetMiner: Extraction and Mining of Academic Social Networks . Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, Zhong Su.

466. Learning Methods for Lung Tumor Markerless Gating in Image-Guided Radiotherapy. Ying Cui, Jennifer G. Dy, Gregory C. Sharp, Brian M. Alexander, Steve B. Jiang.

563. Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere. Robert Grossman, Yunhong Gu.

593. Learning from Multi-Topic Web Documents for Contextual Advertising. Yi Zhang, Arun C. Surendran, John C. Platt, Mukund Narasimhan.

625. Heterogeneous Data Fusion for Alzheimer's Disease Study. Jieping Ye, Kewei Chen, Teresa Wu, Jing Li, Zheng Zhao, Rinkal Patel, Min Bae, Ravi Janardan, Huan Liu, Gene Alexander, Eric Reiman.

649. Scalable and Near Real-Time Burst Detection from eCommerce Queries. Nish Parikh, Neel Sundaresan.

650. Privacy-Preserving Cox Regression for Survival Analysis. Shipeng Yu, Glenn Fung, Romer Rosales, Sriram Krishnan, R. Bharat Rao, Cary Dehing-Oberije, Philippe Lambin.

688. Customer Targeting Models Using Actively-Selected Web Content. Prem Melville, Saharon Rosset, Richard D. Lawrence.

789. The Persuasive Phase of Visualization . Christine H. Chih, Douglass S. Parker.

806. Experimental Comparison of Scalable Online Ad Serving . Gang Wu, Brendan Kitts.

Important Dates

Research Track
- Electronic abstract submission: 11:30pm (PT), February 23, 2008
- Electronic paper submission: 11:30pm (PT), February 29, 2008
- Author Notification: May 25, 2008
Industrial/Government Applications Track
- Electronic abstract submission: 11:30pm (PT), February 23, 2008
- Electronic paper submission: 11:30pm (PT), February 29, 2008
- Author Notification: May 25, 2008

Paper Submission

Length of paper: NINE (9) pages in ACM template (Longer papers will be rejected without review)
Format: PDF, US Letter (8.5" x 11").
Templates: http://www.acm.org/sigs/publications/proceedings-templates
Submission website: https://cmt.research.microsoft.com/KDD2008/
The KDD-2008 review process will not be "double-blind"
The Microsoft conference management system does not send email confirmations after submission, but you can view your submission details from the author console once you have submitted

We cannot accept submissions by e-mail, fax or postal mail.

Research Track Papers

Call for Papers

Please take note of the repeatability guidline.

We invite submissions on all aspects of knowledge discovery and data mining overlapping with topics from machine learning, statistics, databases, and pattern recognition. Papers are expected to describe innovative ideas and solutions that are rigorously evaluated and well-presented. Submissions that describe minor variations of existing methods or only make small or questionable improvements to existing algorithms are discouraged.

Areas of interest include, but are not limited to:

Data mining algorithms
Data mining foundations
High performance and parallel/distributed data mining
Innovative applications of data mining
Data mining systems
KDD framework and process
Mining data streams and sensor data
Mining multi-media data
Mining social networks and graph data
Mining spatial and temporal data
Mining text, Web and semi-structured data
Pre-processing and post-processing in data mining
Robust and scalable statistical methods
Security, privacy, and adversarial data mining
Visual data mining and data visualization

All submitted papers will be judged based on their technical merit, rigor, significance, originality, relevance, and clarity. Papers submitted to KDD-08 should be original work, not previously published in a peer-reviewed conference or journal. Papers substantially similar to papers submitted to KDD-08 should not be under review in another peer-reviewed conference or journal during the KDD-08 reviewing period.

Repeatability guideline: As the SIGKDD conference enters its fourteenth year of existence, we need to take steps to ensure the long term viability of the research output of this community. A basic requirement is to enable the careful scrutiny and repeatability of evaluation results reported in a paper. The description of experimental results in submitted papers should be accompanied with all relevant implementation details and exact parameter specifications. Reviewers will be encouraged to downgrade ratings of papers that do not meet this guideline. Datasets used in the experiments should be made publicly available, whenever possible. When you must use proprietary datasets, please make every effort to supplement your results with those from closely matching synthetic datasets or other public datasets.

Industrial/Government Applications Track

Call for Papers

The Industrial/Government Applications Track of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2008) will highlight challenges, lessons, concerns, and research issues arising out of deploying applications of KDD technology. The focus is on promoting the exchange of ideas between researchers and practitioners of data mining.

The KDD-2008 Industrial/Government Applications (I/G) Track seeks to:

provide a forum for exchanging ideas between KDD practitioners, researchers, companies, and government organizations;
help commercial and government organizations highlight successful KDD applications;
raise interesting (research) challenges and other concerns more specific to industry and government -- customer privacy issues, analysis of data not generally available in academia, issues of scale that arise more heavily in a corporate setting, etc.

The I/G Applications Track solicits papers describing implementations of KDD solutions relevant to commercial or government settings. The primary emphasis is on papers that advance our understanding of practical, applied, or pragmatic issues and highlight new research challenges in real KDD applications. Applications can be in any field including, but not limited to: e-commerce, medical and pharmaceutical, defense, public policy, engineering, manufacturing, telecommunications, and government.

The I/G Applications Track will consist of competitively-selected contributed papers - presented in oral and/or poster form - as well as invited talks. We envision submissions along four sub-areas:

Emerging applications and technology
Deployed KDD case studies
Comparative studies of KDD technology
Pragmatic issues and research considerations in fielding real applications

Emerging application and technology papers discuss prototype applications, tools for focused domains or tasks, useful techniques or methods, useful system architectures, scalability enablers, tool evaluations, or integration of KDD and other technologies. Case studies describe deployed projects with measurable benefits that include KDD technology. Such papers need to demonstrate the importance and impact of the work clearly. Comparative studies compare and contrast KDD technologies using specific examples (without being a product advertisement). Pragmatic issues and considerations include important practical and research considerations, approaches, and architectures that enable successful applications.

Submitters are encouraged (but not required) to select one (or more) of these sub-areas for their papers. In their submission, authors are required to explain why the application is important, the specific need for KDD technology to solve the problem (including why other methods perhaps not based on data mining may fall short), and any innovations or lessons learned in the solution.