STAT557: Data Mining
Fall 2017: Aug 21 - Dec 8


Teaching Assistant:  

Lectures:   TTH 1:35-2:50pm     067 Willard Bldg.

Course homepage:

Description of the course:

With rapid advances in information technology, we have witnessed an explosive growth in our capabilities to generate and collect data in the last decade. In the business world, very large databases on commercial transactions have been generated by retailers. Huge amount of scientific data have been generated in various fields as well. For instance, the human genome database project has collected gigabytes of data on the human genetic code. The World Wide Web provides another example with billions of web pages consisting of textual and multimedia information that are used by millions of people. How to analyze huge bodies of data so that they can be understood and used efficiently remains a challenging problem. Data mining addresses this problem by providing techniques and software to automate the analysis and exploration of large complex data sets. Research on data mining have been pursued by researchers in a wide variety of fields, including statistics, machine learning, database management and data visualization.

This course on data mining will cover methodology, major software tools and applications in this field. By introducing principal ideas in statistical learning, the course will help students to understand conceptual underpinnings of methods in data mining. Considerable amount of effort will also be put on computational aspects of algorithm implementation. To make an algorithm efficient for handling very large scale data sets, issues such as algorithm scalability need to be carefully analyzed. Data mining and learning techniques developed in fields other than statistics, e.g., machine learning and signal processing, will also be introduced. Example topics include linear classification/regression, logistic regression, model regularization, dimension reduction, prototype methods, decision trees, mixture models, and hidden Markov models.

Students will be required to work on projects to practice applying existing software and to a certain extent, developing their own algorithms. Classes will be provided in three forms: lecture, project discussion, and special topic survey/research applications. Project discussion will enable students to share and compare ideas with each other and to receive specific guidance from the instructors. Efforts will be made to help students formulate real-world problems into mathematical models so that suitable algorithms can be applied with consideration of computational constraints. By surveying special topics, students will be exposed to massive literature and become more aware of recent research. Students are strongly encouraged to survey or present their own applications of data mining and statistical learning in graduate research and carry out discussions on data collection and problem formulation.

Prerequisites: Stat 414, 415, 416, or similar courses that cover basics on probability, expectation, and conditional distribution. Basic programming skills. Matrix algebra and multivariate calculus.

Required: The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman

Recommended extra reading:


Note: late projects are not allowed without written request submitted and approved one week ahead of the due date.

Academic Integrity:
All Penn State and Eberly College of Science policies regarding academic integrity apply to this course. See for details.

Lecture Notes & Other Course Materials:
Course notes, reading materials, data sets, and project description

Lecture Dates, academic calendar

------------- Updated on 2017 ---------------