Introduction
In recent years, there has been an unprecedented increase in the number of scholarly documents produced and it has become increasingly necessary to organize and manage the documents and allow for the exploration and discovery of new documents. Digital libraries facilitate this type of exploration and discovery through interfaces based on the metadata associated with scholarly documents and information extraction refers to the identification and labeling of this metadata. However, given the rate at which scholarly documents are produced and the heterogeneous nature of the data they contain, such as tables, figures and citations, it has become increasingly difficult to perform manual information extraction at scale, thus motivating the need for automatic methods. Scholarly information extraction refers to the process by which metadata and entities are extracted from scholarly documents using automated algorithms and systems. These systems and algorithms need to be able to deal with the heterogeneity of the data, both in terms of the format of the documents and in terms of the data contained within the documents themselves. Furthermore, the approaches must be scalable in order to deal with the millions of documents that exist and that continue to be produced. This half-day tutorial seeks to introduce the audience to the vast area of information extraction from scholarly documents. The tutorial will explore both the practical aspects of information extraction for scholarly digital libraries as well as the research opportunities that exist. The focus will be on information extraction in a world of scholarly big data.

Learning Objectives and Outcomes
Attendees should leave the tutorial understanding:
- What is scholarly information extraction?
- What is the motivation for scholarly information extraction and what are the challenges?
- What approaches are there to scholarly information extraction and what readily available tools exist?
- How does information extraction fit into the larger digital library ingestion workflow?
- What research opportunities exist in information extraction and what are best practices?

Topical Outline


Presenters
Kyle WilliamsKyle Williams
A Ph.D. candidate in Information Sciences and Technology at Penn State University who has given several presentations on information extraction for scholarly digital libraries and integrated information extraction tools into document workflows
Jian WuDr. Jian Wu
A postdoctoral fellow in Information Sciences and Technology at Penn State University and the technical director of the CiteSeerX digital library. Dr Wu has experience in designing, implementing and maintaining information extraction workflows as part of the CiteSeerX digital library
Zhaohui WuZhaohui Wu
A Ph.D. candidate in the Computer Science and Engineering at Penn State University with experience in extracting entities from heterogeneous data types and using them to build novel digital libraries
Lee GilesDr. C. Lee Giles
The director of the CiteSeerX digital library project, with extensive experience in complex systems, digital libraries and the Web


Last updated: 2 March 2016