CiteSeerX Software

The purpose of this page is to maintain a list of tools, publications, and Web services that are related to extracting information from scholarly documents so as to provide a point of reference for anyone interested in exploring this topic. The main focus is on header (title, authors, institutions, venue, etc.) and citation metadata extraction, though other types of information extraction are covered as well.

This page was created and is maintained by Kyle Williams and Sagnik ray Choudhury.

For changes and additions to this page please contact kwilliams (at) psu (dot) edu or sagnik (at) psu (dot) edu

Extraction Tools
Publications
Services
- Web Services

Extraction Tools

[Top]

These are publicly available extraction tools for information extraction.

Header Extraction

[Top]

This list is based on Lipinski et al. (JCDL 2013). A big thanks to the authors for identifying all of these tools.

SVM Header Parse
http://sourceforge.net/projects/citeseerx/
License: Apache License v2.0
SVM Header Parse is a tool for metadata extraction based on SVMs and is part of the SeerSuite package. It was developed at the Pennsylvania State University
Grobid
https://github.com/kermitt2/grobid
License: Apache License v2.0
Grobid performs header and citation extraction using CRFs
ParsCit
http://aye.comp.nus.edu.sg/parsCit/
License: Lesser GNU Public License
ParsCit performs header and citation extraction parsing using CRFs
Docear's PDF Inspector
http://www.docear.org/
License: Apache License v2.0, GPLv2, GPLv3
Extracts document metadata based on stylistic analysis
Mendeley
http://www.mendeley.com/
License: Commercial
Mendeley is a software package for managing collections of academic documents; however, it does also perform automatic extraction of metadata using SVMs.
PDFMeat
http://code.google.com/p/pdfmeat/
License: GPLv2
Extracts appropriate terms from a paper and then queries Google Scholar to retrieve the metadata.
SciPlore Xtract
http://sciplore.org/
License: Unsure
Extracts header information based on a stylistic analysis of XML.

Citation Extraction

[Top]

ParsCit
http://aye.comp.nus.edu.sg/parsCit/
License: Lesser GNU Public License
ParsCit performs header and citation extraction parsing using CRFs
HMM Metadata Extractor
http://gales.cdlib.org/~egh/hmm-citation-extractor/
License: Free for use
A citation parsing tool based on Hidden Markov Models

Other Extraction

[Top]

TableSeer
http://sourceforge.net/projects/tableseer/
License: Unspecified, but open source
Automatically extracts tables and table data

Pdffigures
pdffigures project by AllenAI
License: Apache
Automatically extracts figures and tables from PDF documents

Publications

[Top]

A list of publications related to metadata extraction grouped by type of extraction performed. I have NOT read all of these papers, but this might be a good place to start for someone interested in this topic. The references are also in different formats since they come from different sources.

Header Extraction

[Top]

GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. P. Lopez. Proceedings of the 13th European Conference on Digital Library (ECDL), Corfu, Greece, 2009.
J. Beel, B. Gipp, A. Shaker, and N. Friedrich, SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size), in Research and Advanced Technology for Digital Libraries: Proceedings of the 14th European Conference on Digital Libraries (ECDL'10), Glasgow, UK, 2010.
Huy Hoang Nhat Do, Muthu Kumar Chandrasekaran, Philip S. Cho, and Min-Yen Kan.(2013) Extracting and Matching Authors and Affiliations in Scholarly Documents.In Proceedings of the Thirteenth Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL'13), Indianapolis: ACM. 2013.
Han, H., Giles, C., Manavoglu, E., Zha, H., Zhang, Z., Fox, E. (2003). Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries.
Minh-Thang Luong, Thuy Dung Nguyen and Min-Yen Kan (2010) Logical Structure Recovery in Scholarly Articles with Rich Document Features. International Journal of Digital Library Systems (IJDLS), 1(4), 1-23.
Cui, Binge. "Scientific literature metadata extraction based on HMM." Cooperative Design, Visualization, and Engineering. Springer Berlin Heidelberg, 2009. 64-68.

Citation Extraction

[Top]

Erik Hetzner. 2008. A simple method for citation metadata extraction using hidden markov models. In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries (JCDL '08). ACM, New York, NY, USA, 280-284.
Isaac G. Councill, C. Lee Giles, Min-Yen Kan. (2008) ParsCit: An open-source CRF reference string parsing package. In Proceedings of the Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morrocco, May.
Guido Sautter and Klemens Bohm. 2012. Improved bibliographic reference parsing based on repeated patterns. In Proceedings of the Second international conference on Theory and Practice of Digital Libraries (TPDL'12), Panayiotis Zaphiris, George Buchanan, Edie Rasmussen, and Fernando Loizides (Eds.). Springer-Verlag, Berlin, Heidelberg, 370-382.
Eli Cortez , Altigran S. da Silva , Marcos Andre Goncalves , Filipe Mesquita , Edleno S. de Moura, FLUX-CIM: flexible unsupervised extraction of citation metadata, Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada

Other Extraction

[Top]

Khabsa, M., Treeratpituk, P., and Giles, C. L. (2012). AckSeer: A Repository and Search Engine for Automatically Extracted Acknowledgments from Digital Libraries, 185-194.
Liu, Y., Bai, K., Mitra, P., and Giles, C. (2007). Tableseer: automatic table metadata extraction and searching in digital libraries. Proceeding of the 7thth annual international ACM/IEEE joint conference on Digital libraries - JCDL '07, 91-10.
Sagnik Ray Choudhury, Suppawong Tuarob, Prasenjit Mitra, Lior Rokach, Andi Kirk, Silvia Szep, Donald Pellegrino, Sue Jones, and Clyde Lee Giles. 2013. A figure search engine architecture for a chemistry digital library. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries (JCDL '13). ACM, New York, NY, USA, 369-370.
Sagnik Ray Choudhury, Prasenjit Mitra, Andi Kirk, Silvia Szep, Donald Pellegrino, Sue Jones, C. Lee Giles: Figure Metadata Extraction from Digital Documents. ICDAR 2013: 135-139

Comparisons

[Top]

M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp, Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents, in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Indianapolis, IN, USA, 2013.

Datasets

[Top]

Anzaroot, S., and McCallum, A. (2013). A New Dataset for Fine-Grained Citation Field Extraction. ICML Workshop on Peer Reviewing and Publishing Models, 28.

Services

[Top]

Web Services

[Top]

These are web services that you can use for extracting metadata without running any software locally

CiteSeerExtractor
http://citeseerextractor.ist.psu.edu:8080
License: Apache License v2.0
Provides a RESTful API to the tools used for extraction in CiteSeerX
ParsCit Web Service
http://aye.comp.nus.edu.sg/parsCit/#ws
License: N/A
A Web service for parsing citations. Also provide an online demo
FreeCite
http://freecite.library.brown.edu/
License: MIT License
A Web service for parsing citations based on ParsCit

CiteSeerX Software

Contents