Structure Extraction competition @ ICDAR 2011

Goals

The goal of the competition is to evaluate and compare automatic techniques for extracting structural information from digitized books to build hyperlinked table of contents.

Training Data

Resulting from the 1st Book Structure Extraction Competition at ICDAR 2009, a training set of 527 annotated books was distributed to all participants of the 2011 competition.

Contributors of the 2011 groundtruth were further distributed the corresponding additional data, freely released in 2013 (see the training page for further information).

Additionally, a 100-book subset of the 2009 groundtruth is freely available, so as to facilitate access to the competition and foster further research. Please download it from the training data page, together with the evaluation software (.exe).

Motivation

Current digitization and OCR technologies produce the full text of digitized books with only minimal structure information. Pages and paragraphs are usually identified and marked up in the OCR, but more sophisticated structures, such as chapters, sections, etc., are not recognised. Such structures are however of great value in supporting searchers and readers to navigate inside digital books.

Task description

The task is to build hyperlinked table of contents for a sample collection of 1,000 digitized books of different genre and style.

As input, for each book, the output of the OCR process is given, which is a text file in in DjVu XML format. In addition a PDF of the book is also made available.

The expected output is an XML file, containing the generated hyperlinked table of contents for each book in the sample set (the exact format is described in the Submission Format section below).

The table of contents created by participants will be compared to a manually built groundtruth, as described in the evaluation section.

Participants may submit up to 10 "runs", each run containing the table of contents for all 1,000 books in the sample set.

Application in other tasks

Participants of the competition may also be interested the INEX Book Track, where they may test the results of their structure extraction techniques on related tasks such as book search or active reading. For more information, please refer to the INEX website.

Research questions

We list here some example research questions that participants may be interested in exploring that is facilitated by this competition.
  • Can table of contents be extracted from the actual table of contents pages of the book (where available) or could they be generated more reliably from the full content of the book?
  • Can table of contents be extracted only from textual information or is page layout information necessary?
  • What techniques provide reliable logical page number recognition and extraction and how logical page numbers can be mapped to physical page numbers?

Evaluation

The table of contents created by participants will be compared to a manually built groundtruth and will be evaluated using recall/precision like measures at different structural levels (i.e., different depths in the table of contents).

The groundtruth will be generated with the aid of the participating groups, using an annotation tool specifically designed for the purpose. The tool takes as input, a generated table of contents and allows annotators to manually correct any mistakes. As starting point, it will use the table of contents generated by participants. The created groundtruth will be made available to all participants for use in future evaluations. Participants will be expected to contribute annotations up to a minimum of 50 books (this may be reduced with more participating groups).

For consistency with previous years, the official metrics will be those of ICDAR 2009. However, they will be completed with the XRCE measures, which evaluate submissions link-wise rather than title-wise (full details on those metrics and discussion can be found in the IJDAR 2010 paper available below from the Competition Description Section).

Requirements

  • Participants are expected to submit tables of contents for the evaluation set, following the submission format described below (Section submission format)
  • Participants are expected to contribute to the creation of the groundtruth data (see evaluation Section)
  • The winning system will be asked to generate ToCs for the whole INEX Book Track corpus of 50,000 books

Important Dates

March 1Registration starts
May 3Submissions due
May 10 - June 7Groundtruth creation
September 18-21Result announcement and competition report presentation at ICDAR 2011 in Beijing [participation and attendance are welcome but not required]

Submission format

Submissions for the Structure Extraction task should conform to the following DTD:

<!ELEMENT bs-submission (source-files, description, book+)>
<!ATTLIST bs-submission
   participant-id 	CDATA 	#REQUIRED
   run-id 	CDATA 	#REQUIRED
   task 	(book-toc) #REQUIRED
   toc-creation 	(automatic | semi-automatic) #REQUIRED
   toc-source	(book-toc | no-book-toc | full-content 
   | other) #REQUIRED
>

<!ELEMENT source-files EMPTY>

<!ATTLIST source-files
   xml 	(yes|no) #REQUIRED
   pdf 	(yes|no) #REQUIRED
>

<!ELEMENT description (#PCDATA)>

<!ELEMENT book (bookid, toc-entry+)>

<!ELEMENT bookid	(#PCDATA)>

<!ELEMENT toc-entry(toc-entry*)>

<!ATTLIST toc-entry
   title 	(#PCDATA) #REQUIRED
   page 	(#PCDATA) #REQUIRED
>

Each submission must contain the following:
@participant-id:The Participant ID number of the submitting institute.
@run-id: A run ID (which must be unique across all submissions sent from one organization - also please use meaningful, but short names if possible).
@task: Identification of the task, which should just be "book-toc".
@toc-creation: Specification whether the ToC was constructed fully automatically ("automatic") or with some manual aid ("semi-automatic").
@toc-source:Specification of whether the ToC was built based only on the table of contents part of the book ("book-toc"), any other part of the book excluding the ToC pages ("no-book-toc"), or based on the full content of the book ("full-content"). If neither of these applies, please specify or simply use "other".
source-files:Specification of the source files used as input, i.e., the XML file (@xml="yes") and/or the PDF file (@pdf="yes").
description: A description of the approach used to generate the ToC. Please add as much detail as you can, as this would help with the comparison and analysis of the results later on.

Furthermore, a run should contain the search results for each topic confirming to the following criteria:

book: Contains the ToC information for each book.
bookid: Each book should be identified using its bookID, which is the name of the directory that contains the XML source of the book (along with the MARC metadata file).
toc-entry:Contains details of each entry of the table of contents for a given book. Entries may be nested, e.g., sections in a chapter should be nested within the ToC entry of the chapter.
@title: The title of the ToC entry (e.g., chapter title).
@pageThe page counter that corresponds to the start of the section represented by the ToC entry. The page counter starts with 1 on the first page of the book (i.e., cover page). Note that this is different from the page number that may be printed in the book itself (which may only start on the first content page and may include different formats, e.g., v, xii, 2-18, etc.).

An example submission may be as follows:
<bs-submission participant-id="25"
   run-id="ToCExtractedDirectlyFromBookToC" 
   task="book-toc" 
   toc-creation="automatic" 
   toc-source="full">

<source-files xml="yes" pdf="no" />

<description>
  Extraction applied directly to recognised ToC pages of the book. 
  The page numbers are then converted to page counters using a
  pre-built page lookup table. The ToC levels are estimated
  based on the layout indentation of a ToC entry.
</description>

<book>
   <bookid>384D10DAEA4E34A8</bookid>
   <toc-entry title="Introduction" page="7">
      <toc-entry title="What is covered?" page="8" />
      <toc-entry title="Recommended reading order" page="11" />
   </toc-entry>
   ...
</book>

<book>
   <toc-entry title="Preface" page="6" />
   ...
</book>

...

</bs-submission>


Submission procedure

Further instructions will be provided closer to the submission deadlines. Please note that currently there are no plans to provide online validation of submission runs, so please make sure that your runs conform to the DTD.

Results

The annotation process permitted to gather the ToC groundtruth for 513 books.
The resulting evaluation follows. Each RunID is linked to further details on the evaluation of the run.
RunID Participant F-measure (complete entries)
MDCS Microsoft Development Center Serbia 40.75%
Nankai-run1 Nankai University 33.06%
Nankai-run4 Nankai University 33.06%
Nankai-run2 Nankai University 32.46%
Nankai-run3 Nankai University 32.43%
XRCE-run1 Xerox Research Centre Europe 20.38%
XRCE-run2 Xerox Research Centre Europe 18.07%
GREYC-run2 GREYC - University of Caen, France 8.99%
GREYC-run1 GREYC - University of Caen, France 8.03%
GREYC-run3 GREYC - University of Caen, France 3.30%

Link-based Results

The results of the link-based evaluation follow. The detailed linked-based evaluation results are available as a spreadsheet.
RunID Participant F-link
MDCS Microsoft Development Center Serbia 65.1%
Nankai-run1 Nankai University 63.2%
Nankai-run4 Nankai University 63.2%
Nankai-run2 Nankai University 59.8%
Nankai-run3 Nankai University 59.8%
XRCE-run2 Xerox Research Centre Europe 58.1%
XRCE-run1 Xerox Research Centre Europe 57.6%
GREYC-run1 GREYC - University of Caen, France 50.7%
GREYC-run2 GREYC - University of Caen, France 50.7%
GREYC-run3 GREYC - University of Caen, France 24.4%

Competition Description

The evaluation process and the whole competition methodology are described in the following 2010 IJDAR paper:

Antoine Doucet, Gabriella Kazai, Bodin Dresevic, Aleksandar Uzelac, Bogdan Radakovic and Nikola Todic, "Setting up a Competition Framework for the Evaluation of Structure Extraction from OCR-ed Books" (draft), in International Journal of Document Analysis and Recognition (IJDAR), special issue on "Performance Evaluation of Document Analysis and Recognition Algorithms", 22 pages, 2010. [ BibTex ]

The 2011 results are presented in the following 2011 ICDAR paper:

Antoine Doucet, Gabriella Kazai, Jean-Luc Meunier, "ICDAR 2011 Book Structure Extraction Competition", in Proceedings of the Eleventh International Conference on Document Analysis and Recognition (ICDAR'2011), Beijing, China, September 18-21, p.1501-1505, 2011. [ BibTex ]

Please check the publications page to find all the published papers related to this competition (if your paper is missing, please contact us!

Organizers

Antoine Doucet, University of Caen, France
Gabriella Kazai, Microsoft Research Cambridge, UK
Jean-Luc Meunier, Xerox Research Centre Europe, France

Contact - Registration

Antoine Doucet: "antoine DOT doucet AT unicaen DOT fr"