Structure Extraction competition @ ICDAR

Training Data

Since the 2011 competition, the ground truth data from the 2009 competition has been released. It is a different set of exactly the same type of data. This release means to facilitate access to the competition and foster further research. This allows anyone to fine tune their system, experiment, and publish, regardless of the competition schedule. Full details (and data) are available in the training section of this Website.

Goals

The goal of this task is to test and compare automatic techniques for deriving structural information from digitized books in order to build a hyperlinked table of contents that could then be used to navigate inside the books.

Motivation

Current digitization and OCR technologies typically produce the full text of digitized books with only minimal structure information. Pages and paragraphs are usually identified and marked up in the OCR, but more sophisticated structures, such as chapters, sections, etc., are currently not recognised. In order to enable systems to provide users with richer browsing experiences, it is necessary to make available such additional structures, for example in the form of XML markups embedded in the full text of the digitized books.

Task description

The task is to build hyperlinked table of contents for a sample collection of 1,000 digitized books of different genre and style.

As input, for each book, the output of the OCR process is given, which is a text file in in DjVu XML format. In addition a PDF of the book is also made available.

The expected output is an XML file, containing the generated hyperlinked table of contents for each book in the sample set (the exact format is described in the Submission Format section below).

The table of contents created by participants will be compared to a manually built groundtruth, as described in the evaluation section.

Participants may submit up to 10 "runs", each run containing the table of contents for all 1,000 books in the sample set.

Application in other tasks

Participants of the competition may also be interested the INEX 2009 Book Track, where they may test the results of their structure extraction techniques on related tasks such as book search or active reading. For more information, please refer to the INEX website.

Research questions

We list here some example research questions that participants may be interested in exploring that is facilitated by this competition.

Can table of contents be extracted from the actual table of contents pages of the book (where available) or could they be generated more reliably from the full content of the book?
Can table of contents be extracted only from textual information or is page layout information necessary?
What techniques provide reliable logical page number recognition and extraction and how logical page numbers can be mapped to physical page numbers?

Evaluation

The table of contents created by participants will be compared to a manually built groundtruth and will be evaluated using recall/precision like measures at different structural levels (i.e., different depths in the table of contents).

The groundtruth will be generated with the aid of the participating groups, using an annotation tool specifically designed for the purpose. The tool takes as input, a generated table of contents and allows annotators to manually correct any mistakes. As starting point, we will use the table of contents generated by participants. The created groundtruth will be made available to all participants for use in future evaluations. Participants will be expected to contribute annotations to a minimum of 50-100 books (this may be reduced with more participating groups).

The evaluation metrics are described here: [pdf]

Requirements

Participants are expected to submit tables of contents for the evaluation set, following the submission format described below (Section submission format)
Participants are expected to contribute to the creation of the groundtruth data (see Section evaluation)
A paper describing the approach taken should be submitted to INEX (you are also encouraged to submit to ICDAR 2009) [please note that attendance is not required at either ICDAR or INEX]

Important Dates

~~May 8~~	Registration deadline
~~June 24~~	Submissions due
~~July 3~~	Start of the groundtruth annotation
~~July 21~~	Groundtruth annotation due
July 26-29	Result announcement and competition report presentation at ICDAR 2009 [participation and attendance is welcome but not required]
Nov 23	Papers due for the INEX 2009 workshop [attendance is welcome but not required]

Submission format

Submissions for the Structure Extraction task should conform to the following DTD:

<!ELEMENT bs-submission (source-files, description, book+)>

<!ATTLIST bs-submission
   participant-id 	CDATA 	#REQUIRED
   run-id 	CDATA 	#REQUIRED
   task 	(book-toc) #REQUIRED
   toc-creation 	(automatic | semi-automatic) #REQUIRED
   toc-source	(book-toc | no-book-toc | full-content 
   | other) #REQUIRED
>

<!ELEMENT source-files EMPTY>

<!ATTLIST source-files
   xml 	(yes|no) #REQUIRED
   pdf 	(yes|no) #REQUIRED
>

<!ELEMENT description (#PCDATA)>

<!ELEMENT book (bookid, toc-entry+)>

<!ELEMENT bookid	(#PCDATA)>

<!ELEMENT toc-entry(toc-entry*)>

<!ATTLIST toc-entry
	title 	(#PCDATA) #REQUIRED
	page 	(#PCDATA) #REQUIRED
>

Each submission must contain the following:

@participant-id:	The Participant ID number of the submitting institute.
@run-id:	A run ID (which must be unique across all submissions sent from one organization - also please use meaningful, but short names if possible).
@task:	Identification of the task, which should just be "book-toc".
@toc-creation:	Specification whether the ToC was constructed fully automatically ("automatic") or with some manual aid ("semi-automatic").
@toc-source:	Specification of whether the ToC was built based only on the table of contents part of the book ("book-toc"), any other part of the book excluding the ToC pages ("no-book-toc"), or based on the full content of the book ("full-content"). If neither of these applies, please specify or simply use "other".
source-files:	Specification of the source files used as input, i.e., the XML file (@xml="yes") and/or the PDF file (@pdf="yes").
description:	A description of the approach used to generate the ToC. Please add as much detail as you can, as this would help with the comparison and analysis of the results later on.
Furthermore, a run should contain the search results for each topic confirming to the following criteria:
book:	Contains the ToC information for each book.
bookid:	Each book should be identified using its bookID, which is the name of the directory that contains the XML source of the book (along with the MARC metadata file).
toc-entry:	Contains details of each entry of the table of contents for a given book. Entries may be nested, e.g., sections in a chapter should be nested within the ToC entry of the chapter.
@title:	The title of the ToC entry (e.g., chapter title).
@page	The page counter that corresponds to the start of the section represented by the ToC entry. The page counter starts with 1 on the first page of the book (i.e., cover page). Note that this is different from the page number that may be printed in the book itself (which may only start on the first content page and may include different formats, e.g., v, xii, 2-18, etc.).

An example submission may be as follows:

<bs-submission participant-id="25"
  run-id="ToCExtractedDirectlyFromBookToC" 
  task="book-toc" 
  toc-creation="automatic" 
  toc-source="full">

<source-files xml="yes" pdf="no" />

<description>
 Extraction applied directly to recognised ToC pages of the 
 book. The page numbers are then converted to page counters
 using a pre-built page lookup table. The ToC levels are
 estimated based on the layout indentation of a ToC entry.
</description>

<book>
  <bookid>384D10DAEA4E34A8</bookid>
  <toc-entry title="Introduction" page="7">
    <toc-entry title="What is covered?" page="8" />
    <toc-entry title="Recommended reading order" page="11" />
  </toc-entry>
  ...
</book>

<book>
  <toc-entry title="Preface" page="6" />
  ...
</book>

...

</bs-submission>

Submission procedure

Further instructions will be provided closer to the submission deadlines. Please note that currently there are no plans to provide online validation of submission runs, so please make sure that your runs conform to the DTD.

Results

The annotation process permitted to gather the ToC groundtruth for 527 books.
The resulting evaluation follows. Each RunID is linked to further details on the evaluation of the run.

RunID	Participant	F-measure (complete entries)
MDCS	Microsoft Development Center Serbia	41,51%
XRCE-run2	Xerox Research Centre Europe	28,47%
XRCE-run1	Xerox Research Centre Europe	27,72%
XRCE-run3	Xerox Research Centre Europe	27,33%
Noopsis	Noopsis inc.	8,32%
GREYC-run1	GREYC - University of Caen, France	0,08%
GREYC-run2	GREYC - University of Caen, France	0,08%
GREYC-run3	GREYC - University of Caen, France	0,08%

An additional measure was suggested by XRCE. Full details and corresponding results are presented here.

Competition Description

The evaluation process and the whole competition methodology are described in the following 2010 IJDAR paper:

Antoine Doucet, Gabriella Kazai, Bodin Dresevic, Aleksandar Uzelac, Bogdan Radakovic and Nikola Todic, "Setting up a Competition Framework for the Evaluation of Structure Extraction from OCR-ed Books" (draft), in International Journal of Document Analysis and Recognition (IJDAR), special issue on "Performance Evaluation of Document Analysis and Recognition Algorithms", 22 pages, 2010. [ BibTex ]

Organizers

Antoine Doucet, University of Caen, France
Gabriella Kazai, Microsoft Research Cambridge, UK
Document Layout Team at the Microsoft Development Center, Serbia

Contact - Registration

Antoine Doucet: "antoine DOT doucet AT unicaen DOT fr"

Structure Extraction competition @ ICDAR 2009