Structure Extraction competition @ ICDAR

News

The results of the 2013 competition are published. For the first time, participants were not required to get involved in groundtruthing. The investment needed to participate to the Structure Extraction competition was hence lower than ever! The 2013 ground truth consists of 967 books, available in the training section.

Goals

The goal of the competition is to evaluate and compare automatic techniques for extracting structural information from digitized books to build hyperlinked table of contents.

Main changes in 2013

Thanks to the work done by the University of Innsbrück (Digitisation and Digital Preservation) within the EU funded project IMPACT, the groundtruth for this competition will be distributed freely.

This implies that, for the first time, participants will not be required to get involved in the groundtruthing effort.

Training Data

Since the 2013 competition was launched, the ground truth data from the 2011 competition has been released. It is a different set of exactly the same type of data. This release means to facilitate access to the competition and foster further research. This allows anyone to fine tune their system, experiment, and publish, regardless of the competition schedule. Full details (and data) are available in the training section of this Website.

Resulting from the previous Book Structure Extraction Competitions at ICDAR 2009 and 2011, training sets of respectively 527 and 513 annotated books are freely available online. Each of the archives available on the training page additionally contains evaluation software (.exe).

Motivation

Current digitization and OCR technologies produce the full text of digitized books with only minimal structure information. Pages and paragraphs are usually identified and marked up in the OCR, but more sophisticated structures, such as chapters, sections, etc., are not recognised. Such structures are however of great value in supporting searchers and readers to navigate inside digital books.

Task description

The task is to build hyperlinked table of contents for a sample collection of 1,000 digitized books of different genre and style.

As input, for each book, the output of the OCR process is given, which is a text file in in DjVu XML format. In addition a PDF of the book is also made available.

The expected output is an XML file, containing the generated hyperlinked table of contents for each book in the sample set (the exact format is described in the Submission Format section below).

The table of contents created by participants will be compared to a manually built groundtruth, as described in the evaluation section.

Participants may submit up to 10 "runs", each run containing the table of contents for all 1,000 books in the sample set.

Application in other tasks

Participants of the competition may also be interested the INEX Book Track, where they may test the results of their structure extraction techniques on related tasks such as book search or active reading. For more information, please refer to the INEX website.

Research questions

We list here some example research questions that participants may be interested in exploring that is facilitated by this competition.

Can table of contents be extracted from the actual table of contents pages of the book (where available) or could they be generated more reliably from the full content of the book?
Can table of contents be extracted only from textual information or is page layout information necessary?
What techniques provide reliable logical page number recognition and extraction and how can logical page numbers be mapped to physical page numbers?

Evaluation

The table of contents created by participants will be compared to a manually built groundtruth and will be evaluated using recall/precision like measures at different structural levels (i.e., different depths in the table of contents).

For consistency with previous years, the official metrics will be those used in previous rounds of the competition, completed with the XRCE measures, which evaluate submissions link-wise rather than title-wise (full details on those metrics and discussion can be found in the IJDAR 2010 paper available below from the Competition Description Section).

Requirements

Participants are expected to submit tables of contents for the evaluation set, following the submission format described below (Section submission format)
Participants are further expected to provide a summary of their approach, to be used to write the competition overview.
The winning system will be asked to generate ToCs for the whole INEX Book Track corpus of 50,000 books

Important Dates

January 21	Registration and data distribution starts
April 9	Runs and run descriptions due
June 15	Submission deadline for CLEF working notes [optional]
August 25-28	Result announcement and competition report presentation at ICDAR 2013 in Washington [participation and attendance are welcome but not required]
September 23-26	Presentation of selected working notes in CLEF 2013 in Valencia, Spain [participation and attendance are welcome but not required]

Submission format

Submissions for the Structure Extraction task should conform to the following DTD:

<!ELEMENT bs-submission (source-files, description, book+)>
<!ATTLIST bs-submission
   participant-id 	CDATA 	#REQUIRED
   run-id 	CDATA 	#REQUIRED
   task 	(book-toc) #REQUIRED
   toc-creation 	(automatic | semi-automatic) #REQUIRED
   toc-source	(book-toc | no-book-toc | full-content 
   | other) #REQUIRED
>

<!ELEMENT source-files EMPTY>

<!ATTLIST source-files
   xml 	(yes|no) #REQUIRED
   pdf 	(yes|no) #REQUIRED
>

<!ELEMENT description (#PCDATA)>

<!ELEMENT book (bookid, toc-entry+)>

<!ELEMENT bookid	(#PCDATA)>

<!ELEMENT toc-entry(toc-entry*)>

<!ATTLIST toc-entry
   title 	(#PCDATA) #REQUIRED
   page 	(#PCDATA) #REQUIRED
>

Each submission must contain the following:

@participant-id:	The Participant ID number of the submitting institute.
@run-id:	A run ID (which must be unique across all submissions sent from one organization - also please use meaningful, but short names if possible).
@task:	Identification of the task, which should just be "book-toc".
@toc-creation:	Specification whether the ToC was constructed fully automatically ("automatic") or with some manual aid ("semi-automatic").
@toc-source:	Specification of whether the ToC was built based only on the table of contents part of the book ("book-toc"), any other part of the book excluding the ToC pages ("no-book-toc"), or based on the full content of the book ("full-content"). If neither of these applies, please specify or simply use "other".
source-files:	Specification of the source files used as input, i.e., the XML file (@xml="yes") and/or the PDF file (@pdf="yes").
description:	A description of the approach used to generate the ToC. Please add as much detail as you can, as this would help with the comparison and analysis of the results later on.
Furthermore, a run should contain the search results for each topic confirming to the following criteria:
book:	Contains the ToC information for each book.
bookid:	Each book should be identified using its bookID, which is the name of the directory that contains the XML source of the book (along with the MARC metadata file).
toc-entry:	Contains details of each entry of the table of contents for a given book. Entries may be nested, e.g., sections in a chapter should be nested within the ToC entry of the chapter.
@title:	The title of the ToC entry (e.g., chapter title).
@page	The page counter that corresponds to the start of the section represented by the ToC entry. The page counter starts with 1 on the first page of the book (i.e., cover page). Note that this is different from the page number that may be printed in the book itself (which may only start on the first content page and may include different formats, e.g., v, xii, 2-18, etc.).

An example submission may be as follows:

<bs-submission participant-id="25"
   run-id="ToCExtractedDirectlyFromBookToC" 
   task="book-toc" 
   toc-creation="automatic" 
   toc-source="full">

<source-files xml="yes" pdf="no" />

<description>
  Extraction applied directly to recognised ToC pages of the book. 
  The page numbers are then converted to page counters using a
  pre-built page lookup table. The ToC levels are estimated
  based on the layout indentation of a ToC entry.
</description>

<book>
   <bookid>384D10DAEA4E34A8</bookid>
   <toc-entry title="Introduction" page="7">
      <toc-entry title="What is covered?" page="8" />
      <toc-entry title="Recommended reading order" page="11" />
   </toc-entry>
   ...
</book>

<book>
   <toc-entry title="Preface" page="6" />
   ...
</book>

...

</bs-submission>

Submission procedure

Further instructions will be provided closer to the submission deadlines. Please note that currently there are no plans to provide online validation of submission runs, so please make sure that your runs conform to the DTD.

Results

The annotation process produce the ground truth ToC for 967 books.
The resulting title-based evaluation follows. Each RunID is linked to further details on the evaluation of the run.

RunID	Participant	F-measure (complete entries)
MDCS	Microsoft Development Center Serbia	43.61%
Nankai	Nankai University	35.41%
Innsbruck	University of Innsbruck	31.34%
Würzburg	University of Würzburg	19.61%
EPITA	EPITA Research and Development Laboratory	14.96%
GREYC-run-D	GREYC - University of Caen, France	8.82%
GREYC-run-C	GREYC - University of Caen, France	7.91%
GREYC-run-A	GREYC - University of Caen, France	6.21%
GREYC-run-E	GREYC - University of Caen, France	4.71%
GREYC-run-B	GREYC - University of Caen, France	3.79%

Link-based Results

The results of the link-based evaluation are found in the following table.

RunID	Participant	F-link
Innsbruck	University of Innsbruck	67.2%
MDCS	Microsoft Development Center Serbia	66.6%
Nankai	Nankai University	62.4%
GREYC-run-D	GREYC - University of Caen, France	45.0%
Würzburg	University of Würzburg	44.7%
GREYC-run-C	GREYC - University of Caen, France	41.8%
GREYC-run-A	GREYC - University of Caen, France	38.7%
EPITA	EPITA Research and Development Laboratory	35.0%
GREYC-run-B	GREYC - University of Caen, France	23.9%
GREYC-run-E	GREYC - University of Caen, France	23.9%

Competition Description

The evaluation process and the whole competition methodology are described in the following 2010 IJDAR paper:

Antoine Doucet, Gabriella Kazai, Bodin Dresevic, Aleksandar Uzelac, Bogdan Radakovic and Nikola Todic, "Setting up a Competition Framework for the Evaluation of Structure Extraction from OCR-ed Books" (draft), in International Journal of Document Analysis and Recognition (IJDAR), special issue on "Performance Evaluation of Document Analysis and Recognition Algorithms", 22 pages, 2010. [ BibTex ]

The 2013 results are presented in the following 2013 ICDAR paper:

Antoine Doucet, Gabriella Kazai, Sebastian Colutto, Günter Mühlberger "Overview of the ICDAR 2013 Competition on Book Structure Extraction", in Proceedings of the Twelfth International Conference on Document Analysis and Recognition (ICDAR'2013), Washington DC, USA, August 25-28, 6 pages, 2013. [ BibTex ]

Please check the publications page to find all the published papers related to this competition (if your paper is missing, please contact us)!

Organizers

Antoine Doucet, University of Caen, France
Gabriella Kazai, Microsoft Research Cambridge, UK
Günter Mühlberger, University of Innsbrück, Austria

Registration and Contact

To register, please:

sign up for the Social Book Search track of INEX, and
send your INEX user ID to Antoine Doucet : "antoine DOT doucet AT unicaen DOT fr"

We will promptly send you a password to access and download the test set.

Structure Extraction competition @ ICDAR 2013