Antoine Doucet - Full Professor - Information Retrieval and Multilingual Text Processing

Resources

Evaluation Datasets

Book Structure Extraction : The datasets and evaluation tools developed during the previous rounds of the competition are available on the competition's dedicated training page.
(Social) Book Search : Similarly, the data sets and evaluation tools developed within the Book Search track are available to INEX participants through the dedicated page of the INEX Website.

Demonstrations

DAnIEL: The DAnIEL Website, built by Gaël Lejeune, is the result of his Ph.D. work on multilingual epidemic surveillance from news data. On the site, you can find a sample of true and false, negative and positive examples, for some of the languages the system handles. In addition, you can test the system with your own documents.
TVGuido: TV listing for Finland. The information about every programme is provided in the language in which the programme is broadcast. Language detection and title matching are the result of a combination of machine learning techniques and user feedback.

Tools

MFS-mineSweep and MineMFS: These methods permit to efficiently extract maximal frequent sequences from sequential data sets of virtually any size. The extraction technique is most extenstively described in the paper Fast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences [ BibTex ].
Unfortunately, the implementation of our algorithms to mine maximal frequent sequences are not currently distributed, but we are happy to discuss ways to run them on your own data set. Please get in touch.
Efficient computation of the Probability and Expected Document Frequency of discontiguous sequences: We developed an efficient technique to rank extracted discontiguous sequences, based on their statistical interestingness, regardless of their domain of application. To do so, we developed a method of linear time complexity to calculate the exact probability of occurrence of a discontiguous sequence of items. The technique is fully described in the journal article Probability and Expected Document Frequency of Discontinued Word Sequences, an efficient method for their exact computation [ BibTex ].
Google n-gram filter: Since October 2012, Google released a 2nd version of its massive n-gram data set. Per year n-gram statistics from OCR-ed books are available for several languages, together with PoS. However, certain aspects of this very large data set may be irrelevant to your current experiments. To avoid the unnecessary storage of very large data files, I developed a couple of tools to tailor adequate subcollections of the Google n-grams. Please get in touch if you need something and let us see if it may help!

Paper Dataset

Building engagement for MOOC students : Data set for our paper "Building engagement for MOOC students: introducing support for time management on online learning platforms" published at the Workshop on Web-based Education Technologies of the 23rd International World Wide Web Conference (WWW'14). The data collected during the survey can be downloaded as a rar archive.

Antoine Doucet - Resources

Resources

Evaluation Datasets

Demonstrations

Tools

Paper Dataset