Antoine Doucet - Full Professor - Information Retrieval and Multilingual Text Processing

Antoine Doucet - Resources


Evaluation Datasets


  • DAnIEL: The DAnIEL Website, built by Gaël Lejeune, is the result of his Ph.D. work on multilingual epidemic surveillance from news data. On the site, you can find a sample of true and false, negative and positive examples, for some of the languages the system handles. In addition, you can test the system with your own documents.
  • TVGuido: TV listing for Finland. The information about every programme is provided in the language in which the programme is broadcast. Language detection and title matching are the result of a combination of machine learning techniques and user feedback.


  • MFS-mineSweep and MineMFS: These methods permit to efficiently extract maximal frequent sequences from sequential data sets of virtually any size. The extraction technique is most extenstively described in the paper Fast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences [ BibTex ].
    Unfortunately, the implementation of our algorithms to mine maximal frequent sequences are not currently distributed, but we are happy to discuss ways to run them on your own data set. Please get in touch.
  • Efficient computation of the Probability and Expected Document Frequency of discontiguous sequences: We developed an efficient technique to rank extracted discontiguous sequences, based on their statistical interestingness, regardless of their domain of application. To do so, we developed a method of linear time complexity to calculate the exact probability of occurrence of a discontiguous sequence of items. The technique is fully described in the journal article Probability and Expected Document Frequency of Discontinued Word Sequences, an efficient method for their exact computation [ BibTex ].
  • Google n-gram filter: Since October 2012, Google released a 2nd version of its massive n-gram data set. Per year n-gram statistics from OCR-ed books are available for several languages, together with PoS. However, certain aspects of this very large data set may be irrelevant to your current experiments. To avoid the unnecessary storage of very large data files, I developed a couple of tools to tailor adequate subcollections of the Google n-grams. Please get in touch if you need something and let us see if it may help!

Paper Dataset