The SHARES project corresponds to some aspects of the Topic Detection and Tracking programme administered by the the U.S. National Institute of Standards and Technology (NIST), which focuses on establishing objective, standard methods of evaluation of system performance. In this programme, a set of exemplar articles is provided to define a topic, and the tracking system must successfully identify all articles from the newsfeed which are on the same topic. For evaluation purposes, TDT uses news data corpora, containing many tens of thousands of articles, along with a corresponding set of manually compiled relevance results for each article with pre-defined topics. The differences between our approach and the TDT approach are outlined here.

The corpus used within the SHARES project is the TDT2 corpus. Our Web GUI demonstrates our document-similarity approach using smaller sub-corpora derived from the main corpus, to allow faster processing for on-line use.

Our test corpora (stemmed and unstemmed) contain 33 articles each: 3 articles per topic on 11 topics (1259 sentences, 27948 tokens, 5999 types). A list of the topics and articles used, together with TDT article IDs and article numbers in the test corpora, can be found Here.

