“semi-automatic classification of rare words in journalistic text, over a period of years”

This Page

Last modified August 22, 2008 by Andrew Kehoe

APRIL

Analysis and Prediction of Innovation in the Lexicon

This work is concerned with the development of a system for the semi-automatic classification of rare words in journalistic text, over a period of years, with a view to extrapolating from the resultant analysis and predicting some aspects of the future structure of the language.

As with other Unit research, APRIL findings serve a dual purpose - a linguistic role in informing descriptions of the nature of rare words and their patterns of productivity, and an IT role in assisting in the refinement of indexes to large textual database systems.

The rare words of the lexicon constitute 50% of the types (different words) in any database, yet they are routinely ignored in the management of databases. They are statistically significant, and the received wisdom is that they are a miscellany of typographical errors and ephemera that will not yield much informational benefit as retrieval mechanisms in database search.

However, this is not so. These singletons, or hapax legomena, which trickle into and out of the language, are an intrinsic part of its fabric. They form classes at and below the level of word. In terms of word formation, for instance, they are primarily compounds and derivations. In terms of derivation, there is a clear ranking in the morphemes and classes of morphemes chosen. Grammatical trends are apparent.

The study brings fascinating insights into the nature of productivity in the language, and offers objective criteria for augmenting indexes.

Sample new word listings are available on our Neologisms page.

The full set of tools developed for the APRIL project is available for use by registered partners only. A demo version using one month of data is available here.

Acknowledgement: The APRIL Project was funded by the EPSRC from 1997-2000 (Grant Reference GR/L08243/01).