Skip to navigation

Chambers International Corpus

What is CHIC?

To ensure that our dictionaries accurately capture the way language is used right now, Chambers lexicographers use an in-house language resource called the Chambers International Corpus (CHIC) to study the usage patterns and behaviour of linguistic units. A corpus is essentially a very large collection of texts stored electronically and annotated in such a way as to allow lexicographers to collect evidence and answer questions about the way we use language.

CHIC is now approaching a billion words of modern (post-2002), international English. Corpus data sources include newspapers, magazines, fiction, non-fiction, blogs, websites and transcribed spoken data on a variety of subjects from a variety of English-speaking nations. The diversity of the sources attempts to reflect the diversity of language use. The corpus is updated regularly and frequently using a combination of customized web spidering systems and advanced electronic document processing techniques.

Before texts are added to the corpus database they are first processed and enriched with extra information. Each word is assigned a part of speech tag and each document is associated with metadata such as its genre, domain and date and place of publication. This extra information allows for more advanced interrogation of the corpus data.


How do Chambers lexicographers use CHIC?

There are a number of ways of exploiting the corpus for lexicographic ends. The tool we use to query the corpus (Sketch Engine) allows us to build word sketches which are essentially summaries of the sentence elements a particular word tends to interact with and how it interacts with them. For example, you tend to use the adjective stiff with breeze meaning to blow strongly but not with wind. Similarly, even though change and alter are synonyms, it's very common to change your mind but a lot more unusual (and less desirable) to alter your mind. Idiomatic examples of language use are crucial in, in particular, bilingual or learners' dictionaries. Word sketches are used to establish which structures are most typical of a particular headword and the corpus acts as an important source of real life examples of these structures.

The rich metadata allows the lexicographers to compare the ways in which words are used across different subject domains or different geographical regions. US and UK spelling variants are easy to track as is technical or domain specific language. All of this information is important in the construction of a complete dictionary entry.


Curious?

If you would like to know more about CHIC and how it can be used, please email us at: corpus@chambers.co.uk