Czech National Corpus

logo_cnkCNC – website

Hosting institution: Charles University


The main objective of the CNC research infrastructure is to continuously map the Czech language by building large general-purpose language corpora and providing access to them. The CNC’s linguistic data cover a wide range of genres and language varieties (including contemporary written, spoken and diachronic Czech). In addition, the InterCorp parallel corpus contains original and translated texts in Czech and more than 30 other languages to support contrastive research. The CNC corpora are widely used and popular for their ever-growing size (with regularly updated data), varied and well-defined composition, reliable metadata and high quality data processing with state-of-the-art tools. CNC provides intuitive access to its corpora through efficient, specialized web-based applications and user support (including a user forum with Q&A, bug reporting, documentation and knowledge base) featured at the CNC research portal www.korpus.cz. CNC also provides users with tailor-made data packages according to their specific requirements. CNC is currently the only research infrastructure in the Czech Republic that systematically focuses on developing and promoting the methodology of corpus linguistics. Despite its national character, the infrastructure is widely used by international users and the exceptional range of CNC corpora attracts collaborative corpus-based research in the area of contrastive language study requiring comparable data in different languages. CNC closely cooperates with the LINDAT/CLARIN research infrastructure, the Czech national node of the pan-European research infrastructure CLARIN ERIC.

Future development

The development strategy of CNC is based on the CNC’s own strategic research, current trends in empirical linguistics and user feedback. CNC plans to continually develop operation by systematically building its user community, mainly by reaching out to new end-users who are increasingly being recruited from the broader field of social sciences and humanities, enriching the spectrum of data collected with semi-official language used on the internet, semi-formal spoken language or a monitor corpus that will eventually cover the period from 1850 to present. CNC also strives to continuously improve corpus annotation and broaden the portfolio of user applications, both through the upgrade of existing applications and the development of new ones.

Socio-economic impact

The primary orientation of the CNC research infrastructure is to provide open access to linguistic data for the research community in SSH as well as for the general public. CNC now has more than 6,000 registered active users who perform more than 1,900 corpus queries per day. CNC is a unique source of authentic linguistic data both for basic and applied linguistic research while providing support in other areas of humanities and natural language processing.