Czech National Corpus


Hosting institution:

  • Charles University in Prague

CNC is continuously mapping the Czech language by building large general-purpose language corpora and providing access to them. The CNC’s linguistic data cover a wide range of genres and language varieties, including written, spoken and diachronic Czech. In addition, the InterCorp parallel corpus contains original and translated texts in Czech and more than 30 other languages. The CNC corpora constitute a unique resource of authentic language information for both basic and applied linguistic research as well as for other domains of social sciences and humanities. CNC corpora are widely used thanks to their continuously growing size, varied and well-defined composition, reliable metadata and high quality data processing with state-of-the-art tools. The CNC provides intuitive access to its corpora through efficient, specialized web-based applications and user support featured at the CNC research portal that also includes a User Forum (with Q&A, bug reporting, etc.) and a corpus linguistics Wiki. CNC is the only research infrastructure in the Czech Republic focusing systematically on developing the methodology of corpus linguistics. It also provides data packages tailored to specific users’ needs. Despite its national character, CNC is widely used by international users and the exceptional range of CNC corpora attracts collaborative corpus-based research in the area of contrastive language study, which requires comparable data in different languages. The CNC closely cooperates with the research infrastructure LINDAT/CLARIN, the Czech national node of the pan-European research infrastructure CLARIN ERIC.