|
Lecturer(s)
|
-
Radimský Jan, prof. PhDr. Ph.D.
|
|
Course content
|
1. Introduction to corpus linguistics (corpus, types of corpora, technical issues and methodological foundations, the nature of corpus data: what can and cannot be found in a corpus and why) 2. The Czech National Corpus project (types of corpora within the CNC, basic CNC tools, significance and uses, basic corpus querying), other major NLP centers in the Czech Republic and their computational tools 3. Basic and advanced word searches in an unannotated corpus using so-called regular expressions 4. Corpus annotation (metadata in corpora and their use in query design, principles of corpus annotation, tokenization, lemmatization, tagging) 5. Basics of descriptive corpus statistics (absolute and relative frequency, frequency lists and their analysis, Zipf's laws, type and token frequency, reduced frequency and ARF) 6. Word co-occurrence in corpora (word combinations in text from the perspective of a linguist and a computational linguist, collocationssyntactic and semantic constraints, idiomaticity; word embeddings; rule-based and statistical identification of co-occurrences) 7. Multilingual corpora: comparable and parallel. Use of the Intercorp parallel corpus in translation practice 8. Selected national corpora (French: Frantext, Le Monde; Italian: La Repubblica, CORIS/CODIS, ITWAC; Spanish: CRAE, Ancora, Coser, Cluvi). Corpora available in Sketch Engine 9. Basic issues in natural language processing (NLP): foundations, limits, and possibilities (natural vs. formal languages, the Turing test, applied NLP problems and their solutionsrule-based systems, statistical machine learning, neural networks; semantic vectors, large language models, and basic principles of AI) 10. Machine translation and computational tools for translators (electronic dictionaries, CAT tools) 11-13. Current topics in corpus and computational linguistics; solving specific problems using corpora and machine translation tools
|
|
Learning activities and teaching methods
|
|
Monologic (reading, lecture, briefing), Dialogic (discussion, interview, brainstorming), Demonstration, Activating (simulations, games, drama), Work with multi-media resources (texts, internet, IT technologies)
|
|
Learning outcomes
|
The course introduces the basic concepts, methods, and issues of corpus and computational linguistics, as well as the possibilities this discipline offers, particularly for addressing applied linguistic questions. Students will become familiar with the principles of creating and analyzing language corpora, learn to work effectively with available corpus tools, and acquire methods for searching and interpreting linguistic data. Attention will also be given to applications in machine translation, the teaching of grammar and vocabulary, and working with large language models (AI). The course develops critical thinking in working with linguistic data and helps students better understand the structures and functioning of language, which they can apply in translation practice.
- Student explains the basic concepts, methods, and principles of corpus and computational linguistics and demonstrates, using concrete examples, how these approaches can be applied in translation practice (e.g. solving translation problems, choosing equivalents, analyzing usage and context). - Student works effectively with language corpora (including the Czech National Corpus and other tools) and applies various methods of searching and analyzing linguistic data to support informed translation decisions and justify translation choices. - Student understands the relationship between quantitative properties of language and translation processes (e.g. frequency, collocations, phraseology) and uses this knowledge to produce more natural and idiomatic translations. - Student identifies and critically evaluates tools for working with multilingual data (electronic dictionaries, CAT tools, machine translation, etc.) and selects appropriate tools and strategies for specific translation tasks. - Student explains the basic principles of rule-based and statistical methods, as well as machine learning approaches (including neural networks and large language models), and applies this understanding to critically assess and effectively use AI-assisted translation tools.
|
|
Prerequisites
|
The course is taught in English and is primarily intended for students who have not previously completed a similar course in their prior studies (at the Faculty of Arts, University of South Bohemia, typically URO/8KKL).
|
|
Assessment methods and criteria
|
Oral examination, Student performance assessment
Oral exam, ongoing completion of tasks assigned in the seminary.
|
|
Recommended literature
|
-
Barth, Danielle; Schnell, Stefan. Understanding corpus linguistics. First published. London ; New York: Routledge, 2022. ISBN 978-0-367-21962-8.
-
Li, Defeng; Corbett John. The Routledge handbook of corpus translation studies. Abingdon: Oxon, 2025.
-
Lüdeling, Anke,; Kytö, Merja. Corpus linguistics : an international handbook. Volume 1. Berlin: Walter de Gruyter, 2008. ISBN 978-3-11-018043-5.
-
McEnery, Tony; Hardie, Andrew. Corpus linguistics : method, theory and practice. First published. Cambridge: Cambridge University Press, 2012. ISBN 978-0-521-54736-9.
-
Mitkov, Ruslan. The Oxford handbook of computational linguistics. Second edition. Oxford: Oxford University Press, 2022. ISBN 978-0-19-957369-1.
-
Stefanowitsch, Anatol. Corpus linguistics: A guide to the methodology. Berlin. 2020.
-
Teubert, Wolfgang (ed.). Text Corpora and Multilingual Lexicography. Amsterdam: John Benjamins, 2007. ISBN 9789027239655.
|