ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
CorpusLanguageTimespanSizeAnno.AvailabilityLicenceAdd. commentsVLO
2
Hungarian Historical CorpusHungarian177-201030 million wordsConcordancerAvail. through dedicated websiteyes
3
Medieval Charter Sections CorpusCzech, Latin14th century57 chaptersmanually tagged, named entitiesDownloadCC-BY-NC-SA 4.0LINDATyes
4
Sheffield Corpus of ChineseChineseDownloadCC-BY-NC-SA 3.0Oxford Text Archiveyes
5
Reference corpus of historical Slovene goo300k 1.2Slovenian1584-1899300,000 tokensmanually tokenised, lemmatised, PoS-tagged, modern synonyms for archaic wordsDownload, concordancerCC-BY 4.0CLARIN.SI, KonTextyes
6
Digital library and corpus of historical Slovene IMP 1.1Slovenian1584-191917.7 million tokenstokenised, lemmatised, PoS-taggedDownload, concordancerCC-BY-SA 4.0CLARIN.SI, KonTextyes
7
IMP corpus n-grams 1.0Slovenian1584-19192.5 million n-gramsDownloadCC-BY-SA 4.0CLARIN.SIyesET: not sure if this really belongs here, as it is not a corpus.
8
Corpus of Historical American English - Kielipankki Korp version 2017H1American English1810-2009385 million tokenstokenisedConcordancerCLARIN ACAKielipankki, Korpyes
9
Historical Corpus of the Welsh Language 1500-1850Welsh1500-1850420,000 wordsDownload, concordancerAvail. through dedicated websiteyes
10
GerManC. A Historical Corpus of German Newspapers 1650-1800German1650-18001650-1800, 800,000 words, sampled by genreDownloadCC-BY-NC-SA 3.0Oxford Text Archiveyes
11
The Old Bailey CorpusLate Modern English1720-1913134 million wordsDetailed sociobiographical, pragmatic and textual annotationDownload, concordancerCC-BY-NC-SA 4.0CLARIN-D, CLARIN Federated Content Search availableyes
12
"PolDiLemma" Middle Polish Diachrone Lemmatised CorpusPolish, German, Latin, Czech16th-18th centurylemmatisedDownloadPublic DomainCLARIN-D, CLARIN Federated Content Search availableyes
13
Helsinki Corpus of Scottish Correspondence (1540-1750)English1540-17500.5 million tokenstokenisedConcordancerCLARIN ACAKielipankkI, Korpyes
14
Parsed Corpus of Early English Correspondence (PCEEC)English1410-16812.2 million wordstokenised, PoS-tagged, syntactically parsedDownload (need to "apply for approval")Oxford Text Archiveyes
15
B4 Tatian Corpus of Deviating Examples 2.1Latin, Old High German9th century11,300 tokenstokenised, MSD-taggedDownload, concordancerCC-BYUniversity of Hamburgyes
16
Syntactic Reference Corpus of Medieval FrenchOld French9th-13th century245,000 tokenssyntactically parsedDownloadCLARIN ACACLARIN-D (external site?)yes
17
Hamburg Corpus of Old Swedish with Syntactic Annotations (HaCOSSA)English, German, Latin, Old Norse, Swedish128,204 wordssyntactic and morphological annotationDownloadCLARIN RESUniversity of Hamburgyes
18
Deutsches Textarchiv (DTA)German1600-1900CLARIN PUBLINDATyes
19
Reference Corpus Middle Low German/Low Rhenish (1200-1650)Middle Low German1200-1650200,700 tokenstokenised, MSD-taggedDownloadCC-BYUniversity of Hamburgyes
20
B4 LudolfMiddle Low German13506,690 tokenstokenised, tagged for clause type and grammatical functionCLARIN ACAUniversity of Hamburgyes
21
B4 Historisches Predigtenkorpus zum NachfeldMiddle High German9,2500 tokenstokenised, syntactic, discursive annotationCLARIN ACAUniversity of Hamburgyes
22
Mannheimer Korpus Historischer Zeitungen und ZeitschriftenGerman18th and 19th centuries750 volumes, 3532 pages overallDownloadyes
23
MenotaOld Norse1.6 million tokenstokenised, MSD-tagged, lemmatisedConcordancerCC-BYCLARINO, Corpuscleno
24
Greek Medieval TextsAncient Greek4th-16th century3.4 million wordsAvailable - Unrestricted Use CC-BYclarin:elno
25
Austrian Baroque CorpusAustrian1650-1750200,000tokenised, PoS-tagged, lemmatised, named entitiesConcordancerClarin Austriano
26
Corpus Informatizado do Português MedievalPortuguese9th to 16th centuries2 milliontokenised, PoS-taggedConcordancerAvail. through dedicated websiteno
27
Parsed Corpus of Historical Portuguese
Portuguese1380-18813.3 milliontokenised, PoS-tagged (2 million), treebanked (1.2 million)Avail. through dedicated websiteno
28
OROSSIMO Corpus - History Greekn/a553,131 Tokens Structural Annotation (paragraph)DownloadCC - BYclarin:elno
29
ARCHER CorpusEnglish1600-1999noneRestricted online access (users must apply, signed user agreement required)noneCurated by University of Manchester; interface is likely to be CQPwebno
30
Historical Corpora at Lancaster UniversityEnglish1500-Numerous resources; millions of tokensWordclass, in some cases also semantic tagging (USAS system)Restricted online access (users can register online; access conditions for corpora vary, and some are UK users only)noneNumerous historical corpora available via no
31
Older Scottish texts : the Edinburgh DOST corpus / A.J. Aitken, Paul Bratley and Neil Hamilton-SmithEnglish1450-1600877.000 tokensnoneDownloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archiveyes
32
Anthology of Middle English texts / Santiago Gonzalez y Fernandez-CorugedoEnglish, Middle (1100-1500); English; Hebrew1100-14004000 wordsnoneDownload
Oxford Text Archive licence
Oxford Text Archiveyes
33
Helsinki corpus of English textsEnglish; English, Old (ca. 450-1100); English, Middle (1100-1500)730-1710240000 wordsnoneDownload
Oxford Text Archive licence
Oxford Text Archiveyes
34
Corpus of biblical text in Scots / John KirkScotsnot knownnoneDownload
Oxford Text Archive licence
Oxford Text Archiveyes
35
Pamphlets of the American Revolution : [selections] / edited by Bernard BailynEnglish1750-1776noneDownloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archiveyes
36
Corpus of Late Modern English prose / David DenisonEnglish1837-1926noneDownload
Oxford Text Archive licence
Oxford Text Archiveyes
37
The Helsinki corpus of Older Scots : [1450-1700]Scots1450-1700noneDownloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archiveyes
38
The Lampeter Corpus of Early Modern English TractsEnglish1640-1740noneDownloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archiveyes
39
Paris speech in the pastFrench, Middle (ca. 1400-1600); French2000-07noneDownloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archiveyes
40
The York-Helsinki parsed corpus of Old English poetry (YCOEP)English, Old (ca. 450-1100)730–1710noneDownload
Oxford Text Archive licence
Oxford Text Archiveyes
41
Corpus of Early English Correspondence Sampler (CEECS)English1418–1680noneDownload
Oxford Text Archive licence
Oxford Text Archiveyes
42
The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)English, Old (ca. 450-1100); Latin600-1150noneDownload
Oxford Text Archive licence
Oxford Text Archiveyes
43
The English language of the north-west in the late Modern English period: a Corpus of late 18c ProseEnglish1761-90noneDownload
Oxford Text Archive licence
Oxford Text Archiveyes
44
Polish language of the 1960sPolish1963-1967noneDownloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archiveyes
45
Dictionary of Old English Corpus in Electronic Form (DOEC)English, Old (ca. 450-1100); Latin600-1150noneDownload
Oxford Text Archive licence
Oxford Text Archiveyes
46
Partonopeus de Blois: transcriptions of all manuscripts and fragmentsFrench, Old (ca. 842-1400)1166-1199not knownnoneDownloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archiveyes
47
A Corpus of English Dialogues 1560-1760 (CED)English1560-1760not knownnoneDownloadOxford Text Archive licenceOxford Text Archiveyes
48
Parsed Corpus of Early English Correspondence (PCEEC)English; English, Middle (1100-1500)1410-1695

2.2 million wordsPOS-tagging and parsingDownloadOxford Text Archive licenceOxford Text Archiveyes
49
The Electronic Text Corpus of Sumerian Literature. Revised edition.English; Sumerian2100 BCE-1700 BCEnot knownEach word form in the composite transliterations has been assigned to a lexeme which is specified by a citation form, word class information and basic English translation.

Downloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archiveyes
50
The Lancaster Newsbooks CorpusEnglish1654-1655not knownnoneDownloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archiveyes
51
GeMi CorpusGerman1500-1700119,802 tokens
TEI Lite markup, no linguistic annotation
Downloadhttp://creativecommons.org/licenses/by-nc-sa/3.0/Oxford Text Archive; full title The Nottingham Corpus of Early Modern German Midwifery and Women's Medicine (ca. 1500-1700)yes
52
EEBO-TCPEnglish1450-1700766 million tokensTEI P5 markup, no linguistic annotationDownloadCC-0Oxford Text Archive; the 'corpus' is thousands of text, available individually for downloadyes
53
ECCO-TCPEnglish1700-180074 million tokensTEI P5 markup, no linguistic annotationDownloadCC-1Oxford Text Archive; the 'corpus' is thousands of text, available individually for downloadyes
54
EVANS-TCPEnglish1640-1821102 million tokensTEI P5 markup, no linguistic annotationDownloadCC-2Oxford Text Archive; the 'corpus' is thousands of text, available individually for downloadyes
55
Hansard CorpusEnglish1803-20051.6 billionPOS-tags, lemmas, semantic tagsConcordancernonecorpus.byu.edu (Brigham Young Corpora)no
56
Corpus testuale del Tesoro della Lingua Italiana delle Origini
Italian23 million tokensLemmasWeb concordancerunknownAvail. through dedicated websiteno
57
DiaCORISItalian1861-1945Web concordancerunknownAvail. through dedicated websiteno
58
M.I.DIA. (Morfologia dell'Italiano in DIAcronia)
Italian13th-20th cent.7,5 million tokensWeb concordancerCC-BY-NC 4.0Avail. through dedicated websiteno
59
Archivio DatiniItalianLemmasWeb concordancerunknownAvail. through dedicated websiteno
60
FrantextFrench10th-21st cent297 586 781 wordsLemmas, POS-tagsWeb concordancerunknownAvailable by paying substriptionno
61
eFontes Mediae et Infimae Latinitatis Polonorum (Elektroniczny korpus polskiej łaciny średniowiecznej)Polish, Latin1000–15505 million tokensLemmataWeb concordancerunknownno
62
Corpus of the 16. century Polish (Korpus polszczyzny XVI wieku)Polish, Latin16 centuryTEI P5 markup, lemmata, transcriptionCorpus searchunknownno
63
The Electronic Corpus of the 17th and 18th century Polish (Korpus tekstów polskich z XVII i XVIII w.)Polish, Latin1601–177212 million tokensPOS tags (for 0.5M tokens), rich structural annotationCorpus searchunknownno
64
Corpus of old Polish texts until 1500 (Korpus tekstów staropolskich do roku 1500)Polish, Latin?–1500620 thousands tokensTEI P5 markup, no linguistic annotationData available to downloadunknownno
65
Corpus of the 19. century Polish (Korpus polszczyzny XIX-wiecznej)Polish1830–1918625 thousands tokensLemmata, POS tags, transliteration, transcriptionCorpus searchunknownno
66
XV century New Testament translations (Piętnastowieczne przekłady Nowego Testamentu – elektroniczna konkordancja staropolska)Polish, Latin1380–1500400 thousands tokensTEI P5 markup, no linguistic annotationData download, translation browser, Polish and Latin word listsunknownno
67
IMPACT GT corpus (Korpus GT projektu IMPACT)Polish1570–17561.5 million tokenstranscriptionCorpus searchunknown
68
ChronopressPolish1945–195416 million tokensWeb concordancerCC BY SAno
69
Bundesblatt/Feuille fédérale/Foglio federaleGerman/French/Italian1849-2014203,585,806 tokens (German), 239,125,036 tokens (French), 85,223,085 tokens (Italian)TreeTagger (all data), RFtagger (German data)CQPwebUniversité de Genève. SNF Project linked to this corpus containing documents published by the Swiss Federal Council: http://p3.snf.ch/project-143585no
70
DIAKORP v6Czech14th--20th century4 mil. tokenscurrently only basic structural markupWeb concordancerCC BY NC SAAvailable upon request from the Czech National Corpus also for download.no
71
Old Hungarian CorpusHungarian
12th century - 17th century
3 million tokenssegmented into tokens and sentences; partly normalized (to modern Hung. spelling), partly morphologically tagged; locus markers
Download & Concordancer
freely available for everyone
Avail. through dedicated website
not yet
72
Corpus of Old and Middle Hungarian court records and private correspondenceHungarian
16-18th century
850 000 wordstokenised, lemmatised, morphosyntactically tagged, sociolinguistic metadata addedConcordancerfreely available for everyone
Avail. through dedicated website
not yet
73
Mikes dictionary Hungarian1717-17611.5 million wordslemmatisedConcordancerfreely available for everyone
Avail. through dedicated website
not yet
74
Deutsches Textarchiv (German Text Archive, DTA)German1600–1900211 million tokens (growing further)TEI text structures; tokenized, lemmatized, POS, normalized orthographyDownload, Corpus Search, Text-Image-DisplayCreative Commons (CC BY-NC, CC BY-SA, CC BY)* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D
* manual transcription + TEI annotation, automatic linguistic annotation
* wide range of text types
yesDTA subcorpora
75
Dinglers Polytechnisches Journal (Polytechnical Journal of Dingler)German1820–193177.5 million tokensTEI text structures; tokenized, lemmatized, POS, normalized orthographyDownload, Corpus Search, Text-Image-DisplayCC BY-NC-SA 3.0 DE* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D
* manual transcription + TEI annotation, automatic linguistic annotation
yes
76
Referenzkorpus Mittelhochdeutsch (Middle High German Reference CorpusGerman1050–13502.5 million tokenstokenized, lemmatized, POS, normalized orthography, morphosyntactic descriptionDownload, Corpus SearchCC BY-SA 4.0 International* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D
* manual transcription + linguistic annotation
yes
77
Die Grenzboten (journal)German1842–192189 million tokensbasic TEI text structures; tokenized, lemmatized, POS, normalized orthographyDownload, Corpus Search, Text-Image-Displayfree* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D
* OCR, computer-aided TEI annotation, automatic linguistic annotation
yes
78
TreeTagger -- Middle High German parameter fileGerman; Middle High German1100-150010 million tokenstokenized, lemmatized, POSDownloadfreeInstitute for Natural Language Processing, University of Stuttgart, CLARIN D, Middle High German Conceptual Database; CRETAyes
79
OCR Post-correctionGerman: Antiqua and Fraktur18th to 20th centuryweb applicationfreeInstitute for Natural Language Processing, University of Stuttgart, CLARIN D, OCR, post- correction, CRETAyes
80
Part-of-speech tagging: mixed textLatin, Middle Englishweb applicationfreeInstitute for Natural Language Processing, University of Stuttgart, CLARIN D, CRETAyes
81
DDR-Presseportal (GDR press portal)German1945-19941.1 billion tokensbasic TEI text structures; tokenized, lemmatized, POS, normalized orthographyCorpus SearchCLARIN ACA* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D
* OCR, computer-aided TEI annotation, automatic linguistic annotation
no
82
Brieven als buit (Letters as loot)Dutch
17th-18th century
460.000 words (1.000 letters)
manually transcribed (diplomatically), automatically lemmatised and grammatically taggedconcordancerfree
Dutch Language Institute (INT)
No
83
Corpus GysselingDutch13th century1,5 million wordsmanually lemmatised and POS-taggedconcordancer, downloadINT Licence for researchers
Dutch Language Institute (INT)
No
84
The Morpho-Syntactic Database of Mikael Agricola's WorksFinnish1544-155183,678 Sentences; 428,314 Tokens; 38,308 Words Turku Dependency Parser: keyword, part of speech, morphological components and syntactical functionInterfaceCC BY NDKielipankki KorpNo
85
The Finnish Gutenberg CorpusFinnishup to 1925 (IPR expired)2,457,531 Sentences; 34,487,420 WordsInterfaceCC BYKielipankki KorpYes
86
Aleksis Kivi Corpus (SKS)Finnish, Swedish1834–187252,821 Sentences; 413,735 WordsInterfaceCC BY NCKielipankki KorpYes
87
Finnish Folk PoetryMultilingual1564-19391,435,012 Sentences; 7,141,783 WordsInterfaceCC BY NCKielipankki KorpYes
88
Classics of Finnish Literature, Kielipankki VersionFinnish1880-19491,500,000 WordsInterface, DownloadEUPL v.1.1 SAKielipankki Korp, Kielipankki DownloadYes
89
Corpus of Old Literary FinnishFinnish1543-1810167,400 Sentences; 4,133,202 WordsInterfaceEUPL v.1.1 SAKielipankki KorpYes
90
Corpus of Early Modern Finnish, Kielipankki VersionFinnish1809-18998,600,000 Words InterfaceEUPL v.1.1 SAKielipankki KorpYes
91
The Letters of Paul Sinebrychoff, Kielipankki VersionFinnish, Swedish1895-1909100,000 WordsInterfaceCC BYKielipankki Korp, Subcorpus Finnish, Subcorpus Swedish Subcorpus Finnish No, subcorpus Swedish Yes
92
The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki VersionFinnish, Swedish 1770-2011, appr. 10 corpora per decade612,061,367 Sentences; 8,728,581,153 Words InterfaceCC BYKielipankki Korp, Subcorpus Finnish and Subcorpus Swedish see separate entries on this list, N-grams for bothMain corpus Yes, subcorpora No, N-grams for both subcorpora Yes
93
Classics Library of the National Library of Finland - Kielipankki versionFinnish, Swedish1549-1944692 works in Finnish, 285 works in Swedish will be available in the near futureWill be available in the near future at Interface, DownloadCC BYKielipankki Korp, Kielipankki Download, Subcorpus Finnish, Subcorpus Swedish No
94
Virtual Old Literary Finnish (VVKS) - Kielipankki Korp versionFinnish1543-179148 TextsInterface, DownloadCC BY NC NDKielipankki Korp, Kielipankki DownloadNo
95
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)Finnish, Swedish1771-187415 GbDownloadCC BYKielipankki DownloadYes
96
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920)Finnish, Swedish1875–19208,740,000,000 Tokens; 371 GbDownloadCLARIN ACAKielipankki DownloadNo
97
Open Richly Annotated Cuneiform Corpus, Korp Versioncuneiformancient741,129 TokensInterface CC BY SAKielipankki KorpYes
98
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki VersionFinnish1840-20115,246,334,710 TokensInterfaceCC BY SAKielipankki KorpNo
99
The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki VersionSwedish1770-19503,481,646,321 Tokens InterfaceCC BY SAKielipankki KorpNo
100
Corpus of Old Written EstonianEstonian1224-1227, 1485-1889134 texts; total 2,155,435 tokens; total 1,718,114 tokens in Estonian The texts are in the original written form. 16.-18. century texts have been tagged with contemporary Estonian, morphological and language information. 19. century texts are unannotated.InterfaceCC BY NCCELR Meta-ShareYes