Historical corpora in the CLARIN infrastructure

	A	B	C	D	E	F	G	H	I	J
1	Corpus	Language	Timespan	Size	Anno.	Availability	Licence	Add. comments	VLO

2	Hungarian Historical Corpus	Hungarian	177-2010	30 million words		Concordancer		Avail. through dedicated website	yes
3	Medieval Charter Sections Corpus	Czech, Latin	14th century	57 chapters	manually tagged, named entities	Download	CC-BY-NC-SA 4.0	LINDAT	yes
4	Sheffield Corpus of Chinese	Chinese				Download	CC-BY-NC-SA 3.0	Oxford Text Archive	yes
5	Reference corpus of historical Slovene goo300k 1.2	Slovenian	1584-1899	300,000 tokens	manually tokenised, lemmatised, PoS-tagged, modern synonyms for archaic words	Download, concordancer	CC-BY 4.0	CLARIN.SI, KonText	yes
6	Digital library and corpus of historical Slovene IMP 1.1	Slovenian	1584-1919	17.7 million tokens	tokenised, lemmatised, PoS-tagged	Download, concordancer	CC-BY-SA 4.0	CLARIN.SI, KonText	yes
7	IMP corpus n-grams 1.0	Slovenian	1584-1919	2.5 million n-grams		Download	CC-BY-SA 4.0	CLARIN.SI	yes	ET: not sure if this really belongs here, as it is not a corpus.
8	Corpus of Historical American English - Kielipankki Korp version 2017H1	American English	1810-2009	385 million tokens	tokenised	Concordancer	CLARIN ACA	Kielipankki, Korp	yes
9	Historical Corpus of the Welsh Language 1500-1850	Welsh	1500-1850	420,000 words		Download, concordancer		Avail. through dedicated website	yes
10	GerManC. A Historical Corpus of German Newspapers 1650-1800	German	1650-1800	1650-1800, 800,000 words, sampled by genre		Download	CC-BY-NC-SA 3.0	Oxford Text Archive	yes
11	The Old Bailey Corpus	Late Modern English	1720-1913	134 million words	Detailed sociobiographical, pragmatic and textual annotation	Download, concordancer	CC-BY-NC-SA 4.0	CLARIN-D, CLARIN Federated Content Search available	yes
12	"PolDiLemma" Middle Polish Diachrone Lemmatised Corpus	Polish, German, Latin, Czech	16th-18th century		lemmatised	Download	Public Domain	CLARIN-D, CLARIN Federated Content Search available	yes
13	Helsinki Corpus of Scottish Correspondence (1540-1750)	English	1540-1750	0.5 million tokens	tokenised	Concordancer	CLARIN ACA	KielipankkI, Korp	yes
14	Parsed Corpus of Early English Correspondence (PCEEC)	English	1410-1681	2.2 million words	tokenised, PoS-tagged, syntactically parsed	Download (need to "apply for approval")		Oxford Text Archive	yes
15	B4 Tatian Corpus of Deviating Examples 2.1	Latin, Old High German	9th century	11,300 tokens	tokenised, MSD-tagged	Download, concordancer	CC-BY	University of Hamburg	yes
16	Syntactic Reference Corpus of Medieval French	Old French	9th-13th century	245,000 tokens	syntactically parsed	Download	CLARIN ACA	CLARIN-D (external site?)	yes
17	Hamburg Corpus of Old Swedish with Syntactic Annotations (HaCOSSA)	English, German, Latin, Old Norse, Swedish		128,204 words	syntactic and morphological annotation	Download	CLARIN RES	University of Hamburg	yes
18	Deutsches Textarchiv (DTA)	German	1600-1900				CLARIN PUB	LINDAT	yes
19	Reference Corpus Middle Low German/Low Rhenish (1200-1650)	Middle Low German	1200-1650	200,700 tokens	tokenised, MSD-tagged	Download	CC-BY	University of Hamburg	yes
20	B4 Ludolf	Middle Low German	1350	6,690 tokens	tokenised, tagged for clause type and grammatical function		CLARIN ACA	University of Hamburg	yes
21	B4 Historisches Predigtenkorpus zum Nachfeld	Middle High German		9,2500 tokens	tokenised, syntactic, discursive annotation		CLARIN ACA	University of Hamburg	yes
22	Mannheimer Korpus Historischer Zeitungen und Zeitschriften	German	18th and 19th centuries	750 volumes, 3532 pages overall		Download			yes
23	Menota	Old Norse		1.6 million tokens	tokenised, MSD-tagged, lemmatised	Concordancer	CC-BY	CLARINO, Corpuscle	no
24	Greek Medieval Texts	Ancient Greek	4th-16th century	3.4 million words		Available - Unrestricted Use	CC-BY	clarin:el	no
25	Austrian Baroque Corpus	Austrian	1650-1750	200,000	tokenised, PoS-tagged, lemmatised, named entities	Concordancer		Clarin Austria	no
26	Corpus Informatizado do Português Medieval	Portuguese	9th to 16th centuries	2 million	tokenised, PoS-tagged	Concordancer		Avail. through dedicated website	no
27	Parsed Corpus of Historical Portuguese	Portuguese	1380-1881	3.3 million	tokenised, PoS-tagged (2 million), treebanked (1.2 million)			Avail. through dedicated website	no
28	OROSSIMO Corpus - History	Greek	n/a	553,131 Tokens	Structural Annotation (paragraph)	Download	CC - BY	clarin:el	no
29	ARCHER Corpus	English	1600-1999		none	Restricted online access (users must apply, signed user agreement required)	none	Curated by University of Manchester; interface is likely to be CQPweb	no
30	Historical Corpora at Lancaster University	English	1500-	Numerous resources; millions of tokens	Wordclass, in some cases also semantic tagging (USAS system)	Restricted online access (users can register online; access conditions for corpora vary, and some are UK users only)	none	Numerous historical corpora available via	no
31	Older Scottish texts : the Edinburgh DOST corpus / A.J. Aitken, Paul Bratley and Neil Hamilton-Smith	English	1450-1600	877.000 tokens	none	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive	yes
32	Anthology of Middle English texts / Santiago Gonzalez y Fernandez-Corugedo	English, Middle (1100-1500); English; Hebrew	1100-1400	4000 words	none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
33	Helsinki corpus of English texts	English; English, Old (ca. 450-1100); English, Middle (1100-1500)	730-1710	240000 words	none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
34	Corpus of biblical text in Scots / John Kirk	Scots	not known		none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
35	Pamphlets of the American Revolution : [selections] / edited by Bernard Bailyn	English	1750-1776		none	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive	yes
36	Corpus of Late Modern English prose / David Denison	English	1837-1926		none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
37	The Helsinki corpus of Older Scots : [1450-1700]	Scots	1450-1700		none	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive	yes
38	The Lampeter Corpus of Early Modern English Tracts	English	1640-1740		none	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive	yes
39	Paris speech in the past	French, Middle (ca. 1400-1600); French	2000-07		none	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive	yes
40	The York-Helsinki parsed corpus of Old English poetry (YCOEP)	English, Old (ca. 450-1100)	730–1710		none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
41	Corpus of Early English Correspondence Sampler (CEECS)	English	1418–1680		none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
42	The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)	English, Old (ca. 450-1100); Latin	600-1150		none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
43	The English language of the north-west in the late Modern English period: a Corpus of late 18c Prose	English	1761-90		none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
44	Polish language of the 1960s	Polish	1963-1967		none	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive	yes
45	Dictionary of Old English Corpus in Electronic Form (DOEC)	English, Old (ca. 450-1100); Latin	600-1150		none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
46	Partonopeus de Blois: transcriptions of all manuscripts and fragments	French, Old (ca. 842-1400)	1166-1199	not known	none	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive	yes
47	A Corpus of English Dialogues 1560-1760 (CED)	English	1560-1760	not known	none	Download	Oxford Text Archive licence	Oxford Text Archive	yes
48	Parsed Corpus of Early English Correspondence (PCEEC)	English; English, Middle (1100-1500)	1410-1695	2.2 million words	POS-tagging and parsing	Download	Oxford Text Archive licence	Oxford Text Archive	yes
49	The Electronic Text Corpus of Sumerian Literature. Revised edition.	English; Sumerian	2100 BCE-1700 BCE	not known	Each word form in the composite transliterations has been assigned to a lexeme which is specified by a citation form, word class information and basic English translation.	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive	yes
50	The Lancaster Newsbooks Corpus	English	1654-1655	not known	none	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive	yes
51	GeMi Corpus	German	1500-1700	119,802 tokens	TEI Lite markup, no linguistic annotation	Download	http://creativecommons.org/licenses/by-nc-sa/3.0/	Oxford Text Archive; full title The Nottingham Corpus of Early Modern German Midwifery and Women's Medicine (ca. 1500-1700)	yes
52	EEBO-TCP	English	1450-1700	766 million tokens	TEI P5 markup, no linguistic annotation	Download	CC-0	Oxford Text Archive; the 'corpus' is thousands of text, available individually for download	yes
53	ECCO-TCP	English	1700-1800	74 million tokens	TEI P5 markup, no linguistic annotation	Download	CC-1	Oxford Text Archive; the 'corpus' is thousands of text, available individually for download	yes
54	EVANS-TCP	English	1640-1821	102 million tokens	TEI P5 markup, no linguistic annotation	Download	CC-2	Oxford Text Archive; the 'corpus' is thousands of text, available individually for download	yes
55	Hansard Corpus	English	1803-2005	1.6 billion	POS-tags, lemmas, semantic tags	Concordancer	none	corpus.byu.edu (Brigham Young Corpora)	no
56	Corpus testuale del Tesoro della Lingua Italiana delle Origini	Italian		23 million tokens	Lemmas	Web concordancer	unknown	Avail. through dedicated website	no
57	DiaCORIS	Italian	1861-1945			Web concordancer	unknown	Avail. through dedicated website	no
58	M.I.DIA. (Morfologia dell'Italiano in DIAcronia)	Italian	13th-20th cent.	7,5 million tokens		Web concordancer	CC-BY-NC 4.0	Avail. through dedicated website	no
59	Archivio Datini	Italian			Lemmas	Web concordancer	unknown	Avail. through dedicated website	no
60	Frantext	French	10th-21st cent	297 586 781 words	Lemmas, POS-tags	Web concordancer	unknown	Available by paying substription	no
61	eFontes Mediae et Infimae Latinitatis Polonorum (Elektroniczny korpus polskiej łaciny średniowiecznej)	Polish, Latin	1000–1550	5 million tokens	Lemmata	Web concordancer	unknown		no
62	Corpus of the 16. century Polish (Korpus polszczyzny XVI wieku)	Polish, Latin	16 century		TEI P5 markup, lemmata, transcription	Corpus search	unknown		no
63	The Electronic Corpus of the 17th and 18th century Polish (Korpus tekstów polskich z XVII i XVIII w.)	Polish, Latin	1601–1772	12 million tokens	POS tags (for 0.5M tokens), rich structural annotation	Corpus search	unknown		no
64	Corpus of old Polish texts until 1500 (Korpus tekstów staropolskich do roku 1500)	Polish, Latin	?–1500	620 thousands tokens	TEI P5 markup, no linguistic annotation	Data available to download	unknown		no
65	Corpus of the 19. century Polish (Korpus polszczyzny XIX-wiecznej)	Polish	1830–1918	625 thousands tokens	Lemmata, POS tags, transliteration, transcription	Corpus search	unknown		no
66	XV century New Testament translations (Piętnastowieczne przekłady Nowego Testamentu – elektroniczna konkordancja staropolska)	Polish, Latin	1380–1500	400 thousands tokens	TEI P5 markup, no linguistic annotation	Data download, translation browser, Polish and Latin word lists	unknown		no
67	IMPACT GT corpus (Korpus GT projektu IMPACT)	Polish	1570–1756	1.5 million tokens	transcription	Corpus search	unknown
68	Chronopress	Polish	1945–1954	16 million tokens		Web concordancer	CC BY SA		no
69	Bundesblatt/Feuille fédérale/Foglio federale	German/French/Italian	1849-2014	203,585,806 tokens (German), 239,125,036 tokens (French), 85,223,085 tokens (Italian)	TreeTagger (all data), RFtagger (German data)	CQPweb		Université de Genève. SNF Project linked to this corpus containing documents published by the Swiss Federal Council: http://p3.snf.ch/project-143585	no
70	DIAKORP v6	Czech	14th--20th century	4 mil. tokens	currently only basic structural markup	Web concordancer	CC BY NC SA	Available upon request from the Czech National Corpus also for download.	no
71	Old Hungarian Corpus	Hungarian	12th century - 17th century	3 million tokens	segmented into tokens and sentences; partly normalized (to modern Hung. spelling), partly morphologically tagged; locus markers	Download & Concordancer	freely available for everyone	Avail. through dedicated website	not yet
72	Corpus of Old and Middle Hungarian court records and private correspondence	Hungarian	16-18th century	850 000 words	tokenised, lemmatised, morphosyntactically tagged, sociolinguistic metadata added	Concordancer	freely available for everyone	Avail. through dedicated website	not yet
73	Mikes dictionary	Hungarian	1717-1761	1.5 million words	lemmatised	Concordancer	freely available for everyone	Avail. through dedicated website	not yet
74	Deutsches Textarchiv (German Text Archive, DTA)	German	1600–1900	211 million tokens (growing further)	TEI text structures; tokenized, lemmatized, POS, normalized orthography	Download, Corpus Search, Text-Image-Display	Creative Commons (CC BY-NC, CC BY-SA, CC BY)	* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D * manual transcription + TEI annotation, automatic linguistic annotation * wide range of text types	yes	DTA subcorpora
75	Dinglers Polytechnisches Journal (Polytechnical Journal of Dingler)	German	1820–1931	77.5 million tokens	TEI text structures; tokenized, lemmatized, POS, normalized orthography	Download, Corpus Search, Text-Image-Display	CC BY-NC-SA 3.0 DE	* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D * manual transcription + TEI annotation, automatic linguistic annotation	yes
76	Referenzkorpus Mittelhochdeutsch (Middle High German Reference Corpus	German	1050–1350	2.5 million tokens	tokenized, lemmatized, POS, normalized orthography, morphosyntactic description	Download, Corpus Search	CC BY-SA 4.0 International	* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D * manual transcription + linguistic annotation	yes
77	Die Grenzboten (journal)	German	1842–1921	89 million tokens	basic TEI text structures; tokenized, lemmatized, POS, normalized orthography	Download, Corpus Search, Text-Image-Display	free	* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D * OCR, computer-aided TEI annotation, automatic linguistic annotation	yes
78	TreeTagger -- Middle High German parameter file	German; Middle High German	1100-1500	10 million tokens	tokenized, lemmatized, POS	Download	free	Institute for Natural Language Processing, University of Stuttgart, CLARIN D, Middle High German Conceptual Database; CRETA	yes
79	OCR Post-correction	German: Antiqua and Fraktur	18th to 20th century			web application	free	Institute for Natural Language Processing, University of Stuttgart, CLARIN D, OCR, post- correction, CRETA	yes
80	Part-of-speech tagging: mixed text	Latin, Middle English				web application	free	Institute for Natural Language Processing, University of Stuttgart, CLARIN D, CRETA	yes
81	DDR-Presseportal (GDR press portal)	German	1945-1994	1.1 billion tokens	basic TEI text structures; tokenized, lemmatized, POS, normalized orthography	Corpus Search	CLARIN ACA	* Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) -- CLARIN-D * OCR, computer-aided TEI annotation, automatic linguistic annotation	no
82	Brieven als buit (Letters as loot)	Dutch	17th-18th century	460.000 words (1.000 letters)	manually transcribed (diplomatically), automatically lemmatised and grammatically tagged	concordancer	free	Dutch Language Institute (INT)	No
83	Corpus Gysseling	Dutch	13th century	1,5 million words	manually lemmatised and POS-tagged	concordancer, download	INT Licence for researchers	Dutch Language Institute (INT)	No
84	The Morpho-Syntactic Database of Mikael Agricola's Works	Finnish	1544-1551	83,678 Sentences; 428,314 Tokens; 38,308 Words	Turku Dependency Parser: keyword, part of speech, morphological components and syntactical function	Interface	CC BY ND	Kielipankki Korp	No
85	The Finnish Gutenberg Corpus	Finnish	up to 1925 (IPR expired)	2,457,531 Sentences; 34,487,420 Words		Interface	CC BY	Kielipankki Korp	Yes
86	Aleksis Kivi Corpus (SKS)	Finnish, Swedish	1834–1872	52,821 Sentences; 413,735 Words		Interface	CC BY NC	Kielipankki Korp	Yes
87	Finnish Folk Poetry	Multilingual	1564-1939	1,435,012 Sentences; 7,141,783 Words		Interface	CC BY NC	Kielipankki Korp	Yes
88	Classics of Finnish Literature, Kielipankki Version	Finnish	1880-1949	1,500,000 Words		Interface, Download	EUPL v.1.1 SA	Kielipankki Korp, Kielipankki Download	Yes
89	Corpus of Old Literary Finnish	Finnish	1543-1810	167,400 Sentences; 4,133,202 Words		Interface	EUPL v.1.1 SA	Kielipankki Korp	Yes
90	Corpus of Early Modern Finnish, Kielipankki Version	Finnish	1809-1899	8,600,000 Words		Interface	EUPL v.1.1 SA	Kielipankki Korp	Yes
91	The Letters of Paul Sinebrychoff, Kielipankki Version	Finnish, Swedish	1895-1909	100,000 Words		Interface	CC BY	Kielipankki Korp, Subcorpus Finnish, Subcorpus Swedish	Subcorpus Finnish No, subcorpus Swedish Yes
92	The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version	Finnish, Swedish	1770-2011, appr. 10 corpora per decade	612,061,367 Sentences; 8,728,581,153 Words		Interface	CC BY	Kielipankki Korp, Subcorpus Finnish and Subcorpus Swedish see separate entries on this list, N-grams for both	Main corpus Yes, subcorpora No, N-grams for both subcorpora Yes
93	Classics Library of the National Library of Finland - Kielipankki version	Finnish, Swedish	1549-1944	692 works in Finnish, 285 works in Swedish will be available in the near future		Will be available in the near future at Interface, Download	CC BY	Kielipankki Korp, Kielipankki Download, Subcorpus Finnish, Subcorpus Swedish	No
94	Virtual Old Literary Finnish (VVKS) - Kielipankki Korp version	Finnish	1543-1791	48 Texts		Interface, Download	CC BY NC ND	Kielipankki Korp, Kielipankki Download	No
95	The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874)	Finnish, Swedish	1771-1874	15 Gb		Download	CC BY	Kielipankki Download	Yes
96	The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920)	Finnish, Swedish	1875–1920	8,740,000,000 Tokens; 371 Gb		Download	CLARIN ACA	Kielipankki Download	No
97	Open Richly Annotated Cuneiform Corpus, Korp Version	cuneiform	ancient	741,129 Tokens		Interface	CC BY SA	Kielipankki Korp	Yes
98	The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version	Finnish	1840-2011	5,246,334,710 Tokens		Interface	CC BY SA	Kielipankki Korp	No
99	The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version	Swedish	1770-1950	3,481,646,321 Tokens		Interface	CC BY SA	Kielipankki Korp	No
100	Corpus of Old Written Estonian	Estonian	1224-1227, 1485-1889	134 texts; total 2,155,435 tokens; total 1,718,114 tokens in Estonian	The texts are in the original written form. 16.-18. century texts have been tagged with contemporary Estonian, morphological and language information. 19. century texts are unannotated.	Interface	CC BY NC	CELR Meta-Share	Yes