1 | Timestamp | Name of the resource | URL | Type of resource | Language(s) | Size | Maximum length (number of words) of the annotated MWEs | Are the MWEs only contiguous or also non-contiguous? | Availability | Licence | Licence type | If you are a resource owner/developer and the resource is not available: are you interested in making it available (e.g. for research)? | Additional description of the resource | Other comments | Do you want to provide more detailed information? | Resource creator/owner | Contact email of the resource creator/owner | Relevant publications | Type of MWE description: Intensional or extensional | Size: the number of MWE base forms in the resource | Size: the number of MWE variants in the resource | Size: the number of variation patterns | Type(s) of MWEs | Special features | Grammatical framework | Lexical framework | Origin/source(s) of the MWEs in the resource | Sample entry |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 07/05/2014 06:50:01 | Lexicon of Arabic Modal Multiword Expressions and Repository of their Variation Patterns | http://www.rania-alsabbagh.com/am-mwe.html | MWE dictionary or lexicon (MWEs only) | Modern Standard Arabic Egyptian Arabic | 10 K | 4 | Also non-contiguous | Available, unrestricted use | Creative Commons (CC): http://creativecommons.org/examples | Yes (click continue to fill in more information) | Rania Al-Sabbagh | alsabba1@illinois.edu | Rania Al-Sabbagh, Roxana Girju and Jana Diesner. 2014. Unsupervised Construction of a Lexicon and a Pattern Repository of Arabic Modal Multiword Expressions. In Proceedings of the 10th Workshop of Multiword Expressions at EACL 2014, Gothenburg, Sweden, April 26-27, 2014. | Extensional | Dictionary, repository of variation patterns | ||||||||||||
3 | 07/05/2014 07:57:39 | around 200 corpora for sixty languages | sketchengine.co.uk | Web service | 60 | computed at run time: millions | 20 | Only contiguous | Available, restricted use | yes | Terms and other MWEs will be available as a web service, as automatically identified (using grammar patterns and statistics over part-of-speech-tagged, lemmatised, very large corpora) | The survey doesn't fit our resources very well. Our resources are often the best there is for a language, so this is unfortunate | No (click continue to submit) | |||||||||||||||
4 | 07/05/2014 08:35:15 | National Corpus of Polish | http://clip.ipipan.waw.pl/NationalCorpusOfPolish | Treebank with MWE annotations | Polish | 20,000 multi-word named entities | 23 | Also non-contiguous | Available, unrestricted use | GNU GPL v.3 | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | The Named Entity level of the National Corpus of Polish is concerned. Its gold standard subcorpus, available under GPL, contains 87,300 NEs, annotated together with their nested NEs. As a result, annotation trees are provided. Over 22% of them are multi-word NEs. Coordinated NEs are annotated disjointly, which results in some discontinuities. | Yes (click continue to fill in more information) | Institute of Computer Science, Polish Academy of Sciences, with 3 partners | agata.savary@univ-tours.fr | WASZCZUK, J., GŁOWIŃSKA, K., SAVARY, A., PRZEPIÓRKOWSKI, A., LENART, M. (2013): Annotation tools for syntax and named entities in the National Corpus of Polish, in the International Journal of Data Mining, Modelling and Management, Vol. 5, No. 2, Inderscience Publishers, pp. 103-122, preprint. SAVARY, A., CHOJNACKA-KURAŚ, M., WESOŁEK, A., SKOWROŃSKA, D., , ŚLIWIŃSKI, P. (2012), "Anotacja jednostek nazewniczych", in PRZEPIÓRKOWSKI, A., BAŃKO, M., GÓRSKI, R., LEWANDOWSKA-TOMASZCZYK, B. (eds.). Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warszawa, pp. 129--167. SAVARY, A., PISKORSKI, J. (2011), Language Resources for Named Entity Annotation in the National Corpus of Polish, in Control and Cybernetics 40(2), Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland, pp. 361-391. | corpus occurrences, their lemmas and other attributes | about 20,000 | Compound named entities: person names, organization names, geographical names, geopolitical names, dates, time expressions, as well as relative adjectives (e.g. Polish) and personal derivations (a varsovian) thereof. | Outermost NEs are annotated with all their nested NEs, e.g.: [National Corpus of [Polish]] The corpus is balanced with respect to different genres. | Corpus | [ [Irlandzkej]relAdj(irlandzki;placeName(Irlandia)) Armii Republikańskiej ]orgName(Irlandzka Armia Republikańska) 'Irish Republican Army' | ||||||
5 | 07/05/2014 10:14:04 | ACL RD-TEC: a dataset for terminology extraction and classification | http://www.elra.info/Language-Resources-LRs.html | a terminological bank | English | 75,0000 entries | Only contiguous | Available, unrestricted use | ELRA, free for research | ELRA, free for research | yes | This is a terminological resource, each entry is annotated as valid and invalid term, in which valid terms are further annotated as technology and non-technology terms | No (click continue to submit) | |||||||||||||||
6 | 07/05/2014 15:30:18 | Comprehensive Multiword Expressions (CMWE) Corpus | http://www.ark.cs.cmu.edu/LexSem/ | Treebank with MWE annotations | English | 3500 instances (2400 types) | Also non-contiguous | Available, unrestricted use | CC-BY-SA | Creative Commons (CC): http://creativecommons.org/examples | This dataset provides human annotations of multiword expressions (MWEs) for sentences in social web reviews from the English Web Treebank corpus. 55,579 words (3,812 sentences, 723 documents) were annotated. MWEs are formed by grouping together words into strong (highly idiosyncratic) or weak (loosely collocational) expressions according to our English annotation guidelines (https://github.com/nschneid/nanni/wiki/MWE-Annotation-Guidelines). For example, I will sum_ it _up~with , it was worth_every_penny ! is annotated as containing 2 strong MWEs (sum_up, worth_every_penny) and 1 weak MWE (sum_up~with). These are comprehensive annotations, i.e., for each sentence, the annotator marked *all* expressions deemed MWEs. Every annotation was reviewed by at least two annotators. See (Schneider et al., LREC 2014) for details. The full text of the corpus is distributed by LDC. If you do not have access to the English Web Treebank you will only be able to see the annotated MWEs, not the surrounding context. A statistical system for MWE identification that was trained on this corpus is available at the same URL. | Yes (click continue to fill in more information) | Description of the corpus and annotation process: Nathan Schneider, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T. Mordowanec, Henrietta Conrad, and Noah A. Smith (2014). Comprehensive annotation of multiword expressions in a social web corpus. LREC. Annotation guidelines: https://github.com/nschneid/nanni/wiki/MWE-Annotation-Guidelines Description of MWE identification tool trained on the corpus: Nathan Schneider, Emily Danchik, Chris Dyer, and Noah A. Smith (2014). Discriminative lexical semantic segmentation with gaps: running the MWE gamut. Transactions of the Association for Computational Linguistics 2(April):193−206. http://www.cs.cmu.edu/~nschneid/mwe.pdf | multiword named entities; a wide variety of MWEs that are idiomatic in form, function, or frequency—this includes compounds, light/support verb constructions, verb particle constructions, prepositional verbs, phrasal idioms, and collocations. Each MWE instance is simply a "strong" or "weak" grouping of tokens; there is no explicit taxonomy of MWE categories. | Distinction between strong and weak MWEs (weak MWEs can contain nested strong MWEs as constituents). Gappy (non-contiguous) MWEs are allowed and other MWEs may occur inside the gap. | annotated directly in context | I will sum_ it _up~with , it was worth_every_penny ! is annotated as containing 2 strong MWEs (sum_up, worth_every_penny) and 1 weak MWE (sum_up~with). | |||||||||||
7 | 10/05/2014 07:52:01 | DICI (Dictionary of Italian Collocations) | no website | MWE dictionary or lexicon (MWEs only) | italian | 11 | 3 | Also non-contiguous | yes | It is still a work in progress | No (click continue to submit) | |||||||||||||||||
8 | 12/05/2014 06:10:46 | Dictionary Development Process list of semantic domains. | http://semdom.org/ | list of domains | English (The materials have been translated into a number of other languages. The translations can be downloaded from http://rapidwords.net/. However I do not know the quality or naturalness of the translations.) | None of the MWEs are annotated, except for being tagged for semantic domain. | 6 | Also non-contiguous | Available, unrestricted use | Creative Commons--Share Alike | Creative Commons (CC): http://creativecommons.org/examples | The list of domains includes example words and MWEs for each domain. The list is posted at http://semdom.org/. The list can be downloaded from http://rapidwords.net/. | No (click continue to submit) | |||||||||||||||
9 | 15/05/2014 12:54:31 | Wiktionary English phrasal verbs | http://en.wiktionary.org/wiki/Category:English_phrasal_verbs | Monolingual list of MWEs | English | 2110 | English verbs accompanied by particles, such as prepositions and adverbs. | No (click continue to submit) | ||||||||||||||||||||
10 | 15/05/2014 12:56:03 | Wiktionary English idioms | http://en.wiktionary.org/wiki/Category:English_idioms | Monolingual list of MWEs | English | 7894 | English phrases understood by subjective, as opposed to literal meanings. | No (click continue to submit) | ||||||||||||||||||||
11 | 20/05/2014 01:07:56 | Proposition Bank | https://catalog.ldc.upenn.edu/LDC2004T14 | Treebank with MWE annotations | English | Also non-contiguous | Available, restricted use | LDC User Agreement for Non-Members Subscription & Standard Members, and Non-Members | LDC User Agreement for Non-Members | PropBank annotation was developed to provide training data for supervised machine learning classifiers. It provides semantic information, including the basic “who is doing what to whom,” in the form of predicate-by-predicate semantic role assignments. The annotation involves selection of a roleset, a coarse-grained sense of the predicate, which has a listing of the roles expressed as argument numbers associated with that sense. E.g., the roleset for Take.01: Take.01: acquire, come to have, choose, bring Arg0: Taker Arg1: Thing taken Arg2: Taken-from, source of thing taken Arg3: Destination The roleset and example sentences from frame files serve as a guide to annotators on how to assign argument numbers to annotation instances. The goal is to assign these labels across the many possible syntactic realizations of the same semantic role. The recent expansion of PB to provide coverage for noun, adjective, and complex predicates such as MWEs has enriched the semantics that PB is able to capture, but it has created an overwhelming number of new rolesets. To alleviate this, PB has opted to begin unifying frame files through a process of ‘aliasing’(Bonial et al., 2014), in which related concepts are aliased to each other and unified so that there is a single roleset representing all instantiations. Extending aliasing to a variety of MWEs is explored, such that take it easy, as in “I’m just going to take it easy,” would be aliased to the existing lexical verb roleset for relax. | Type of resource: Proposition bank on top of a treebank For further information, see https://catalog.ldc.upenn.edu/. References: Claire Bonial, Julia Bonn, Kathryn Conger, Jena D. Hwang and Martha Palmer. In preparation. Prop- Bank: Semantics of New Predicate Types. Proceedings of the Language Resources and Evaluation Conference - LREC-2014. Reykjavik, Iceland. | Yes (click continue to fill in more information) | Martha Palmer | Martha.Palmer@colorado.edu | http://verbs.colorado.edu/~mpalmer/projects/ace.html | Extensional | Phrasal verbs, light verb constructions, verbal expressions. | This is a proposition bank on top of a treebank. | Roleset id: take.26 , Project anger on someone, idiomatic, Source: , vncls: , framnet: take.26: Roleset added due to instances in CallHome corpus. Framed by Claire. No VN class. Roles: Arg0-PAG: angry person Arg1-PPT: usually "it", thing causing anger Arg2-GOL: person anger is projected on Example: Typical Usage person: ns, tense: ns, aspect: ns, voice: ns, form: ns Whether they take it out on Governor Schwartzeneggar in California could be another test of that as well. Arg0: they Rel: [take][out] Arg1: it Arg2: on Governor Schwartzeneggar Argm-loc: in California | |||||||||
12 | 20/05/2014 10:00:59 | Lassy Small | http://www.let.rug.nl/~vannoord/Lassy/ | Treebank with MWE annotations | Dutch | 30.557 | 57 | Also non-contiguous | Available, unrestricted use | academic free, fee for commercial use see http://tst-centrale.org/nl/producten/corpora/lassy-klein-corpus/6-66?cf_product_name=Lassy+Klein-corpus http://tst-centrale.org/nl/producten/corpora/lassy-klein-corpus-commercieel/6-83?cf_product_name=Lassy+Klein-corpus+commercieel | LASSY (Large Scale Syntactic Annotation of written Dutch) is a STEVIN project. STEVIN is a Flemish-Dutch Language and Speech Processing Technology Programme launched by de Nederlandse Taalunie. The STEVIN programme office is run jointly by NWO Humanities Division and SenterNovem. A large corpus of written Dutch texts (1,000,000 words) has been syntactically annotated (manually corrected), based on D-COI and its successor. In addition, a very large corpus (almost 700,000,000 words) has been syntactically annotated automatically. The project extends the available syntactically annotated corpora for Dutch both in size as well as with respect to the various text genres and topical domains. In addition, various browse and search tools for syntactically annotated corpora have been developed and made available. Their potential for applications in corpus linguistics and information extraction is illustrated and evaluated in a series of case studies. See also @incollection{van2013large, title={Large scale syntactic annotation of written Dutch: Lassy}, author={Van Noord, Gertjan and Bouma, Gosse and Van Eynde, Frank and De Kok, Daniel and Van der Linde, Jelmer and Schuurman, Ineke and Sang, Erik Tjong Kim and Vandeghinste, Vincent}, booktitle={Essential Speech and Language Technology for Dutch}, pages={147--164}, year={2013}, publisher={Springer} } | No (click continue to submit) | ||||||||||||||||
13 | 20/05/2014 10:12:47 | Alpino Treebank | http://www.let.rug.nl/~vannoord/trees/ | Treebank with MWE annotations | Dutch | 2704 | 11 | Only contiguous | Available, unrestricted use | no licence | The Alpino treebank contains syntactically annotated Dutch sentences. The treebank (more than 150,000 words) includes the full cdbl (newspaper) part of the Eindhoven corpus. The Alpino Treebank was released in 2002. In the mean-time, our treebanking efforts have led to various corrections of the actual annotations, improvements of the various tools we use, and differences in the actual XML-format that we use for the annotations. | Yes (click continue to fill in more information) | Gertjan van Noord | g.j.m.van.noord@rug.nl | Robert Malouf, Gertjan van Noord. Wide Coverage Parsing with Stochastic Attribute Value Grammars. In: IJCNLP-04 Workshop Beyond Shallow Analyses - Formalisms and statistical modeling for deep analyses. Leonoor van der Beek, Gosse Bouma, Robert Malouf, Gertjan van Noord. The Alpino Dependency Treebank. In: Computational Linguistics in the Netherlands CLIN 2001. Rodopi 2002. Leonoor van der Beek, Gosse Bouma, and Gertjan van Noord. Een brede computationele grammatica voor het Nederlands. Nederlandse Taalkunde, 2002. Gosse Bouma and Geert Kloosterman. Querying dependency treebanks in XML. In Proceedings of the Third international conference on Language Resources and Evaluation (LREC), Gran Canaria, 2002. Gosse Bouma, Gertjan van Noord, Robert Malouf. Alpino: Wide Coverage Computational Analysis of Dutch. In: Computational Linguistics in the Netherlands CLIN 2000. Rodopi 2001. | treebank | named entities idiomatic expressions foreign language | dependency treebank | Corpus | |||||||||
14 | 20/05/2014 10:27:11 | DuelME | http://tst-centrale.org/nl/producten/lexica/duelme/7-35?cf_product_name=DuELME | MWE dictionary or lexicon (MWEs only) | Dutch | 5000 | Also non-contiguous | Available, restricted use | academic free, fee for commercial use | TST cenrrale | The paper describes a 5.000 entry corpus-based multi-word expression lexical database forDutch developed using thesemethods. The database has been externally validated, and its usability has been evaluated in NLP-systems for Dutch. The MWE database developed fills a gap in existing lexical resources for Dutch. The generic methods and tools for MWE identification and lexical representation focus on Dutch, but they are largely language-independent and can also be used for other languages, new domains, and beyond this project. The research results and data described in this paper have therefore significantly contributed to strengthening the digital infrastructure for Dutch, and will continue to do so in the context of the CLARIN research infrastructure. | Yes (click continue to fill in more information) | @incollection{odijk2013identification, title={Identification and lexical representation of multiword expressions}, author={Odijk, Jan}, booktitle={Essential Speech and Language Technology for Dutch}, pages={201--217}, year={2013}, publisher={Springer} } | not sure | mostly verbal expressions | LMF version exists | generic | Dictionary, Corpus | ||||||||||
15 | 26/05/2014 17:25:45 | Pattern Dictionary of English Prepositions (PDEP) | http://www.clres.com/db/TPPEditor.html | Dictionary or lexicon with MWEs (also includes MWEs) | English | Approximately 270 English phrasal prepositions | 4 | Only contiguous | Available, unrestricted use | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | The Pattern Dictionary of English Prepositions (PDEP) provides a comprehensive inventory of English prepositions, including phrasal prepositions. PDEP provides a sense=annotated corpus for these prepositions and characterizes their behavior in prototypical syntagmatic patterns. Included in this description is the class to which each sense belongs, enabling an examination of properties across prepositions, such as spatial or temporal prepositions. | No (click continue to submit) | ||||||||||||||||
16 | 30/05/2014 11:00:12 | Multilingual Collocation Dictionary | no website | Dictionary or lexicon with MWEs (also includes MWEs) | French, Romanian, German | 250 multilingual entries | 3 | Also non-contiguous | Available, restricted use | CC-BY-NC academic use only, no derivatives | Creative Commons (CC): http://creativecommons.org/examples | yes | The multilingual dictionary contains trilingual entries (verbo-nominal collocations) for French, for Romanian and for German. We represent verbo-nominal collocations, with their morpho-syntactic properties (preference for specific number, case or gender, for voice, for some prepositions). Examples extracted from corpora and their frequency are also available | No (click continue to submit) | ||||||||||||||
17 | 02/06/2014 21:59:15 | Collection of Distibutionally Idiosyncratic Items (CoDII) | http://www.english-linguistics.de/codii/ | Multilingual list of MWEs | English, German | English: < 100 German: > 400 | 4 | Also non-contiguous | Available, unrestricted use | yes | The Collection of Distributionally Idiosyncratic Items (CoDII) is a linguistic resource on lexical items which have highly idiosyncratic occurrence patterns, such as bound words. So, rather than being a general MWE resource, only bound words (and the expressions containing them) are documented) The bound words and the corresponding expressions can be downloaded as txt files from: http://multiword.sourceforge.net/PHITE.php?sitesig=FILES&page=FILES_20_Data_Sets Files: German_CE_Trawinski, English_CE_Trawinski | Yes (click continue to fill in more information) | Frank Richter, Beata Trawinski, Manfred Sailer | sailer@em.uni-frankfurt.de | http://www.english-linguistics.de/codii/index.html | Intensional | MWEs with bound words | Various linguistic classifications of the MWE are included | manually, based on the phraseological literature | |||||||||
18 | 06/06/2014 17:35:05 | English DELA e-dictionary | http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html | Dictionary or lexicon with MWEs (also includes MWEs) | English | 296,606 simple word forms for 150,145 different lemmas 132,990 multi-word forms for 69,912 different lemmas | 8 | Only contiguous | Available, unrestricted use | LGPL-LR | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | The file contains inflected forms and lemmas for both single and compound words. Example of a compound entry: waves of immigrants,wave of immigrants.N+NPN+z1:p Inflected form: waves of immigrants Lemma: wave of immigrants Category: N (noun) Syntactic structure: NPN (noun preposition noun) Popularity: z1 (frequently used) Morphological features: p (plural) | Yes (click continue to fill in more information) | CHROBOT, A., COURTOIS, B., HAMANI, M., GROSS, M., ZELLAGUI, K. | agata.savary@univ-tours.fr | SAVARY, A. (2000): Recensement et description des mots composés - méthodes et applications.. Thèse de doctorat en Informatique Fondamentale (PhD Thesis), Université de Marne-la-Vallée. SAVARY, A. (2000): Recensement et description des mots composés - méthodes et applications.. Thèse de doctorat en Informatique Fondamentale (PhD Thesis), Université de Marne-la-Vallée. (in French) | Extensional | 69912 | 132990 | Contiguous general language MWEs, mainly compound nouns and adjectives. | Popularity: z1 (frequently used) | None | Corpus processors: Unitex, NooJ | Dictionary | waves of immigrants,wave of immigrants.N+NPN+z1:p Inflected form: waves of immigrants Lemma: wave of immigrants Category: N (noun) Syntactic structure: NPN (noun preposition noun) Popularity: z1 (frequently used) Morphological features: p (plural) | |||
19 | 06/06/2014 17:37:09 | French DELA | http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html | Dictionary or lexicon with MWEs (also includes MWEs) | French | 683,824 forms of simple words for 102,073 different lemmas 108,436 compound forms for 83,604 different lemmas | Only contiguous | Available, unrestricted use | LGPL-LR | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | No (click continue to submit) | |||||||||||||||||
20 | 06/06/2014 17:49:29 | SAWA - a Grammatical Lexicon of Warsaw Urban Proper Names | http://zil.ipipan.waw.pl/SAWA | MWE dictionary or lexicon (MWEs only) | Polish | 300000 | 6 | Only contiguous | Available, unrestricted use | CC BY-SA | Creative Commons (CC): http://creativecommons.org/examples | Contains proper names of the places and institutions related to the Warsaw transportation system. Almost all of the names are multi-word. | Yes (click continue to fill in more information) | IPIPAN Warsaw | Malgorzata.Marciniak@ipipan.waw.pl | SAVARY, A., RABIEGA-WIŚNIEWSKA, J., WOLIŃSKI, M. (2009): Inflection of Polish Multi-Word Proper Names with Morfeusz and Multiflex, in MARCINIAK, M., MYKOWIECKA, A. (eds.) "Aspects of Natural Language Processing", Lecture Notes in Computer Science 5070, Springer Verlag, pp. 111–141. http://www.info.univ-tours.fr/~savary/Papers/savary-et-al-LNAI-2009.pdf MARCINIAK, M., RABIEGA-WIŚNIEWSKA, J., SAVARY, A., WOLIŃSKI, M., HELIASZ, C. (2009): Constructing an Electronic Dictionary of Polish Urban Proper Names, in Recent Advances in Intelligent Information Systems (Proceedings of the Balto-Slavonic Natural Language Processing Workshop, Kraków), Academic Publishing House EXIT, Warsaw, pp. 743–749. http://www.info.univ-tours.fr/~savary/Papers/marciniak-et-al-BSNLP-2009.pdf | both | 9000 | 300000 | 450 | Proper names of places and institutions related to the Warsaw transportation system (street, squares, bus stops, bridges, people after whom streets are named, etc.) | Includes old variants on steer and square names, notably those before 1989. Nested NEs are delimited and factorized. Morphosyntactic variants are represented. | Multiflex (http://www.springerlink.com/content/n265j22n73084433/), Morfeusz (http://sgjp.pl/morfeusz/) | Dictionary, Intitutional lists of streets and bus stops | Intentional entry: ulica(ulica:subst:sg:nom:f) {Aleksandra Bardiniego}(Aleksander Bardini:subst:sg:gen:m1),subst(NC-O_N-ulica-OSOBY) 'Aleksander Bardini Street' Extensional entry: ul. A. Bardiniego,ulica Aleksandra Bardiniego:subst:sg:loc:f | |||
21 | 06/06/2014 18:01:00 | SEJFEK - Grammatical Lexicon of Polish Economic Phraseology | http://zil.ipipan.waw.pl/SEJFEK | MWE dictionary or lexicon (MWEs only) | Polish | 146,861 inflected forms | 10 | Only contiguous | Available, unrestricted use | CC BY-SA | Creative Commons (CC): http://creativecommons.org/examples | Yes (click continue to fill in more information) | Filip Makowiecki, Agata Savary | agata.savary@univ-tours.fr | SAVARY, A., ZABOROWSKI, B., KRAWCZYK-WIECZOREK, A., MAKOWIECKI, F. (2012): SEJFEK — a Lexicon and a Shallow Grammar of Polish Economic Multi-Word Units, in Proceedings of Cognitive Aspects of the Lexicon (COGALEX-III), a Workshop at COLING 2012, Mumbai, India. http://aclweb.org/anthology//W/W12/W12-5116.pdf | both | 11212 | 146861 | 305 | Multi-word nominal terms from the domain of economy and finance | Nested MWEs are delimited and factorized. | Multiflex (http://www.springerlink.com/content/n265j22n73084433/), Morfeusz (http://sgjp.pl/morfeusz/), Toposław (http://zil.ipipan.waw.pl/Toposlaw), Unitex (http://igm.univ-mlv.fr/~unitex/) | Dictionary, Internet | Intensional entry: założenie(założenie:subst:sg:nom:n2) {lokaty bankowej}(lokata bankowa:subst:sg:gen:f),subst(NC-O_N-nb-inv-pl) Ekstensional entry: założenia lokat bankowych,założenie lokaty bankowej:subst:sg:gen:n2 | ||||
22 | 06/06/2014 18:15:30 | Prolexbase | http://zil.ipipan.waw.pl/Prolexbase | multilingual relational database of named entities with MWEs | Polish, English, French | 320000 | 16 | Only contiguous | Available, unrestricted use | CC BY-SA | Creative Commons (CC): http://creativecommons.org/examples | Yes (click continue to fill in more information) | Małgorzata Baron, Béatrice Bouchou Markhoff, Leszek Manicki, Denis Maurel, Agata Savary, Mickaël Tran, Duško Vitas | agata.savary@univ-tours.fr | Maurel, D. (2008): Prolexbase: a Multilingual Relational Lexical Database of Proper Names. In proceedings of LREC 2008, Marrakech, Morocco. http://www.lrec-conf.org/proceedings/lrec2008/summaries/91.html Savary, A., Manicki, L., Baron, M. (2013): Populating a Multilingual Ontology of Proper Names from Open Sources. In Journal of Language Modelling, Vol 2, No. 2, pp. 189-225. http://jlm.ipipan.waw.pl/ojs/index.php/JLM/article/view/63 | both | 173000 | 320000 | Proper names, most of which are multi-word units. | Semantic network with interlingual links. Relations of synonymy, meronymy, etc. between the named objects. Relative adjectives and inhabitant names. All data are manually validated. | Dictionary, Wikipedia, Geonames | |||||||
23 | 13/06/2014 16:44:26 | Szeged TreebankFX | http://www.inf.u-szeged.hu/rgai/mwe | Treebank with MWE annotations | Hungarian | 6734 light verb constructions | 3 | Also non-contiguous | Available, restricted use | academic use only | own licencing | The Szeged Treebank is a morphosyntactically tagged and syntactically annotated database, which is available in both constituency-based and dependency-based versions. All texts in the corpus are manually annotated for LVCs. The corpus contains 6734 occurrences of 1215 LVCs altogether in 82,099 sentences. | Yes (click continue to fill in more information) | University of Szeged, Department of Informatics | vinczev@inf.u-szeged.hu | Vincze, Veronika 2011: Semi-Compositional Noun + Verb Constructions: Theoretical Questions and Computational Linguistic Analyses. PhD thesis, University of Szeged, August 2011. Vincze, Veronika; Csirik, János 2010: Hungarian Corpus of Light Verb Constructions. In: Proceedings of COLING 2010, Beijing, China, pp. 1110-1118. Vincze, Veronika; Zsibrita, János; Nagy T., István 2013: Dependency Parsing for Identifying Hungarian Light Verb Constructions. In: Proceedings of IJCNLP 2013, pp. 207-215. | Extensional | 1215 | light verb constructions | verbal, participial and nominal occurrences are also annotated; non-adjacent LVCs are annotated | constituency and dependency grammar | manually annotated | ||||||
24 | 15/06/2014 15:34:26 | The Grammatical Lexicon of Polish Phraseology (SEJF = Słownik elektroniczny jednostek frazeologicznych) | http://zil.ipipan.waw.pl/SEJF | MWE dictionary or lexicon (MWEs only) | Polish | 3200 multi-word lexemes, 68,000 corresponding inflected forms | 6 | Only contiguous | Available, unrestricted use | CC BY-SA license. | Creative Commons (CC): http://creativecommons.org/examples | Yes (click continue to fill in more information) | Monika Czerepowicka (lexicography) and Agata Savary (automatic inflection and validation) | czerepowicka@gmail.com | GRALIŃSKI, F., SAVARY, A., CZEREPOWICKA, M., MAKOWIECKI, F. (2010): Computational Lexicography of Multi-Word Units: How Efficient Can It Be?, in Proceedings of Multiword Expressions: from Theory to Applications (MWE 2010), Workshop at COLING 2010, Beijing, China, August 28. CZEREPOWICKA, M., KOSEK, I. (2011): Problemy opisu związków frazeologicznych w formalizmie „Multifleks” (na przykładzie rodzaju wyrażeń frazeologicznych), in "Różne formy, różne treści", pp. 117–126, Warszawa 2011. CZEREPOWICKA, M. (2011): „Toposław” jako narzędzie znakowania jednostek wieloczłonowych, in Matusiak-Kempa, I., Przybyszewski, S. (eds.) Nowe zjawiska w języku, tekście, komunikacji. Kontekst a komunikacja, Olsztyn, pp. 28–35. CZEREPOWICKA, M. (2014): Jednostki obce w słowniku języka polskiego na przykładzie "Słownika elektronicznego jednostek frazeologicznych" (SEJF), in LingVaria IX (2014) | 1 (17), doi: 10.12797/LV.09.2014.17.04, pp. 59-68. | Extensional | 3200 | 68000 | 160 graph-based inflection paradigms | The Dictionary contains mainly multi-word nouns (2121 lemmas) and adverbs (604), adjectives (446) and others of general (non terminological) Polish language. | SEJF can code nested MWEs. | <CATEGORIES> Nb : sg , pl Case: nom, gen, dat, acc, inst, loc, voc Gen: m1, m2, m3, f, n1, n2, p1, p2, p3 Pers: pri, sec, ter Deg: pos, com, sup Asp: imperf, perf Neg: aff, neg Accent: akc, nakc Postprep : praep, npraep Accom: congr, rec Agglt: nagl, agl Vocal: wok, nwok <EXTRA_CATEGORIES> Usage: <E>,offic, neut, spok <GRAPHICAL_CATEGORIES> LetterCase: same, all_lower, all_upper, first_upper,first_upper_each_word,no_letter_case,other Init:<E>,dot,no_dot,dot2,no_dot2,dot3,no_dot3,dot4,no_dot4,dot5,no_dot5 Dot : pun , npun <CLASSES> subst: (Nb,<var>),(Case,<var>),(Gen,<fixed>),(Usage,<var>) depr: (Nb,<fixed>),(Case,<var>),(Gen,<fixed>) num: (Nb,<fixed>),(Case,<var>),(Gen,<var>),(Accom,<var>) numcol: (Nb,<fixed>),(Case,<var>),(Gen,<fixed>),(Accom,<var>) adj: (Nb,<var>),(Case,<var>),(Gen,<var>),(Deg,<var>) adja: adjc: adjp: adv: (Deg,<var>) ppron12: (Nb,<fixed>),(Case,<var>),(Gen,<var>),(Pers,<fixed>),(Accent,<var>) ppron3: (Nb,<var>),(Case,<var>),(Gen,<var>),(Pers,<fixed>),(Accent,<var>),(Postprep,<var>) siebie: (Case,<var>) fin: (Nb,<var>),(Pers,<var>),(Asp,<fixed>) bedzie: (Nb,<var>),(Pers,<var>),(Asp,<fixed>) aglt: (Nb,<var>),(Pers,<var>),(Asp,<fixed>),(Vocal,<var>) praet: (Nb,<var>),(Gen,<var>),(Asp,<fixed>),(Agglt,<var>) impt: (Nb,<var>),(Pers,<var>),(Asp,<fixed>) imps:(Asp,<fixed>) inf:(Asp,<fixed>) pcon:(Asp,<fixed>) pant:(Asp,<fixed>) ger: (Nb,<var>),(Case,<var>),(Gen,<fixed>),(Asp,<fixed>),(Neg,<var>) pact: (Nb,<var>),(Case,<var>),(Gen,<var>),(Asp,<fixed>),(Neg,<var>) ppas: (Nb,<var>),(Case,<var>),(Gen,<var>),(Asp,<fixed>),(Neg,<var>) winien: (Nb,<var>),(Gen,<var>),(Asp,<fixed>) pred: prep:(Case,<fixed>) conj: qub: xxs: (Nb,<var>),(Case,<var>),(Gen,<fixed>) xxx: ign: interp: sp: burk: brev:(Dot,<fixed>) | Dictionary, Corpus, collected manually | |||||
25 | 18/06/2014 08:26:35 | List of Hungarian light verb constructions | http://www.inf.u-szeged.hu/rgai/mwe | Monolingual list of MWEs | Hungarian | 3 | Also non-contiguous | Available, unrestricted use | Light verb constructions were collected from the manually annotated corpora Szeged TreebankFX and SzegedParalellFX. The list contains their base forms. | Yes (click continue to fill in more information) | University of Szeged, Department of Informatics | vinczev@inf.u-szeged.hu | Vincze, Veronika 2012: Light Verb Constructions in the SzegedParalellFX English-Hungarian Parallel Corpus. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, pp. 2381-2388. Vincze, Veronika; Csirik, János 2010: Hungarian Corpus of Light Verb Constructions. In: Proceedings of COLING 2010, Beijing, China, pp. 1110-1118. | light verb constructions | Corpus | |||||||||||||
26 | 18/06/2014 08:28:51 | List of English light verb constructions | http://www.inf.u-szeged.hu/rgai/mwe | Monolingual list of MWEs | English | 3 | Also non-contiguous | Available, unrestricted use | Light verb constructions were collected from the manually annotated corpora Wiki50 and SzegedParalellFX. Their base forms are included in the list. | Yes (click continue to fill in more information) | University of Szeged, Department of Informatics | vinczev@inf.u-szeged.hu | Vincze, Veronika; Nagy T., István; Berend, Gábor 2011: Multiword expressions and Named Entities in the Wiki50 corpus. In: Proceedings of RANLP 2011. Hissar, Bulgaria, pp. 289-295. Vincze, Veronika 2012: Light Verb Constructions in the SzegedParalellFX English-Hungarian Parallel Corpus. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, pp. 2381-2388. | light verb constructions | Corpus | |||||||||||||
27 | 18/06/2014 08:33:25 | Bilingual list of English-Hungarian light verb constructions | http://www.inf.u-szeged.hu/rgai/mwe | Multilingual parallel list of MWEs | English, Hungarian | 3 | Also non-contiguous | Available, unrestricted use | Light verb constructions from the manually annotated SzegedParalellFX corpus were collected and the English and Hungarian equivalents were matched. Also, their verbal counterparts are also provided (if any). | Yes (click continue to fill in more information) | University of Szeged, Department of Informatics | vinczev@inf.u-szeged.hu | Vincze, Veronika 2012: Light Verb Constructions in the SzegedParalellFX English-Hungarian Parallel Corpus. In: Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC 2012). Istanbul, Turkey, pp. 2381-2388. | light verb constructions | Corpus | observe make an observation megfigyelést tesz megfigyel | ||||||||||||
28 | 19/06/2014 11:19:09 | SALDO | http://spraakbanken.gu.se/resurs/saldo | Dictionary or lexicon with MWEs (also includes MWEs) | Swedish | about 7500 | 10 | Also non-contiguous | Available, unrestricted use | CC-BY | Creative Commons (CC): http://creativecommons.org/examples | SALDO is a lexical-semantic resource, differently organized from a wordnet. It treats single-word items and MWEs in the same way, in the sense that MWEs have a part of speech and an inflectional paradigm but no internal structure. See the following references for more information: @article{Borin-Lars2013-9, title = "SALDO: a touch of yin to WordNet's yang", journal = "Language resources and evaluation", author = "Borin, Lars and Forsberg, Markus and Lönngren, Lennart", year = "2013", volume = "47", number = "4", url = "http://dx.doi.org/10.1007/s10579-013-9233-4", pages = "1191--1211", } @article{Borin-Lars2013-6, title = "Close encounters of the fifth kind: Some linguistic and computational aspects of the Swedish FrameNet++ project", journal = "Veredas", author = "Borin, Lars and Forsberg, Markus and Lyngfelt, Benjamin", year = "2013", volume = "17", number = "1", url = "http://www.ufjf.br/revistaveredas/files/2013/11/2-BORIN-FORSBERG-LINGFELT-FINAL.pdf", pages = "28--43", } | ||||||||||||||||
29 | 19/06/2014 11:24:43 | Swedish FrameNet (SweFN) | http://spraakbanken.gu.se/eng/swefn | Dictionary or lexicon with MWEs (also includes MWEs) | Swedish | a few thousand | Also non-contiguous | Available, unrestricted use | CC-BY | Creative Commons (CC): http://creativecommons.org/examples | SweFN is a framenet for Swedish. It reuses Berkeley FrameNet frames as much as possible and also adds new frames. The word sense inventory used for identifying lexical units is that of SALDO, hence any MWE found in SALDO is a candidate for a lexical unit in SweFN. | |||||||||||||||||
30 | 19/06/2014 11:35:27 | Swedish FrameNet++ (SweFN++) | http://spraakbanken.gu.se/eng/swefn | Dictionary or lexicon with MWEs (also includes MWEs) | Swedish English a number of South Asian languages Finnish | about 8000 | 10 | Also non-contiguous | Available, unrestricted use | CC-BY | Creative Commons (CC): http://creativecommons.org/examples | Swedish FrameNet++ is a lexical macroresource created by interlinking a number of freely available digital lexical resources. As opposed to most such endeavors (e.g. BabelNel, UBY, Etymological WordNet, etc,) SweFN++ is not only based on automatic processing of the resources, but a considerable amount of manual post-correction and qualified linguistic and lexicographic work have gone into this effort. The resources are interlinked using the sense and form-unit PIDs of SALDO, the pivot resource of SweFN++. Part of the sense inventory is linked to other languages through WordNet synsets and IDS/LWT identifiers. See the following references: @article{Borin-Lars2013-6, title = "Close encounters of the fifth kind: Some linguistic and computational aspects of the Swedish FrameNet++ project", journal = "Veredas", author = "Borin, Lars and Forsberg, Markus and Lyngfelt, Benjamin", year = "2013", volume = "17", number = "1", url = "http://www.ufjf.br/revistaveredas/files/2013/11/2-BORIN-FORSBERG-LINGFELT-FINAL.pdf", pages = "28--43", } @incollection{Borin-Lars2013-15, title = "The Intercontinental Dictionary Series – a rich and principled database for language comparison", booktitle = "Approaches to Measuring Linguistic Differences / ed. by Lars Borin ; Anju Saxena ", author = "Borin, Lars and Comrie, Bernard and Saxena, Anju", year = "2013", publisher = "De Gruyter Mouton", address = "Berlin", isbn = "978-3-11-030525-8", pages = "285--302", } | ||||||||||||||||
31 | 19/06/2014 14:43:44 | Reference data for Collocation Extraction | http://ufal.mff.cuni.cz/~pecina/resources.html | MWE dictionary or lexicon (MWEs only) | Czech | 12000 thousands approx | 2 | Also non-contiguous | Available, unrestricted use | CC-BY-NC | Creative Commons (CC): http://creativecommons.org/examples | yes | Annotated list of dependency bigrams occurring in the PDT more than five times and having part-of-speech patterns that can possibly form a collocation. Each bigram is assigned to one of the six MWE categories described below by three annotators. | Yes (click continue to fill in more information) | Pavel Pecina | pecina@ufal.mff.cuni.cz | Pavel Pecina. Lexical association measures and collocation extraction. Language Resources and Evaluation, 44, pages 137-158, 2010. Pavel Pecina. Reference Data for Czech Collocation Extraction. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions, pages 11-14, Marrakech, Morocco, 2008. Pavel Pecina and Pavel Schlesinger: Combining Association Measures for Collocation Extraction. Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, Australia, July 2006. | Intensional | 12232 | 1. stock phrases 2. names of persons, organizations, geographicallocations, and 3. support verb constructions 4. technical terms 5. idiomatic expressions | dependency | Corpus | geometrický AI1A Atr prostor NI-A Head | |||||
32 | 19/06/2014 18:24:24 | SEJF - The Grammatical Lexicon of Polish Phreseology (SEJF = Słownik Elektroniczny Jednostek Frazeologicznych) | http://zil.ipipan.waw.pl/SEJF | MWE dictionary or lexicon (MWEs only) | Polish | The lexicon contains about 3200 multi-word lexemes, 68,000 corresponding inflected forms. | 6 | Only contiguous | Available, unrestricted use | The lexicon contains about 3200 multi-word lexemes, 68,000 corresponding inflected forms | Creative Commons (CC): http://creativecommons.org/examples | Yes (click continue to fill in more information) | Monika Czerepowicka (lexicography) and Agata Savary (automatic inflection and validation) | czerepowicka@gmail.com | GRALIŃSKI, F., SAVARY, A., CZEREPOWICKA, M., MAKOWIECKI, F. (2010): Computational Lexicography of Multi-Word Units: How Efficient Can It Be?, in Proceedings of Multiword Expressions: from Theory to Applications (MWE 2010), Workshop at COLING 2010, Beijing, China, August 28. CZEREPOWICKA, M., KOSEK, I. (2011): Problemy opisu związków frazeologicznych w formalizmie „Multifleks” (na przykładzie rodzaju wyrażeń frazeologicznych), in "Różne formy, różne treści", pp. 117–126, Warszawa 2011. CZEREPOWICKA, M. (2011): „Toposław” jako narzędzie znakowania jednostek wieloczłonowych, in Matusiak-Kempa, I., Przybyszewski, S. (eds.) Nowe zjawiska w języku, tekście, komunikacji. Kontekst a komunikacja, Olsztyn, pp. 28–35. CZEREPOWICKA, M. (2014), Jednostki obce w słowniku języka polskiego na przykładzie "Słownika elektronicznego jednostek frazeologicznych" (SEJF), LingVaria 2014 (IX), z. 1 (17), s. 59-68 [doi: 10.12797/LV.09.2014.17.04]. | Extensional | about 3200 | about 68000 | 160 graph-based inflection paradigms | SEJF contains mainly nominal (2121 units) and also adjectival (446) and adverbial (604) compounds of the general (non terminological) Polish language. | The morphosyntactic tagset using following categories: Nb : sg , pl Case: nom, gen, dat, acc, inst, loc, voc Gen: m1, m2, m3, f, n1, n2, p1, p2, p3 Pers: pri, sec, ter Deg: pos, com, sup Asp: imperf, perf Neg: aff, neg Accent: akc, nakc Postprep : praep, npraep Accom: congr, rec Agglt: nagl, agl Vocal: wok, nwok Each unit is annotated as a one of the following classes: subst: (Nb,<var>),(Case,<var>),(Gen,<fixed>),(Usage,<var>) depr: (Nb,<fixed>),(Case,<var>),(Gen,<fixed>) num: (Nb,<fixed>),(Case,<var>),(Gen,<var>),(Accom,<var>) numcol: (Nb,<fixed>),(Case,<var>),(Gen,<fixed>),(Accom,<var>) adj: (Nb,<var>),(Case,<var>),(Gen,<var>),(Deg,<var>) adja: adjc: adjp: adv: (Deg,<var>) ppron12: (Nb,<fixed>),(Case,<var>),(Gen,<var>),(Pers,<fixed>),(Accent,<var>) ppron3: (Nb,<var>),(Case,<var>),(Gen,<var>),(Pers,<fixed>),(Accent,<var>),(Postprep,<var>) siebie: (Case,<var>) fin: (Nb,<var>),(Pers,<var>),(Asp,<fixed>) bedzie: (Nb,<var>),(Pers,<var>),(Asp,<fixed>) aglt: (Nb,<var>),(Pers,<var>),(Asp,<fixed>),(Vocal,<var>) praet: (Nb,<var>),(Gen,<var>),(Asp,<fixed>),(Agglt,<var>) impt: (Nb,<var>),(Pers,<var>),(Asp,<fixed>) imps:(Asp,<fixed>) inf:(Asp,<fixed>) pcon:(Asp,<fixed>) pant:(Asp,<fixed>) ger: (Nb,<var>),(Case,<var>),(Gen,<fixed>),(Asp,<fixed>),(Neg,<var>) pact: (Nb,<var>),(Case,<var>),(Gen,<var>),(Asp,<fixed>),(Neg,<var>) ppas: (Nb,<var>),(Case,<var>),(Gen,<var>),(Asp,<fixed>),(Neg,<var>) winien: (Nb,<var>),(Gen,<var>),(Asp,<fixed>) pred: prep:(Case,<fixed>) conj: qub: xxs: (Nb,<var>),(Case,<var>),(Gen,<fixed>) xxx: ign: interp: sp: burk: brev:(Dot,<fixed>) | Dictionary, Corpus, collected manually | aleja(aleja:subst:sg:nom:f) sztywnych(sztywny:subst:pl:gen:m1),subst(NC-O_N) entry: aleja sztywnych morphosyntactic tag of the entry: subst [noun] morphosyntactic disambiguation in the brackets after a word: (lexeme : morphosyntactic tag of a proper form : value of the Number category : value of the Case category : value of the Gender category) information in the brackets after a morphosyntactic tag of the entry, eg. (NC-O_N) - type of a graph which is use to inflect the unit list of all inflected forms of the unit (MWE): aleja sztywnych,aleja sztywnych:subst:sg:nom:f aleje sztywnych,aleja sztywnych:subst:pl:nom:f alei sztywnych,aleja sztywnych:subst:sg:gen:f alej sztywnych,aleja sztywnych:subst:pl:gen:f alei sztywnych,aleja sztywnych:subst:pl:gen:f alei sztywnych,aleja sztywnych:subst:sg:dat:f alejom sztywnych,aleja sztywnych:subst:pl:dat:f aleję sztywnych,aleja sztywnych:subst:sg:acc:f aleje sztywnych,aleja sztywnych:subst:pl:acc:f aleją sztywnych,aleja sztywnych:subst:sg:inst:f alejami sztywnych,aleja sztywnych:subst:pl:inst:f alei sztywnych,aleja sztywnych:subst:sg:loc:f alejach sztywnych,aleja sztywnych:subst:pl:loc:f alejo sztywnych,aleja sztywnych:subst:sg:voc:f aleje sztywnych,aleja sztywnych:subst:pl:voc:f | |||||
33 | 21/06/2014 11:17:29 | WICOL | http://www.vronk.net/wicol/index.php/Main_Page | MWE dictionary or lexicon (MWEs only) | Slovak, German | not specified | 3 | Only contiguous | Available, restricted use | yes | Collocation profiles of 250 Slovak nouns Collocation profiles of 700 Slovak Adjectives Collocation profiles of 500 German Nons with Slovak equivalents Collocation profiles of 250 German Adjectives with Slovak equivalents | No (click continue to submit) | ||||||||||||||||
34 | 22/06/2014 19:53:21 | WordNet-Affect translated in Romanian and Russian. | http://lilu.fcim.utm.md/resourcesRoRuWNA.html | Dictionary or lexicon with MWEs (also includes MWEs) | English, Romanian, Russian | 348 | 4 | Also non-contiguous | Available, unrestricted use | no licence | WordNet-Affect is a lexical resource that contains information about emotions the words convey. It has been developed from the lexical knowledge base WordNet, through a selection and labelling of the affective concepts represented by sets of synonyms. Affective labels (a-labels) were manually assigned to Word Net synsets of nouns, adjectives, verbs and adverbs which convey affective meaning. Words labelled with the Emotion tag were further reannotated into six emotional categories: joy, fear, anger, sadness, disgust, surprise. Word Net-Affect is freely available for research purposes at http://wndomains.itc.it. The collection of WORDNET-AFFECT synsets used in our work was provided as a resource in SemEval-2007 Affective Text task focused on text annotation with affective tags. Word Net-Affect is organised in six files: anger.txt, disgust.txt, fear.txt, joy.txt, sadness.txt, surprise.txt. We keep the same data organisation. Please cite the following reference in the publications or presentations containing research results obtained through the use of this resource: "Emotions in words: developing a multilingual WordNet-Affect". CICLING 2010, Iasi, Romania, 2010. | No (click continue to submit) | ||||||||||||||||
35 | 24/06/2014 15:01:41 | Oxford Arabic Dictionary | http://www.oxforddictionaries.com/words/arabic | Dictionary or lexicon with MWEs (also includes MWEs) | Arabic, English | Also non-contiguous | Yes (click continue to fill in more information) | Oxford University Press | tressy.arts@gmail.com | http://ukcatalogue.oup.com/product/9780199580330.do | Intensional | compound nouns, compound adverbs, preposition + nouns, compound terminology, named entries, phrasal verbs, verbal expressions, collocations, etc. | All MWEs are written in both languages | Corpus, manually | ||||||||||||||
36 | 24/06/2014 16:51:39 | Unified Medical Language System (UMLS) SPECIALIST Lexicon | http://specialist.nlm.nih.gov/lexicon | Dictionary or lexicon with MWEs (also includes MWEs) | English | 301,345 MWE base forms; 417,755 including all inflectional variants | Only contiguous | Available, unrestricted use | For terms and conditions of use, please see http://lexsrv3.nlm.nih.gov/LexSysGroup/docs/termsAndConditions.html | Terms and conditions; link given above. | The SPECIALIST Lexicon has been built since 1994 at the U.S. National Library of Medicine, National Institutes of Health. It is intended to be a general English lexicon that includes many biomedical terms. It provides comprehensive coverage of biomedical vocabulary as well as commonly occurring English words. The lexicon entry for each word or term records the syntactic categorization, variant forms (morphological information), and specification of acronyms. | No (click continue to submit) | ||||||||||||||||
37 | 29/07/2014 09:38:15 | Multilingual Collocation Dictionary system Centre Tesniere (MultiCoDiCT) | http://tesniere.univ-fcomte.fr/multicodict_eng.html | MWE dictionary or lexicon (MWEs only) | Arabic < > French Chinese < > English < > French French < > Portuguese < > Spanish Korean < > English < > French | Unknown | Multilingual collocation dictionaries of specialised domains exploiting inherent mathematical properties by means of formal specification techniques A software engineering approach to multilingual terminology management. Applications in multilingual : Terminology Standards Safety critical domains : e.g.: clinical medicine. | No (click continue to submit) | ||||||||||||||||||||
38 | 29/07/2014 14:27:12 | Stanford Multiword Expression Resources | http://mwe.stanford.edu/resources/ | MWE dictionary or lexicon (MWEs only) | English, Russian | The following is a list of resources relevant to the LinGO Multiword Expression Project, along with a basic description of each resource, the date of release and a description of the author(s). In the instance that a reference is listed for the resource, we ask that any published results which make use of the given data set cite that reference appropriately. English and Russian Prepositional Phrases Verb particle constructions with compositionality judgements BNC verb particle construction frequency list Verb particle constructions with Levin verb classes and Google frequencies | No (click continue to submit) | |||||||||||||||||||||
39 | 29/07/2014 14:30:42 | MWE resources listed in http://multiword.sourceforge.net | http://multiword.sourceforge.net/PHITE.php?sitesig=FILES&page=FILES_20_Data_Sets | list of resources | English, Chinese, Czech, German, Portuguese, Greek, French, Estonian | No (click continue to submit) | ||||||||||||||||||||||
40 | 19/08/2014 11:57:00 | LEX-MWE-PT: Word Combination in Portuguese Language | http://metashare.ilsp.gr:8080/repository/browse/lex-mwe-pt-word-combination-in-portuguese-language/8c13600ccd0711e1a404080027e73ea2f9cfd28f51d5437b8f5827c516c348fe/, http://www.clul.ul.pt/en/research-teams/187-combina-pt-word-combinations-in-portuguese-language | MWE dictionary or lexicon (MWEs only) | Portuguese (European) | 12,753 MWE lemmas | Also non-contiguous | Available, restricted use | Restrictions: Academic - Non Commercial Use Distribution Access/Medium: Downloadable Licensors: Amália Mendes, amalia.mendes@clul.ul.pt (Copied from META-SHARE) | Under negotiation | This lexicon includes multiword expressions (MWE) of European Portuguese extracted from a balanced 50,8M word written corpus – a subcorpus of the Reference Corpus of Contemporary Portuguese (CRPC). This corpus covers different genres, being mainly constituted by journalistic texts (59%), but it also includes texts from literature (21%), magazines (15%), miscellaneous, supreme court verdicts, parliament sessions and leaflets (5%). The MWE lexicon covers 1.198 lemmas (composed of single words from different POS categories: nouns, adjectives, verbs and adverbs) and a total of 12.753 MWE lemmas (which include inflectional variants of the MWE lemmas) and 242.233 concordances of those MWE expressions manually verified. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: Centro de Linguística da Universidade de Lisboa: Amália Mendes | amalia.mendes@clul.ul.pt | ANTUNES, Sandra, Maria Fernanda BACELAR DO NASCIMENTO, João Miguel CASTELEIRO, Amália MENDES, Luísa PEREIRA, Tiago SÁ (2006) "A Lexical Database of Portuguese Multiword Expressions" in VIEIRA, Renata et al. (2006) PROPOR 2006, LNAI 3960, Berlin, Springer-Verlag, pp. 238-243. | Extensional (in the form of corpus concordances) | 12,753 MWE lemmas | 242,233 concordances of those MWE expressions manually verified. | frozen groups (e.g., patrão fora, dia santo na loja 'while the cat is away, the mice will play'); semi-frozen groups where the meaning of the expression can not be predicted by the meaning of the parts (e.g., esticar o pernil 'kick the bucket'), that are not subject to syntactical variability (e.g., internal modification *esticar o grande pernil 'kick the big bucket' or passivization *o pernil foi esticado 'the bucket was kicked') but allow inflectional variation (e.g., esticaram o pernil 'kicked the bucket'); semi-frozen groups that can be compositional and in some cases semantically idiosyncratic, and that allow for the substitution of one of the collocates by other words associated through a synonym or hyperonymy/hyponym relation (e.g., onda/maré/vaga de assaltos 'wave of robberies'; países/estados membros 'member states'); sets of favoured co-occurring forms, that constitute however syntactic dependencies. | A detailed description of the lexicon, its structure and content is given at the resource webpage: http://www.clul.ul.pt/en/research-teams/187-combina-pt-word-combinations-in-portuguese-language | Since the extraction of lexical collocations must rely on a large collection of data, a written and balanced corpus of 50 million words, the COMBINA corpus, was designed from the existing corpus CRPC: http://www.clul.ul.pt/en/research-teams/183-reference-corpus-of-contemporary-portuguese-crpc | |||||||
41 | 19/08/2014 12:03:28 | PANACEA Environment SCF MWE merged Italian Lexicon | http://metashare.ilsp.gr:8080/repository/browse/panacea-environment-scf-mwe-merged-italian-lexicon/c4e3084680c211e28763000c291ecfc8d62c1eae3f784dd99671954514e657ce/, http://hdl.handle.net/10230/20173 | MWE dictionary or lexicon (MWEs only) | Italian | Available, restricted use | Creative Commons Attribution-NonCommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/) | Creative Commons (CC): http://creativecommons.org/examples | The Italian PANACEA_ENV_MWE_SCF_merged.lmf.xml lexicon is obtained by merging two automatically extracted lexicons: a domain lexicon (environment) for SCFs, PANACEA_SCF_IT_environment.lmf.xml and a MWE Italian lexicon env-mw.lmf.xml. The lexicon was produced at CNR-ILC, Pisa, Italy as an outcome of the PANACEA EU-FP7 Funded Project under Grant Agreement 248064 (http://www.panacea-lr.eu). <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Monica Monachini | monica.monachini@ilc.cnr.itz | ||||||||||||||||
42 | 19/08/2014 12:06:41 | PANACEA Labour SCF MWE merged Italian Lexicon | http://metashare.ilsp.gr:8080/repository/browse/panacea-labour-scf-mwe-merged-italian-lexicon/c903fc2880c211e28763000c291ecfc84a99d41ec468410985a8d1ebfc06de71/, http://hdl.handle.net/10230/20174 | MWE dictionary or lexicon (MWEs only) | Italian | Available, restricted use | Creative Commons Attribution-NonCommercial 3.0 Unported License (http://creativecommons.org/licenses/by-nc/3.0/) | Creative Commons (CC): http://creativecommons.org/examples | The Italian PANACEA_LAB_SCF_MWE_merged.lmf.xml lexicon is obtained by merging two automatically extracted lexicons: a domain lexicon (labour) for SCFs, PANACEA_SCF_IT_labour.lmf.xml and a MWE Italian lexicon lab-mw.lmf.xml. The lexicon was produced at CNR-ILC, Pisa, Italy as an outcome of the PANACEA EU-FP7 Funded Project under Grant Agreement 248064 (http://www.panacea-lr.eu). <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Monica Monachini | monica.monachini@ilc.cnr.itz | MENDES, Amália, Sandra ANTUNES, Maria Fernanda BACELAR DO NASCIMENTO, João Miguel CASTELEIRO, Luísa PEREIRA, Tiago SÁ (2006) "COMBINA-PT: a Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions", Proceedings of the V International Conference on Language Resources and Evaluation - LREC2006, Genoa, May 22-28 2006, pp. 1900-1905. | |||||||||||||||
43 | 20/08/2014 10:39:10 | BioLexicon | http://metashare.ilsp.gr:8080/repository/browse/biolexicon/37c86584de6c11e2b1e400259011f6ead5fa82f93c0544b29b1b61526cd7c87f/, http://catalog.elra.info/product_info.php?products_id=1113 | Dictionary or lexicon with MWEs (also includes MWEs) | English | Available, restricted use | Several licenses for different uses (academic/commercial) and users (ELRA members/non-members). | ELRA | BioLexicon is a large-scale English terminological resource which has been developed to address the needs emerging in text mining efforts in the biomedical domain. It contains information on: - terminological nouns, including nominalised verbs and proper names (e.g., gene names) - terminological adjectives - terminological adverbs - terminological verbs - general English words frequently used in the biology domain Existing information on terms was integrated, augmented, complemented and linked, through processing of massive amounts of biomedical text, to yield inter alia over 2.2M lexical entries (over 3.3M semantic relations), and information on over 1.8M variants and on over 2M synonymy relations. Moreover, extensive information is provided on how verbs and nominalised verbs in the domain behave at both syntactic and semantic levels, supporting thus applications aiming at discovery of relations and events involving biological entities in text. It contains domain specific verbs (658), includes both automatically-extracted syntactic subcategorization frames (1710), as well as semantic event frames (850) that are based on corpus annotation by domain experts. Once populated with terms from existing repositories, BioLexicon was augmented with term variants extracted from the scientific literature and complemented with manually selected lexical items, such as biologically relevant verbs and multiword token expressions. BioLexicon is available in a relational database format (MySQL dump format) and it adheres to the EAGLES/ISO standards for lexical resources. <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Mapelli Valérie | mapelli@elda.org | ||||||||||||||||
44 | 20/08/2014 10:47:19 | LABEL-LEX (MW) | http://metashare.ilsp.gr:8080/repository/browse/label-lex-mw/86090e98de7011e2b1e400259011f6ea56b6b33c48c549568791c59a76545065/, http://catalog.elra.info/product_info.php?products_id=700 | MWE dictionary or lexicon (MWEs only) | Portuguese | 88 619 | Available, restricted use | Several licenses for different uses (academic/commercial) and users (ELRA members/non-members). | ELRA | LABEL-LEX (MW) is a Portuguese formalized lexicon, containing 88 619 inflected multiword lexical units (formally, sequences of simple words). The units are distributed as follows: - 85,881 nouns, with information about type, gender, number, inflected forms, irregular inflected forms and subcategorisation frames - 2,204 adverbs - 409 adjectives, with information about degree, gender, number, comparison, position, inflected forms, irregular inflected forms and subcategorisation frames - 125 pronouns, prepositions/postpositions and conjunctions <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Mapelli Valérie | mapelli@elda.org | Extensional | 88619 | nouns, adverbs, adjectives, pronouns, prepositions/postpositions and conjunctions | ||||||||||||
45 | 20/08/2014 10:51:49 | OpenLogos Bilingual Dictionaries | http://metashare.ilsp.gr:8080/repository/browse/openlogos-bilingual-dictionaries/c27d98c2e0d511e3a462080027f903f2d1cca27a783a451bae80dfe14cc90043/ | Dictionary or lexicon with MWEs (also includes MWEs) | English>French, English>German, English>Italian | Available, unrestricted use | Restrictions: Academic - Non Commercial Use | GPL | The OpenLogos bilingual dictionaries (English-French, English-German and English-Italian) contain the following linguistic information: part-of-speech (POS), gender (GEN), number (NUM), morphological paradigms (PAT) for source and target words, head word (HEAD) in multiwords, homographs (HOMO), auxiliary (AUX), alternate word (ALT), causative verb (CAUS), reflexive verb (REFL), and aspectual verb (ASP). In addition, they contain semantico-syntactic knowledge (SAL), a three-level interlingua-style hierarchical taxonomy with over 1,000 elements, embracing all POS. SAL represents the conceptual formalization of things, ideas, relationships, dispositions, conditions, processes, etc., as described in the SAL Tutorial of the Learn Logos application, available with the OpenLogos software. Each bilingual dictionary contains over 80,000 entries. Verbs, nouns and adjectives are the most represented classes. We believe that they are useful for machine translation and other natural language processing applications. <Description from META-SHARE> | Yes (click continue to fill in more information) | Anabela Barreiro | anabela.barreiro@inesc-id.pt | ||||||||||||||||
46 | 20/08/2014 10:56:01 | The CINTIL Corpus – International Corpus of Portuguese | http://metashare.ilsp.gr:8080/repository/browse/the-cintil-corpus-international-corpus-of-portuguese/99a51c1ade6d11e2b1e400259011f6eabab9a2512cd6404c8f828ed94885c413/, http://catalog.elra.info/product_info.php?products_id=1102 | Dictionary or lexicon with MWEs (also includes MWEs) | Portuguese | Available, restricted use | Several licenses for different uses (academic/commercial) and users (ELRA members/non-members). | ELRA | CINTIL-Corpus Internacional do Português is a linguistically interpreted written and spoken corpus of European Portuguese. It is composed of one million annotated tokens, each one of which verified by human expert annotators. The annotation comprises information on part-of-speech, open class lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). Multiword Lexical Units (MWU) for Named Entity Recognition (NER): Delimitation and classification of multi-word expressions for Named Entities following the usual IOB tagging schema for NER, and the typical classes of Number, Date, Person, Location, etc.<Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Mapelli Valérie | mapelli@elda.org | ||||||||||||||||
47 | 25/08/2014 14:58:24 | bgMWE – tool for MWE recognition | http://metashare.ilsp.gr:8080/repository/browse/bgmwe-tool-for-mwe-recognition/c51ec6406afd11e281b65cf3fcb88b70b4b3bc3889ed462581042dea4cb48a06/, http://dcl.bas.bg/en/bgMWE_en.html | tool | Bulgarian (language dependent) | Available, restricted use | CC-BY-NC | Creative Commons (CC): http://creativecommons.org/examples | bgMWE is a tool for corpus processing and MWE recognition and tagging. It is developed in Java and is thus platform independent. bgMWE comprises a set of modules which can be applied for particular NLP tasks. It is largely language independent and can work either in resource-light mode, or its performance can be boosted by employing lexical resources. The system includes the following modules: Web crawler for Wikipedia; Extraction of lexical data – lists of words and MWEs; Converter between formats – vertical format, XML, etc.; Preprocessing module – applying a chunker, a tagger, etc.; Collection of frequency data; MWE recognition and tagging. Further improvement of bgMWE is planned in the following directions: improving efficiency; implementing various methods for MWE recognition; developing a visualisation module or integrating existing open source visualisation methods; module for extensive evaluation. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: Institute for Bulgarian Language. Contact person: Ivelina Stoyanova | dcltools@dcl.bas.bg | ||||||||||||||||
48 | 25/08/2014 15:01:45 | Bulgarian MWE dictionary | http://metashare.ilsp.gr:8080/repository/browse/bulgarian-mwe-dictionary/50f06bb26afc11e281b65cf3fcb88b703aab1e9d754e40c88e16d771d08c1842/, http://dcl.bas.bg/en/mweDictionary_en.html | MWE dictionary or lexicon (MWEs only) | Bulgarian | 27744 | Available, restricted use | CC-BY-NC | Creative Commons (CC): http://creativecommons.org/examples | The Bulgarian dictionary of MWEs includes 27,744 MWEs altogether which are divided into 13 categories based on their idyomaticity which is evaluated with respect to the following features: whether the MWE is a named entity; whether the MWE contains a reference to a named entity; the degree to which the meaning of the MWE is compositional and transparent. The MWEs are extracted from several sources: Wikipedia, the Thesaurus of Bulgarian (1994) and other printed dictionaries and electronic corpora. The MWEs are manually verified and classified into categories. <Description from META-SHARE>. Further improvement of bgMWE is planned in the following directions: improving efficiency; implementing various methods for MWE recognition; developing a visualisation module or integrating existing open source visualisation methods; module for extensive evaluation. <From http://dcl.bas.bg/en/bgMWE_en.html> | Yes (click continue to fill in more information) | IPR holder: Institute for Bulgarian Language. Contact person: Ivelina Stoyanova | iva@dcl.bas.bg | |||||||||||||||
49 | 25/08/2014 15:08:13 | Chooser - annotation tool | http://metashare.ilsp.gr:8080/repository/browse/chooser-annotation-tool/5f603b6c6a5911e281b65cf3fcb88b7080220eacf93548d7a4b6d2c8f15960db/, http://dcl.bas.bg/en/Chooser.html | tool | Language independent | Also non-contiguous | Available, restricted use | GPL (share alike) | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | Chooser is an OS independent multi-functional system for linguistic annotation, adaptable to different annotation schemata. The basic annotation functionalities of the tool are: (i) fast and easy-to-perform selection; (ii) run-time access to information for the candidate senses such as definition, frequency, the associated wordnet synsets with all the pertaining info – synonyms, gloss, semantic relations, notes on usage, form, etc.; (iii) identification of MWEs with contiguous and non-contiguous constituents and supplying information for them at run-time. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: Department of Computational Linguistics, Institute for Bulgarian Language. Contact person: Borislav Rizov | boby@dcl.bas.bg | |||||||||||||||
50 | 25/08/2014 15:14:08 | Lists of Bulgarian Multiword Expressions (BulMWEs) | http://metashare.ilsp.gr:8080/repository/browse/lists-of-bulgarian-multiword-expressions/bc41fa6266cd11e281b65cf3fcb88b70b5af49e6365f4bee903c9527bd6d1e4a/, http://dcl.bas.bg/en/dictionaries_en.html | MWE dictionary or lexicon (MWEs only) | Bulgarian | Available, restricted use | Restrictions: Academic - Non Commercial Use Fee: free of charge | The lists of Multiword expressions are the result of automatic and semi-automatic tagging and classification of the corpus Wiki1000+ (13.4 million tokens): Non-decomposable - 700, Idiosyncratically decomposable - 3,156, Simple decomposable (NEs without connection between elements - 36,932, NEs with a meaningful element(s) - 11,248, Non-NEs with a vague connection between components - 1,46, NEs with meaningful components but connection difficult to restore - 1,086, NEs with descriptor and additional element - 18,962, Non-NEs with a NE as one of the components - 27,373, Non-NEs with a standard, easy to restore connection between components- 140,394, NEs with a standard, easy to restore connection between components - 16,653, Non-NEs with explicit connection between components - 1,468), “Free collocations” - 49,651, Free phrases- 1,197,762. <Description from META-SHARE> | Yes (click continue to fill in more information) | Department of Computational Linguistics, Institute for Bulgarian Language. Contact person: Ivelina Stoyanova | iva@dcl.bas.bg | |||||||||||||||||
51 | 25/08/2014 15:18:17 | Mutilingual dictionaries | http://metashare.ilsp.gr:8080/repository/browse/mutilingual-dictionaries/0dbc01f46afb11e281b65cf3fcb88b70f5819e245fc3484b8b7fafb21b1bd291/, http://dcl.bas.bg/en/multilingualDictionary_en.html | Dictionary or lexicon with MWEs (also includes MWEs) | Bulgarian, English, German, Romanian, Greek, Polish | Available, restricted use | CC-BY-NC | Creative Commons (CC): http://creativecommons.org/examples | The set of multilingual dictionaries covers all pairs of languages among the following: Bulgarian, English, German, Romanian, Greek, and Polish. The main source of the dictionaries is Wikipedia – translations of article titles and category labels. The dictionaries include single words, MWEs and phrases but are predominantly phrase-to-phrase. The following sets of dictionaries are included in the pack: General bilingual dictionaries for each pair of languages; Bilingual dictionaries of personal names for each pair of languages; Bilingual dictionaries of organisations for each pair of languages; Bilingual dictionaries of toponyms for each pair of languages. The dictionaries are stored in plain text format for easy and flexible storage and processing. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: Institute for Bulgarian Language. Contact person: Svetla Koeva | svetla@dcl.bas.bg | ||||||||||||||||
52 | 25/08/2014 15:29:25 | Wiki1000+ corpus with annotated MWEs | http://metashare.ilsp.gr:8080/repository/browse/wiki1000-corpus-with-annotated-mwes/a2038f0a6af411e281b65cf3fcb88b704919deb94d474ffc825290985f395f46/, http://dcl.bas.bg/en/wikiCorpus_en.html | Corpus with annotated MWEs | Bulgarian | Available, restricted use | Licence: CC-BY Restrictions: Academic - Non Commercial Use, Attribution Distribution Access/Medium: Downloadable | Creative Commons (CC): http://creativecommons.org/examples | Wiki1000+ is a corpus of articles from Wikipedia, compiled for the purposes of the study of multiword expressions (MWEs) in Bulgarian. The Wiki1000+ corpus contains 6,311 text samples with at least 1,000 tokens each, amounting to 13.4 million tokens. The corpus is a part of the Bulgarian National Corpus. Wiki1000+ is annotated with the following linguistic information: sentence boundaries, tokenisation, lemmatisation, POS tagging, and MWE annotation. MWE annotation includes MWE id, labelling the components of the MWE and determining the type of the MWE according to a classification based on idiomaticity. <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Svetla Koeva | svetla@dcl.bas.bg | ||||||||||||||||
53 | 26/08/2014 09:03:11 | Bulgarian Sense-Annotated Corpus | http://metashare.ilsp.gr:8080/repository/browse/bulgarian-sense-annotated-corpus/b7d5478666cd11e281b65cf3fcb88b705fc4c009156a4a9499794778d015eaa8/, http://dcl.bas.bg/semcor/en/ | Sense-annotated corpus with MWEs | Bulgarian | 5797 | Available, restricted use | Restrictions: Academic - Non Commercial Use Distribution Access/Medium: Accessible Through Interface | The Bulgarin Sence-annotated Corpus (BulSemCor) contains sense-disambiguated lexical items defined in the context of occurrence. The Bulgarian Sense-annotated Corpus follows the methodology of the Princeton University SemCor. As BulSemCor it consists of excerpts from the Brown Corpus of Bulgarian. Each lexical item (simple word, compound word or multiword expression) is assigned manually the unique semantic or grammatical meaning from the Bulgarian wordnet (BulNet) in the particular context. Contrary to other sense annotated corpora, the BulSemCor covers both open and close class words and all occurences of multiword expressions and named entities. The annotated lexical units inherit all the information from the synonym sets in the BulNet, incl. explanatory definition, PoS, usage examples, notes on grammatical, stylistic, and pragmatic properties, and all relations (semantic morpho-syntactic and extra-linguistic) pertaining to the synset, as well as the semantic and derivational relations pertaining to the literal. The BulSemCor contains 101 062 tokens, 99 480 annotated lexical units - 86 842 single words, а 5797 multiword expressions. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: Institute for Bulgarian Language. Contact person: Ivelina Stoyanova | dcltools@dcl.bas.bg | ||||||||||||||||
54 | 26/08/2014 09:07:34 | Collocation and Term Extractor (CollTerm) | http://metashare.ilsp.gr:8080/repository/browse/collocation-and-term-extractor/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75/, http://www.nljubesic.net/resources/tools/collterm/ | tool | Language independent | Available, unrestricted use | Apache Licence 2.0 Restrictions: Inform Licensor Execution location: hidden Distribution Access/Medium: Downloadable | Apache | CollTerm is a language independent tool for collocation and term extraction. It is an application that collects collocation and term candidates based on five different co occurrence measures for multiword units (i.e. collocations) or distributional differences from large representative corpus by application of the TF-IDF measurement on singleword units. The language dependent part consists of stop-word list and list of MWU MSD-patterns that can be coded with regular expressions as well. The application is describe in the paper presented at TKE2012 by Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I, Tadić, Gornostay, T. Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages. The first version of this application is available as an integral part of ACCURAT Toolkit that is available under Apache 2.0 license (http://www.accurat-project.eu/index.php?p=accurat-toolkit). In this version of the tool a calibration of MWU MSD-patterns has been provided for Croatian thus enhancing the usability of the tool. The plan is to provide calibration for other CESAR languages as well. | Yes (click continue to fill in more information) | IPR holder: University of Zagreb, Faculty of Humanities and Social Sciences, c/o Marko Tadić. Contact person: Nikola Ljubešić | nljubesi@ffzg.hr | ||||||||||||||||
55 | 26/08/2014 09:10:38 | Dictionary of Neologisms in Bulgarian Language | http://metashare.ilsp.gr:8080/repository/browse/dictionary-of-neologisms-in-bulgarian-language/7ad446f268ad11e281b65cf3fcb88b70dd4a3a216cb34a998c25fda3d4e70b2a/, http://infolex.ibl.bas.bg/PhrasThes/searchNeologPage.seam?cid=17 | Dictionary or lexicon with MWEs (also includes MWEs) | Bulgarian | 160 | Available, restricted use | CC - BY - NC - ND Restrictions: Academic - Non Commercial Use, No Redistribution Download location: hidden Distribution Access/Medium: Accessible Through Interface | Creative Commons (CC): http://creativecommons.org/examples | The Dictionary of Neologisms in Bulgarian Language contains over 2,200 new words and 160 new multiword units (compounds and terminological units) that have entered the Bulgarian language in the past 20 years. Each entry contains information about: part-of-speech (for lexemes); origin (for borrowed words); stylistic and grammatical notes; lexical meaning of the unit; synonyms and antonyms (if available). If necessary, short examples (phrases or sentences) are given to illustrate the use of the neologism in context. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: Institute for Bulgarian Language. Contact person: Diana Blagoeva | d.blagoeva@ibl.bas.bg | |||||||||||||||
56 | 26/08/2014 09:13:13 | Java version of NooJ (JavaNooJ) | http://metashare.ilsp.gr:8080/repository/browse/java-version-of-nooj/2f8caa506aff11e2aedc000423bfd61c0a125e4434514b43ba542943a6108ec7/, http://www.nooj4nlp.net/pages/download.html | tool | Language independent | Available, restricted use | GPL Restrictions: Academic - Non Commercial Use Fee: no price Download location: hidden Distribution Access/Medium: Downloadable | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | NooJ is a linguistic development environment that allows linguists to formalize several levels of linguistic phenomena: typography and spelling; lexicons of simple words, multiword units and discontinuous expressions; inflectional, derivational and productive morphology; local and structural syntax, transformational and semantic analysis and generation. For each of these levels NooJ provides linguists with one formal framework specifically designed to facilitate the description of each phenomenon, as well as parsing/development/debugging tools designed to be as computationally efficient as possible, from Finite-State machines to Turing machines. This approach distinguishes NooJ from other computational linguistic frameworks which provide a unique formalism based on a compromise between power and efficiency. As a corpus processing tool, NooJ allows all researchers and professional to extract information from general or technical corpora by applying sophisticated queries based on concepts rather than word forms and build indices, add semantic annotations, perform statistical analyses, etc. Java version of NooJ is an oper source software and working on all operating systems. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: Max Silberztein | elliadd@univ-fcomte.fr | ||||||||||||||||
57 | 26/08/2014 09:16:40 | MONO version of NooJ (MONONooJ) | http://metashare.ilsp.gr:8080/repository/browse/mono-version-of-nooj/fc91787a6b7f11e29f6e000423bfd61cad17bb05bcbd470da8cec4ebdda3481e/, http://www.nooj4nlp.net/pages/download.html | tool | language independent | Available, restricted use | MS - NC - No ReD - ND Restrictions: Academic - Non Commercial Use, No Derivatives, No Redistribution Fee: no price Download location: hidden Distribution Access/Medium: Downloadable | META-SHARE: http://www.meta-net.eu/meta-share/licenses | NooJ is a linguistic development environment that allows linguists to formalize several levels of linguistic phenomena: typography and spelling; lexicons of simple words, multiword units and discontinuous expressions; inflectional, derivational and productive morphology; local and structural syntax, transformational and semantic analysis and generation. For each of these levels NooJ provides linguists with one formal framework specifically designed to facilitate the description of each phenomenon, as well as parsing/development/debugging tools designed to be as computationally efficient as possible, from Finite-State machines to Turing machines. This approach distinguishes NooJ from other computational linguistic frameworks which provide a unique formalism based on a compromise between power and efficiency. As a corpus processing tool, NooJ allows all researchers and professional to extract information from general or technical corpora by applying sophisticated queries based on concepts rather than word forms and build indices, add semantic annotations, perform statistical analyses, etc. MONO version of NooJ is operative on all platforms that support MONO. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: Max Silberztein | Max.Silberztein@univ-fcomte.fr | ||||||||||||||||
58 | 26/08/2014 09:20:13 | PANACEA Environment Bilingual Glossary French-to-English (ENV Glossary FR-EN) | http://metashare.ilsp.gr:8080/repository/browse/panacea-environment-bilingual-glossary-french-to-english/f333e5a6bbb611e28763000c291ecfc880c9eb1f3a94470eb387e97674e2bcac/, http://hdl.handle.net/10230/19969 | Dictionary or lexicon with MWEs (also includes MWEs) | French, English | Available, unrestricted use | CC - BY Download location: hidden Distribution Access/Medium: Downloadable | Creative Commons (CC): http://creativecommons.org/examples | This glossary contains terminology in French-to-English, with a focus on environmental terms, resulting from PANACEA research. It contains about 3846 entries, both single words and multiwords, with part-of-speech information, manually validated. <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact persons: Gregor Thurmair, Vera Aleksić | info@linguatec.de | ||||||||||||||||
59 | 26/08/2014 09:23:47 | PANACEA Environment Multi Word Italian Lexicon | http://metashare.ilsp.gr:8080/repository/browse/panacea-environment-multi-word-italian-lexicon/f8769888bbb611e28763000c291ecfc8297387836d4e4c379114c193ecd3cc85/, http://hdl.handle.net/10230/20182 | MWE dictionary or lexicon (MWEs only) | Italian | Available, restricted use | CC - BY - NC Download location: hidden Distribution Access/Medium: Downloadable | Creative Commons (CC): http://creativecommons.org/examples | The Environment MW Italian Lexicon is a lexicon of noun-noun multiword expressions automatically extracted from a 36Mio word web crawled corpus in the environmental domain. The lexicon was produced at CNR-ILC, Pisa, Italy as an outcome of the PANACEA EU-FP7 Funded Project under Grant Agreement 248064 (http://www.panacea-lr.eu). <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Monica Monachini | monica.monachini@ilc.cnr.itz | Frontini F., Quochi V., Rubino F. (2012) “Automatic Creation of quality Multi-word Lexica from noisy text data” Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data. COLING 2012. Mumbay, India. Quochi, Valeria, Frontini, Francesca and Rubino Francesco (2012). A MWE Acquisition and Lexicon Builder Web Service. Proceedings of the COLING 2012. Mumbay, India. | |||||||||||||||
60 | 26/08/2014 09:27:47 | PANACEA Labour Legislation Bilingual Glossary French-to-English (LAB Glossary FR-EN) | http://metashare.ilsp.gr:8080/repository/browse/panacea-labour-legislation-bilingual-glossary-french-to-english/f6aea810bbb611e28763000c291ecfc8dc31bd362d6b43dbb6c42a1b69cb1c0f/, http://hdl.handle.net/10230/19988 | Dictionary or lexicon with MWEs (also includes MWEs) | French, English | Available, unrestricted use | CC - BY Download location: hidden Distribution Access/Medium: Downloadable | Creative Commons (CC): http://creativecommons.org/examples | This glossary contains terminology in French-to-English, with a focus on labour legislation terms, resulting from PANACEA research. It contains about 2441 entries, both single words and multiwords, with part-of-speech information, manually validated. <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact persons: Gregor Thurmair, Vera Aleksić | info@linguatec.de | Thurmair; Gr., Aleksić, V., 2012: Creating Term and Lexicon Entries From Phrase Tables. Proc. EAMT Trento | |||||||||||||||
61 | 26/08/2014 09:30:56 | PANACEA Labour Multi Word Italian Lexicon | http://metashare.ilsp.gr:8080/repository/browse/panacea-labour-multi-word-italian-lexicon/f8d9d876bbb611e28763000c291ecfc853a5588bfab84b39a330e1b4220b2d83/, http://hdl.handle.net/10230/20177 | MWE dictionary or lexicon (MWEs only) | Italian | Available, restricted use | CC - BY - NC Download location: hidden Distribution Access/Medium: Downloadable | Creative Commons (CC): http://creativecommons.org/examples | The Labour MW Italian Lexicon is a lexicon of noun-noun multiword expressions automatically extracted from a 70Mio word web crawled corpus in the labour law domain. The lexicon was produced at CNR-ILC, Pisa, Italy as an outcome of the PANACEA EU-FP7 Funded Project under Grant Agreement 248064 (http://www.panacea-lr.eu). <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Monica Monachini | monica.monachini@ilc.cnr.itz | Frontini F., Quochi V., Rubino F. (2012) “Automatic Creation of quality Multi-word Lexica from noisy text data” Proceedings of the Sixth Workshop on Analytics for Noisy Unstructured Text Data. COLING 2012. Mumbay, India. Quochi, Valeria, Frontini, Francesca and Rubino Francesco (2012). A MWE Acquisition and Lexicon Builder Web Service. Proceedings of the COLING 2012. Mumbay, India. | |||||||||||||||
62 | 26/08/2014 09:35:52 | Serbian NooJ module (SrpNooJ) | http://metashare.ilsp.gr:8080/repository/browse/serbian-nooj-module/0d68b2f28b3411e2ab9f001517144592e9978ff1de0d4abebd4d6c8935fcb9af/, http://www.nooj4nlp.net/pages/resources.html | Lexical conceptual resource with examples of MWEs | Serbian | Available, unrestricted use | CC - BY Restrictions: Attribution Fee: no price Download location: hidden Distribution Access/Medium: Downloadable | Creative Commons (CC): http://creativecommons.org/examples | Serbian NooJ module (SrpNooJ) was produced in the scope of the EU-funded CESAR project. It consists of a set of resources in both alphabets that are in use for Serbian: Cyrillic and Latin. Each set consists of: the dictionary properties’ definition file (metadata), one text – a novel “Dva carstva” (Two empires) from a Serbian author Branimir Ćosić comprising of 106684 tokens, a sample dictionary in readable form with 35 lemma that belong to 9 grammatical classes, with examples of multiword units and derivational morphology, a sample of morphological grammars used for lemmas from a sample dictionary – three for simple nouns, two for adjectives, two for verbs, and one for a multiunit noun, a readable sample dictionary of inflected forms automatically produced from a sample dictionary of lemmas and a sample morphological grammars, a syntactic grammar for recognition of one class of named entities – full personal names with their roles or functions, a full compiled dictionary (divided in three files: nouns, verbs, and other). It comprises of 85868 entries: nouns (40886), adjectives (25558), verbs (15366), and other (4058). <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Miloš Utvić | misko@matf.bg.ac.rs | Samples location: http://www.nooj4nlp.net/pages/resources.html | |||||||||||||||
63 | 26/08/2014 09:38:42 | Shallow Grammar for the National Corpus of Polish (NKJPGrammar) | http://metashare.ilsp.gr:8080/repository/browse/shallow-grammar-for-the-national-corpus-of-polish/f95d762a6aff11e284b6000423bfd61c5891270890b246d88f606717f0ce6ea7/, http://clip.ipipan.waw.pl/LRT?action=AttachFile&do=view&target=gramatyka_Spejd_NKJP_1.0.zip | grammar | Polish | Available, restricted use | GPL Restrictions: Share Alike Fee: free of charge Download location: hidden Distribution Access/Medium: Downloadable | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | Shallow Grammar for the National Corpus of Polish is a set of rules which was used for the automatic pre-annotation of the National Corpus of Polish at the syntactic level. It was constructed manually and encoded in the shallow parsing system Spejd (http://nlp.ipipan.waw.pl/Spejd/). It consists of 1187 rules for multiword entities, abbreviations, syntactic words, and syntactic groups. <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Katarzyna Głowińska | k.glowinska@gmail.com | Głowińska K., Przepiórkowski A. The Design of Syntactic Annotation Levels in the National Corpus of Polish. W: LREC 2010 proceedings. Waszczuk, J., Głowińska, K., Savary, A., Przepiórkowski, A. Tools and Methodologies for Annotating Syntax and Named Entities in the National Corpus of Polish. In: Proceedings of Computational Linguistics – Applications (CLA 2010), Workshop at IMCSIT 2010, Wisła, Poland, October 18-20. | |||||||||||||||
64 | 26/08/2014 09:44:02 | CST Tokeniser (rtfreader) | http://metashare.ilsp.gr:8080/repository/browse/cst-tokeniser/fc95a26642cf11e28d2f0050569b00008e521dc7f3a24ee48b252a6001e61201/, https://github.com/kuhumcst/rtfreader | tool | Danish | Available, restricted use | GPL Download location: hidden Distribution Access/Medium: Downloadable | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | Sentence segmenter. Optional tokenisation, MWU-recognition and recognition of abbreviations. Input from RTF (rich text) or flat text. In the case of RTF, layout and style info is used to recognise and properly treat e.g. head lines and bulleted lists. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: University of Copenhagen. Contact person: Bart Jongejan | bartj@hum.ku.dk | https://github.com/kuhumcst/rtfreader/blob/master/README.md | |||||||||||||||
65 | 26/08/2014 17:48:59 | Oxford Collocations Dictionary | http://abloz.com/huzheng/stardict-dic/dict.org/stardict-OxfordCollocationsDictionary-2.4.2.tar.bz2 | MWE dictionary or lexicon (MWEs only) | English | 8378 | Available, unrestricted use | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | No (click continue to submit) | |||||||||||||||||||
66 | 28/08/2014 20:16:48 | British English Source Lexicon (BESL) version 2.2 | http://metashare.ilsp.gr:8080/repository/browse/british-english-source-lexicon-besl-version-22/dc410e62de6811e2b1e400259011f6eaff8112b159c346f8a910378af93ece2a/, http://catalog.elra.info/product_info.php?products_id=834 | Dictionary or lexicon with MWEs (also includes MWEs) | English (British) | 58,000 multi-word compound nouns | Only contiguous | Available, restricted use | ELRA END USER Restrictions: Academic - Non Commercial Use For Members of ELRA ELRA END USER Restrictions: Academic - Non Commercial Use For Non Members of ELRA | ELRA | BESL is a complete database of the English lexicon. It consists of over 230,000 lemmas, over 350,000 word forms, 60,000 proper nouns, 3,000 abbreviations, and 58,000 multi-word compound nouns. Each headword is provided with a full listing of all inflected forms and other morphological variation. Every word form is marked for part of speech (using Penn TreeBank notation). Most single-word forms include a representation of IPA pronunciation. BESL covers both British and American English, and other spelling variants, with cross-references between corresponding forms. Each lemma is graded on a scale between 1 and 9 to indicate frequency, based on corpus evidence. Lemmas are also classified by domain, where appropriate (e.g. Computing, Religion). Obscene or offensive lemmas are marked using a 2-grade system. Proper name lemmas in BESL include personal names, surnames, place names, and brand names. BESL is provided in XML. <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Mapelli Valérie | mapelli@elda.org | compound nouns | |||||||||||||
67 | 28/08/2014 20:23:57 | CINTIL-Corpus Internacional do Português | http://metashare.ilsp.gr:8080/repository/browse/cintil-corpus-internacional-do-portugues/fe32ebf2485511e2a2aa782bcb074135aa0fdcd287ac45e7b67de9c36d8d2890/, http://catalog.elra.info/product_info.php?products_id=1102, http://cintil.ul.pt/ | Corpus with annotated MWEs | Portuguese | Unknown | Not Available Through Meta Share ELRA END USER Restrictions: Academic - Non Commercial Use | ELRA | CINTIL-Corpus Internacional do Português is a linguistically interpreted corpus of Portuguese. At present it is composed of 1 Million annotated tokens, verified by human expert annotators. The annotation comprises information on part-of-speech, open classes lemma and inflection, multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). The corpus has been developed at the University of Lisbon by the NLX group at the Faculty of Sciences and the Anagrama group at the Cenro de Linguística da Universidade de Lisboa. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holders: António Branco, Amália Mendes | antonio.branco@di.fc.ul.pt | Florbela Barreto, António Horta Branco, Aida Cardoso, Amália Mendes, Fernanda Bacelar Nascimento, Raïssa Gillier e João Silva, CINTIL Corpus Internacional do Português: Annotation Manual, v. 7.0 , , 2012 | multi-word expressions pertaining to the class of adverbs and to the closed POS classes, and multi-word proper names (for named entity recognition). | ||||||||||||||
68 | 01/09/2014 13:04:01 | Italian Syntactic-Semantic Treebank (ISST) | http://metashare.ilsp.gr:8080/repository/browse/italian-syntactic-semantic-treebank-isst/ccc16e0ede7311e2b1e400259011f6eafc6f8055ac6343659ae911e80a008400/, http://catalog.elra.info/product_info.php?products_id=887 | Treebank with MWE annotations | Italian | Available, restricted use | Several licenses for different uses (academic/commercial) and users (ELRA members/non-members). | ELRA | ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in XML. ISST has a five-level structure covering orthographic, morpho-syntactic, syntactic and semantic levels of linguistic description. Syntactic annotation is distributed over two different levels: the constituent structure level and the functional relations level. The fifth level deals with lexico-semantic annotation, which is carried out in terms of sense tagging of lexical heads (nouns, verbs and adjectives) augmented with other types of semantic information: ItalWordNet (see ELRA-M0018) is the reference lexical resource used for the sense tagging task. Both syntactic and lexico-semantic annotations refer to the morpho-syntactically annotated text, which in turn is linked to the orthographic file with the text and mark-up of macrotextual organisation (e.g. titles, subtitles, summary, body of article, paragraphs). <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Mapelli Valérie | mapelli@elda.org | http://www.aclweb.org/anthology/W00-1903 | Several. (The ones mentioned in the article added in the "Relevant publications" field are compounds, support verb constructions and idioms.) | The description field of the META-SHARE record contains detailed information about the features of the treebank. | The adopted morpho-syntactic annotation scheme conforms to the EAGLES international standard.The ISST functional annotation scheme is based on FAME (Lenci et al. 1999, 2000). | ISST corpus consists of about 300,000 word tokens reflecting contemporary language use. It includes two different sections: 1) a "balanced" corpus, testifying general language usage, for a total of about 210,000 tokens; 2) a specialised corpus, amounting to 90,000 tokens, with texts belonging to the financial domain. The balanced corpus contains a selection of articles from different types of Italian texts, namely newspapers (La Repubblica and Il Corriere della Sera) and a number of different periodicals which were selected to cover a high variety of topics (politics, economy, culture, science, health, sport, leisure, etc.). The financial corpus includes articles taken from Il Sole-24 Ore. All in all, they cover a 10 year time period (1985-1995). (Copied from http://www.aclweb.org/anthology/W00-1903) | |||||||||||
69 | 01/09/2014 13:24:25 | LX-Stopwords | http://metashare.ilsp.gr:8080/repository/browse/lx-stopwords/29892e16a35a11e1a404080027e73ea22e53349e39f348a7944b0b5bef6e9c41/, http://nlx.di.fc.ul.pt/ | List of stopwords with MWEs | Portuguese | 173 | Available, unrestricted use | Restrictions: Academic - Non Commercial Use, Commercial Use User Nature: Academic, Commercial Distribution Access/Medium: Downloadable | LX-Stopwords resource is a manual list of words from Portuguese composed by 2631 words of 51 types. The words are grouped in three big classes, arranged according to their morpho-syntactic category and inflectional feature value (closed classes, open classes, and multi-word units). This list was created as a support resource to develop CRIVO/EtiFac tool (see Branco & Silva, 2001), a tool for the semiautomatic annotation of corpora. With this in mind, the list seeks to be an as exhaustive as possible repository of all word forms that belong to closed classes, items typically with high frequency and fixity. Taking into account the ambiguity between words of different categories, which means that some words from closed classes (1866 words) can be part of others categories, two classes were added to the list: open classes (592 words) and multi-word units (173 words), including only the words already contained in closed classes. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: University of Lisbon, Faculty of Sciences. Licensor, distribution rights holder and contact person: António Branco | antonio.branco@di.fc.ul.pt | Catarina Carvalheiro, LX-Stopwords Narrative Description: http://194.117.45.196:2000/LX-Stopwords.pdf. In Proceedings Branco & Silva, EtiFac: A Facilitating Tool for Manual Tagging, pp. 81-90 , Proceedings of XVII Encontro Anual da APL, 2001: http://www.di.fc.ul.pt/~ahb/nexing/main.htm | Samples location: http://194.117.45.196:2000/stopwordssample.txt <entries> <sub-class>_QNT#ms</sub-class> <list> <stopword>algum</stopword> <stopword>certo</stopword> <stopword>imenso</stopword> <stopword>muito</stopword> <stopword>nenhum</stopword> <stopword>numeroso</stopword> <stopword>pouco</stopword> <stopword>tanto</stopword> <stopword>todo</stopword> </list> </entries> <entries> <sub-class>_REL</sub-class> <list> <stopword>como</stopword> <stopword>onde</stopword> <stopword>que</stopword> <stopword>quem</stopword> <stopword>quê</stopword> </list> </entries> | ||||||||||||||
70 | 01/09/2014 13:32:23 | New Oxford Dictionary of English, 2nd Edition (NODE) | http://metashare.ilsp.gr:8080/repository/browse/new-oxford-dictionary-of-english-2nd-edition/9460637ede6b11e2b1e400259011f6ea58609ecf25e1458f8e72077ed6ad7a70/, http://catalog.elra.info/product_info.php?products_id=679 | Dictionary or lexicon with MWEs (also includes MWEs) | English | More than 10 000 | Available, restricted use | ELRA END USER Restrictions: Academic - Non Commercial Use For Members of ELRA User Nature: Academic ELRA END USER Restrictions: Academic - Non Commercial Use For Non Members of ELRA User Nature: Academic | ELRA | This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML. - Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc. <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Mapelli Valérie | mapelli@elda.org | phrasal verbs and other multi-word phrases | ||||||||||||||
71 | 01/09/2014 13:37:28 | Oxford English phonetics files | http://metashare.ilsp.gr:8080/repository/browse/oxford-english-phonetics-files/e986bb8ede6911e2b1e400259011f6eacf808bda74be4dc4879f8d2cf624cc4a/, http://catalog.elra.info/product_info.php?products_id=845 | Lists of word forms together with a representation of their IPA pronunciation (with MWEs) | English | "a large number" | Available, restricted use | ELRA END USER Restrictions: Academic - Non Commercial Use For Members of ELRA User Nature: Academic ELRA END USER Restrictions: Academic - Non Commercial Use For Non Members of ELRA User Nature: Academic | ELRA | Derived from a range of Oxford Dictionaries, these files list word forms together with a representation of their IPA pronunciation. It contains 250,000 words. Pronunciation is based on standard British English. Word forms include dictionary lemmas and inflections or other morphological variations, plus a wide range of proper name and encyclopedic material. The data also includes a large number of common multi-word phrases and compound nouns. The files are provided in XML. <Description from META-SHARE> | Yes (click continue to fill in more information) | Contact person: Mapelli Valérie | mapelli@elda.org | common multi-word phrases and compound nouns | Dictionary, Derived from a range of Oxford Dictionaries | |||||||||||||
72 | 01/09/2014 13:50:11 | SEJFEK4Spejd | http://metashare.ilsp.gr:8080/repository/browse/sejfek4spejd/07c31f266b0011e284b6000423bfd61c7e50e1c0b2b74065adaf52ade4365eeb/, http://zil.ipipan.waw.pl/SEJFEK4Spejd | Shallow grammar of multi-word economic terms | Polish | 11,270 automatically generated rules | Available, restricted use | CC - BY - SA Restrictions: Attribution, Share Alike Fee: free of charge Download locations: hidden Distribution Access/Medium: Downloadable GPL Restrictions: Share Alike Fee: free of charge Download location: hidden Distribution Access/Medium: Downloadable | CC and GPL | SEJFEK4Spejd is the SEJFEK lexicon (Grammatical Lexicon of Polish Economical Phraseology) converted into a lexicalized Spejd shallow grammar. It contains 11,270 automatically generated rules which recognize inflected, case-insensitive multi-word economic terms from the lexicon. Recognized multi-word terms are combined into syntactic words. During the analysis disambiguation (unification and POS-based selection of interpretations) of terms is also performed. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR and distribution rights holders: Institute of Computer Science, Polish Academy of Sciences. Contact persons: Bartosz Zaborowski, Aleksandra Wieczorek | bartosz.zaborowski@ipipan.waw.pl | SAVARY, A., ZABOROWSKI, B., KRAWCZYK-WIECZOREK, A., MAKOWIECKI, F. (2012): SEJFEK — a Lexicon and a Shallow Grammar of Polish Economic Multi-Word Units, in Proceedings of Cognitive Aspects of the Lexicon (COGALEX-III), a Workshop at COLING 2012, Mumbai, India. | ||||||||||||||
73 | 02/09/2014 08:16:19 | Slovene Lexical Database (SLD) | http://eng.slovenscina.eu/spletni-slovar/leksikalna-baza | Dictionary or lexicon with MWEs (also includes MWEs) | Slovene | 2,500 headwords with 10.946 lexical units, including: - 2.053 multi-word lexical units (non-idiomatic lexemes) - 1.446 phraseological units (idiomatic lexemes) - 44.626 collocations (2 content words) - 4.602 extended collocations (more than 2 content words) - 8.298 syntactic combinations | 14 | Also non-contiguous | Available, unrestricted use | CC BY-NC-SA 2.5 | Creative Commons (CC): http://creativecommons.org/examples | SLD is a lexical database with a comprehensive syntactic and semantic description of some of the most common Slovene nouns, verbs, adjectives and adverbs. It was designed as reference dictionary for general public, school population and linguists, however, the encoded syntactic structures and patterns for each registered sense of the word also present an important lexical resource for NLP applications. The database is conceptualized as a network of interrelated lexico-grammatical information (sense, syntax, collocations, examples), with lemma (headword) representing the top hierarchical level and functioning as the umbrella for all lexical units placed under it. MWEs are included as either independent lexical units (multi-word units, phraseological units) or as integral part of other lexical units (collocations, extended collocations, syntactic combinations), depending on their degree of semantic compositionality. | Yes (click continue to fill in more information) | Ministry of Education, Science and Sport | info@slovenscina.eu | GANTAR, Polona, KREK, Simon, 2011: Slovene lexical database. In: Majchraková, D., Garabík, R. (eds.). Natural language processing, multilinguality: sixth international conference, Modra, Slovakia, 20-21 October 2011. pp. 72-80. KREK, Simon, 2012: New Slovene sketch grammar for automatic extraction of lexical data. SKEW3, 3rd International Sketch Engine Workshop, 21-22 March 2012, Brno, Czech Republic. | both | 2.053 multi-word units (non-idiomatic lexemes) 1.446 phraseological units (idiomatic lexemes) 44.626 collocations (2 content words) 4.602 extended collocations (more than 2 content words) 8.298 syntactic combinations | 2.386 multi-word units 2.175 phraseological units 102.292 collocations 8.420 extended collocations 10.789 syntactic combinations | 1. collocations (2 content words 1.1. extended collocations (more than 2 content words) 2. syntactic combinations (semantically transparent, structurally fixed), i.e. combinations with numeric elements, combinations with proper nouns, combinations with prepositions, coordinate structures, similes and analogies 3. multi-word lexical units (at least partly semantically opaque, structurally fixed, not idiomatic) 4. phraseological units (completely semantically opaque, idiomatic meaning is marked) | Automatically extracted from reference 1-billion word Gigafida corpus of written Slovene (http://eng.slovenscina.eu/korpusi/gigafida) and manually validated. | Corpus | 0. headword: davek (tax, n) 1. collocations: [prometni, vstopni, plačani] davek, [visok] davek; [odmera, uvedba, plačevanje] davka, [utaja] davkov; [plačevati, pobirati, znižati] davke etc. 2. syntactic combinations: biti oproščen davka; cena brez davka; davek po odbitku etc. 3. multi-word lexical units: davek na dodano vrednost 4. phraseological units: krvni davek | |||||
74 | 02/09/2014 08:22:16 | ssj500k dependency treebank | http://eng.slovenscina.eu/tehnologije/ucni-korpus | Treebank with MWE annotations | Slovene | - 500.000 tokens with lemmas and POS tags - 235.000 tokens with dependency tree links, i.e. 11.000 dependency parsed sentences - 4.398 named entities (including multi-word NEs) | 23 | Only contiguous | Available, unrestricted use | CC BY-NC-SA 2.5 | Creative Commons (CC): http://creativecommons.org/examples | The ssj500k treebank was built as a training corpus for machine-learning NLP applications and includes balanced sampled texts from the reference corpus of written Slovene. All texts have been manually segmentated, tokenized and annotated in terms of lemmatization, morphosyntactic tagging, dependency parsing (approx. 1/2) and named entity identification (approx. 1/5). Currently, only multi-word named entities (personal names, place names, organisation names and proper names) are explicitly annotated. | Yes (click continue to fill in more information) | Ministry of Education, Science and Sport | info@slovenscina.eu | 1483 | 1709 | 1. "geo": place name (215) 2. "org": name of organisation (392) 3. "person": name of person (814) 4. "other": proper names (288) | Corpus | <name type="other"> Grand ssj4.15.54.t16 grand Grand Npmsn Slmei National ssj4.15.54.t17 national National Npmsn Slmei </name> | ||||||||
75 | 02/09/2014 10:52:18 | Serbian Wordnet (SrpWN) | http://metashare.ilsp.gr:8080/repository/browse/serbian-wordnet/e3c4ffae8bde11e288f7001517144592cf4cb1f92d7644319d6c1d339f4d0229/, http://korpus.matf.bg.ac.rs/SrpWN | Wordnet with MWEs | Serbian | 10164 | Available, restricted use | CC - BY - NC Restrictions: Academic - Non Commercial Use Fee: no price Download location: hidden Distribution Access/Medium: Downloadable | Creative Commons (CC): http://creativecommons.org/examples | Serbian WordNet (SrpWN) represents a lexical semantic network, containing synsets with glosses and various semantic relations, such as antonymy, meronymy, causation, category domain, etc. The initial version of the Serbian Wordnet was produced in the scope of the EU-funded Balkanet project and it contains all synsets from basic concept sets 1 and 2, and two thirds of synsets from basic concept set 3. Through interlingual relations it is connected to English Wordnet (versions 2.0 and 3.0) and wordnets of many other languages. Currently the Serbian Wordnet contains 18,366 synsets (literals 31,274): 1380 adjectives (literals 1887), 2104 verbs (literals 3918), 14,765 nouns (literals 25,298), other 117. 706 synsets are not connected to the PWN, being either Balkan specific concepts (532) or Serbian specific concepts (174). 18,310 synsets have definitions in Serbian, and 1,274 have examples of usage. Semantic relations in SrpWN: hypernym - 16,590; holo_part - 1,298; holo_member - 3,831; holo_portion - 118; near_antonym - 736; be_in_state - 252; causes - 63. From 31,274 literals in SrpWN 10,164 are multi-word units. <Description from META-SHARE> | Yes (click continue to fill in more information) | IPR holder: Cvetana Krstev | cvetana@matf.bg.ac.rs | C. Krstev and B. Djordjević and S. Antonić and N. Ivković-Berček and Z. Zorica and V. Crnogorac and L. Macura, "Cooperative Work in Further Development of Serbian Wordnet," INFOtheca, vol. 9, pp. 59a-78a, May 2008. Cvetana Krstev, Ivan Obradović, Duško Vitas, “An Approach to the Development of Language Specific Concepts in Wordnets”, In Southern Journal of Linguistics, Special Theme: South Slavic and Balkan Languages, Mila Dimitrova-Vulchanova (ed.), Vo. 29, No. 1/2, pp. 106-118, Department of Modern Linguistics, University of Mississippi, 2008. (More references listed in META-SHARE) | ||||||||||||||
76 | 02/09/2014 11:19:55 | The database of Estonian multi-word expressions (ESTMWE) | http://metashare.ilsp.gr:8080/repository/browse/the-database-of-estonian-multi-word-expressions/4d8252e8463411e2a6e4005056b400243ed5ec91ec5044bbb0e85b2ce16f472b/, https://metashare.ut.ee/repository/browse/the-database-of-estonian-multi-word-expressions/4d8252e8463411e2a6e4005056b400243ed5ec91ec5044bbb0e85b2ce16f472b/, http://www.cl.ut.ee/ressursid/pysiyhendid/index.php?lang=en | MWE dictionary or lexicon (MWEs only) | Estonian | 12500 | Available, unrestricted use | Proprietary Restrictions: Academic - Non Commercial Use Distribution Access/Medium: Accessible Through Interface CC - BY Restrictions: Attribution Distribution Access/Medium: Downloadable | Creative Commons (CC): http://creativecommons.org/examples | This database contains a subtype of multi-word expressions, namely those consisting of a verb and a particle or a verb and its complements. | Yes (click continue to fill in more information) | IPR holder: Tartu Ülikool, University of Tartu. Contact person: Kadri Muischnek | Documentation of information recorded in the database, references, etc: http://www.cl.ut.ee/ressursid/pysiyhendid/index.php?lang=en | This database contains a subtype of multi-word expressions, namely those consisting of a verb and a particle or a verb and its complements. The expressions consisting of a verb and its subject are not included. The multi-word units consisting of a verb and a infinite form of a verb are included irregularly. Subtypes: yv – particle verb nv – expression consisting of a noun (phrase) and a verb; could be divided further into idiomatic expressions and collocations tv – support verb construction av – catenative verb construction | ||||||||||||||
77 | 02/09/2014 12:12:00 | THAMUS lexicons | http://metashare.ilsp.gr:8080/repository/search/?q=thamus | Monolingual and bilingual lexicons with MWEs | Italian, German >Italian, Italian>German, English>Italian, Italian>English | Available, restricted use | Several licenses for different uses (academic/commercial) and users (ELRA members/non-members). | ELRA | 28 generic and technical (domain-specific) mono- and bilingual dictionaries. Multi-word terms contain morphological coding for the head word. A full list of the dictionaries, including dictionary name, (the name reflects both domain and linguality), ELRA ID, ELRA catalogue URL, a specification of domain and whether the entries are in canonical or inflected form, the number of entries and language(s), will be made available at the PARSEME WG1 wiki. | Yes (click continue to fill in more information) | Contact person: Mapelli Valérie | mapelli@elda.org | ||||||||||||||||
78 | 04/09/2014 06:59:31 | English NN compounds | http://www.csse.unimelb.edu.au/research/lt/resources/ncompound/ncompound.tgz | Monolingual list of MWEs | English | Total instances: 2169. Test instances: 1081 (file name: test). Training instances: 1088 (file name: train). Semantic relations: 20 | 2 | Only contiguous | Available, unrestricted use | This dataset is made available under the terms of the Creative Commons Attribution 3.0 Unported licence (http://creativecommons.org/licenses/by/3.0/), with attribution via citation of the following paper, which describes the dataset in full detail: Kim, Su Nam and Timothy Baldwin (2008) Standardised Evaluation of English Noun Compound Interpretation, In Proceedings of LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 39-42. The paper can be found in the PDF at: http://www.lrec-conf.org/proceedings/lrec2008/workshops/W20_Proceedings.pdf | Creative Commons (CC): http://creativecommons.org/examples | This tarball contains the set of noun-noun compounds annotated for semantic relation originally presented in: Kim, Su Nam and Timothy Baldwin (2005) Automatic Interpretation of Noun Compounds using WordNet Similarity, In Proceedings of the Second International Joint Conference on Natural Language Processing (IJCNLP-05), Jeju, South Korea, pp. 945-56. <From README.txt in tar folder> | Yes (click continue to fill in more information) | Tim Baldwin and Su Nam Kim | tb@ldwin.net | Kim, Su Nam and Timothy Baldwin (2008) Standardised Evaluation of English Noun Compound Interpretation, In Proceedings of LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 39-42. The paper can be found in the PDF at: http://www.lrec-conf.org/proceedings/lrec2008/workshops/W20_Proceedings.pdf http://people.eng.unimelb.edu.au/tbaldwin/pubs/nlpke2008.pdf | Noun-noun compounds | Semantic relations between the components of the compund | FORMAT: Format for each instance in the test and training data: NOUN1 NOUN2 RELATION (e.g. "apple pie material" => NC = "apple pie", with semantic relation "material", indicating that the "pie" is made of "apple") SEMANTIC RELATIONS: The semantic relations are as defined in: Barker, Ken and Stan Szpakowicz (1998) Semi-automatic recognition of noun modifier relationships. In Proceedings of the 17th International Conference on Computational Linguistics (COLING 1998), Montreal, Canada, pp. 96-102. | |||||||||
79 | 04/09/2014 08:17:48 | English compound nominalisation interpretation dataset | http://www.csse.unimelb.edu.au/research/lt/resources/nominalisation/nominalisation.tgz | Monolingual list of MWEs | English | 464 compounds in total | Available, unrestricted use | This dataset is made available under the terms of the Creative Commons Attribution 3.0 Unported licence (http://creativecommons.org/licenses/by/3.0/), with attribution via citation of the following paper, which describes the dataset in full detail: Nicholson, Jeremy and Timothy Baldwin (2008) Interpreting Compound Nominalisations, In Proceedings of LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 43-45. The paper can be found in the PDF at: http://www.lrec-conf.org/proceedings/lrec2008/workshops/W20_Proceedings.pdf | Creative Commons (CC): http://creativecommons.org/examples | The dataset is based on a random sample of 1000 sentences from the BNC. 3 annotators independently identified all binary compound nouns in the dataset, and got together to resolve any disagreements. This led to a total of 464 compound nouns, of which 119 consisted of one or more proper nouns and were excluded (and are tagged as "PN"), leaving 345 compound nouns. Each of these was then again multiply annotated according to the 5-way classification of SUB, DOB, POB, NA or NV, as described below: SUB: the head noun is deverbal, and the modifier correspond to the subject of the base verb (e.g. "student demonstration", interpreted as "_student(s)_ _demonstrate_") OBJ: the head noun is deverbal, and the modifier correspond to the object of the base verb (e.g. "eye irritation", interpreted as "[SOMETHING] _irritates_ the _eye_") POBJ: the head noun is deverbal, and the modifier correspond to a prepositional argument of the base verb (e.g. "bird cage", interpreted as "_cage_ for _bird_") NA: the head noun is deverbal, but the modifier are not an argument of the base verb in an acceptable paraphrase (e.g. "memory size", where "size" can be interpreted as being deverbal, but not meaningfully in this context) NV: the head noun is not deverbal (e.g. "scout hut") In the case that the head noun is (potentially) deverbal, the base verb is provided. | Yes (click continue to fill in more information) | Jeremy Nicholson, Tim Baldwin | tb@ldwin.net | Nicholson, Jeremy and Timothy Baldwin (2008) Interpreting Compound Nominalisations, In Proceedings of LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 43-45. http://www.lrec-conf.org/proceedings/lrec2008/workshops/W20_Proceedings.pdf | The dataset is based on a random sample of 1000 sentences from the British National Corpus (BNC: Burnard (2000)). 32% of the sentences were found to contain at least one compound noun, with 464 compounds in total. About a quarter (119) of these were identified as containing one or more proper nouns. | Compound nomalisations | SAMPLE ANNOTATION: Annotated examples include: <doc> Demand for the new car is strongest in large urban areas like New <cn rel="PN" hvf="">York city</cn> , Los Angeles and Miami , where bomb ings , riots and car-jackings fill the <cn rel="NA" hvf="bulletin">news bulletins</cn> . </doc> where "news bulletin" has been identified as a compound noun, the head noun has been identified as deverbal (base verb = "bulletin"), and the noun compound type has been tagged as "NA"; <doc> Demand for the new car is strongest in large urban areas like New <cn rel="PN" hvf="">York city</cn> , Los Angeles and Miami , where bomb ings , riots and car-jackings fill the <cn rel="NA" hvf="bulletin">news bulletins</cn> . </doc> where "York city" has been identified as a compound noun but tagged as incorporating a proper noun ("PN"), and "news bulletin" has also been identified as a compound noun, with "bulletin" being derived from the base verb "bulletin" but the compound type again being "NA"; and <doc> During my first attack I experienced some very inaccurate <cn rel="POB" prep="in" com="ADJ" hvf="fire">return fire</cn> which ceased just before I broke away . </doc> where "return fire" is a compound noun, "fire" is the base verb of the head noun, and the modifier is a prepositional object ("POB") of the head noun, where the preposition is "in" (i.e. the interpretation is of the form "fire in return"). | |||||||||||
80 | 04/09/2014 08:25:46 | Deep lexical acquisition of English verb-particle constructions | http://www.csse.unimelb.edu.au/research/lt/resources/vpc/vpc.tgz | Monolingual list of MWEs | English | Available, unrestricted use | his dataset is made available under the terms of the Creative Commons Attribution 3.0 Unported licence (http://creativecommons.org/licenses/by/3.0/), with attribution via citation of the following paper, which describes the dataset in full detail: Baldwin, Timothy (2008) A Resource for Evaluating the Deep Lexical Acquisition of English Verb-Particle Constructions, In Proceedings of LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 1-2. The paper can be found in the PDF at: http://www.lrec-conf.org/proceedings/lrec2008/workshops/W20_Proceedings.pdf | Creative Commons (CC): http://creativecommons.org/examples | This is a sample of VPC token instances identified by the (various) POS tagger-, chunker-, chunk grammar-, and parser-based extraction methods of: Baldwin, Timothy (2005) The Deep Lexical Acquisition of English Verb-particle Constructions, Computer Speech and Language, Special Issue on Multiword Expressions, Volume 19, Issue 4, pp. 398-414. as having high confidence of being evidence of either an intransitive VPC or (simple) transitive VPC for a given verb--preposition combination. The data is separated into individual sets of instances for each verb--preposition combination, with up to 50 (putative) token instances each of the two valences. In addition, there is a gold-standard set of intransitive and transitive VPCs generated by hand-checking the sets of evidence to check that there is at least one true positive VPC instance, and further filtering out simple adverbial VPCs (e.g. "walk in"). Full details of the different files can be found in readme.file, and full details of the different tasks can be found in readme.task. <From README.txt> | Yes (click continue to fill in more information) | Tim Baldwin | tb@ldwin.net | Baldwin, Timothy (2008) A Resource for Evaluating the Deep Lexical Acquisition of English Verb-Particle Constructions, In Proceedings of LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 1-2. The paper can be found in the PDF at: http://www.lrec-conf.org/proceedings/lrec2008/workshops/W20_Proceedings.pdf | particle verbs (VPC) | ||||||||||||||
81 | 04/09/2014 08:31:29 | List of MWE resources (part of a chapter on MWEs in the Handbook of Natural Language Processing, second edition) | http://handbookofnlp.cse.unsw.edu.au/?n=Chapter12.Chapter12 | List of MWE resources, tools, workshops and bibliographic references | several | A list of MWE resources that Tim Baldwin and Su Nam Kim put together as part of a chapter on MWEs in the Handbook of Natural Language Processing, second edition: http://handbookofnlp.cse.unsw.edu.au/?n=Chapter12.Chapter12 | Yes (click continue to fill in more information) | Contact person: Tim Baldwin | tb@ldwin.net | @incollection{baldwin-handbook10, author = {Timothy Baldwin and Su Nam Kim}, title = {Multiword Expressions}, booktitle = {Handbook of Natural Language Processing, Second Edition}, editor = {Nitin Indurkhya and Fred J. Damerau}, publisher = {CRC Press, Taylor and Francis Group}, address = {Boca Raton, FL}, year = {2010}, note = {ISBN 978-1420085921} } | Several | |||||||||||||||||
82 | 04/09/2014 08:39:26 | Bilingual Spanish-English and English-Spanish lexicons (INCYTA) | http://metashare.ilsp.gr:8080/repository/search/?q=INCYTA, | Bilingual lexicon with MWEs | Spanish > English, English > Spanish | Available, restricted use | Several licenses for different uses (academic/commercial) and users (ELRA members/non-members). | ELRA | Collection of bilingual lexicons from several domains. The metadata is collected from the META-SHARE catalogue. All lexicons have the (MWE survey) category "Bilingual lexicon with MWEs", and it seems like all of them are bidirectional English>Spanish and Spanish>English (this must be checked and verified). | Yes (click continue to fill in more information) | Contact person: Mapelli Valérie | mapelli@elda.org | ||||||||||||||||
83 | 04/09/2014 12:02:12 | SIGLEX-MWE Software & Resources for MWE | http://multiword.sourceforge.net/PHITE.php?sitesig=FILES, http://sourceforge.net/projects/multiword/ | data sets with MWEs | several | 25 data sets, 6 tools | Available, unrestricted use | GNU General Public License version 2.0 (GPLv2) | GPL | The central forum for the MWE community. Share your open-source data sets and MWE extraction tools, exchange ideas on evaluation strategies and further development of the tools, and discuss theoretical definitions and linguistic properties of MWEs. <From http://sourceforge.net/projects/multiword/> | No (click continue to submit) | |||||||||||||||||
84 | 12/09/2014 09:52:55 | Algemeen Nederlands Woordenboek (ANW, Dictionary of Contemporary Dutch) | https://catalog.clarin.eu/vlo/record?4&docId=hdl_58_10032_47_056f1e3bdb30c3ac022916421452e7f0&q=multiword+expressions&index=0&count=13, http://anw.inl.nl/search | Dictionary or lexicon with MWEs (also includes MWEs) | Dutch | Available, restricted use | free for academic use; non applicable for commercial parties | unknown | The Algemeen Nederlands Woordenboek (ANW, Dictionary of Contemporary Dutch) is a corpus-based, scholarly dictionary of contemporary standard Dutch in the Netherlands and in Flanders, describing the Dutch vocabulary from 1970 onwards. The dictionary provides information on form, content and use of words belonging to the general vocabulary of Dutch and it focuses on written language. It provides semasiological and onomasiological access to the dictionary and is meant to be useful for a wide range of users. The ANW can be characterised as an online dictionary under construction. | Yes (click continue to fill in more information) | Creator: Institute of Dutch Lexicology (Instituut voor Nederlandse Lexicologie, INL), Description and Production (Descriptie en Productie) | servicedesk@inl.nl | "Lexical subtypes": proper names, terminology, multi-word expressions | |||||||||||||||
85 | 14/09/2014 08:00:40 | MkdComp | no website | MWE dictionary or lexicon (MWEs only) | Macedonian | 784 | 6 | Only contiguous | Unknown | yes | Yes (click continue to fill in more information) | Aleksandar Petrovski | a.petrovski.sise@gmail.com | intensional and extensional | 784 | 6273 | 42 | Compound nouns, compound adjectives, adverbs, conjunctions, prepositions, compound terminology, named entities | Dictionary, Corpus | |||||||||
86 | 23/09/2014 13:09:34 | JRC-Names | https://ec.europa.eu/jrc/en/language-technologies/jrc-names | Multilingual parallel list of MWEs | dozens of languages written in over twenty different scripts | about 280,000 distinct names (first name and last name) plus about 320,000 spelling variants (status September 2014), growing daily | Only contiguous | Available, restricted use | European Commission End-user Licence Agreement (EULA), mostly free see http://optima.jrc.it/Resources/LICENCE-EULA_JRC-Names_2011.pdf | JRC-Names is a highly multilingual named entity resource for person and organisation names (called 'entities'). It consists of large lists of names and their many spelling variants (up to hundreds for a single person), including across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.). The named entity resource file with the list of spelling variants is accompanied by Java-implemented demonstrator software that (a) allows to produce - for any input name - a list of known spelling variants, and that (b) analyses UTF8-encoded text files to find known entity mentions, returning the name variant found, the preferred display name for that entity, the unique name identifier for that name, the position of the entity name in the text, and its length in characters. All entity variants were found in real-life text. Spelling mistakes are included on purpose as these occur in real life and they help retrieve intended name mentions. | Yes (click continue to fill in more information) | European Commission - Joint Research Centre (JRC) | Ralf.Steinberger@jrc.ec.europa.eu | Steinberger Ralf, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva & Erik van der Goot (2011). JRC-Names: A freely available, highly multilingual named entity resource. Proceedings of the 8th International Conference Recent Advances in Natural Language Processing (RANLP). Hissar, Bulgaria, 12-14 September 2011. | 280000 | 320000 | named entities | Corpus | 24 P Vladimir+Putin 24 P Владимр+Путин 24 P วลาดิมีร์+ปูติน 24 P Влади́мир+Влади́мирович+Пу́тин 24 P Vadimir+Poutine 24 P 普京 24 P 弗拉基米尔普京 24 P Vladimir+Putin+Владимир+Путин 24 P فلاديمير+بوتين 24 P ولادمير+پوتين 24 P ვლადიმირ+პუტინი 24 P فلادمير+بوتين 24 P 弗拉基米尔•普京 24 P Vladimir+Vladimirovitch+Putin 24 P Владимиры+фырт+Владимир+Путин 24 P Vladimir+Putín 24 P Vlagyimír+Putyin 24 P Vladìmir+Putin 24 P 弗拉基米尔•弗拉基米罗维奇•普京 24 P Vladimir+Puttin 24 P Vladimir+Vladimorovich+Putin 24 P ウラジーミルプーチン 24 P Vladimir+Poutin 24 P Вадимир+Путин 24 P Βλαντίμιρ+Πούτιν 24 P Władimir+Putin 24 P Vladimira+Putina 24 P Valdimir+Poetin 24 P Владмир+Путин 24 P Władimira+Putin 24 P וולאדימיר+פוטין 24 P Vladimr+Poutine 24 P Valdímir+Putin 24 P فلاديمير+جيريرو 24 P Владимир+Владимирович+Путин 24 P Vladimir+Poutine 24 P Vladmir+Putin 24 P Vladimir+Putin-Владимир+Путин 24 P Vladimirju+Putinu 24 P Владимиир+Путин 24 P ލަޑިމިއަރ+ޕޫޓިން 24 P Vladimir+Vladimirovic+Putin 24 P Vladimir+Vladimorovitsj+Poetin 24 P Vladimir+Vladimirovich+Putin 24 P Vladimirus+Putin 24 P Vladimir+Vladimirovic+Poutine 24 P Путін+Володимир 24 P Vladimir+Vladimirovič+Putin 24 P Vlidamir+Putin 24 P Vládimir+Putin 24 P 弗拉基米尔+普京 24 P Vladimír+Putin 24 P Wladimr+Putin 24 P Vladamir+Poutine 24 P Уладзімір+Пуцін 24 P Vladimir+Vladimirovitsj+Poetin 24 P Vladimir+Poetin 24 P Vladamir+Putin 24 P Vladimir+Ptin 24 P Վլադիմիր+Պուտին 24 P Vladímir+Vladímirovich+Putin 24 P Vladimiras+Putinas 24 P Vladímir+Putin 24 P Wladimir+Poetin 24 P ウラジーミル・プーチン 24 P Vladimirjem+Putinom 24 P Vladimirr+Putin 24 P Vladimier+Poetin 24 P Vladimir+Vladimirovitj+Putin 24 P 弗拉基米尔+弗拉基米罗维奇+普京 24 P Vladimirja+Putina 24 P Βλαντιμίρ+Πούτιν 24 P Vladímir+Ptin 24 P Vadimir+Putin 24 P Vladimir+Pekhtin 24 P Vlagyimir+Vlagyimirovics+Putyin 24 P Waldimir+Putin 24 P Putin+Vladimir 24 P Valadimir+Poutine 24 P Vladmir+Poutine 24 P Vladimir+Putyin 24 P 弗拉基米尔弗拉基米罗维奇普京 24 P Vlagyimir+Putyin 24 P 블라디미르+푸틴 24 P Wladimir+Wladimirowitsch+Putin 24 P Vladimir+Ptuin 24 P Wladimir+Poutine 24 P Wlaidimir+Putin 24 P விளாடிமிர்+பூட்டின் 24 P Vladimir+PUTIN 24 P Vladimir+Putin+Vladimir+Yakovlev 24 P Vlaidimir+Putin 24 P Valdiimir+Putin 24 P Путін+Володимир+Володимирович 24 P ولادیمیر+پوتین 24 P Владимиръ+Пѹтинъ 24 P Владимир+Путин 24 P Владамир+Путин 24 P Vladimir+Pútin 24 P Vladimin+Putin 24 P Wiladimir+Putin 24 P Vladimir+Vladimirovici+Putin 24 P ולדימיר+פוטין 24 P Władymir+Putin etc. ... | |||||||||
87 | 23/09/2014 13:29:34 | Parallel English-French split phrasal verbs | http://cameleon.imag.fr/xwiki/bin/view/Main/Phrasal_verbs_annotation | Multilingual parallel list of MWEs | English, French | 750 | 2 | Also non-contiguous | Available, unrestricted use | We evaluated the difficulty in translation English phrasal verbs (e.g. give up, take off) into French using a standard Moses SMT system. We focused on transitive, split occurrences (e.g. take my shirt off) and compared hierarchical and phrase-based models. The resource contains English sentences with split phrasal verbs marked, and corresponding automatic SMT translations in French with the translations of the source phrasal verbs also marked. Reference translations are not provided but can be easily retrieved from the WIT3 TED Corpus. | No (click continue to submit) | |||||||||||||||||
88 | 23/09/2014 22:13:51 | ITU Web2.0 Treebank | no website yet | Treebank with MWE annotations | Turkish | 2860 MWEs 5K sentences | 3 | Also non-contiguous | Available, restricted use | academic use only commercial use for a fee | not specified yet | yes | ITU Web2.0 Treebank is recent effort of creating a web treebank for Turkish. It has annotations in multiple layers: normalization, morphology, MWEs and syntax | No (click continue to submit) | ||||||||||||||
89 | 23/09/2014 22:16:26 | ITU-METU-Sabancı Turkish Dependency Treebak | no | Treebank with MWE annotations | Turkish | 3531 | 3 | Also non-contiguous | Available, restricted use | academic use only | not specified yet | yes | The reannotation of the METU-Sabancı Turkish Treebank with new dependency annotation schemes and MWEs | No (click continue to submit) | ||||||||||||||
90 | 29/09/2014 08:59:46 | WikiMwe | www.ukp.tu-darmstadt.de/data/wikimwe/ | Monolingual list of MWEs | English | > 350,000 | 4 | Only contiguous | Available, restricted use | CC-BY-SA | Creative Commons (CC): http://creativecommons.org/examples | WikiMwe is a large resource of English multiword expressions mined from Wikipedia. It contains over 350,000 multiword units of size 2-4, including technical terminology, non-compositional multiword expressions, and collocations. For each entry, POS and frequency information and pointwise mutual information (PMI) scores are included. Additionally, we provide definitional and category information for many entries, in order to facilitate the application of the resource in theoretical (semantic similarity, domain disambiguation) and applied (terminology extraction) natural language processing research. Details on WikiMwe can be found in the following publication: S. Hartmann, G. Szarvas, and I. Gurevych (2011). Mining Multiword Terms from Wikipedia, in M.T. Pazienza & A. Stellato (Eds.): Semi-Automatic Ontology Development: Processes and Resources, pp. 226-258, Hershey, PA, USA: IGI Global. | Yes (click continue to fill in more information) | Silvana Hartmann, UKP Lab, Technische Universität Darmstadt | hartmann@ukp.informatik.tu-darmstadt.de | http://www.ukp.tu-darmstadt.de/publications/details/?no_cache=1&pub_id=TUD-CS-2011-0204&type=99&bibtex=yes | no inflection patterns | English Wikipedia corpus | ||||||||||
91 | 01/10/2014 06:33:10 | Ontology of Rhetorical Figures for Serbian (RetFig) | http://resursi.mmiljana.com/RetFigS.aspx | ontology | Serbian | 98 figures | Unknown | yes | The RetFig page http://resursi.mmiljana.com/RetFigS.aspx contains a classification of rhetorical figures in Serbian. Clicking on the + sign shows an example for each rhetorical figure. If you wish to download the ontology in XML or OWL format, you first need to send an authentication request to the moderator (Kontakt form) and once you get your username and password, you can sign up (Prijava form) and you will see a link for download at the bottom of the page. If you wish to find out a bit more about the ontology, this paper gives more details: Ontology of Rhetorical Figures for Serbian Miljana Mladenović, Jelena Mitrović, Text, Speech, and Dialogue, Lecture Notes in Computer Science Volume 8082, 2013, pp 386-393 http://link.springer.com/chapter/10.1007%2F978-3-642-40585-3_49 From the paper introduction: "Natural language texts are not always ”flat” with unique, ordinary, untwisted literal meaning. On the contrary, texts written in a natural language almost always have more than one meaning, due to the usage of various linguistic operations over words, phrases, sentences, et cetera. Without taking these facts into consideration, we can get incomplete and imprecise results in some NLP tasks. This especially holds true in areas of opinion mining, sentiment analysis and discourse analysis. For example, if we say ”He is as fast as light”, this statement will be marked as a positive opinion statement. On the other hand, if we say ”He is as fast as a turtle”, opinion mining techniques will not show the correct result unless we include the process of detection of rhetorical figures. Our first task, in this direction, is to create the very first formal and comprehensive domain ontology of rhetorical figures in Serbian that will lead us, primarily, towards an ontology based semantic tool for annotation of rhetorical figures and implementations in other NLP tasks." | Yes (click continue to fill in more information) | Miljana Mladenovic, Jelena Mitrovic | jmitrovic@gmail.com | Mladenović, Miljana, and Jelena Mitrović. "Ontology of Rhetorical Figures for Serbian." Text, Speech, and Dialogue. Springer Berlin Heidelberg, 2013. | |||||||||||||||
92 | 02/10/2014 08:27:42 | Pattern Dictionary of English Verbs | http://pdev.org.uk/ | Pattern Dictionary of English Verbs | English | 5793 verbs | No (click continue to submit) | |||||||||||||||||||||
93 | 08/10/2014 13:55:44 | BabelNet | http://babelnet.org | Dictionary or lexicon with MWEs (also includes MWEs) | 50 languages. | 49 million lemmas | Only contiguous | Available, restricted use | CC-BY-NC | Creative Commons (CC): http://creativecommons.org/examples | BabelNet is both a multilingual encyclopedic dictionary, with lexicographic and encyclopedic coverage of terms, and a semantic network which connects concepts and named entities in a very large network of semantic relations, made up of more than 9 million entries, called Babel synsets. Each Babel synset represents a given meaning and contains all the synonyms which express that meaning in a range of different languages. | No (click continue to submit) | ||||||||||||||||
94 | 13/10/2014 17:24:18 | SemLex | http://ufal.mff.cuni.cz/lexemann | MWE dictionary or lexicon (MWEs only) | Czech | almost 9,000 MWEs | 12 | Only contiguous | Available, unrestricted use | CC-BY | Creative Commons (CC): http://creativecommons.org/examples | The SemLex lexicon was compiled during the annotation of MWEs in the Prague Dependency Treebank and it should contain all MWEs occurring in it. There are almost 9,000 MWEs in the lexicon and they are connected to the text data (800,000 words). Each entry includes its basic form, lemmas, frequency of the MWE in annotated corpora, syntactic structure (i.e. the topology of lemmas in the dependency tree) and deep syntactic structure (analogicaly with tectogrammatical tree and lemmas). Majority of entries has part of speech of the whole phrase (i.e. it can be used in place of e.g. noun in the sentence). Some of them have gloss, example, or synonyms. There is no categorization acording to for instance PoS of the MWE components -- but this information can be obtained from the corpus. | Yes (click continue to fill in more information) | Charles University in Prague, ÚFAL (Pavel Straňák, Eduard Bejček) | bejcek@ufal.mff.cuni.cz | Bejček Eduard, Straňák Pavel: Annotation of Multiword Expressions in the Prague Dependency Treebank. In: Language Resources and Evaluation, Vol. 44, No. 1-2, Copyright © Springer Netherlands, ISSN 1574-020X, pp. 7-21, Apr 2010 Straňák Pavel: Annotation of Multiword Expressions in The Prague Dependency Treebank. Ph.D. thesis, Univerzita Karlova v Praze, Prague, Czech Republic, 79 pp., Sep 2010 | 8800 | Part of speech of the whole MWE together. Each MWE is linked to (several) occurences in the data. | Functional Generative Description | Corpus | BASIC_FORM: bezpečnostní pás (= safety belt) LEMMATIZED: bezpečnostní pás GLOSS: dlouhý pruh n. předmět v růz. zařízeních (brief description) POS: N PDT25_FREQ: 1 TREE_STRUCT: pás — [head] bezpečnostní → 1 BASIC_FORM: zákon o dani z nemovitostí (= real estate tax law) LEMMATIZED: zákon o daň z nemovitost POS: N PDT25_FREQ: 0 (due to a inter-annotator disagreement) TREE_STRUCT: zákon — [head] daň → 1 nemovitost → 2 BASIC_FORM: Jihovýchodní Asie (= Southeast Asia) LEMMATIZED: jihovýchodní Asie POS: N PDT25_FREQ: 4 TREE_STRUCT: Asie — [head] jihovýchodní → 1 BASIC_FORM: držet hubu (= shut up) LEMMATIZED: držet huba POS: V PDT25_FREQ: 1 TREE_STRUCT: držet — [head] huba → 1 BASIC_FORM: tím pádem (= consequently) LEMMATIZED: ten pád POS: D PDT25_FREQ: 7 TREE_STRUCT: pád — [head] tím → 1 | |||||||
95 | 13/10/2014 23:38:10 | VALLEX | http://ufal.mff.cuni.cz/vallex | Valency lexicon | Czech | 4250 verbs, 6460 entries | Also non-contiguous | Available, unrestricted use | CC 3.0 BY-NC-SA | Creative Commons (CC): http://creativecommons.org/examples | The Valency Lexicon of Czech Verbs, Version 2 (VALLEX 2.x), is a collection of linguistically annotated data and documentation, resulting from an attempt at formal description of valency frames of Czech verbs. VALLEX 2.x has been developed at the Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague. VALLEX 2.x is a successor of VALLEX 1.0, extended in both theoretical and quantitative aspects. VALLEX 2.x provides information on the valency structure (combinatorial potential) of verbs in their particular senses. VALLEX is closely related to the Prague Dependency Treebank project: both of them use Functional Generative Description (FGD), being developed by Petr Sgall and his collaborators since the 1960s, as the background theory. In VALLEX 2.x, there are roughly 2,730 lexeme entries containing together around 6,460 lexical units ("senses"). Note that VALLEX 2.x - according to FGD, but unlike traditional dictionaries and also unlike VALLEX 1.0 - treats a pair of perfective and imperfective aspectual counterparts as a single lexeme (if perfective and imperfective verbs would be counted separately, the size of VALLEX 2.x would virtually grow to 4,250 verb entries). To ensure high quality of the data, all VALLEX entries have been created manually, using several previously existing lexicons as well as corpus evidence from the Czech National Corpus. | Yes (click continue to fill in more information) | Charles University in Prague, ÚFAL (Markéta Lopatková, Zdeněk Žabokrtský, Václava Kettnerová, Eduard Bejček) | bejcek@ufal.mff.cuni.cz | Lopatková, M., Žabokrtský, Z., Kettnerová, V.: Valenční slovník českých sloves. Praha: Karolinum, 382 p., 2008 (ve spolupráci se Skwarskou, K., Bejčkem, E., Hrstkovou, K., Novou, M., Tichým, M.) Žabokrtský Zdeněk, Lopatková Markéta: Valency Information in VALLEX 2.0: Logical Structure of the Lexicon. The Prague Bulletin of Mathematical Linguistics, No. 87, pp. 41-60, 2007. | Intensional | verbal valency | Functional Generative Description | Dictionary, Corpus | angažovat {biasp} [ 1 ] ≈ zaměstnat / zaměstnávat -frame: ACT(1){obl} PAT(4){obl} EFF(jako+4){opt} DIR3(){typ} -example: angažoval otce jako vyjednávače; angažovali herce do nové revue -rfl: pass: do nové hry se nakonec angažovali jen osvědčení herci -class: appoint verb [ 2 ] ≈ učinit/činit účastným -frame: ACT(1){obl} PAT(4){obl} LOC(){typ} -example: angažovat občany v boji za lepší zítřky -rfl: pass: občané se angažovali v boji za lepší zítřky | ||||||||
96 | 13/10/2014 23:46:02 | PDT-Vallex | http://ufal.mff.cuni.cz/PDT-Vallex/ | Valency lexicon | Czech | over 11000 valency frames for more than 7000 verbs | Also non-contiguous | Available, unrestricted use | CC BY-NC-SA 3.0 | Creative Commons (CC): http://creativecommons.org/examples | The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool) , and also in more human readable form (see the links above and below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives. | Yes (click continue to fill in more information) | uresova@ufal.mff.cuni.cz | 1. Urešová Zdeňka: PDT-Vallex - trochu jiný valenční slovník. In: Slovo – Tvorba – Dynamickosť. Na počesť Kláry Buzássyovej, Copyright © Veda, Bratislava, Slovakia, ISBN 978-80-224-1107-3, pp. 278-286, 2010 2. Urešová Zdeňka: Building the PDT-VALLEX valency lexicon. In: On-line Proceedings of the fifth Corpus Linguistics Conference, http://ucrel.lancs.ac.uk/publications/cl2009, University of Liverpool, UK. 2009 3. Hajič Jan, Panevová Jarmila, Urešová Zdeňka, Bémová Alevtina, Kolářová Veronika, Pajas Petr: PDT-VALLEX: Creating a Large-coverage Valency Lexicon for Treebank Annotation. In: Proceedings of The Second Workshop on Treebanks and Linguistic Theories, Copyright © Vaxjo University Press, Vaxjo, Sweden, ISBN 91-7636-394-5, ISSN 1651-0267, pp. 57-68, Nov. 2003 | Intensional | valency of verbs, nouns, adjectives and adverbs | Functional Generative Description | Dictionary, Corpus | angažovat angažovat-1 ACT(1) PAT(4) ?EFF(.4[{jako,jakožto}:/AuxY];za+4) (zaměstnat) angažoval neherce jako herce angažovat-2 (1x) ACT(1) PAT(4) (motivovat) akce angažovala lidi | |||||||||
97 | 07/01/2015 21:24:18 | Serbian DELA e-dictionary | no website | Dictionary or lexicon with MWEs (also includes MWEs) | Serbian | 4,581,657 simple word forms for 133,361 different lemmas 262,686 multi-word forms for 13.717 different lemmas | 7 | Only contiguous | Available, restricted use | Restricted use: attribution, academic use only, commercial use for a fee, no derivatives, no redistribution | yes | Dictionary contains inflected forms and lemmas for both single and compound words. Example of a compound entry: švedsku pelenu,švedska pelena.N:fs4q Inflected form: švedsku pelenu Lemma: švedska pelena 'Swedish diapers' Semantic marker: Concrete Category: N (noun) Morphological features: - Gender: f (feminine) - Number: s (singular) - Case: 4 (accusative) - Animacy: q (non-animate) | Yes (click continue to fill in more information) | Cvetana Krstev, Duško Vitas | cvetana@matf.bg.ac.rs | Cvetana Krstev, Processing of Serbian – Automata, Texts and Electronic dictionaries Faculty of Philology, University of Belgrade, Belgrade, 2008. Cvetana Krstev, Duško Vitas, Agata Savary, “Prerequisites for a Comprehensive Dictionary of Serbian”, in Proceedings of the 5th International Conference on NLP, FinTAL 2006, Turku, Finland, August, 2006, eds. Tapio Salakoski et al., LNAI, pp. 552-564, Springer, Berlin, Heidelberg, 2006 Cvetana Krstev, Ivan Obradović, Ranka Stanković, Duško Vitas, “An Approach to Efficient Processing of Multi-word Units”, in Computational Linguistics - Applications, eds. Adam Przepiórkowski et al, Studies in Computational Intelligence 458, Springer-Verlag, Berlin Heidelberg, DOI 10.1007/978-3-642-34399-5_6, pp. 109-229, 2013. | Extensional | 13717 | 262686 | 108 | Contiguous general language MWEs, mainly compound nouns and adjectives, prepositions, conjunctions, adverbs and interjections. Contains also terminology (mainly form Library and Information Science) | All MWEs have additional information in form of markers: semantic (e.g. +Hum for human), pronunciation (e.g. +Ek for Ekavian), domain (e.g. +DoM=Culinary). | None | Corpus processor Unitex | MWEs are extracted from corpora, from traditional dictionaries and also added manually | žutog kao limun,žut kao limun.A+Col:adms4v Inflected form: žutog kao limun Lemma: žut kao limun 'yellow as a lemon' Category: A (adjective) Semantic: +Col (color) Morphological features: - Degree: a (positive) - Definiteness: d (yes) - Gender: m (masculine) - Number: s (singular) - Case: 4 (accusative) - Animatness: v (animate) | ||
98 | 08/01/2015 13:51:45 | Verne80days_MSD+MWU+NE_Serbian | no website | annotated text | Serbian | 54,899 units | Only contiguous | Available, restricted use | Restrictied use (attribution, academic use only, commercial use for a fee, no derivatives, no redistribution) | yes | Jules Verne's novel "Around the World in 80 Days" lemmatized and morphologically annotated (simple words, MWE, Named Entities). Multiword units include conjunctions, interjections, prepositions, adverbs, nouns and adjectives. Named Entities include persons, organizations, geo-political names, time expressions and amount expressions. | Yes (click continue to fill in more information) | Cvetana Krstev, Duško Vitas | cvetana@matf.bg.ac.rs | Cvetana Krstev, Processing of Serbian – Automata, Texts and Electronic dictionaries Faculty of Philology, University of Belgrade, Belgrade, 2008. Duško Vitas, Svetla Koeva, Cvetana Krstev, Ivan Obradović, “Tour du monde through the dictionaries”, Actes du 27eme Colloque International sur le Lexique et la Gammaire, L'Aquila, 10-13 septembre 2008, eds. M. Constant, T, Nakamura, M. De Gioia, S. Vecchiato, pp.249-256, Universite Paris-Est, Institut Gaspard-Monge, 2008. | Text contains 54,899 units. Out of this number, there are 954 MWUs and 3,036 NEs. Among MWUs there are: 391 noun, 1 adjective, 6 numerals, 141 conjunctions, 279 adverbs, 122 prepositions, 2 interjections. Among NEs, 2049 are MWUs. There are 56 (37 MWUs) organization names, 644 (543 MWUs) temporal expressions, 1144 (165 MWUs) geo-political names, 555 (534 MWUs) amount expressions and 1123 (770 MWUs) personal names. | none | Unitex corpus processor | {Električni sat,električni sat.N+Comp+Conc:ms1q} {iznad,iznad.PREP+p2} {kamina,kamin.N+Sr:ms2q} {bio,biti.V+Imperf+Tr+Iref+Aux:Gsm} {je,jesam.V+Imperf+It+Iref+Aux:Pzsi} {spojen,spojiti.V+Perf+Tr+Iref+Ref:Tms} {sa,sa.PREP+p6} {satom,sat.N:ms6q} {u,u.PREP+p7} {spavaćoj sobi,spavaća soba.N+Comp:fs7q} {Fileasa Foga,.NE+persName+full:ms2v} , | |||||||||
99 | 27/01/2015 17:57:56 | MILA Lexicon | http://www.mila.cs.technion.ac.il/resources_lexicons_mila.html | Dictionary or lexicon with MWEs (also includes MWEs) | Hebrew | 3000 | 4 | Only contiguous | Available, restricted use | For non-commercial research purposes, this resource is licensed under the GNU General Public License (GPL) | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | No (click continue to submit) | ||||||||||||||||
100 | 27/01/2015 18:01:46 | Hebrew Verb Complements Lexicon | http://www.mila.cs.technion.ac.il/resources_lexicons_verbcomplements.html | MWE dictionary or lexicon (MWEs only) | Hebrew | 5600 | 2 | Also non-contiguous | Available, unrestricted use | For non-commercial research purposes, this resource is licensed under the GNU General Public License (GPL) | GNU General Public Licence (GPL): http://www.gnu.org/licenses/gpl.html | Statistics on the likelihood of seeing a verb co-ocurring with any of the six most frequent prepositions in Hebrew. | No (click continue to submit) | |||||||||||||||
101 | 27/01/2015 21:07:07 | MWUEI | no website | Dictionary or lexicon with MWEs (also includes MWEs) | English-Italian | approx. 14,000 | 5 | Also non-contiguous | Unknown | no | still under development | No (click continue to submit) | ||||||||||||||||
102 | 28/01/2015 16:17:45 | WICOL | http://www.vronk.net/wicol/index.php/Main_Page | MWE dictionary or lexicon (MWEs only) | Slovak, German | collocational profiles of: for Slovak 255 nouns 730 adjectives 10 adverbs 8 verbs for German only 49 verbs for German-Slovak 500 nouns 285 verbs 262 adjectives 287 proverbs for German and Slovak | 5 | Also non-contiguous | Available, restricted use | yes | Yes (click continue to fill in more information) | Prof. Dr. Peter Ďurčo | durco@vronk.net | Ďurčo, Peter – Banášová, Monika – Hanzlíčková, Astrid: Feste Wortverbindungen im Kontrast. Trnava: UCM 2010, 128 s. ISBN 978-80-8105-197-5 Ďurčo, Peter: Zum Konzept eines zweisprachigen Kollokationswörterbuchs. Prinzipien der Erstellung am Beispiel Deutsch – Slowakisch. In: F. Hausmann (Hrsg.): Collocations in European lexicography and dictionary research. Lexicographica, Vol. 24. Tübingen: Niemeyer Verlag 2008, 69-89. ISSN 0175-6206 Ďurčo, P. – Garabík, R. – Majchráková, D. – Ďurčo, M.: Contrastive Dictionary of German and Slovak Collocations. In: Cognitive Studies/Études Cognitives, Vol. 9, 2009, Warsaw: Institute of Slavic Studies, Polish Academy of Sciences, 101-115. | Extensional | Dictionary, Corpus | ||||||||||||
103 | 14/02/2015 14:19:41 | CombiNet | http://combinet.humnet.unipi.it/ | MWE dictionary or lexicon (MWEs only) | Italian | Also non-contiguous | CombiNet is an ongoing project funded by the Italian Ministry of Education, University and Research (MIUR) that aims at developing a corpus-based online dictionary of Italian Word Combinations, i.e. MWEs of various kinds as well as distributional profiles of single words (argument structure patterns, subcategorization frames, and selectional preferences). | |||||||||||||||||||||
104 | 17/02/2015 13:12:49 | Jakob-Lexikon | http://www.jakoblexikon.ch | Dictionary or lexicon with MWEs (also includes MWEs) | German | Around 1200 verbal MWEs | 10 | Also non-contiguous | Unknown | The lexicon was built for psycho-semantic analysis of texts. For further information contact Mark Luder via info@jakoblexikon.ch | No (click continue to submit) | |||||||||||||||||
105 | 03/09/2015 09:31:00 | Greek MWEs DB | to be provided end of October 2015 | DB with MWEs only | Greek | around 300 verbal MWEs | 8 | Also non-contiguous | Unknown | it will be made available at the end of October 2015 when the terms of availability will be specified | An ongoing research project. The DB is aimed to serve both as an NLP resource and as a dictionary. To this end it provides exhaustive morphological description of fixed parts using the PAROLE tagset, structural description (free XPs, possible word order permutations, binding and control phenomena), variant forms of the MWE, relations between variants if any, syntactic alternations (causative-inchoative, passivisation, dative genitive alternation). Structural description is theory neutral. On the lexicographic front it provides a glossing of the MWE, an English translation, a fully glossed usage example retrieved from corpora or the WEB, corpus examples, incorrect usages of the MWE testing structural properties, synonymous MWEs and MWEs with the opposite meaning and pairs of verb MWEs that have the relation of causative-inchoative structures but with a different verbal head. The xml output of the DB is formatted according to LMF --this is ongoing work. | Stella Markantonatou, Panagiotis Minos, Erasmia Koletti, Elpiniki Margariti, Aimilia Stripelli, George Zakis | Contanct person: Stella Markantonatou, email: marks@ilsp.athena-innovation.gr | (1) Stella Markantonatou, Erasmia Koletti, Elpiniki Margariti, Panagiotis Minos, Aimilia Stripeli, Georgios Zakis, Niki Samaridi. 2015. Lexical Resource for free subject verb MWEs. Parseme 4th general meeting (2) Stella Markantonatou, Erasmia Koletti, Elpiniki Margariti, Panagiotis Minos, Aimilia Stripeli, Georgios Zakis, Niki Samaridi. 2015.Lexical resource for free subject verb MWEs. Modern Greek MWE 2015’ in the framework of the 12th International Conference on Greek Linguistics, 16th September 2015. | both | Free subject verb MWEs | HNC http://hnc.ilsp.gr/ and the web |