Free corpora download
HipparchiaServer front end to Hipparchia corpora: searching, browsing, concordances, texts, dictionaries, parsing. Implements the Corpus Query Protocol as a package for the R statistical environment.
It allows to query linguistic corpora and manipulate the data as native R objects. It is based on the CWB software. Arabic Stemming Corpora. Corpus Manager. Yet another corpus manager. Allows for HTTP access to annotated text corpora , client does not need to install any special software to access the server any browser with JavaScript support will do. GloVe GloVe model for distributed word representation. GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. The links provided contain word vectors obtained from the respective corpora. If you want word vectors trained on massive web datasets, you need only download one of these text files! Pre-trained word vectors This project aims at building an efficient indexer and search engine for natural language corpora with multilevel annotations.
BioNLP- Corpora is a repository of biomedically and linguistically annotated corpora and biomedical data sources. There are many resources available in separate packages in this project. Arabic business corpora Arabic business and management corpus. The Text Variation Explorer TVE is a tool for exploring the effect of window size on various common linguistic measures. TextDirectory is a tool for aggregating text files based on various filters and transformation functions.
A tool for mapping a document into a network of terms in order to visualize the topic structure. The Great American Word Mapper. Tweet tokenizer, POS Tagger, hierarchical word clusters, and a dependency parser for tweets, along with annotated corpora and web-based annotation tools. Text annotation tool and statistics for various types of linguistic analysis and multilayer annotation. A tool for searching, studying, and analyzing digital texts and corpora.
A word cloud generator, with dynamic filters, links to images, and KWIC capabilities. YEDDA is a python-based collaborative text span annotation tool with support for a very wide variety of languages including Chinese. Corpus Text Processor. Corpus Text Processor is a downloadable application that provides batched operations for common corpus processing tasks such as encoding or standardization.
Only user corpora can be downloaded from Sketch Engine. Preloaded corpora in Sketch Engine cannot be downloaded but word embeddings computed from these corpora for the purpose of language modelling and similar applications are available for download from our word embeddings page.
Users can also download word lists, n-gram lists and other language data generated from these corpora. CorpusReader manage very large corpora and corpora containing milestone annotation. It provides tools for enriching corpora with output of linguistic parsers, and for extracting quantitative information ViewXML v. It contains almost 15 m. The spoken part consists mainly of the telephone based Switchboard corpus.
The OANC comes in versions with different annotation schemes. Sign up to join this community. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group.
Create a free Team What is Teams? Learn more.