- published: 25 Mar 2015
- views: 747
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus).
Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. There are two main types of parallel corpora which contain texts in two languages. In a translation corpus, the texts in one language are translations of texts in the other language. In a comparable corpus, the texts are of the same kind and cover the same content, but they are not translations of each other. To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. Machine translation algorithms for translating between two languages are often trained using parallel fragments comprising a first language corpus and a second language corpus which is an element-for-element translation of the first language corpus.
A brief description of how to handle different text formats when building a corpus in corpus linguistics. Feel free to use in your own teaching of corpus linguistics.
The latest version 4 of PoolParty Thesaurus Server (http://www.poolparty.biz/) offers a full-blown text corpus analysis module. By using this, taxonomists and thesaurus managers can analyze large text collections and identify gaps between a thesaurus and the content base. One can glean candidate terms and can identify parts of the taxonomy which do not occur in the actual content. PoolParty's text mining capabilities are outstanding: highly performant and precise, multilingual and to be used in various industries.
See how PoolParty's taxonomy management methodology (https://www.poolparty.biz) is now supported even more efficiently by PoolParty's latest release. We demonstrate how PoolParty 5.2 makes use of deep text mining including corpus analysis and co-occurrence analysis. We show an example based on UNESCO world heritage sites and demonstrate how automatic classification can be extended step-by-step. An immediate feedback is given by PoolParty's faceted GraphSearch. Initial taxonomies can be built by using PoolParty's linked data harvester to fetch data from DBpedia.
PyData London 2016 Deep Boltzmann machines (DBMs) are exciting for a variety of reasons, principal among which is the fact that they are able to learn probabilistic representations of data in an entirely unsupervised manner. This allows DBMs to leverage large quantities of unlabelled data which are often available. The resulting representations can then be fine-tuned using limited labelled data or studied to obtain a more comprehensive understanding of the data at hand. This talk will begin by providing a high level description of DBMs and the training algorithms involved in learning such models. A topic modelling example will be used as a motivating example to discuss practical aspects of fitting DBMs and potential pitfalls. The entire code for this project is written in python using on...
PyData Madrid 2016 Most of the talks and workshop tutorials can be found here: https://github.com/PyDataMadrid2016/Conference-Info Deep Boltzmann machines (DBMs) are exciting for a variety of reasons, principal among which is the fact that they are able to learn probabilistic representations of data in an entirely unsupervised manner. This allows DBMs to leverage large quantities of unlabelled data which are often available. The resulting representations can then be fine-tuned using limited labelled data or studied to obtain a more comprehensive understanding of the data at hand. This talk will begin by providing a high level description of DBMs and the training algorithms involved in learning such models. A topic modelling example will be used as a motivating example to discuss practica...
A brief screencast about the difference between looking at linguistic data as 'text' or as 'corpus'. Feel free to use in your own teaching of corpus linguistics.
Original version is http://togotv.dbcls.jp/20150321.html Biomedical Linked Annotation Hackathon (BLAH) 2015 was held in The University of Tokyo Kashiwa Campus Station Satellite in Kashiwa, Chiba, Japan. On the last day of the Hackathon (Feb. 27), public symposium of the BLAH 2015 was held. In this talk, Karin Verspoor (University of Melbourne) makes a presentation entitled "Interoperability of Text Corpus Annotations with the Semantic Web". (21:05)
Corpus linguistics is the study of language as expressed in corpora of "real world" text.The text-corpus method is a digestive approach for deriving a set of abstract rules, from a text, for governing a natural language, and how that language relates to and with another language; originally derived manually, corpora now are automatically derived from the source texts.Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field, in their natural contexts, and with minimal experimental-interference.The field of Corpus Linguistics features divergent views about the value of corpus annotation, ranging from John McHardy Sinclair, who advocates minimal annotation, and so allow texts to speak for themselves; to the Survey of English Usage team w...
This screen cast shows how you can using searching in Nederlab to define your own research corpora.
http://www.birmingham.ac.uk/elal The 11th Sinclair Lecture was delivered by Professor Michaela Mahlberg, Chair in Corpus Linguistics and Director of the Centre for Corpus Research. This year’s Sinclair lecture was also Professor Mahlberg’s inaugural lecture to mark her new position at the University of Birmingham. John Sinclair was one of the founding fathers of corpus linguistics - a discipline that has radically changed theories about language and approaches to the study language. Research in corpus linguistics uses computer-assisted methods to identify and quantify patterns in naturally occurring language data. Since the early days of corpus linguistics developments in computing power and the increasing availability of data have contributed to pushing the boundaries of corpus lingui...
In this PoolParty Academy (https://www.poolparty.biz/academy/) learning tutorial we introduce you to the text corpus management functionalities of PoolParty Semantic Suite. You learn how to analyse different input source and use them to extend your existing taxonomies.
The concept of the semantic web is present in information science discourse for more than 13 years and was conceived as a kind of counterweight to statistical methods for analysis of web corpuses, which stood behind the success of companies such as Google or Yahoo. In contrast to a statistical approaches it should lead to a better machine processing of data on the web through semantic layers (and formalization) added by man. In recent years, however, we are witnessing a gradual convergence between the semantic and the "statistical" web. Thanks to the efforts of initiatives such Schema.org there is a gradual progress not only in the standardization of the micro-data, but also to their massive deployment. Presentation from Semantic Web in Business Conference (an international conference on...
We show how to build a machine learning document classification system from scratch in less than 30 minutes using R. We use a text mining approach to identify the speaker of unmarked presidential campaign speeches. Applications in brand management, auditing, fraud detection, electronic medical records, and more.
Justin Grimmer, Stanford University New Directions in Computational Social Science & Data Science https://simons.berkeley.edu/talks/Justin-Grimmer-2016-04-25
Presented by Ute Römer & Stefanie Wulff at the University of Michigan English Language Institute on Dec 6th, 2007. For more info see: http://micase.elicorpora.info/
Alan Akbik November 10, 2014 Title: Open and Exploratory Extraction of Relations (and Common Sense) from Large Text Corpora Abstract: The use of deep syntactic information such as typed dependencies has been shown to be very effective in Information Extraction (IE). Despite this potential, the process of manually creating rule-based information extractors that operate on dependency trees is not intuitive for persons without an extensive NLP background. In this talk, I present an approach and a graphical tool that allows even novice users to quickly and easily define extraction patterns over dependency trees and directly execute them on a very large text corpus. This enables users to explore a corpus for structured information of interest in a highly interactive and data-guided fashion, an...