Generative Oscillation - A Cognitive Model for the Emergence of Language
Research Material for a discontinued PhD
DRAFT COPY ONLY
NOT READY FOR PRINT PUBLICATION
The GO model proposes a co-generative view of the emergence of language. Most conventional linguistics models conceive... more The GO model proposes a co-generative view of the emergence of language. Most conventional linguistics models conceive of language as a representational system of symbols which refer to events, either mental or external to the organism. This representational function is said to motivate the linguistic system and (depending upon the linguistic model) largely control its form. The GO (Generative Oscillation) model proposed here recognizes the representational role of language. However it notes that as the mental linguistic system itself becomes efficiently organized, it creates an internal logic and drive of its own. To some extent this internally motivated linguistic system is conceived to override the external motivation to represent another reality. Since the internal linguistic system is dynamic and generative, it may give rise to linguistic output which seems strange in an inter-human communicative context (or even within the reflective mind of the creator). Thus while the external communicative context can become a constraint on unmotivated non-representational "internal language", it might not eliminate it. The Generative Oscillation model proposes that actual language production is an oscillating compromise between the representational function of language and the mental "language bot" itself (i.e. an internal self-organizing system) which is generating language strings just because that is what language language bots do. As far as I know, the Generative Oscillation Model, or anything like it, had not been suggested before in linguistics at the time of writing. Some conventional linguists may find it a bit "off the wall".
TED-LIUM: an Automatic Speech Recognition dedicated corpus
URL for downloading the corpus: http://www-lium.univ-lemans.fr/TED-LIUM
This paper presents the corpus developed by the LIUM for Automatic Speech Recognition (ASR), based on the TED Talks.... more This paper presents the corpus developed by the LIUM for Automatic Speech Recognition (ASR), based on the TED Talks. This corpus was built during the IWSLT 2011 Evaluation Campaign, and is composed of 118 hours of speech with its accompanying automatically aligned transcripts. We describe the content of the corpus, how the data was collected and processed, how it will be publicly available and how we built an ASR system using this data leading to a WER score of 17.4%. The official results we obtained at the IWSLT 2011 evaluation campaign are also discussed.
Hambye, P., Simon, A.-C., Bardiaux, A. " French in Belgium: a speaker from Liège " in Varieties of Spoken French: a source book
co-authored with Detey, S., Durand, J., Lacks, B., Lyche, C. (eds)
Work Notes on the Phrygian texts
by Mel Copeland
This is a PDF file from our website covering Etruscan and Phrygian texts, with images compiled from the Etruscan Phrases websitehttp://www.maravot.com/Etruscan_Phrases_a.html and http://www.maravot.com/Phrygian.html. We found that both the Etruscan and Phrygian texts are in a language close to Latin. Differences between the Etruscan and Phrygian writings are small. For instance Etruscan “o” omega is rendered as “V.” The Phrygian texts render the character as an “o,” usually much smaller than the other letters. The Phrygian words blend well into the Etruscan GlossaryA, and thus we found no need to create a separate glossary for the Phrygian language seen in the texts primarily from a site called Midas City. Midas City is built on quite plateau with its principal monument facing east. The mountain has many rock-cut altars, most of which are step altars like those found in Armenia, which appear to be dedicated to the rising of the sun on special days. The Phrygian texts are not only similar to the Etruscan’s they give us more understanding on the Etruscan texts as well. For instance, an inscription on the base of a hawk helped confirm the name of THALNA, the mother of Helen of Troy who was the Greek goddess Nemesis. THALNA relates to the Latin word for retaliation (talio-onis), which is what Nemesis represents. The Etruscan word for retaliation is THALIO (THALIV).
In contrast to offerings from the British Museum and University of Bologna, where their analyses, following Pallottino, are generally speculation based on guesswork relating to short funerary inscriptions, the Etruscan Phrases work is supported by a strong grammar and vocabulary based on all texts, small and large. Thus, to clear the mystery of the Etruscan language alleged by such esteemed institutions, it is imperative that the Etruscan Phrases GlossaryA.xls be audited. We mention this since the only prospect of clearing up the Etruscan Mystery is through a verifiable audit of the Etruscan Grammar recorded in Etruscan Phrases. The British Museum, University of Bologna and other "Pallottino School" works have not produced a vocabulary or grammar that can be audited, since their theory is that the Etruscan language is unlike any other known to man, not Indo-European. Etruscan Phrases claims that the Etruscan Language is similar to Latin, French, Italian and Romanian, an Indo-European language. It offers a grammar, declension patterns and regular, measurable shifts between Etruscan and these languages; ergo the work can be easily audited. The audit of the Etruscan GlossaryA will, of course, be also an audit of the Phrygian texts.
Much of the confirmation of our work comes through Etruscan mirrors that record stories of Greek heroes, such as that of the Trojan War. Because the story is familiar and linking the genesis of Greek heroes and gods, containing their names and actions, we have comparative texts to use in analyzing the Etruscan language, its shifts from Greek and Latin to Etruscan. For instance the heroes of the story follow a regular shift, of dropping vowels and final consonants, etc. Heracles (L. Hercules) is Hercle (almost like the French, Hercule). Helen’s name declines: Helenai and Helenei, leading us to the declension of other nouns. Her father was Zeus who transformed into a swan and raped the goddess Nemesis THALNA (retribution) who had transformed into a goose. She laid an egg or two eggs, one of which was Helen which was found by shepherds near Sparta and taken to Tyndareüs and Leda to bring up. From the egg came Helen, the most beautiful woman in the world.
Etruscan mirrors and murals carry amazing details never before known to modern man. The images, names and texts associated with the mirrors and murals set the baseline for understanding Etruscan Grammar and the words recorded in Etruscan Phrases GlossaryA.pdf. (The most current version available at http://www.maravot.com/Etruscan_Phrases_a.html.
We should hope, therefore, that there will be many linguists / scholars who will jump at the chance to clear up the Etruscan Mystery and rewrite the histories so clearly overshadowed by the Pallottino School theories, to help even the museums containing Etruscan artifacts explain a bit more about the items in their displays.
Etruscan GlossaryA.pdf an index to about 2,500 Etruscan words which are similar to Latin, French, Italian and Romanian. Declension patterns follow those in Latin. The 2,500 words equal the repeated words in 6,000 words of the major extant texts. The texts have been frozen in time, covering ~700-400 B.C., representing a lens to understanding the early formation of Indo-European languages, particularly the early Italic-Latin-Celtic languages, such as Italian, French & Romanian / Dacian. (By 45 BC. the language was a dead language - no one understood or could write Etruscan)
This GlossaryA works together with Indo-European Table 1 which refutes theories by the Pallottino school of thought that the Etruscan language is not Indo-European and an isolate, unlike any other language. It is very close to Latin and, curiously, Romanian, Italian and French. The Latin suffix, "us" shifts to "o" as in Italian (Titus vs Tito); first person conjugation patterns are similar to French and Romanian. This GlossaryA provides a quick look at the grammatical structure of the Etruscan language, how closely it coincides with Latin. A more detailed Declension Table can be seen on the Etruscan Phrases website. These PDF documents facilitate independent confirmation of the words in GlossaryA.xls , the Grammar and Declension Table. All words can be examined from actual images of texts on the Etruscan Phrases website. Over 150 texts, with about 6,000 words can be examined at Etruscan Phrases.
The Etruscans surfaced in Italy about 1,000 B.C., reputed to have arrived from Lydia / Phrygia. The Phrygians originated near Macedonia in Thrace, according to Herodotus. One may therefore inquire whether the ancient Thracians (Dacians, Gettae, modern Romanians), spoke a language common to the Phrygians, at the time of the Trojan War and after (~1180 B.C.). The Thracians, Phrygians and Lydians (also dead languages) were allies of the Trojans, according to the Iliad. Etruscan Phrases finds a common vocabulary among Latin, Italian, French, Romanian, Etruscan and Phrygian. While French, Spanish, Italian and Romanian are considered Romance languages, showing a similar Latin heritage, Etruscan is not, of course, a Romance language, as it preceded Latin, at least in the written form (giving Rome its alphabet).
Resolution of the Etruscan Mystery may be likened to Michael Ventris' decipherment of Linear B and Jean-François Champollion's decipherment of Egyptian hieroglyphics using the Rosetta stone - written in Egyptian hieroglyphics, Demotic and Greek. The decipherment of Etruscan is a bit more challenging; since we have no multilingual Rosetta Stone, but we do have enough vocabulary and grammar to establish that Etruscan is similar to Latin, French, Italian and Romanian. (Certainly far more vocabulary and a more extensive grammar is provided in Etruscan Phrases than that used by Ventris to claim translation of Linear B as an old form of Greek)
The mirrors with the Devotional Plates may be an easy entry into an audit, for those who are hesitant to examine the larger texts, such as the Zagreb Mummy (Script Z).
The Influence of Typological Features on Stylometric Text Classification
draft only
This work aims to establish whether the features of morphological typology that a particular language exhibits affect... more This work aims to establish whether the features of morphological typology that a particular language exhibits affect parameters such as accuracy and precision is stylometric measurements conducted for the purpose of text classification. This work provides insights for such fields as plagiarism detection, authorship attribution, automatic essay scoring, and sentiment classification.
Work Notes on Etruscan Mirrors and Murals III
by Mel Copeland
This is a PDF file from our website covering Etruscan Mirrors and Murals, with images compiled from the Etruscan Phrases website http://www.maravot.com/Etruscan_Phrases_a.html.
In contrast to offerings from the British Museum and University of Bologna, where their analyses, following Pallottino, are generally speculation based on guesswork relating to short funerary inscriptions, the Etruscan Phrases work is supported by a strong grammar and vocabulary based on all texts, small and large. Thus, to clear the mystery of the Etruscan language alleged by such esteemed institutions, it is imperative that the Etruscan Phrases GlossaryA.xls be audited. We mention this since the only prospect of clearing up the Etruscan Mystery is through a verifiable audit of the Etruscan Grammar recorded in Etruscan Phrases. The British Museum, University of Bologna and other "Pallottino School" works have not produced a vocabulary or grammar that can be audited, since their theory is that the Etruscan language is unlike any other known to man, not Indo-European. Etruscan Phrases claims that the Etruscan Language is similar to Latin, French, Italian and Romanian, an Indo-European language. It offers a grammar, declension patterns and regular, measurable shifts between Etruscan and these languages; ergo the work can be easily audited.
Most important to the work are the Etruscan mirrors and murals that contain known Classical stories and the names of the principle characters in the stories. The star of the mirrors is Helen of Troy who was the young daughter of King Tyndareüs of Sparta and abducted by the equally beautiful son of King Priam of Troy, thereby causing the Trojan War. While the entire story has captured the hearts and imaginations of generations since that event (Troy was destroyed ~1180 B.C.) we can presume through Etruscan mirrors that the event was part of their history – and they had a somewhat different recollection of it than the Greek version passed down to us. Here, in Part II of our work notes on Etruscan mirrors, we address two other curious gods that seem to be planted in stories not heretofore known to include them. Heracles is part of the Etruscan Helen of Troy story. Here in Part III he is shown suckling Hera's breast as an adult. Another hero/god Adonis is related to an Asiatic theme, appearing to be consulting Sinar, a goddess of Lebanon/Mt. Hermon.
Because the story is familiar and linking the genesis of Greek heroes and gods, containing their names and actions, we have comparative texts to use in analyzing the Etruscan language, its shifts from Greek and Latin to Etruscan. For instance the heroes of the story follow a regular shift, of dropping vowels and final consonants, etc. Heracles (L. Hercules) is Hercle (almost like the French, Hercule). Helen’s name declines: Helenai and Helenei, leading us to the declension of other nouns. Her father was Zeus who transformed into a swan and raped the goddess Nemesis THALNA (retribution) who had transformed into a goose. She laid an egg or two eggs, one of which was Helen which was found by shepherds near Sparta and taken to Tyndareüs and Leda to bring up. From the egg came Helen, the most beautiful woman in the world.
The most beautiful man at the time was Alexander, spelled ELCHSENTRE and he abducted Helen from her husband Menelaus, MENLE, the brother of King Agamemnon: ACHMEMNVN. His wife Clytemnestra is CLVTHVMVSTHA who murdered her husband in the bath upon returning from the Trojan War, and their son, Orestes (VRSTE) killed her and her lover in revenge. Athena (L. Minerva) is MENRFA; Hera (L. Juno) is VNI, her consort is Zeus (L. Jupiter) Etr. TINIA. Thetis is THETIS and THETHIS, she was a dangerous shape-changer and compelled by the gods to wed her husband Peleus, PELE; they produced the Greek hero of the Trojan War, Achilles who the Etruscans call ACHLE. The mother of Helen, Leda, is LATFA and her brothers, Castor and Polydeukes (Pollux) are CASTVR and PVLTVCEI. Their father Tyndareüs is TVNTLE. Aphrodite (Etr. TVRAN) was a cause of the Trojan War when she was judged by Alexander as “The Fairest” as written on an apple thrown into the wedding of Thetis and Peleus by Eris (Etr. ERIS). Aphrodite’s son was Eros (Etr. ERVS) – appearing in many texts. Another popular figure in Etruscan mirrors is Hermes (L. Mercury) TVRMS.
Apollo (APLV) and Artemis are represented frequently in the texts. Ajax Telemonos EIFAS TELMVNVS committed suicide after Achilles was killed, because he did not deserve Achilles’ armor. Apollo (APLV) and his sister the virgin huntress Artemis (ARTVMES) were highly active in the Trojan War. The Etruscans introduce a new character like Artemis called MEAN who crowns Alexander, awarding him the hand of Helen, though we understand from the Greek version that it was Aphrodite (Etr. TVRAN) that awarded Alexander the hand of Helen in the Judgment of Paris. MEAN appears to be a goddess of the hunt like Artemis from Lydia, recalling the old name of Lydia, Maionia (Μαιονία). This is just a tease, for the mirrors and murals carry amazing details never before known to modern man. The images, names and texts associated with the mirrors and murals set the baseline for understanding Etruscan Grammar and the words recorded in Etruscan Phrases GlossaryA.pdf. (The most current version available at http://www.maravot.com/Etruscan_Phrases_a.html.
We should hope, therefore, that there will be many linguists / scholars who will jump at the chance to clear up the Etruscan Mystery and rewrite the histories so clearly overshadowed by the Pallottino School theories, to help even the museums containing Etruscan artifacts explain a bit more about the items in their displays.
Etruscan GlossaryA.pdf an index to about 2,500 Etruscan words that are similar to Latin, French, Italian and Romanian. Declension patterns follow those in Latin. The 2,500 words equal the repeated words in 6,000 words of the major extant texts. The texts have been frozen in time, covering ~700-400 B.C., representing a lens to understanding the early formation of Indo-European languages, particularly the early Italic-Latin-Celtic languages, such as Italian, French & Romanian / Dacian. (By 45 BC. the language was a dead language - no one understood or could write Etruscan)
This GlossaryA works together with Indo-European Table 1 which refutes theories by the Pallottino school of thought that the Etruscan language is not Indo-European and an isolate, unlike any other language. It is very close to Latin and, curiously, Romanian, Italian and French. The Latin suffix, "us" shifts to "o" as in Italian (Titus vs Tito); first person conjugation patterns are similar to French and Romanian. This GlossaryA provides a quick look at the grammatical structure of the Etruscan language, how closely it coincides with Latin. A more detailed Declension Table can be seen on the Etruscan Phrases website. These PDF documents facilitate independent confirmation of the words in GlossaryA.xls , the Grammar and Declension Table. All words can be examined from actual images of texts on the Etruscan Phrases website. Over 150 texts, with about 6,000 words can be examined at Etruscan Phrases.
The Etruscans surfaced in Italy about 1,000 B.C., reputed to have arrived from Lydia / Phrygia. The Phrygians originated near Macedonia in Thrace, according to Herodotus. One may therefore inquire whether the ancient Thracians (Dacians, Gettae, modern Romanians), spoke a language common to the Phrygians, at the time of the Trojan War and after (~1180 B.C.). The Thracians, Phrygians and Lydians (also dead languages) were allies of the Trojans, according to the Iliad. Etruscan Phrases finds a common vocabulary among Latin, Italian, French, Romanian, Etruscan and Phrygian. While French, Spanish, Italian and Romanian are considered Romance languages, showing a similar Latin heritage, Etruscan is not, of course, a Romance language, as it preceded Latin, at least in the written form (giving Rome its alphabet).
Resolution of the Etruscan Mystery may be likened to Michael Ventris' decipherment of Linear B and Jean-François Champollion's decipherment of Egyptian hieroglyphics using the Rosetta Stone - written in Egyptian hieroglypics, Demotic and Greek. The decipherment of Etruscan is a bit more challenging, since we have no multilingual Rosetta Stone, but we do have enough vocabulary and grammar to establish that Etruscan is similar to Latin, French, Italian and Romanian. (Certainly far more vocabulary and a more extensive grammar is provided in Etruscan Phrases than that used by Ventris to claim translation of Linear B as an old form of Greek)
The mirrors with the Devotional Plates may be an easy entry into an audit, for those who are hesitant to examine the larger texts, such as the Zagreb Mummy (Script Z).
Techniques and tools. Corpus methods and statistics for semantics
by Dylan Glynn
An overview of the corpus methods and statistical techniques in Cognitive Semantics An overview of the corpus methods and statistical techniques in Cognitive Semantics
Korpusbasierte Online-Dialoganalyse am Beispiel Twitter
by Agnes Mainka
DGI `12. Proceedings of the 2. DGI 2012 Conference: Social Media und Web Science - Das Web als Lebensraum. (pp. 331-344). Frankfurt a.M.: DGI
Dieser Artikel diskutiert das Vorgehen und die Ergebnisse einer Dialoganalyse auf der Microbloggingplattform Twitter.... more Dieser Artikel diskutiert das Vorgehen und die Ergebnisse einer Dialoganalyse auf der Microbloggingplattform Twitter. Dialoge werden zum einen durch Metadaten aus der Twitter API und zum andern durch korpuslinguistische Annotation des Machinese Phrase Taggers von Connexor eruiert. Die Ergebnisse der Untersuchungen zeigen, dass die Metainformationen von Twitter Konversationen auffindbar machen können, jedoch sind zusätzliche Informationen nötig, um einen thematischen Dialog aus diesen Konversationen zu filtern. Für diese Problematik wird hier der Vergleich von Nominalphrasen als ein möglicher Lösungsansatz untersucht.
“It takes a Yorkshireman to talk Yorkshire”: towards a framework for the historical study of enregisterment’
by Paul Cooper
Presented at the 14th Methods in Dialectology conference, August 2011. This version submitted for publication in the conference volume.
In this paper I consider the phenomenon of enregisterment and whether it can be studied in historical contexts. ... more
In this paper I consider the phenomenon of enregisterment and whether it can be studied in historical contexts. Following Johnstone et al’s definition of enregisterment as an instance where a feature has ‘become associated with a style of speech and can be used to create a context for that style’ (2006:82), I am investigating whether their notions of second and third-order indexicality can be applied to historical texts. I am specifically focussing on a stereotypical feature of the Yorkshire dialect: the phenomenon of Definite Article Reduction; as this feature is, to some extent, enregistered.
The historical context of this paper is the nineteenth century, due to the evolution of a strong interest in dialects in that century (Milroy in Watts & Trudgill (eds) 2002:14); the role that the resulting dialect dictionaries played in enregistering dialect features (Beal 2009:141-145); and the sheer quantity of examples of DAR in nineteenth-century Yorkshire dialect literature (a pilot study showed that around 80% of all definite articles were reduced).
My data for this paper comes from a corpus of dialect literature, literary dialect (Shorrocks 1999), and texts which discuss dialect such as Wright’s English Dialect Dictionary (1905) and Hunter’s Hallamshire Glossary (1888). I shall also consider data from contemporary newspapers such as The York Herald (October 25 1889), which mentions ‘the abbreviation...of the definite article’.
I am attempting to answer the following questions: (1) are comments like: ‘The absence of þ or th in the definite article is remarkable in the Sheffield dialect’ (Addy 1888:xviii) and ‘it is said the ghost of a t' is always to be recognised’ (Easther 1883:134) evidence for the nineteenth-century enregisterment of DAR?; (2) do textual representations of DAR highlight the feature’s enregisterment?; (3) is it possible to create a framework for the historical study of enregisterment?
References
Addy, S. O. (1888). A Glossary of Words used in the Neighbourhood of Sheffield including a Selection of local names, and some Notices of Folk-lore, Games and Customs. London: Published for the English Dialect Society by Trubner & Co. Ludgate Hill.
Beal, J. C. (2009) ‘Enregisterment, Commodification, and Historical Context: “Geordie” versus “Sheffieldish”’. American Speech 84 (2): 138-156.
Easther, Alfred (1883). A Glossary of the Dialect of Almondbury and Huddersfield. London: Published for the English Dialect Society by Truber & Co, Ludgate Hill.
Hunter, Joseph (1888). The Hallamshire Glossary. William Pickering.
Johnstone, B, Andrus, J, and Danielson, A. E. (2006). ‘Mobility, Indexicality and the Enregisterment of “Pittsburghese”. Journal of English Linguistics 34 (2): 77-104.
Milroy, J. 2002. ‘The Legitimate Language’ in Watts and Trudgill (eds). Alternative Histories of English. pp 6-27. Routledge.
Morris M.C.F. (1892). Yorkshire Folk-Talk with characteristics of those who speak it in the North and East Ridings. London: Henry Frowde.
Shorrocks, G (1999) ‘Working-Class Literature in Working-Class Language: The North of England’ in Hoenselaars, T and Buning, M (eds). English Literature and Other Languages. pp.87-96. Amsterdam: Rodopi.
Wright, Joseph (1905). The English Dialect Dictionary. Published by Henry Frowde, Amen Corner, E.C.
http://newspapers.bl.uk/blcs/ - accessed 10/11/2010 16:50 (York Herald acquired here)
8 views
Seen by:Computer Assisted Semantic Annotation in the DutchSemCor Project
by Attila Görög
Görög, A. and Vossen, P. (2010). Computer Assisted Semantic Annotation in the DutchSemCor Project. In:Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10),Valletta, Malta.
The goal of this paper is to describe the annotation protocols and the Semantic Annotation Tool (SAT) used in the... more The goal of this paper is to describe the annotation protocols and the Semantic Annotation Tool (SAT) used in the DutchSemCor project. The DutchSemCor project is aiming at aligning the Cornetto lexical database with the Dutch language corpus SoNaR. 250K corpus occurrences of the 3,000 most frequent and most ambiguous Dutch nouns, adjectives and verbs are being annotated manually using the SAT. This data is then used for bootstrapping 750K extra occurrences which in turn will be checked manually. Our main focus in this paper is the methodology applied in the project to attain the envisaged Inter-annotator Agreement (IA) of ≥80%. We will also discuss one of the main objectives of DutchSemCor i.e. to provide semantically annotated language data with high scores for quantity, quality and diversity. Sample data with high scores for these three features can yield better results for co-training WSD systems. Finally, we will take a brief look at our annotation tool.
4 views
Seen by:DutchSemCor: building a semantically annotated corpus for Dutch
by Attila Görög
Vossen, P., Görög, A., Laan, F., Van Gompel, M. (2011). DutchSemCor: building a semantically annotated corpus for Dutch. In: Proceedings of Electronic Lexicography in the 21st century: New Applications for new users (eLEX2011), Bled, Slovenia.
State of the art Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical... more State of the art Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years, while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver a Dutch corpus that is sense-tagged with senses from the Cornetto lexical database. Part of this corpus (circa 300K examples) is manually tagged. The remainder is automatically tagged using different WSD systems and validated by human annotators. The project uses existing corpora compiled in other projects; these are extended with Internet examples for word senses that are less frequent and do not (sufficiently) appear in the corpora. We report on the status of the project and the evaluations of the WSD systems with the current training data.
5 views
Seen by:DutchSemCor: Targeting the ideal sense-tagged corpus.
by Attila Görög
Vossen, P., Görög, A., Izquierdo, R., Van den Bosch, A. (2012). DutchSemCor: Targeting the ideal sense-tagged corpus. In: Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC'12), Istanbul, Turkey.
Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach... more Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver a Dutch corpus that is sense-tagged with senses from the Cornetto lexical database. In this paper, we discuss the different conflicting requirements for a sense-tagged corpus and our strategies to fulfill them. We report on a first series of experiments to support our semi-automatic approach to build the corpus.
2 views
Seen by:Review of Herbst, Faulhaber & Uhrig (2011)(eds): The phraseological view of language: A tribute to John Sinclair
Lin, P.M.S. (2012). Review of "Herbst, Faulhaber & Uhrig (2011)(eds): The phraseological view of language: A tribute to John Sinclair" for The Linguist List
Type-token and Hapax-token Relation: A Combinatorial Model
by Jiří Milička
Published in Glottotheory 2/1 2009
Contains an exact formula for computing Type-token relation curve from a frequency distribution of types of a text (or... more Contains an exact formula for computing Type-token relation curve from a frequency distribution of types of a text (or from rank-frequency distribution). The formula is generalized to compute not only the number of the types, but also the number of the types of a certain frequency.
8 views
Seen by:Valency and Information Structure: A quantitative approach to from – to juxtaposition in Arabic
by Jiří Milička
Presented on the CL 2011, Birmingham
In Arabic, mutual order of prepositional phrases syntactically dependent on one head is neither fixed nor random. This... more In Arabic, mutual order of prepositional phrases syntactically dependent on one head is neither fixed nor random. This paper explores the factors affecting the order of prepositions from and to. Many factors related to syntax, morphology and phonology are taken into account and analysed with a corpus driven approach.
11 views
Seen by:A Combinatorial Method for a Context Comparison
by Jiří Milička
Published in Issues in Quantitative Linguistics 2 2011
When comparing the use of two word types within one text, we can do it by comparing the contexts in which they occur.... more When comparing the use of two word types within one text, we can do it by comparing the contexts in which they occur. We pick all the tokens that occur e.g. immediatelly to the right of the word A and immediatelly to the right of the word B, thus getting two multiple subsets of text. This paper offers a method for comparing such subsets (and its use is not limited only to the field of linguistics). The method is based on comparing the cardinality of the intersection of the two multiple subsets and a model which characterizes the average cardinality of all possible subsets of a given length from the given text. The model is derived algebraically.
Argumentation across L1 and L2 Writing: Exploring Cultural Influences and Transfer issues
Published in Vial Vigo International Journal of Applied Linguistics (2012)
