Ideas for100K Word Data Set for Human and Machine Learning
Lori LevinAlon LavieJaime CarbonellLanguage Technologies InstituteCarnegie Mellon University
The data set should support
Machine learningMachine learning from small data can work if
the data is structured. Analysis by humans
Humans can learn a lot from a small data set if the form-function mappings are clear.
Concrete Suggestions1. Hand align a portion of the corpus. 2. Include parse trees and feature structures for a
portion of the corpus.3. Include a representative sample of diversity of
phrase structures.4. Include a representative sample of diversity in
function/meaning.5. Include some simple, single sentences.6. Include some full texts.7. Look for well-known divergences. 8. Conduct an evaluation to be sure that the
corpus elicits what you want it to elicit.
Hand align a portion of the corpus
Automatic alignments algorithms can be bootstrapped from the hand alignments.
A lexicon can be created from the alignments.
Humans can study word usage.
Provide parse trees for a portion of the corpus
Parse trees plus alignments can be input to Avenue-style rule learning Automatic treebanking of the minor language
Humans can study the translation of specific structures.
There should be semantic and functional information in addition to structural information. See below.
Include a representative example of structural diversity Part of the corpus can be structured to
include simple, common sub-trees from the English Penn TreeBank.
Learn a collection of structural mappings that is compositionalA lot of mileage from small data
Preliminary work with Katharina ProbstRaw WSJ data requires editingNeed redundant examples of each structure
Include a representative example of function or meaning Finding out how English structures translate
into minor language structures is not enoughFor example, finding out how to translate
English auxiliary verbs is not useful because they have many functions: tense, aspect, epistemics, evidentials, etc.
Finding out how to express tense, aspect, epistemics, evidentials, etc. is useful.
Include some multi-sentence texts
In order to observeTemporal sequencing of eventsCausationRhetorical relations
Contrast, elaboration, etc.
Given and new informationCo-reference
Look for well-known divergences
E.g., run across the street vs cross the street running
But see below for our view of divergences.
Include some simple sentences
So that the form-function mapping is clear to a human without confounding factors
As a seed for machine learning
Evaluation
Test the corpus on a few languages that in order to be sure that the intended structures and functions are elicited. Need to watch out for idiosyncrasies, lexical
gaps, special constructions, etc. For example, if you want to elicit a noun
modified by a preposition, the person in the room will work better than a bottle of wine.
Hard problems
Body of common phenomena with a tail of phenomena that are individually rare, but collectively massive.
Extra slides
Our view of translation divergences Elaboration on the different roles of
structure and function
Our view of divergences which is divergent from some other views of divergences
Divergences arise when the same function is expressed by a different structure.
Many functions are expressed by specialized constructions that do not translate literally into other languages.
Divergences cannot be neatly grouped into a few classes.
Typological differences between languages are relevant: Embedding vs serialization Synthetic vs analytic causative constructions
Coverage: Structure and Function
Structural DiversityAppositives, adjuncts, embedded clauses,
coordinate structures, ellipsis, etc. Functional/Meaning Diversity
Temporal relations, rhetorical relations, modality, negation, tense, aspect, etc.
Structure and Function
The way you understand a text is by knowing which structure has which function.
The same function is expressed by different structures in different languages.
What a human needs to know(function) Who did what to who when? What happened before/after what? What caused what? Is it first hand knowledge, hearsay, or
inference? Is it certain, probable, or improbable?
Did it happen or not? What do these words mean?
How a human knows these things(structure/grammar)
Who did what to who when? Grammatical relations, coreference, time expressions, pronouns/pro-drop,
nominalizations, subordinate clauses, case marking, word order, agreement, tense, aspect
What happened before/after what? Time expressions, temporal connectives, tense and aspect morphemes
What caused what Markers of rhetorical relationsbetween sentences
Is it first hand knowledge, hearsay, or inference? Is it certain, probable, or improbable? Markers of modality and epistemics
Did it happen or not? Markers of negation and counterfactuals
What do these words mean? Vocabulary
Other Questions, existentials, possessives, coordinate structures
How to make sure the corpus captures what a human needs to know
Organize the corpus by function and then a human can observe the corresponding structure.
Coverage of data for human analysis: basics Closed Class and Special Constructions
Dates, names, numbers, prices, etc. Pronouns, prepositions, etc.
Encoding of grammatical relations and/or semantic roles. How do you know who did what to who? Word order, case marking, agreement
Encoding of old and new information Word order, special constructions (e.g., clefts), etc.
Questions Negation Modification Possession Coordination Indirect speech
Coverage of data for human analysis: multi-sentence and multi-clause
Rhetorical relationsCause, elaboration, contrast, etc.
Temporal relationsBefore, after, during, etc.
Same subject and obviation phenomena Subordination
As subject or objectAs complementAs adjunct
Other grammatically encoded meanings Modality and Epistemics
Certainty, source of information (first hand, second hand, inference), etc.
Conditionals Comparatives Existentials Tense and aspect Definiteness