It has long been assumed that a verb’s syntactic distribution is determined by at least two kinds of lexical information: (i) the verb’s semantic type signatures and (ii) its morphosyntactic features. The first of these is often termed S(emantic)-selection; the second goes under various names, though perhaps the most neutral term is subcategorization. Standard distributional analyses in the theoretical literature have had tremendous success in uncovering the nature of S-selection and its relationship to the syntax—i.e. projection rules. But as theories scale to the entire lexicon, these approaches hit a limit, imposed by the sheer size of lexica and by bounds on human analysts’ memory and processing power. This challenge suggests the need for lexicon-scale datasets.
For a detailed description of the datasets associated with this project, the item construction and collection methods, and discussion of how to use a dataset on this scale to address questions in linguistic theory, please see the references below.
Data
Sentences | Predicates | Frames | Download | Citation |
---|---|---|---|---|
50000 | 1000 | 50 | v1 (zip) | White & Rawlins 2016, 2020 |
74830 | 1007 | 150 | v2 (zip) | White & Rawlins 2016, 2020 An & White 2020 |
50 | 50 | 50 | linking (zip) | White & Rawlins 2020 |
1850 | 37 | 50 | single verb (zip) | White & Rawlins 2020 |
1380 | 30 | 46 | replication (zip) | White & Rawlins 2020 |
References
Researchers
Aaron Steven White |
Kyle Rawlins |
Hannah Youngeun An |