It has long been assumed that a verb’s syntactic distribution is determined by at least two kinds of lexical information: (i) the verb’s semantic type signatures and (ii) its morphosyntactic features. The first of these is often termed S(emantic)-selection; the second goes under various names, though perhaps the most neutral term is subcategorization. Standard distributional analyses in the theoretical literature have had tremendous success in uncovering the nature of S-selection and its relationship to the syntax—i.e. projection rules. But as theories scale to the entire lexicon, these approaches hit a limit, imposed by the sheer size of lexica and by bounds on human analysts’ memory and processing power. This challenge suggests the need for lexicon-scale datasets.

For a detailed description of the datasets associated with this project, the item construction and collection methods, and discussion of how to use a dataset on this scale to address questions in linguistic theory, please see the references below.


Sentences Predicates Frames Download Citation
50000 1000 50 v1 (zip) White & Rawlins 2016, 2020
74830 1007 150 v2 (zip) White & Rawlins 2016, 2020
An & White 2020
50 50 50 linking (zip) White & Rawlins 2020
1850 37 50 single verb (zip) White & Rawlins 2020
1380 30 46 replication (zip) White & Rawlins 2020


