The MegaAcceptability Dataset

It has long been assumed that a verb’s syntactic distribution is determined by at least two kinds of lexical information: (i) the verb’s semantic type signatures and (ii) its morphosyntactic features. The first of these is often termed S(emantic)-selection; the second goes under various names, though perhaps the most neutral term is subcategorization. Standard distributional analyses in the theoretical literature have had tremendous success in uncovering the nature of S-selection and its relationship to the syntax—i.e. projection rules. But as theories scale to the entire lexicon, these approaches hit a limit, imposed by the sheer size of lexica and by bounds on human analysts’ memory and processing power. This challenge suggests the need for lexicon-scale datasets.

For a detailed description of the datasets associated with this project, the item construction and collection methods, and discussion of how to use a dataset on this scale to address questions in linguistic theory, please see the references below.

Data

Sentences	Predicates	Frames	Download	Citation
50000	1000	50	v1 (zip)	White & Rawlins 2016, 2020
74830	1007	150	v2 (zip)	White & Rawlins 2016, 2020 An & White 2020
50	50	50	linking (zip)	White & Rawlins 2020
1850	37	50	single verb (zip)	White & Rawlins 2020
1380	30	46	replication (zip)	White & Rawlins 2020

References

Kim, Gene Louis and Aaron Steven White. to appear. Montague Grammar Induction. Semantics and Linguistic Theory 30. [pdf]

White, Aaron Steven, and Kyle Rawlins. 2020. Frequency, Acceptability, and Selection: A Case Study of Clause-Embedding. Glossa 5(1): 105. 1–41. [pdf, code, doi]

An, Hannah Youngeun, and Aaron Steven White. 2020. The Lexical and Grammatical Sources of Neg-Raising Inferences. In Proceedings of the Society for Computation in Linguistics 3: 220–233. [pdf, doi]

White, Aaron Steven, and Kyle Rawlins. 2016. A Computational Model of S-Selection. Edited by Mary Moroney, Carol-Rose Little, Jacob Collard, and Dan Burgdorf. Semantics and Linguistic Theory 26: 641–663. [pdf, doi]

Researchers

Aaron Steven White

Kyle Rawlins

Hannah Youngeun An