It has long been assumed that a verb’s syntactic distribution is determined by at least two kinds of lexical information: (i) the verb’s semantic type signatures and (ii) its morphosyntactic features. The first of these is often termed S(emantic)-selection ; the second goes under various names, though perhaps the most neutral term is subcategorization. Standard distributional analyses in the theoretical literature have had tremendous success in uncovering the nature of S-selection and its relationship to the syntax—i.e., projection rules. But as theories scale to the entire lexicon, these approaches hit a limit, imposed by the sheer size of lexica and by bounds on human analysts’ memory and processing power. This challenge suggests the need for lexicon-scale datasets.

The MegaAcceptability dataset consists of ordinal acceptability judgments for 1,000 clause-embedding verbs of English in 50 surface-syntactic frames and with three different matrix tense-aspect. For a detailed description of the dataset, the item construction and collection methods, and discussion of how to use a dataset on this scale to address questions in linguistic theory, please see the following paper:

An, H. Y. & A.S. White 2019. The lexical and grammatical sources of neg-raising inferences. arXiv:1908.05253 [cs.CL]

White, A. S. & K. Rawlins. 2016. A computational model of S-selection. In M. Moroney, C-R. Little, J. Collard & D. Burgdorf (eds.), Semantics and Linguistic Theory 26, 641-663. Ithaca, NY: CLC Publications.

If you make use of this dataset in a presentation or publication, we ask that you please cite these papers.


