Who did What : A Large-Scale Person-Centered Cloze Dataset
We have constructed a new ``who did what'' (WDW) dataset of over 200,000 fill-in-the-gap (cloze) multiple choice reading comprehension problems constructed from the LDC English Gigaword newswire corpus. The WDW dataset has a variety of novel features. First, we avoid using article summaries for question formation. Instead, each problem is formed from two independent articles --- an article given as the passage to be read and a separate article on the same events used to form the question. Second, we avoid anonymization --- each choice is a person named entity. Third, the problems have been filtered to remove a fraction that are easily solved by simple baselines, while remaining 90% solvable by humans. This gives a newswire cloze dataset significantly different from existing ones and also different from story completion datasets.