Download

Download Script!

What is this ?

This is a script to generate Who-did-What(WDW) datasaet using Gigaword articles from Linguistic Data Consortium.

Prerequisites :

English Gigaword v5
Java version "1.8.0_25" or higher

Contents :

Readme.txt : Please read this before you run the script.
start.sh : Main script.
targeted.sh : Another main script to flag candidate answers in the passage.
FindPassage.jar : Script to find passages in Gigaword.
FindPassage_lib : Library files to run the above .jar file.
keys : .xml files with question sentences and multiple choices (not passage).

What you will get

You will have a reproduced Who-did-What dataset as .xml files in a directory named "who_did_what".

who_did_what/
├ Strict/
│　├ train.xml
│　├ valid.xml
│　└ test.xml
└ Relaxed/
　└ train.xml

Strict : Baselines are suppressed exactly to the random chance.
Relaxed : Baselines are less suppressed. Only for training.

Running the script

If you already have Gigaword data, then you can simply run the script with the following command.

$ start.sh [path/to/gigaword]

Also, you can flag candidate answers in the passage with the following script. (experimental: it seems the script misses some person names in passages.)

$ targeted.sh [path/to/gigaword]

NOTE : This script will take up to 20 min and few hours to flag, and end up with 1GB dataset.