Download
This is a script to generate Who-did-What(WDW) datasaet using Gigaword articles from Linguistic Data Consortium.
Prerequisites :
- English Gigaword v5
- Java version "1.8.0_25" or higher
Contents :
- Readme.txt : Please read this before you run the script.
- start.sh : Main script.
- targeted.sh : Another main script to flag candidate answers in the passage.
- FindPassage.jar : Script to find passages in Gigaword.
- FindPassage_lib : Library files to run the above .jar file.
- keys : .xml files with question sentences and multiple choices (not passage).
You will have a reproduced Who-did-What dataset as .xml files in a directory named "who_did_what".
who_did_what/
├ Strict/
│ ├ train.xml
│ ├ valid.xml
│ └ test.xml
└ Relaxed/
└ train.xml
- Strict : Baselines are suppressed exactly to the random chance.
- Relaxed : Baselines are less suppressed. Only for training.
If you already have Gigaword data, then you can simply run the script with the following command.
- $ start.sh [path/to/gigaword]
Also, you can flag candidate answers in the passage with the following script. (experimental: it seems the script misses some person names in passages.)
- $ targeted.sh [path/to/gigaword]
NOTE : This script will take up to 20 min and few hours to flag, and end up with 1GB dataset.