What is this ?

This is a script to generate Who-did-What(WDW) datasaet using Gigaword articles from Linguistic Data Consortium.

Prerequisites :

Contents :

  • Readme.txt : Please read this before you run the script.
  • : Main script.
  • : Another main script to flag candidate answers in the passage.
  • FindPassage.jar : Script to find passages in Gigaword.
  • FindPassage_lib : Library files to run the above .jar file.
  • keys : .xml files with question sentences and multiple choices (not passage).

What you will get

You will have a reproduced Who-did-What dataset as .xml files in a directory named "who_did_what".

├ Strict/
│ ├ train.xml
│ ├ valid.xml
│ └ test.xml
└ Relaxed/
  └ train.xml

  • Strict : Baselines are suppressed exactly to the random chance.
  • Relaxed : Baselines are less suppressed. Only for training.
Running the script

If you already have Gigaword data, then you can simply run the script with the following command.

  • $ [path/to/gigaword]

Also, you can flag candidate answers in the passage with the following script. (experimental: it seems the script misses some person names in passages.)

  • $ [path/to/gigaword]

NOTE : This script will take up to 20 min and few hours to flag, and end up with 1GB dataset.