Skip to content

Latest commit

 

History

History

MULTIWOZ2_2

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

MultiWOZ 2.2

This dataset consists of a schema file schema.json describing the ontology and dialogue files dialogues_*.json of dialogue data under the train, dev, and test folders.

Notes:

  • Compared to MultiWOZ 2.1, we remove SNG01862.json as it's an invalid dialogue.
  • MultiWOZ 2.2 is also available on Hugging Face and ParlAI.

Schema file

schema.json defines the new ontology using the schema representation in the schema-guided dialogue dataset.

The table below shows the categorical slots, non-categorical slots and intents defined for each domain.

Domain Categorical slots Non-categorical slots Intents
Restaurant pricerange, area, bookday, bookpeople food, name, booktime, address, phone, postcode, ref find, book
Attraction area, type name, address, entrancefee, openhours, entrancefee, openhours, phone, postcode find
Hotel pricerange, parking, internet, stars, area, type, bookpeople, bookday, bookstay name, address, phone, postcode, ref find, book
Taxi - destination, departure, arriveby, leaveat, phone, type book
Train destination, departure, day, bookpeople arriveby, leaveat, trainid, ref, price, duration find, book
Bus day departure, destination, leaveat find
Hospital - department , address, phone, postcode find
Police - name, address, phone, postcode find

Of the 61 slots in the schema, the following 35 slots are tracked in the dialogue state:

{
  "attraction-area",
  "attraction-name",
  "attraction-type",
  "bus-day",
  "bus-departure",
  "bus-destination",
  "bus-leaveat",
  "hospital-department",
  "hotel-area",
  "hotel-bookday",
  "hotel-bookpeople",
  "hotel-bookstay",
  "hotel-internet",
  "hotel-name",
  "hotel-parking",
  "hotel-pricerange",
  "hotel-stars",
  "hotel-type",
  "restaurant-area",
  "restaurant-bookday",
  "restaurant-bookpeople",
  "restaurant-booktime",
  "restaurant-food",
  "restaurant-name",
  "restaurant-pricerange",
  "taxi-arriveby",
  "taxi-departure",
  "taxi-destination",
  "taxi-leaveat",
  "train-arriveby",
  "train-bookpeople",
  "train-day",
  "train-departure",
  "train-destination",
  "train-leaveat"
}

Dialogue files

Dialogues are formatted following the data presentation of the schema-guided dialogue dataset.

Because the state value of a slot can be mentioned in different ways in the dialogues (e.g. 8pm and 20:00), the ground truth state values is presented as a list of values to incorporate such cases. Predicting any of them is considered as correct in the evaluation. Specifically, the state values of each turn is represented as:

{
  "state":{
    "active_intent": String. User intent of the current turn.
    "requested_slots": List of string representing the slots, the values of which are being requested by the user.
    "slot_values": Dict of state values. The key is slot name in string. The value is a list of values.
  }
}

In addition, we also add the span annotations that identify the location where slot values have been mentioned in the utterances for non-categorical slots. These span annotations are represented as follows:

{
  "slots": [
    {
      "slot": String of slot name.
      "start": Int denoting the index of the starting character in the utterance corresponding to the slot value.
      "exclusive_end": Int denoting the index of the character just after the last character corresponding to the slot value in the utterance. In python, utterance[start:exclusive_end] gives the slot value.
      "value": String of value. It equals to utterance[start:exclusive_end], where utterance is the current utterance in string.
    }
  ]
}

There are some non-categorical slots whose values are carried over from another slot in the dialogue state. Their values don"t explicitly appear in the utterances.

For example, a user utterance can be "I also need a taxi from the restaurant to the hotel.", in which the state values of "taxi-departure" and "taxi-destination" are respectively carried over from that of "restaurant-name" and "hotel-name". For these slots, instead of annotating them as spans, we use a "copy from" annotation to identify the slot it copies the value from. This annotation is formatted as follows,

{
  "slots": [
    {
      "slot": Slot name string.
      "copy_from": The slot to copy from.
      "value": A list of slot values being . It corresponds to the state values of the "copy_from" slot.
    }
  ]
}

Action annotation

There are 8,333 turns missing dialogue action annotations in MultiWOZ 2.1. We used a finetuned T5 model to annotate actions for these missing turns, and manually verified and corrected them. Please note that there are still 749 turns without dialogue action annotations because the semantics of the utterances can"t be appropriately expressed using the dialogue actions defined by ConvLab, such as "Sure. Just a moment.", "said to skip.", etc.

Please check the annotated action annotation in "dialog_acts.json". It is formatted in the same style as MultiWOZ 2.1 except that we use character-level indexing instead of token-level indexing for the action values.

{
  "$dialogue_id": [
    "$turn_id": {
      "dialogue_acts": {
        "$act_name": [
          [
            "$slot_name",
            "$action_value"
          ]
        ]
      },
      "span_info": [
        [
          "$act_name"
          "$slot_name",
          "$action_value"
          "$start_charater_index",
          "$exclusive_end_character_index"
        ]
      ]
    }
  ]
}

Conversion to the data format of MultiWOZ 2.1

To include the corrections from MultiWOZ 2.2 dataset into MultiWOZ 2.1 in the format used by the MultiWOZ 2.1 dataset, please download the MultiWOZ 2.1 zip file, unzip it, and run

python convert_to_multiwoz_format.py --multiwoz21_data_dir=<multiwoz21_data_dir> --output_file=<output json file>

Please refer to our paper for more details about the dataset.

Questions

We are continuously making efforts to make this dataset better. If you have any questions, please feel free to contact us by (schema-guided-dst@google.com or xiaoxuez@google.com).