QG-Bench consists of question generation datasets in 8 different languages and 11 diverse domains.
The dataset is proposed in "Generative Language Models for Paragraph-Level Question Generation, EMNLP 2022 main conference",
and all the datasets are shared on huggingface via the link below.
To use the dataset, first install datasets
library (pip install datasets
) and load the dataset.
from datasets import load_dataset
dataset = load_dataset("lmqg/qg_squad")
An example of the dataset instance looks as follows.
{
"question": "What is heresy mainly at odds with?",
"paragraph": "Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs. A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.",
"answer": "established beliefs or customs",
"sentence": "Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs .",
"paragraph_sentence": "<hl> Heresy is any provocative belief or theory that is strongly at variance with established beliefs or customs . <hl> A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.",
"paragraph_answer": "Heresy is any provocative belief or theory that is strongly at variance with <hl> established beliefs or customs <hl>. A heretic is a proponent of such claims or beliefs. Heresy is distinct from both apostasy, which is the explicit renunciation of one's religion, principles or cause, and blasphemy, which is an impious utterance or action concerning God or sacred things.",
"sentence_answer": "Heresy is any provocative belief or theory that is strongly at variance with <hl> established beliefs or customs <hl> ."
}
Each feature contains following information:
question
: astring
feature.paragraph
: astring
feature.answer
: astring
feature.sentence
: astring
feature.paragraph_answer
: astring
feature, which is same as the paragraph but the answer is highlighted by a special token<hl>
.paragraph_sentence
: astring
feature, which is same as the paragraph but a sentence containing the answer is highlighted by a special token<hl>
.sentence_answer
: astring
feature, which is same as the sentence but the answer is highlighted by a special token<hl>
.
Each of paragraph_answer
, paragraph_sentence
, and sentence_answer
feature is assumed to be used to train a question generation model, but with different information.
The paragraph_answer
and sentence_answer
features are for answer-aware question generation and paragraph_sentence
feature is for sentence-aware question generation.
See more detail at our paper.
- QG-Bench (multilingual): The multilingual subset of QG-Bench from Wikipedia in each language.
Dataset | Data size (train/valid/test) | Average character length (paragraph/sentence/question/answer) |
---|---|---|
English (lmqg/qg_squad ) |
75,722/10,570/11,877 | 757/179/59/20 |
French (lmqg/qg_frquad ) |
17,543/3,188/3,188 | 797/160/57/23 |
Japanese (lmqg/qg_jaquad ) |
27,809/3,939/3,939 | 424/72/32/6 |
Korean (lmqg/qg_koquad ) |
54,556/5,766/5,766 | 521/81/34/6 |
Russian (lmqg/qg_ruquad ) |
40,291/5,036/5,036 | 754/174/64/26 |
Italian (lmqg/qg_itquad ) |
46,550/7,609/7,609 | 807/124/66/16 |
Spanish (lmqg/qg_esquad ) |
77,025/10,570/10,570 | 781/122/64/21 |
German (lmqg/qg_dequad ) |
9,314/2,204/2,204 | 1,577/165/59/66 |
IMPORTANT: The lmqg/qg_frquad
is private as the original FQuAD requires filling a form first, please see here
.
- QG-Bench (multidomain): The multidomain subset of QG-Bench in English.
Dataset | Data size (train/valid/test) | Average character length (paragraph/sentence/question/answer) |
---|---|---|
SubjQA/Book (lmqg/qg_subjqa ) |
637/92/191 | 1,514/146/28/83 |
SubjQA/Elec (lmqg/qg_subjqa ) |
697/99/238 | 1,282/129/26/66 |
SubjQA/Grocery (lmqg/qg_subjqa ) |
687/101/379 | 896/107/25/49 |
SubjQA/Movie (lmqg/qg_subjqa ) |
724/101/154 | 1,746/146/27/72 |
SubjQA/Restaurant (lmqg/qg_subjqa ) |
823/129/136 | 1,006/104/26/51 |
SubjQA/Trip (lmqg/qg_subjqa ) |
875/143/397 | 1,002/108/27/51 |
SQuADShifts/Amazon (lmqg/qg_squadshifts ) |
3,295/1,648/4,942 | 773/111/43/18 |
SQuADShifts/Wiki (lmqg/qg_squadshifts ) |
2,646/1,323/3,969 | 773/184/58/26 |
SQuADShifts/News (lmqg/qg_squadshifts ) |
3,355/1,678/5,032 | 781/169/51/20 |
SQuADShifts/Reddit (lmqg/qg_squadshifts ) |
3,268/1,634/4,901 | 774/116/45/19 |
We release QG models fine-tuned on every dataset in QG-Bench. Following models are available via the transformers modelhub and can be used as below.
We recommend to use the models via lmqg
, but they are compatible with transformers
too.
- With
lmqg
library
from lmqg import TransformersQG
# initialize model
model = TransformersQG(language='en', model='lmqg/t5-large-squad-qg')
# a list of paragraph
context = [
"William Turner was an English painter who specialised in watercolour landscapes",
"William Turner was an English painter who specialised in watercolour landscapes"
]
# a list of answer (same size as the context)
answer = [
"William Turner",
"English"
]
# model prediction
question = model.generate_q(list_context=context, list_answer=answer)
print(question)
[
"Who was an English painter who specialised in watercolour landscapes?",
"What nationality was William Turner?"
]
- With
transformers
library
from transformers import pipeline
pipe = pipeline("text2text-generation", 'lmqg/t5-large-squad-qg')
# model prediction
input_text = 'generate question: <hl> Beyonce <hl> further expanded her acting career, starring as blues singer Etta James in the 2008 musical biopic, Cadillac Records.'
pipe(input_text)
[{'generated_text': 'Who starred as Etta James in Cadillac Records?'}]
English QG model fine-tuned on lmqg/qg_squad
. The data split follows Du, et al 2017
and Du, et al 2018
.
Model | LM | Training Data | Test Data | BLEU4 | METEOR | ROUGE-L | BERTScore | MoverScore |
---|---|---|---|---|---|---|---|---|
UniLM |
UniLM (340M Parameter) | lmqg/qg_squad |
lmqg/qg_squad |
22.78 | 25.49 | 51.57 | - | - |
UniLM-v2 |
UniLM-v2 (110M Parameter) | lmqg/qg_squad |
lmqg/qg_squad |
24.70 | 26.33 | 52.13 | - | - |
ProphetNet |
ProphetNet (340M Parameter) | lmqg/qg_squad |
lmqg/qg_squad |
23.91 | 26.60 | 52.26 | - | - |
ERNIE-GEN |
ERNIE-GEN (340M Parameter) | lmqg/qg_squad |
lmqg/qg_squad |
25.40 | 26.92 | 52.84 | - | - |
lmqg/t5-small-squad-qg |
t5-small |
lmqg/qg_squad |
lmqg/qg_squad |
24.40 | 25.84 | 51.43 | 90.20 | 63.89 |
lmqg/t5-base-squad-qg |
t5-base |
lmqg/qg_squad |
lmqg/qg_squad |
26.13 | 26.97 | 53.33 | 90.60 | 64.74 |
lmqg/t5-large-squad-qg |
t5-large |
lmqg/qg_squad |
lmqg/qg_squad |
27.21 | 27.70 | 54.13 | 91.00 | 65.29 |
lmqg/bart-base-squad-qg |
facebook/bart-base |
lmqg/qg_squad |
lmqg/qg_squad |
24.68 | 26.05 | 52.66 | 90.87 | 64.47 |
lmqg/bart-large-squad-qg |
facebook/bart-large |
lmqg/qg_squad |
lmqg/qg_squad |
26.17 | 27.07 | 53.85 | 91.00 | 64.99 |
The results of UniLM/UniLM-v2/ProphetNet/ERNIE-GEN are taken from their papers.
Non-English QG model fine-tuned on QG-Bench (multilingual).
Please cite following paper if you use any resource:
@inproceedings{ushio-etal-2022-generative,
title = "{G}enerative {L}anguage {M}odels for {P}aragraph-{L}evel {Q}uestion {G}eneration",
author = "Ushio, Asahi and
Alva-Manchego, Fernando and
Camacho-Collados, Jose",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, U.A.E.",
publisher = "Association for Computational Linguistics",
}