MuLD: The Multitask Long Document Benchmark [dataset]

MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.

Resource type: Collection
Total Items: 14
Size: 12.1 GB
Contributors: Creator: Hudson, G Thomas ¹
Creator: Al Moubayed, Noura ¹

¹ Durham University
Research methods: MuLD consists of 6 tasks chosen to span a variety of dataset sizes, genres, and formulations, and are created by filtering, extending, or modifying existing NLP dataset approaches.
Other description: Code published on GitHub: https://github.com/ghomasHudson/muld
Keyword: NLP
Long documents
Multitask
Subject: Natural language processing (Computer science)
Machine learning
Language: German
English
Cited in: arxiv:2202.07362
Identifier: ark:/32150/r102870v95b
Publisher: Durham University
Date Created: 2022-01-01

Items in this Collection

List of items in this collection
	Title	Date Uploaded	Visibility	Action
	Style Change - test [dataset] Is part of: MuLD: The Multitask Long Document Benchmark [dataset]	26 April 2022	Open Access	Single-Use Link to File Edit File Download File
File Name: style_change_test.json.bz2 File Format: x-bzip2 (bzip2 compressed data, block size = 900k, BZ2, Bzip2) Creator: Depositor: G.T. Hudson Edit Access: Users: mjxs37
	OpenSubtitles - test [dataset] Is part of: MuLD: The Multitask Long Document Benchmark [dataset]	26 April 2022	Open Access	Single-Use Link to File Edit File Download File
File Name: opensubtitles_test.json.bz2 File Format: x-bzip2 (bzip2 compressed data, block size = 900k, BZ2, Bzip2) Creator: Depositor: G.T. Hudson Edit Access: Users: mjxs37
	Character ID - validation [dataset] Is part of: MuLD: The Multitask Long Document Benchmark [dataset]	26 April 2022	Open Access	Single-Use Link to File Edit File Download File
File Name: character_id_validation.json.bz2 File Format: x-bzip2 (bzip2 compressed data, block size = 900k, BZ2, Bzip2) Creator: Depositor: G.T. Hudson Edit Access: Users: mjxs37
	Character ID - train [dataset] Is part of: MuLD: The Multitask Long Document Benchmark [dataset]	26 April 2022	Open Access	Single-Use Link to File Edit File Download File
File Name: character_id_train.json.bz2 File Format: x-bzip2 (bzip2 compressed data, block size = 900k, BZ2, Bzip2) Creator: Depositor: G.T. Hudson Edit Access: Users: mjxs37

Descriptions

Actions

Items in this Collection

Style Change - test [dataset]

OpenSubtitles - test [dataset]

Character ID - validation [dataset]

Character ID - train [dataset]