Google Research Datasets

visage Public
Visage contains an image dataset of images with human annotations on whether or not certain attributes are present or depicted in the image. The attribute may either be stereotypical or non-stereotypical w.r.t. to the identity group in the image. It also contains a list of attributes in English along with annotations about whether they are visual.

google-research-datasets/visage’s past year of commit activity

6 Apache-2.0 1 0 0 Updated Jul 16, 2024
dices-dataset Public
This repository contains two datasets with multi-turn adversarial conversations generated by human agents interacting with a dialog model and rated for safety by two corresponding diverse rater pools.

google-research-datasets/dices-dataset’s past year of commit activity

23 2 1 0 Updated Jul 16, 2024
wit Public
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

google-research-datasets/wit’s past year of commit activity

975 40 3 1 Updated Jul 12, 2024
cube Public
CUBE is a benchmark to evaluate the Cultural Competence of T2I models

google-research-datasets/cube’s past year of commit activity

2 CC-BY-4.0 0 0 0 Updated Jul 11, 2024
rico_semantics Public
Consists of ~500k human annotations on the RICO dataset identifying various icons based on their shapes and semantics, and associations between selected general UI elements and their text labels. Annotations also include human annotated bounding boxes which are more accurate and have a greater coverage of UI elements.

google-research-datasets/rico_semantics’s past year of commit activity

19 CC-BY-SA-4.0 1 1 0 Updated Jun 27, 2024
tpu_graphs Public

google-research-datasets/tpu_graphs’s past year of commit activity

C++ 120 Apache-2.0 43 2 1 Updated Jun 25, 2024
MISeD Public
MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.

google-research-datasets/MISeD’s past year of commit activity

7 3 0 0 Updated Jun 25, 2024
richhf-18k Public
RichHF-18K dataset contains rich human feedback labels we collected for our CVPR'24 paper: https://arxiv.org/pdf/2312.10240, along with the file name of the associated labeled images (no urls or images are included in this dataset).

google-research-datasets/richhf-18k’s past year of commit activity

80 2 7 0 Updated Jun 25, 2024
web-images Public
Images gathered from the Internet in 2023 and some metadata

google-research-datasets/web-images’s past year of commit activity

HTML 0 0 0 0 Updated Jun 24, 2024
GeniL Public
GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.

google-research-datasets/GeniL’s past year of commit activity

0 CC-BY-4.0 0 0 0 Updated Jun 18, 2024

View all repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Research Datasets

Pinned Loading

Repositories

People

Top languages

Most used topics