Hateful Memes Classification

Introduction

This repo is my bachelor thesis project's source code. It's based on and forked from vladsandulescu/hatefulmemes. As in the title, it's about my solution to hateful memes challenge. To come up the solution, I have read all wining solutions's publications. The first place and the fifth place solutions insfluenced my work most. Besides those influences, I also came up a novel mechanism called Multiple Directional Attention (MDA) to support UNITER in utilizing different data channels at once. MDA is the generalization of bidirectional cross-attention of the fifth solution. A data channel in this case is a pair of image and text. Text can be: meme text; caption; paraphrased meme text; context. Image is image feature (including detected objects in image). Unlike the fifth solution only using 2 data channels ([[img, meme text], [img, caption]]), 3 data channels were used to improve model performance. As a result, UNITER with a MDA variant achieved 0.8026 AUC ROC and 0.7510 Accuracy which is above 5th place in the challenge.

Please read my bachelor thesis and fifth place solution publication to understand more.

If you use my generated data, my model's source code, my model's weight, my bachelor thesis, please use the following biblatex to cite

@phdthesis{vu_2021, 
    title={Hateful memes classification}, 
    author={VU, Dinh Anh},
    institution={University of Science and Technology of Hanoi},
    url={https://drive.google.com/file/d/1_5aZCVhIbBs5yrkcJQJ3oLz-nFAALkcb/view?usp=sharing},
    year={2021}
}

Enviroment

One should read the installation scripts and know hardware information (like what kind of GPU, which driver that GPU needs, what are packages versions mentioned in scripts, etc). Don't worry when it doesn't work at the first time.

This project was carried out on a machine has:

a GPU Tesla K80 12 GiBs
a CPU Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

On Debian 10.9. (This is just another way to say I use linux.)

And with the following tools:

conda
Python 3.6.13 (installed from conda)

This project was divided into subprojects with different conda enviroments. Therefore, one should not put everything in 1 conda environment.

Project workflow and structure

The project is generally structured as follows:

root/
├─ data/
├─ model_asset/
├─ notebooks/
│  ├─ graph.ipynb
├─ py-scripts/
│  ├─ README.md
├─ py-bottom-up-attention/
│  ├─ conda/
│  │  ├─ setup_bua.sh
├─ UNITER/
│  ├─ conda/
│  │  ├─ setup_uniter.sh
│  ├─ storage/
│  │  ├─ pretrained/

root/ is where YOU clone this repo into. Therefore, you might want to rename root/ to whatever you want. Here I use root/ for convenience. There are 3 folders ignored by git because they contains large files.

data/ - original dataset and generated dataset
model_asset/ - models checkpoints and logs
UNITER/storage/pretrained/ - pretrained UNITER core models

Therefore, you should create these directories first for later convenience.

And we have 3 foreign repos as subprojects:

Note: You must read files shown in the general structure and README files of foreign repos.

Dataset preparation

Shortcut: go to release section of this repos then download data.zip extract files to data/ folder. There is a README.md, please read it carefully.

Hateful memes challenge dataset

Download the original dataset from this then extract files to data/ folder.

Get from dev_seen_unseen.jsonl file and place it in the same folder as the other jsonl files.

Image feature extraction

Setup

# start at root folder
cd py-bottom-up-attention/conda
bash setup.sh

Download Before run this command, click me

wget --no-check-certificate http://nlp.cs.unc.edu/models/faster_rcnn_from_caffe_attr_original.pkl -P ~/.torch/fvcore_cache/models/

It simply download pretrained model for py-bottom-up-attention

Generate

# In conda enviroment: bua
# In folder: py-bottom-up-attention

# This will extract features for 3 hours
python demo/detectron2_mscoco_proposal_maxnms_hm.py --split img --data_path ../data/ --output_path ../data/imgfeat/ --output_type tsv --min_boxes 10 --max_boxes 100

# This will split for under 30 minutes
python demo/hm.py --split img --split_json_file train.jsonl --d2_file_suffix d2_10-100_vg --data_path ../data/ --output_path ../data/imgfeat/
python demo/hm.py --split img --split_json_file dev_seen_unseen.jsonl --d2_file_suffix d2_10-100_vg --data_path ../data/ --output_path ../data/imgfeat/
python demo/hm.py --split img --split_json_file test_seen.jsonl --d2_file_suffix d2_10-100_vg --data_path ../data/ --output_path ../data/imgfeat/ 
python demo/hm.py --split img --split_json_file test_unseen.jsonl --d2_file_suffix d2_10-100_vg --data_path ../data/ --output_path ../data/imgfeat/

Split more

# In a conda enviroment that has: python 3.6, pandas, tqdm
# In folder: data

# This simply split big tsv files into smaller npy files
# This will split for under 30 minutes
python spliter.py

Click to see spilter.py

```python
import csv
import numpy as np
import base64
import sys
import os

from tqdm import tqdm
from pathlib import Path
csv.field_size_limit(sys.maxsize)

def read_ff(feature_file, test_mode=False):
    TRAIN_VAL_FIELDNAMES = ["id", "img", "label", "text", "img_id", "img_h", "img_w", "objects_id", "objects_conf",
                        "attrs_id", "attrs_conf", "num_boxes", "boxes", "features"]
    TEST_FIELDNAMES = ["id", "img", "text", "img_id", "img_h", "img_w", "objects_id", "objects_conf",
                "attrs_id", "attrs_conf", "num_boxes", "boxes", "features"]

    with open(feature_file, mode='r', encoding='utf8') as f:
        tsv_reader = csv.DictReader(f, delimiter='\t',
                                    fieldnames=TRAIN_VAL_FIELDNAMES if not test_mode else TEST_FIELDNAMES)
        data = []
        for item in tsv_reader:
            try:
                idb = {'img_id': str(item['img_id']),
                        'img': str(item['img']),
                        'text': str(item['text']),
                        'label': int(item['label']) if not test_mode else None,
                        'img_h': int(item['img_h']),
                        'img_w': int(item['img_w']),
                        'num_boxes': int(item['num_boxes']),
                        'boxes': np.frombuffer(base64.decodebytes(item['boxes'].encode()),
                                                dtype=np.float32).reshape((int(item['num_boxes']), -1)),
                        
                        'features': np.frombuffer(base64.decodebytes(item['features'].encode()),
                                                    dtype=np.float32).reshape((int(item['num_boxes']), -1))}
                data.append(idb)
            except:
                print(f"Some error occurred reading img id {item['img_id']}")

        return data

def split(data, folder_name, test_mode=False):
    with open(f"map_{folder_name}.tsv", mode='w', encoding='utf8') as f:
        TRAIN_VAL_FIELDNAMES = ["img", "label", "text", "img_id", "img_h", "img_w", "num_boxes", "npy"]
        TEST_FIELDNAMES =      ["img",          "text", "img_id", "img_h", "img_w", "num_boxes", "npy"]

        tsv_writer = csv.DictWriter(f, fieldnames=TRAIN_VAL_FIELDNAMES if not test_mode else TEST_FIELDNAMES, delimiter='\t')

        tsv_writer.writeheader()
        for d in tqdm(data):
            if test_mode:
                tsv_writer.writerow({'img_id': d['img_id'], 
                                        'img': d['img'],
                                        'text': d['text'],
                                        'img_h': d['img_h'],
                                        'img_w': d['img_w'],
                                        'num_boxes': d['num_boxes'], 
                                        'npy': f"{folder_name}/{d['img_id']}.npy"})
            else:
                tsv_writer.writerow({'img_id': d['img_id'], 
                                        'img': d['img'],
                                        'text': d['text'],
                                        'label': d['label'],
                                        'img_h': d['img_h'],
                                        'img_w': d['img_w'],
                                        'num_boxes': d['num_boxes'], 
                                        'npy': f"{folder_name}/{d['img_id']}.npy"})

            np.save(f"{folder_name}/{d['img_id']}.npy", d)

    return len(os.listdir(folder_name))

if '__main__' == __name__:
    feature_files = ['data_train_d2_10-100_vg.tsv', 
                        'data_dev_seen_unseen_d2_10-100_vg.tsv', 
                        'data_test_seen_d2_10-100_vg.tsv',
                        'data_test_unseen_d2_10-100_vg.tsv']
    for ff in feature_files:
        print(Path(ff).exists())

    data_ff = []
    for ff in feature_files:
        if ff == 'data_test_unseen_d2_10-100_vg.tsv':
            data_ff.append(read_ff(ff, True))
        else:
            data_ff.append(read_ff(ff))

    folders = ['data_train_d2_10-100_vg', 
                'data_dev_seen_unseen_d2_10-100_vg', 
                'data_test_seen_d2_10-100_vg',
                'data_test_unseen_d2_10-100_vg']

    for data, folder in zip(data_ff, folders):
        if 'data_test_unseen_d2_10-100_vg' == folder or 'data_test_seen_d2_10-100_vg' == folder:
            print(split(data, folder, True))
        else: 
            print(split(data, folder, False))
```

Image captioning

Download all 3 csv files from dinhanhx/imgcap/hm_inf_out. If you want to reproduce these files, please read this and that.

Then place those files into root/data/imgcap/ folder

Text paraphrasing

Download all jsonl files staring with data_test_paraphrased_nlpaug_ from dinhanhx/paraphrased_text_for_hm.

Then place those files into root/data/textaug/.

Context addition

Download annotations files from the first place solution.

Then place those files jsonl files into root/data/HimariO_annotations/ folder

Then place preprocess.py into the same folder and read it then run it (in a conda enviroment that has python 3.6 and pandas)

Click to see preprocess.py

```python
import pandas as pd
from pandas.core.common import flatten
import json

def load_jsonl(filename):
    data = []
    with open(filename, 'r') as fobj:
        for line in fobj:
            d = json.loads(line)
            data.append({'id': d['id'],
                            'img': d['img'],
                            'partition_description': ' '.join(list(flatten(d['partition_description'])))
                        })
        return pd.DataFrame.from_records(data)


if '__main__' == __name__:
    train_dev_all_df = load_jsonl('train_dev_all.entity.jsonl')
    test_seen_df = load_jsonl('test_seen.entity.jsonl')
    test_unseen_df = load_jsonl('test_unseen.entity.jsonl')

    data_test_df = train_dev_all_df.merge(test_seen_df, how='outer').merge(test_unseen_df, how='outer')
    data_test_df['id'] = data_test_df['id'].apply(lambda x: str(x).zfill(5))
    data_test_df.to_json('data_test.jsonl', orient='records', lines=True)
```

Double check

The data/ folder should look like this:

imgfeat/
data_dev_seen_unseen_d2_10-100_vg.tsv
data_test_seen_d2_10-100_vg.tsv
data_test_unseen_d2_10-100_vg.tsv
data_train_d2_10-100_vg.tsv
tiny_data_dev_seen_unseen_d2_10-100_vg.tsv
tiny_data_test_seen_d2_10-100_vg.tsv
tiny_data_test_unseen_d2_10-100_vg.tsv
tiny_data_train_d2_10-100_vg.tsv

spliter.py
map_data_dev_seen_unseen_d2_10-100_vg
map_data_test_seen_d2_10-100_vg
map_data_test_unseen_d2_10-100_vg
map_data_train_d2_10-100_vg
data_dev_seen_unseen_d2_10-100_vg/
data_test_seen_d2_10-100_vg/
data_test_unseen_d2_10-100_vg/
data_train_d2_10-100_vg/

HimariO_annotations/

imgcap/

textaug/

img/                  - the PNG images
train.jsonl           - the training set
dev_seen.jsonl        - the development set for Phase 1
dev_unseen.jsonl      - the development set for Phase 2
dev_seen_unseen.jsonl - the combined development set
test_seen.jsonl       - the test set for Phase 1
test_unseen.jsonl     - the test set for Phase 2
README.md
LICENSE.txt

Result reproduction

Setup

# start at root folder
cd UNITER/conda
bash setup_uniter.sh

Config files for all models in all experiments

Model names are same with model name in my bachelor thesis publication.

Phase 1

Phase 2

Phase 3

Phase 4

Reproduce the best model

Shortcut: go to release section of this repos then download output_quadruple_dev_seen_unseen_0_imgcap_HimariO.zip then extract to output_quadruple_dev_seen_unseen_0_imgcap_HimariO place in model_asset/ folder.

# In conda enviroment: uniter
# In folder: UNITER
python train_hm.py --config config/dax/quadruple_attn/train-hm-base-quadruple-hpc_0_imgcap_HimariO.json

Test the best model

# In a conda enviroment that has python 3.6, pandas, click, pretty_errors, sklearn
# In folder: py-scripts
python calc_test.py --test_jsonl test_seen.jsonl --result_csv ../model_asset/output_quadruple_dev_seen_unseen_0_imgcap_HimariO/hm/base/results/test_results_3420_rank0_final.csv

Note this will test with test_seen.jsonl

Inference the best model

To produce result.csv for test_unseen.jsonl do the following

# In conda enviroment: uniter
# In folder: UNITER
python inf_hm.py --root_path ./ --dataset_path ../data --test_image_set test --train_dir ../model_asset/output_quadruple_dev_seen_unseen_0_imgcap_HimariO --ckpt 3420 --output_dir path/to/folder/to/store/ --fp16

Note: remember to change path/to/folder/to/store/

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
Im2txt		Im2txt
UNITER		UNITER
notebooks		notebooks
py-bottom-up-attention		py-bottom-up-attention
py-scripts		py-scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_BASELINE.md		README_BASELINE.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hateful Memes Classification

Introduction

Enviroment

Project workflow and structure

Dataset preparation

Hateful memes challenge dataset

Image feature extraction

Image captioning

Text paraphrasing

Context addition

Double check

Result reproduction

Setup

Config files for all models in all experiments

Phase 1

Phase 2

Phase 3

Phase 4

Reproduce the best model

Test the best model

Inference the best model

About

Releases 1

Packages

Languages

License

dinhanhx/hateful_memes_classification

Folders and files

Latest commit

History

Repository files navigation

Hateful Memes Classification

Introduction

Enviroment

Project workflow and structure

Dataset preparation

Hateful memes challenge dataset

Image feature extraction

Image captioning

Text paraphrasing

Context addition

Double check

Result reproduction

Setup

Config files for all models in all experiments

Phase 1

Phase 2

Phase 3

Phase 4

Reproduce the best model

Test the best model

Inference the best model

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages