Skip to content

dinhanhx/hateful_memes_classification

Repository files navigation

Hateful Memes Classification

forthebadge

forthebadge

forthebadge

Introduction

This repo is my bachelor thesis project's source code. It's based on and forked from vladsandulescu/hatefulmemes. As in the title, it's about my solution to hateful memes challenge. To come up the solution, I have read all wining solutions's publications. The first place and the fifth place solutions insfluenced my work most. Besides those influences, I also came up a novel mechanism called Multiple Directional Attention (MDA) to support UNITER in utilizing different data channels at once. MDA is the generalization of bidirectional cross-attention of the fifth solution. A data channel in this case is a pair of image and text. Text can be: meme text; caption; paraphrased meme text; context. Image is image feature (including detected objects in image). Unlike the fifth solution only using 2 data channels ([[img, meme text], [img, caption]]), 3 data channels were used to improve model performance. As a result, UNITER with a MDA variant achieved 0.8026 AUC ROC and 0.7510 Accuracy which is above 5th place in the challenge.

Please read my bachelor thesis and fifth place solution publication to understand more.

If you use my generated data, my model's source code, my model's weight, my bachelor thesis, please use the following biblatex to cite

@phdthesis{vu_2021, 
    title={Hateful memes classification}, 
    author={VU, Dinh Anh},
    institution={University of Science and Technology of Hanoi},
    url={https://drive.google.com/file/d/1_5aZCVhIbBs5yrkcJQJ3oLz-nFAALkcb/view?usp=sharing},
    year={2021}
}

Enviroment

One should read the installation scripts and know hardware information (like what kind of GPU, which driver that GPU needs, what are packages versions mentioned in scripts, etc). Don't worry when it doesn't work at the first time.

This project was carried out on a machine has:

  • a GPU Tesla K80 12 GiBs
  • a CPU Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

On Debian 10.9. (This is just another way to say I use linux.)

And with the following tools:

  • conda
  • Python 3.6.13 (installed from conda)

This project was divided into subprojects with different conda enviroments. Therefore, one should not put everything in 1 conda environment.

Project workflow and structure

The project is generally structured as follows:

root/
├─ data/
├─ model_asset/
├─ notebooks/
│  ├─ graph.ipynb
├─ py-scripts/
│  ├─ README.md
├─ py-bottom-up-attention/
│  ├─ conda/
│  │  ├─ setup_bua.sh
├─ UNITER/
│  ├─ conda/
│  │  ├─ setup_uniter.sh
│  ├─ storage/
│  │  ├─ pretrained/

root/ is where YOU clone this repo into. Therefore, you might want to rename root/ to whatever you want. Here I use root/ for convenience. There are 3 folders ignored by git because they contains large files.

  • data/ - original dataset and generated dataset
  • model_asset/ - models checkpoints and logs
  • UNITER/storage/pretrained/ - pretrained UNITER core models

Therefore, you should create these directories first for later convenience.

And we have 3 foreign repos as subprojects:

Note: You must read files shown in the general structure and README files of foreign repos.

Dataset preparation

Shortcut: go to release section of this repos then download data.zip extract files to data/ folder. There is a README.md, please read it carefully.

Hateful memes challenge dataset

Download the original dataset from this then extract files to data/ folder.

Get from dev_seen_unseen.jsonl file and place it in the same folder as the other jsonl files.

Image feature extraction

Setup

# start at root folder
cd py-bottom-up-attention/conda
bash setup.sh

Download Before run this command, click me

wget --no-check-certificate http://nlp.cs.unc.edu/models/faster_rcnn_from_caffe_attr_original.pkl -P ~/.torch/fvcore_cache/models/

It simply download pretrained model for py-bottom-up-attention

Generate

# In conda enviroment: bua
# In folder: py-bottom-up-attention

# This will extract features for 3 hours
python demo/detectron2_mscoco_proposal_maxnms_hm.py --split img --data_path ../data/ --output_path ../data/imgfeat/ --output_type tsv --min_boxes 10 --max_boxes 100

# This will split for under 30 minutes
python demo/hm.py --split img --split_json_file train.jsonl --d2_file_suffix d2_10-100_vg --data_path ../data/ --output_path ../data/imgfeat/
python demo/hm.py --split img --split_json_file dev_seen_unseen.jsonl --d2_file_suffix d2_10-100_vg --data_path ../data/ --output_path ../data/imgfeat/
python demo/hm.py --split img --split_json_file test_seen.jsonl --d2_file_suffix d2_10-100_vg --data_path ../data/ --output_path ../data/imgfeat/ 
python demo/hm.py --split img --split_json_file test_unseen.jsonl --d2_file_suffix d2_10-100_vg --data_path ../data/ --output_path ../data/imgfeat/ 

Split more

# In a conda enviroment that has: python 3.6, pandas, tqdm
# In folder: data

# This simply split big tsv files into smaller npy files
# This will split for under 30 minutes
python spliter.py
Click to see spilter.py
```python
import csv
import numpy as np
import base64
import sys
import os

from tqdm import tqdm
from pathlib import Path
csv.field_size_limit(sys.maxsize)

def read_ff(feature_file, test_mode=False):
    TRAIN_VAL_FIELDNAMES = ["id", "img", "label", "text", "img_id", "img_h", "img_w", "objects_id", "objects_conf",
                        "attrs_id", "attrs_conf", "num_boxes", "boxes", "features"]
    TEST_FIELDNAMES = ["id", "img", "text", "img_id", "img_h", "img_w", "objects_id", "objects_conf",
                "attrs_id", "attrs_conf", "num_boxes", "boxes", "features"]

    with open(feature_file, mode='r', encoding='utf8') as f:
        tsv_reader = csv.DictReader(f, delimiter='\t',
                                    fieldnames=TRAIN_VAL_FIELDNAMES if not test_mode else TEST_FIELDNAMES)
        data = []
        for item in tsv_reader:
            try:
                idb = {'img_id': str(item['img_id']),
                        'img': str(item['img']),
                        'text': str(item['text']),
                        'label': int(item['label']) if not test_mode else None,
                        'img_h': int(item['img_h']),
                        'img_w': int(item['img_w']),
                        'num_boxes': int(item['num_boxes']),
                        'boxes': np.frombuffer(base64.decodebytes(item['boxes'].encode()),
                                                dtype=np.float32).reshape((int(item['num_boxes']), -1)),
                        
                        'features': np.frombuffer(base64.decodebytes(item['features'].encode()),
                                                    dtype=np.float32).reshape((int(item['num_boxes']), -1))}
                data.append(idb)
            except:
                print(f"Some error occurred reading img id {item['img_id']}")

        return data

def split(data, folder_name, test_mode=False):
    with open(f"map_{folder_name}.tsv", mode='w', encoding='utf8') as f:
        TRAIN_VAL_FIELDNAMES = ["img", "label", "text", "img_id", "img_h", "img_w", "num_boxes", "npy"]
        TEST_FIELDNAMES =      ["img",          "text", "img_id", "img_h", "img_w", "num_boxes", "npy"]

        tsv_writer = csv.DictWriter(f, fieldnames=TRAIN_VAL_FIELDNAMES if not test_mode else TEST_FIELDNAMES, delimiter='\t')

        tsv_writer.writeheader()
        for d in tqdm(data):
            if test_mode:
                tsv_writer.writerow({'img_id': d['img_id'], 
                                        'img': d['img'],
                                        'text': d['text'],
                                        'img_h': d['img_h'],
                                        'img_w': d['img_w'],
                                        'num_boxes': d['num_boxes'], 
                                        'npy': f"{folder_name}/{d['img_id']}.npy"})
            else:
                tsv_writer.writerow({'img_id': d['img_id'], 
                                        'img': d['img'],
                                        'text': d['text'],
                                        'label': d['label'],
                                        'img_h': d['img_h'],
                                        'img_w': d['img_w'],
                                        'num_boxes': d['num_boxes'], 
                                        'npy': f"{folder_name}/{d['img_id']}.npy"})

            np.save(f"{folder_name}/{d['img_id']}.npy", d)

    return len(os.listdir(folder_name))

if '__main__' == __name__:
    feature_files = ['data_train_d2_10-100_vg.tsv', 
                        'data_dev_seen_unseen_d2_10-100_vg.tsv', 
                        'data_test_seen_d2_10-100_vg.tsv',
                        'data_test_unseen_d2_10-100_vg.tsv']
    for ff in feature_files:
        print(Path(ff).exists())

    data_ff = []
    for ff in feature_files:
        if ff == 'data_test_unseen_d2_10-100_vg.tsv':
            data_ff.append(read_ff(ff, True))
        else:
            data_ff.append(read_ff(ff))

    folders = ['data_train_d2_10-100_vg', 
                'data_dev_seen_unseen_d2_10-100_vg', 
                'data_test_seen_d2_10-100_vg',
                'data_test_unseen_d2_10-100_vg']

    for data, folder in zip(data_ff, folders):
        if 'data_test_unseen_d2_10-100_vg' == folder or 'data_test_seen_d2_10-100_vg' == folder:
            print(split(data, folder, True))
        else: 
            print(split(data, folder, False))
```

Image captioning

Download all 3 csv files from dinhanhx/imgcap/hm_inf_out. If you want to reproduce these files, please read this and that.

Then place those files into root/data/imgcap/ folder

Text paraphrasing

Download all jsonl files staring with data_test_paraphrased_nlpaug_ from dinhanhx/paraphrased_text_for_hm.

Then place those files into root/data/textaug/.

Context addition

Download annotations files from the first place solution.

Then place those files jsonl files into root/data/HimariO_annotations/ folder

Then place preprocess.py into the same folder and read it then run it (in a conda enviroment that has python 3.6 and pandas)

Click to see preprocess.py
```python
import pandas as pd
from pandas.core.common import flatten
import json

def load_jsonl(filename):
    data = []
    with open(filename, 'r') as fobj:
        for line in fobj:
            d = json.loads(line)
            data.append({'id': d['id'],
                            'img': d['img'],
                            'partition_description': ' '.join(list(flatten(d['partition_description'])))
                        })
        return pd.DataFrame.from_records(data)


if '__main__' == __name__:
    train_dev_all_df = load_jsonl('train_dev_all.entity.jsonl')
    test_seen_df = load_jsonl('test_seen.entity.jsonl')
    test_unseen_df = load_jsonl('test_unseen.entity.jsonl')

    data_test_df = train_dev_all_df.merge(test_seen_df, how='outer').merge(test_unseen_df, how='outer')
    data_test_df['id'] = data_test_df['id'].apply(lambda x: str(x).zfill(5))
    data_test_df.to_json('data_test.jsonl', orient='records', lines=True)
```

Double check

The data/ folder should look like this:

imgfeat/
data_dev_seen_unseen_d2_10-100_vg.tsv
data_test_seen_d2_10-100_vg.tsv
data_test_unseen_d2_10-100_vg.tsv
data_train_d2_10-100_vg.tsv
tiny_data_dev_seen_unseen_d2_10-100_vg.tsv
tiny_data_test_seen_d2_10-100_vg.tsv
tiny_data_test_unseen_d2_10-100_vg.tsv
tiny_data_train_d2_10-100_vg.tsv

spliter.py
map_data_dev_seen_unseen_d2_10-100_vg
map_data_test_seen_d2_10-100_vg
map_data_test_unseen_d2_10-100_vg
map_data_train_d2_10-100_vg
data_dev_seen_unseen_d2_10-100_vg/
data_test_seen_d2_10-100_vg/
data_test_unseen_d2_10-100_vg/
data_train_d2_10-100_vg/

HimariO_annotations/

imgcap/

textaug/

img/                  - the PNG images
train.jsonl           - the training set
dev_seen.jsonl        - the development set for Phase 1
dev_unseen.jsonl      - the development set for Phase 2
dev_seen_unseen.jsonl - the combined development set
test_seen.jsonl       - the test set for Phase 1
test_unseen.jsonl     - the test set for Phase 2
README.md
LICENSE.txt

Result reproduction

Setup

# start at root folder
cd UNITER/conda
bash setup_uniter.sh

Config files for all models in all experiments

Model names are same with model name in my bachelor thesis publication.

Phase 1

Phase 2

Phase 3

Phase 4

Reproduce the best model

Shortcut: go to release section of this repos then download output_quadruple_dev_seen_unseen_0_imgcap_HimariO.zip then extract to output_quadruple_dev_seen_unseen_0_imgcap_HimariO place in model_asset/ folder.

# In conda enviroment: uniter
# In folder: UNITER
python train_hm.py --config config/dax/quadruple_attn/train-hm-base-quadruple-hpc_0_imgcap_HimariO.json

Test the best model

# In a conda enviroment that has python 3.6, pandas, click, pretty_errors, sklearn
# In folder: py-scripts
python calc_test.py --test_jsonl test_seen.jsonl --result_csv ../model_asset/output_quadruple_dev_seen_unseen_0_imgcap_HimariO/hm/base/results/test_results_3420_rank0_final.csv

Note this will test with test_seen.jsonl

Inference the best model

To produce result.csv for test_unseen.jsonl do the following

# In conda enviroment: uniter
# In folder: UNITER
python inf_hm.py --root_path ./ --dataset_path ../data --test_image_set test --train_dir ../model_asset/output_quadruple_dev_seen_unseen_0_imgcap_HimariO --ckpt 3420 --output_dir path/to/folder/to/store/ --fp16

Note: remember to change path/to/folder/to/store/