Skip to content

The Multimodal Model for Vietnamese Visual Question Answering (ViVQA)

Notifications You must be signed in to change notification settings

nngocson2002/ViVQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

We would be grateful if you like our project and give us a star ⭐ on GitHub.

Getting Started

It currently includes code and models at here.

Installation

git clone https://github.com/ngocson1042002/ViVQA.git
cd ViVQA/beit3/HCMUS
pip install salesforce-lavis
pip install torchscale timm underthesea efficientnet_pytorch
pip install --upgrade transformers

Run model

We support our work with Hugging Face model ngocson2002/vivqa-model.

from transformers import AutoModel
from transformers import AutoTokenizer
from processor import Processor
from PIL import Image
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModel.from_pretrained("ngocson2002/vivqa-model", trust_remote_code=True).to(device)
processor = Processor()

image = Image.open('./ViVQA/demo/1.jpg').convert('RGB')
question = "màu áo của con chó là gì?"

inputs = processor(image, question, return_tensors='pt')
inputs["image"] = inputs["image"].unsqueeze(0)

model.eval()
with torch.no_grad():
    output = model(**inputs)
    logits = output.logits
    idx = logits.argmax(-1).item()

print("Predicted answer:", model.config.id2label[idx]) # prints: màu đỏ

Authors

Affiliations

  • Faculty of Mathematics and Computer Science, University of Science, Ho Chi Minh, Vietnam
  • Faculty of Information Technology, University of Science, Ho Chi Minh, Vietnam
  • Vietnam National University, Ho Chi Minh, Vietnam

Contact

Contact Ngoc-Son Nguyen ([email protected]) if you have any questions.

About

The Multimodal Model for Vietnamese Visual Question Answering (ViVQA)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages