Skip to content

This repository is part of the GSoC '24 project and demonstrates video annotation capabilities through the integration of a multimodal vision and language model with spatiotemporal analysis.

Notifications You must be signed in to change notification settings

manishkumart/Super-Rapid-Annotator-Multimodal-Annotation-Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo

🌟 Super Rapid Annotator 🌟

Red Hen Lab's Super Rapid Annotator Powered By Large Language Models

Welcome to the Super Rapid Annotator project! This tool is designed for video annotation by leveraging advanced multimodal vision and language models. 🚀

📚 Problem Statement

Annotating videos, especially identifying specific entities and their temporal relationships, is a complex and time-consuming task. Traditional methods lack efficiency and accuracy, particularly in handling multiple videos simultaneously. Our Super Rapid Annotator addresses these challenges by integrating state-of-the-art multimodal models with sophisticated spatial-temporal analysis, streamlining the annotation process.

🛠️ Tech Stack

Python PyTorch Hugging Face Gradio Pydantic OpenCV FFmpeg FastAPI Node.js

Prerequisites

GPU Storage

Setup

Clone the Repository

First, clone the repository and navigate into the project directory:

git clone https://github.com/manishkumart/Super-Rapid-Annotator-Multimodal-Annotation-Tool.git
cd Super-Rapid-Annotator-Multimodal-Annotation-Tool

Create Virtual Environment

Create a virtual environment and install the required packages:

conda create -n env python=3.10 -y
conda activate env
pip install -r requirements.txt
npm i cors-anywhere

Model Downloads

Download the necessary models using the below command:

python models/download_models.py --m1 ./models/ChatUniVi --m2 ./models/Phi3

Update Model paths

Head over to chat_uni.py and update the model path at line 77, and in struct_phi3.py at line 90.

Backend Servers

You need three terminals for this. Run each of the following commands in three different terminals with the full path specified:

cd backend
  1. Start the chat_uni server responsible for video annotation.

    uvicorn chat_uni:app --reload --port 8001

    This server will run on port 8001. If the port is busy, you can try another port, and then update the port in script.js under src.

  2. Start the struct_phi3 server:

     uvicorn struct_phi3:app --reload --port 8002

    This server will run on port 8002. If the port is busy, you can try another port, and then update the port in script.js under src.

  3. Start the Node.js server:

    node backend/server.js

    This proxy server will run on port 8080.

Frontend Server

Open a new terminal and paste the below command

python frontend/serve_html.py

The frontend server can be accessed at http://localhost:5500.

Steps to Follow through the UI:

40BB8366-D842-4840-8E71-E796AFE2A9C8

  1. Upload a Video

    • Click on the "Upload Video" button to select and upload your video file.
  2. Select Annotation Options

    • Choose any combination of the available annotation options:
      • Standing/Sitting
      • Screen Interactions or not
      • Hands free or not
      • Indoor/Outdoor
  3. Start the Annotation Process

    • Click the "Start" button. This will display the selected options and the name of the uploaded video.
  4. Annotate the Video

    • Click the "Video Annotate" button. This will use the prompt and the uploaded video to generate annotations.
  5. View the Prompt

    • Click on the "Prompt" button to see the prompt used in the background based on the selected options.
  6. Get the Output

    • Click the "Output" button to receive the structured output of the annotations.

📖 Blog Posts

  • An Experiment to Unlock Ollama’s Potential in Video Question Answering: Read here
  • Vertical Scaling in Video Annotation with Large Language Models: A Journey with GSoC’24 @ Red Hen Labs: Read here
  • My Journey with Red Hen Labs at GSoC ’24: Read here
  • Why Google Summer Of Code?: Read here

🌟 Features

  • Automatic Video Annotation: Uses the best vision language models for rapid and accurate annotation.
  • Multimodal Capabilities: Combines vision and language models to enhance understanding and entity detection.
  • Concurrent Processing: Efficiently processes multiple videos at once.
  • CSV Output: Annotations are compiled into a user-friendly CSV format.

📑 Findings and Insights

Motivation

At Red Hen Labs, through Google Summer of Code, I am contributing to vertical growth by developing an annotation product for the video space using large language models. This approach ensures that we build effective, domain-specific applications rather than generic models.

Importance of Structuring Models

We cannot always use models out of the box; hence, we must structure them well to achieve the desired outputs. Following the recommendations from the mentors, my first step is to test the capabilities of Video Large Language Models by annotating the following four key entities among many others:

  1. Screen Interaction: Determine if the subject in the video is interacting with a screen in the background.
  2. Hands-Free: Check if the subject’s hands are free or if they are holding anything.
  3. Indoors: Identify whether the subject is indoors or outdoors.
  4. Standing: Observe if the subject is sitting or standing.

The Journey Ahead

We are in an era where new open-source models emerge monthly, continuously improving. This progress necessitates focusing on developing great products around these models, which involve vertical scaling, such as fine-tuning models for specific domains. This approach not only optimizes the use of existing models but also accelerates the development of practical and effective solutions.

Dataset Preview

Here is a glimpse of the news dataset that we will be annotating, showcasing the real-world application of our annotation models.

Video Frames and Key Entities

All of the video frames we analyzed are sourced from news segments, each lasting approximately 4–5 seconds. To accurately capture the main key entities from these models, I have extensively experimented with prompt engineering, employing multiple variations and different models. The most effective prompt, yielding outstanding results, is provided below.

The Golden Prompt

For each question, analyze the given video carefully and base your answers on the observations made.

  1. Examine the subject’s right and left hands in the video to check if they are holding anything like a microphone, book, paper (white color), object, or any electronic device, try segmentations and decide if the hands are free or not.
  2. Evaluate the subject’s body posture and movement within the video. Are they standing upright with both feet planted firmly on the ground? If so, they are standing. If they seem to be seated, they are seated.
  3. Assess the surroundings behind the subject in the video. Do they seem to interact with any visible screens, such as laptops, TVs, or digital billboards? If yes, then they are interacting with a screen. If not, they are not interacting with a screen.
  4. Consider the broader environmental context shown in the video’s background. Are there signs of an open-air space, like greenery, structures, or people passing by? If so, it’s an outdoor setting. If the setting looks confined with furniture, walls, or home decorations, it’s an indoor environment.

By taking these factors into account when watching the video, please answer the questions accurately.

🙏 Acknowledgment

Special thanks to Raúl Sánchez Sánchez for his continuous support and guidance throughout this project.

📄 License

This project is licensed under the MIT License.

🤝 Contributing

Contributions are welcome! Please feel free to submit a pull request.

🌎 Connect with me

For any questions, please reach out to me at LinkedIn

About

This repository is part of the GSoC '24 project and demonstrates video annotation capabilities through the integration of a multimodal vision and language model with spatiotemporal analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages