Quantitative Evaluation Framework for Video-based Conversational Models

This readme provides a detailed walkthrough of our proposed quantitative benchmarking framework. The framework enables an in-depth evaluation of video-based conversational models through two types of assessments:

Video-based Generative Performance Benchmarking
Zero-Shot Question-Answer Evaluation

Video-based Generative Performance Benchmarking

Our framework introduces a benchmark designed to assess the text generation performance of video-based conversational models. We leverage a test set of 500 samples curated from the ActivityNet-200 videos for this purpose.

You can download the videos from here and corresponding human-generated detailed descriptions from here.

Our benchmarks cover five key aspects:

Correctness of Information
Detailed Orientation
Contextual Understanding
Temporal Understanding
Consistency

Evaluation Aspect	Video Chat	LLaMA Adapter	Video LLaMA	Video-ChatGPT
Correctness of Information	2.23	2.03	1.96	2.40
Detail Orientation	2.50	2.32	2.18	2.52
Contextual Understanding	2.53	2.30	2.16	2.62
Temporal Understanding	1.94	1.98	1.82	1.98
Consistency	2.24	2.15	1.79	2.37

We generate task-specific question-answers by querying the GPT-3.5-Turbo model using the human-generated detailed video descriptions. The generated question-answer pairs are available for download here.

Follow the steps below to perform the quantitative benchmarking:

Step 1: Run the inference using the provided question-answer pairs for each criteria.

python video_chatgpt/eval/run_inference_benchmark_general.py \
    --video_dir <path-to-directory-containing-videos> \
    --gt_file <ground-truth-file-containing-question-answer-pairs> \
    --output_dir <output-dir-path> \
    --output_name <output-file-name> \
    --model-name <path-to-LLaVA-Lightening-7B-v1-1> \
    --projection_path <path-to-Video-ChatGPT-weights>

Note that the question-answer pairs (gt_file) are the same for correctness, detailed orientation and Contextual understanding.
For temporal understanding and consistency, separate question-answer pairs are provided.

Step 2: Execute the corresponding evaluation script to perform benchmarking.

For example, for correctness criteria:

python quantitative_evaluation/evaluate_benchmark_1_correctness.py \
    --pred_path <path-to-prediction-file-generated-using-inference-script> \
    --output_dir <output-directory-path> \
    --output_json <path-to-save-annotation-final-combined-json-file> \
    --api_key <openai-api-key-to-access-GPT3.5-Turbo-model>

For evaluation on all 5 criteria, you can use:

bash quantitative_evaluation/evaluate_benchmark.sh

Note: To further understand how the question-answer annotations are prepared for the benchmarking, refer to: benchmark_dataset_generation.

Zero-Shot Question-Answer Evaluation

Our framework facilitates zero-shot evaluation on five standard open-ended question-answer datasets: MSRVTT, MSVD, TGIF, and ActivityNet-QA. For the sake of brevity, we present the evaluation method on ActivityNet-QA. The evaluation protocol remains the same for all datasets, except for some dataset-specific changes related to videos and annotations.

Model	MSVD-QA		MSRVTT-QA		TGIF-QA		Activity Net-QA
	Accuracy	Score	Accuracy	Score	Accuracy	Score	Accuracy	Score
FrozenBiLM	32.2	--	16.8	--	41.0	--	24.7	--
Video Chat	56.3	2.8	45.0	2.5	34.4	2.3	26.5	2.2
LLaMA Adapter	54.9	3.1	43.8	2.7	-	-	34.2	2.7
Video LLaMA	51.6	2.5	29.6	1.8	-	-	12.4	1.1
Video-ChatGPT	64.9	3.3	49.3	2.8	51.4	3.0	35.2	2.7

Follow these steps to conduct the evaluation:

Step 1: Run the inference. You'll need the following:

a) Videos: Download the videos for ActivityNet-QA from here.

b) Question and answer annotations: You can obtain these from the official GitHub repository, or download from here.

Run the command:

python video_chatgpt/eval/run_inference_activitynet_qa.py \
    --video_dir <path-to-video-dir> \
    --gt_file_question <test_q.json> \
    --gt_file_answers <test_a.json> \
    --output_dir <path-to-out-dir> \
    --output_name video_chatgpt_activitynet_qa_preds \
    --projection_path <path-to-video-chat-gpt-checkpoint>

This will generate a JSON file containing the model's predicted responses.

Step 2: Evaluate the predicted responses. The evaluation process computes the accuracy and assigns a score on a scale of 1-5. This step requires the predictions from step-1, question-answer pair annotations, and an OpenAI API key.

Run the command:

python quantitative_evaluation/evaluate_activitynet_qa.py \
    --pred_path <video_chatgpt_activitynet_qa_preds> \
    --output_dir <path-to-out-dir> \
    --output_json <video_chatgpt_activitynet_qa_results> \
    --api_key <your-openai-api_key> \
    --num_tasks 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Quantitative Evaluation Framework for Video-based Conversational Models

Video-based Generative Performance Benchmarking

Zero-Shot Question-Answer Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Quantitative Evaluation Framework for Video-based Conversational Models

Video-based Generative Performance Benchmarking

Zero-Shot Question-Answer Evaluation