ViDA-UGC

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have introduced a paradigm shift for Image Quality Assessment (IQA) from unexplainable image quality scoring to explainable IQA, demonstrating practical applications like quality control and optimization guidance. However, current explainable IQA methods not only inadequately use the same distortion criteria to evaluate both User-Generated Content (UGC) and AI-Generated Content (AIGC) images, but also lack detailed quality analysis for monitoring image quality and guiding image restoration.

In this study, we establish the first large-scale Visual Distortion Assessment Instruction Tuning Dataset for UGC images, termed ViDA-UGC, which comprises 11K images with fine-grained quality grounding, detailed quality perception, and reasoning quality description data. This dataset is constructed through a distortion-oriented pipeline, which involves human subject annotation and a Chain-of-Thought (CoT) assessment framework. This framework guides GPT-4o to generate quality descriptions by identifying and analyzing UGC distortions, which helps capturing rich low-level visual features that inherently correlate with distortion patterns. Moreover, we carefully select 476 images with corresponding 6,149 question answer pairs from ViDA-UGC and invite a professional team to ensure the accuracy and quality of GPT-generated information. The selected and revised data further contribute to the first UGC distortion assessment benchmark, termed ViDA-UGC-Bench.

Experimental results demonstrate the effectiveness of the ViDA-UGC and CoT framework for consistently enhancing various image quality analysis abilities across multiple base MLLMs on ViDA-UGC-Bench and Q-Bench, even surpassing GPT-4o.

Data Construction

Overview of our CoT assessment framework. (a) Direct inference leads to inaccurate low-level information and superficial reasoning. (b) Standard CoT decomposes the task into reasoning steps but suffers from hallucination and inconsistent quality analysis. (c) Our approach integrates human expertise and ground-truth annotations to guide reliable step-wise reasoning and accurate quality prediction.

An overview of the ViDA-UGC construction pipeline. This pipeline includes four steps: In (a), MILP sampling strategy ensures approximately uniform distribution for sampled images across low-level features. Human subjects further annotate distortion boxes and image quality scores. In (b), we mark each box with a number and integrate IQA expertise to GPT prompt. GPT-4o then outputs textual descriptions of distortions and their visual attributes. In (c), the generated textual descriptions, together with human annotations and IQA expertise, serve as ground truth information in the proposed CoT framework, which is exploited by GPT-4o to generate quality description data. In (d), the quality description data are transformed into distortion-related VQA and MCQ via GPT-4o. Moreover, the generated distortion attributes in (b) are converted into grounding data using pre-defined chat templates.

Benchmark

To address the limitations of existing benchmarks, which often lack challenging questions and diversity in IQA tasks, we introduce ViDA-UGC-Bench, a first UGC distortion assessment benchmark. We carefully select 476 images with their distortion triplet samples, 476 overall quality analysis from ViDA-Description, 2,567 multi-choice questions from ViDA-Perception, and 3,106 grounding data from ViDA-Grounding. Based on the collected images and data, a professional team consisting of image-processing researchers is responsible for revising the question-answer pairs and ensuring the accuracy of low-level information, aiming to minimize GPT-4o biases. In the figure above, we present one excluded sample and several test questions retained in ViDA-UGC-Bench. The hence-constructed ViDA-UGC-Bench covers all ten UGC distortions and tests MLLMs from three core IQA tasks: distortion grounding, low-level perception, and quality description.

Comparison

Performance on the low-level perception ability

Model (variant)	Training Dataset	Q-Bench			ViDA-UGC-Bench
Model (variant)	Training Dataset	Dist	I-C Dist	Overall	Type	Position	Severity	Significance	Overall
Qwen-VL-Chat	no (Baseline)	50.78%	53.62%	56.39%	31.91%	37.83%	31.72%	36.64%	34.84%
	Q-Instruct	70.43%	74.34%	71.51%	28.50%	35.79%	31.72%	47.12%	35.88%
	ViDA-UGC	78.60%	79.93%	73.98%	60.27%	54.44%	74.37%	69.49%	63.34%
Qwen2-VL-7B	no (Baseline)	75.49%	75.66%	77.19%	43.43%	43.53%	51.26%	54.15%	47.53%
	Q-Instruct	75.68%	77.63%	76.92%	34.12%	42.00%	51.47%	52.08%	44.14%
	ViDA-UGC	87.16%	85.20%	80.6%	76.37%	61.17%	80.46%	72.20%	71.45%
InternVL2.5-8B	no (Baseline)	69.07%	70.07%	73.98%	37.08%	43.78%	44.96%	49.04%	43.51%
	Q-Instruct	76.65%	76.64%	76.45%	29.99%	38.32%	68.07%	45.53%	43.40%
	ViDA-UGC	87.74%	84.54%	79.33%	81.24%	62.06%	82.14%	78.27%	74.80%
InternVL3-8B	no (Baseline)	70.43%	71.38%	74.72%	43.57%	47.08%	47.06%	52.08%	47.37%
	Q-Instruct	70.43%	75.99%	73.18%	29.10%	34.90%	53.15%	47.28%	39.77%
	ViDA-UGC	82.49%	79.93%	77.19%	82.87%	61.80%	78.15%	72.52%	73.00%
GPT-4o-2024-11-20	no (Zero-shot)	75.14%	78.10%	78.60%	46.38%	60.28%	53.57%	59.58%	55.20%

Comparison of the low-level Perception ability between baseline MLLMs, Q-Instruct-tuned versions, and ViDA-UGC-tuned versions on both Q-Bench and ViDA-UGC-Bench. For each baseline model, ViDA-UGC-tuned version achieves the highest score in both benchmarks.

Performance on the quality description ability

Model (variant)	Training Dataset	Q-Bench				ViDA-UGC-Bench
Model (variant)	Training Dataset	completeness	precision	relevance	overall	completeness	precision	reasoning	overall
Qwen-VL-Chat	no (Baseline)*	0.72	0.53	1.86	3.11	0.2	0.58	1.56	2.34
	no (Baseline)†	0.78	0.55	1.99	3.32	0.5	0.77	2.13	3.4
	Q-Instruct*	0.97	0.76	1.95	3.68	0.58	0.93	1.98	3.49
	ViDA-UGC†	1.07	0.74	1.99	3.80	1.33	1.14	2.89	5.36
Qwen2-VL-7B	no (Baseline)*	1.30	1.06	2	4.36	0.70	0.73	2.78	4.21
	no (Baseline)†	1.36	1.28	2	4.64	0.74	0.92	2.88	4.54
	Q-Instruct*	1.15	1.35	2	4.50	0.69	1.14	1.78	3.61
	ViDA-UGC†	1.40	1.30	2	4.70	1.49	1.34	2.95	5.78
InternVL2.5-8B	no (Baseline)*	0.96	0.72	1.83	3.51	0.74	0.74	2.76	4.24
	no (Baseline)†	1.15	1.03	1.94	4.12	0.75	1.25	2.90	4.9
	Q-Instruct*	0.97	1.22	1.93	4.12	0.63	1.28	1.63	3.54
	ViDA-UGC†	1.22	1.23	1.99	4.44	1.46	1.32	2.94	5.72
InternVL3-8B	no (Baseline)*	1.10	0.93	1.86	3.89	0.93	0.96	2.95	4.84
	no (Baseline)†	1.27	1.34	2	4.61	0.96	1.28	2.98	5.22
	Q-Instruct*	0.98	1.32	1.95	4.25	0.64	1.25	1.63	3.52
	ViDA-UGC†	1.34	1.35	1.99	4.68	1.51	1.36	3	5.87

Comparison of the Quality Description ability between baseline MLLMs, Q-Instruct-tuned versions, and ViDA-UGC-tuned versions. For the training dataset with *, results are obtained under the prompt from Q-Instruct: 'Describe and evaluate the quality of the image. Think step by step.'. For the datasets with †, results are obtained under our proposed CoT framework. For each baseline model, ViDA-UGC-tuned version achieves the highest score and our proposed CoT framewor achieves the second best score in both benchmarks..

Performance on quality grounding ability

Method	Referring Grounding			Distortion Grounding	Region Perception
Method	Response Rate	Acc_0.5	mIoU	COCO mAP	COCO mAP
Qwen-VL-Chat	1	32.4	37.3	-	-
Qwen-VL-Chat-ViDA	1	41.3_(+8.9)	43.4_(+6.1)	16.8	27.1
Qwen2-VL-7B	0.79	24.9	29.8	-	-
Qwen2-VL-7B-ViDA	0.99	42.1_(+17.2)	45.2_(+15.4)	17.7	29.4
InternVL2.5-8B	0.98	29.0	37.0	-	-
InternVL2.5-8B-ViDA	1	43.3_(+14.3)	46.4_(+9.4)	20.0	32.3
InternVL3-8B	0.95	25.8	33.5	-	-
InternVL3-8B-ViDA	1	44.2_(+18.4)	47.0_(+13.5)	19.7	30.4

Comparison of the Referring Grounding, Distortion Grounding, and Region Perception abilities between baselines and ViDA-UGC-tuned versions.

Detailed Image Quality Analysis via Visual Distortion Assessment for UGC Images

Abstract

Data Construction

Benchmark

Comparison

Performance on the low-level perception ability

Performance on the quality description ability

Performance on quality grounding ability

Open Source Resources

GitHub Repository

Datasets

Pre-trained Models

Citation

Contact

Academic Services