ViDA-UGC

Abstract

Recent multimodal large language models (MLLMs) have shifted image quality assessment (IQA) from black-box quality scoring toward explainable quality analysis. However, existing datasets and benchmarks are not specifically designed for user-generated content (UGC) distortions, which arise from real-world capture, processing, and sharing pipelines. They also lack a unified framework for detailed IQA.

In this study, we propose detailed IQA for UGC images, formulated around three core abilities: Grounding (A1), Perception (A2), and Description (A3). To support these abilities, we establish the first large-scale Visual Distortion Assessment Instruction Tuning Dataset for UGC images, named ViDA-UGC, containing 11,534 images, 36K distortion bounding boxes, and 534K instruction tuning data. This dataset is built through a distortion-oriented pipeline combining human subject annotation with Grounding-Perception-Guided Controlled Chain-of-Thought (GPGC CoT) prompting, which guides GPT-4o to progressively identify, analyze, and reason about UGC distortions along an explicit distortion evidence chain.

We further select and verify a subset of ViDA-UGC with a professional team to build ViDA-UGC-Bench, the first UGC distortion assessment benchmark covering all three abilities. Experimental results show that ViDA-UGC fine-tuning consistently enhances grounding, perception, and description across Qwen-VL and InternVL series, while GPGC CoT prompting further improves quality descriptions without fine-tuning, with the best-tuned models even surpassing GPT-4o. ViDA-UGC-Bench further reveals that current MLLMs remain limited in distortion assessment for detailed IQA. The dataset and code will be released publicly.

Comparison

Referring Grounding Results

Model	Variant	Response Rate	Acc_0.5	mIoU
Qwen-VL-Chat	Baseline	1.00	32.4	37.3
Qwen-VL-Chat	ViDA	1.00	41.3_(+8.9)	43.4_(+6.1)
Qwen2-VL-7B	Baseline	0.79	24.9	29.8
Qwen2-VL-7B	ViDA	0.99	42.1_(+17.2)	45.2_(+15.4)
InternVL2.5-8B	Baseline	0.98	29.0	37.0
InternVL2.5-8B	ViDA	1.00	43.3_(+14.3)	46.4_(+9.4)
InternVL3-8B	Baseline	0.95	25.8	33.5
InternVL3-8B	ViDA	1.00	44.2_(+18.4)	47.0_(+13.5)

Referring grounding results for baseline and ViDA-UGC-tuned MLLMs. Response rate is the fraction of valid box outputs; the best result is in bold and the largest gain is in red.

Distortion Grounding and Region Perception Results

Type	Method	DES mAP	DG mAP	RP mAP
Detector	TOOD	-	29.2	36.2
	Co-DETR	-	31.4	35.8
	Grounding DINO	-	37.1	42.2
MLLM	InternVL3-8B-ViDA	17.4	19.7	30.4
	InternVL2.5-8B-ViDA	15.9	20.0	32.3
	Qwen2-VL-7B-ViDA	17.9	17.7	29.4
	Qwen-VL-Chat-ViDA	15.5	16.8	27.1

Distortion grounding and region perception results for fine-tuned detectors and ViDA-UGC-tuned MLLMs. DES extracts boxes from quality descriptions, DG performs direct distortion grounding, and RP denotes region perception.

Perception Performance on Q-Bench

Model (variant)	Training Dataset	Type			Concern				Overall
Model (variant)	Training Dataset	Yes-or-No	What	How	Distortion	Other	I-C Distortion	I-C Other	Overall
Qwen-VL-Chat	no (Baseline)	56.00%	58.63%	54.77%	50.78%	61.57%	53.62%	62.45%	56.39%
	Q-Instruct	78.36%	74.56%	61.05%	70.43%	67.59%	74.34%	77.14%	71.51%
	ViDA-UGC	77.09%	78.32%	66.53%	78.60%	65.97%	79.93%	71.02%	73.98%
Qwen2-VL-7B	no (Baseline)	83.82%	82.74%	64.71%	75.49%	77.08%	75.66%	82.86%	77.19%
	Q-Instruct	84.00%	80.97%	65.31%	75.68%	75.69%	77.63%	80.82%	76.92%
	ViDA-UGC	82.55%	86.95%	72.62%	87.16%	70.83%	85.20%	78.37%	80.60%
InternVL2.5-8B	no (Baseline)	78.91%	77.21%	65.52%	69.07%	76.85%	70.07%	84.08%	73.98%
	Q-Instruct	81.64%	83.63%	64.10%	76.65%	72.00%	76.64%	83.67%	76.45%
	ViDA-UGC	81.45%	85.18%	71.60%	87.74%	66.20%	84.54%	78.37%	79.33%
InternVL3-8B	no (Baseline)	78.91%	76.99%	67.95%	70.43%	75.93%	71.38%	85.71%	74.72%
	Q-Instruct	76.73%	79.20%	63.69%	70.43%	70.37%	75.99%	80.41%	73.18%
	ViDA-UGC	80.00%	85.40%	66.53%	82.49%	70.37%	79.93%	74.69%	77.19%
GPT-4o	no (Zero-shot)	83.59%	82.40%	71.81%	75.14%	78.76%	78.10%	85.00%	78.60%

Comparison of the perception ability between baseline MLLMs, Q-Instruct-tuned versions, and ViDA-UGC-tuned versions on Q-Bench (LLVisionQA-dev). For each baseline model, the highest score is highlighted in bold.

Perception Performance on ViDA-UGC-Bench

Model (variant)	Training Dataset	Dimension					Concern				Overall
Model (variant)	Training Dataset	Clarity	Compress	Exposure	Content	Noise	Type	Position	Severity	Significance	Overall
Qwen-VL-Chat	no (Baseline)	41.54%	18.71%	46.09%	50.85%	13.03%	31.91%	37.83%	31.72%	36.64%	34.84%
	Q-Instruct	30.27%	39.94%	50.22%	47.46%	14.23%	28.50%	35.79%	31.72%	47.12%	35.88%
	ViDA-UGC	66.89%	69.34%	58.71%	66.10%	46.86%	60.27%	54.44%	74.37%	69.49%	63.34%
Qwen2-VL-7B	no (Baseline)	51.90%	52.83%	43.75%	47.46%	17.15%	43.43%	43.53%	51.26%	54.15%	47.53%
	Q-Instruct	43.17%	55.82%	39.06%	45.76%	20.92%	34.12%	42.00%	51.47%	52.08%	44.14%
	ViDA-UGC	69.45%	84.12%	69.64%	86.44%	46.86%	76.37%	61.17%	80.46%	72.20%	71.45%
InternVL2.5-8B	no (Baseline)	39.75%	53.77%	44.64%	55.93%	18.41%	37.08%	43.78%	44.96%	49.04%	43.51%
	Q-Instruct	34.25%	59.75%	41.96%	61.86%	32.22%	29.99%	38.32%	68.07%	45.53%	43.40%
	ViDA-UGC	73.15%	85.69%	71.43%	84.75%	60.25%	81.24%	62.06%	82.14%	78.27%	74.80%
InternVL3-8B	no (Baseline)	50.76%	52.52%	46.65%	55.93%	12.13%	43.57%	47.08%	47.06%	52.08%	47.37%
	Q-Instruct	31.50%	55.97%	38.84%	52.54%	28.03%	29.10%	34.90%	53.15%	47.28%	39.77%
	ViDA-UGC	71.35%	87.74%	70.31%	83.90%	46.03%	82.87%	61.80%	78.15%	72.52%	73.00%
GPT-4o	no (Zero-shot)	54.93%	51.26%	72.54%	58.47%	26.36%	46.38%	60.28%	53.57%	59.58%	55.20%

Comparison of the perception ability between baseline MLLMs, Q-Instruct-tuned versions, and ViDA-UGC-tuned versions on ViDA-UGC-Bench. Compress denotes the compression dimension, and content denotes the content-anomaly dimension. For each baseline model, the highest score is highlighted in bold.

Description Performance on both Q-Bench and ViDA-UGC-Bench

Model (variant)	Training Dataset	Q-Bench				ViDA-UGC-Bench
Model (variant)	Training Dataset	Completeness	Precision	Relevance	Overall	Completeness	Precision	Reasoning	Overall
Qwen-VL-Chat	no (Baseline)*	0.72	0.53	1.86	3.11	0.20	0.58	1.56	2.34
	no (Baseline)†	0.78	0.55	1.99	3.32	0.50	0.77	2.13	3.40
	Q-Instruct*	0.97	0.76	1.95	3.68	0.58	0.93	1.98	3.49
	ViDA-UGC†	1.07	0.74	1.99	3.80	1.33	1.14	2.89	5.36
Qwen2-VL-7B	no (Baseline)*	1.30	1.06	2.00	4.36	0.70	0.73	2.78	4.21
	no (Baseline)†	1.36	1.28	2.00	4.64	0.74	0.92	2.88	4.54
	Q-Instruct*	1.15	1.35	2.00	4.50	0.69	1.14	1.78	3.61
	ViDA-UGC†	1.40	1.30	2.00	4.70	1.49	1.34	2.95	5.78
InternVL2.5-8B	no (Baseline)*	0.96	0.72	1.83	3.51	0.74	0.74	2.76	4.24
	no (Baseline)†	1.15	1.03	1.94	4.12	0.75	1.25	2.90	4.90
	Q-Instruct*	0.97	1.22	1.93	4.12	0.63	1.28	1.63	3.54
	ViDA-UGC†	1.22	1.23	1.99	4.44	1.46	1.32	2.94	5.72
InternVL3-8B	no (Baseline)*	1.10	0.93	1.86	3.89	0.93	0.96	2.95	4.84
	no (Baseline)†	1.27	1.34	2.00	4.61	0.96	1.28	2.98	5.22
	Q-Instruct*	0.98	1.32	1.95	4.25	0.64	1.25	1.63	3.52
	ViDA-UGC†	1.34	1.35	1.99	4.68	1.51	1.36	3.00	5.87

Description comparison among baseline MLLMs, Q-Instruct-tuned versions, and ViDA-UGC-tuned versions. Results with * are obtained using the Q-Instruct prompt ('Describe and evaluate the quality of the image. Think step by step.'), and those with † use our proposed GPGC CoT prompting. For each baseline model, the best are highlighted in bold and second-best scores are highlighted in blue.

Detailed Image Quality Analysis via Visual Distortion Assessment for User-Generated Content

Abstract

Data Construction

Benchmark

Comparison

Referring Grounding Results

Distortion Grounding and Region Perception Results

Perception Performance on Q-Bench

Perception Performance on ViDA-UGC-Bench

Description Performance on both Q-Bench and ViDA-UGC-Bench

Open Source Resources

GitHub Repository

Datasets

ViDA-UGC-tuned Model

Citation

Contact

Academic Services