Recent advances in Multimodal Large Language Models (MLLMs) have introduced a paradigm shift for Image Quality Assessment (IQA) from unexplainable image quality scoring to explainable IQA, demonstrating practical applications like quality control and optimization guidance. However, current explainable IQA methods not only inadequately use the same distortion criteria to evaluate both User-Generated Content (UGC) and AI-Generated Content (AIGC) images, but also lack detailed quality analysis for monitoring image quality and guiding image restoration.
In this study, we establish the first large-scale Visual Distortion Assessment Instruction Tuning Dataset for UGC images, termed ViDA-UGC, which comprises 11K images with fine-grained quality grounding, detailed quality perception, and reasoning quality description data. This dataset is constructed through a distortion-oriented pipeline, which involves human subject annotation and a Chain-of-Thought (CoT) assessment framework. This framework guides GPT-4o to generate quality descriptions by identifying and analyzing UGC distortions, which helps capturing rich low-level visual features that inherently correlate with distortion patterns. Moreover, we carefully select 476 images with corresponding 6,149 question answer pairs from ViDA-UGC and invite a professional team to ensure the accuracy and quality of GPT-generated information. The selected and revised data further contribute to the first UGC distortion assessment benchmark, termed ViDA-UGC-Bench.
Experimental results demonstrate the effectiveness of the ViDA-UGC and CoT framework for consistently enhancing various image quality analysis abilities across multiple base MLLMs on ViDA-UGC-Bench and Q-Bench, even surpassing GPT-4o.
Overview of our CoT assessment framework. (a) Direct inference leads to inaccurate low-level information and superficial reasoning. (b) Standard CoT decomposes the task into reasoning steps but suffers from hallucination and inconsistent quality analysis. (c) Our approach integrates human expertise and ground-truth annotations to guide reliable step-wise reasoning and accurate quality prediction.
An overview of the ViDA-UGC construction pipeline. This pipeline includes four steps: In (a), MILP sampling strategy ensures approximately uniform distribution for sampled images across low-level features. Human subjects further annotate distortion boxes and image quality scores. In (b), we mark each box with a number and integrate IQA expertise to GPT prompt. GPT-4o then outputs textual descriptions of distortions and their visual attributes. In (c), the generated textual descriptions, together with human annotations and IQA expertise, serve as ground truth information in the proposed CoT framework, which is exploited by GPT-4o to generate quality description data. In (d), the quality description data are transformed into distortion-related VQA and MCQ via GPT-4o. Moreover, the generated distortion attributes in (b) are converted into grounding data using pre-defined chat templates.
To address the limitations of existing benchmarks, which often lack challenging questions and diversity in IQA tasks, we introduce ViDA-UGC-Bench, a first UGC distortion assessment benchmark. We carefully select 476 images with their distortion triplet samples, 476 overall quality analysis from ViDA-Description, 2,567 multi-choice questions from ViDA-Perception, and 3,106 grounding data from ViDA-Grounding. Based on the collected images and data, a professional team consisting of image-processing researchers is responsible for revising the question-answer pairs and ensuring the accuracy of low-level information, aiming to minimize GPT-4o biases. In the figure above, we present one excluded sample and several test questions retained in ViDA-UGC-Bench. The hence-constructed ViDA-UGC-Bench covers all ten UGC distortions and tests MLLMs from three core IQA tasks: distortion grounding, low-level perception, and quality description.
Model (variant) | Training Dataset | Q-Bench | ViDA-UGC-Bench | ||||||
---|---|---|---|---|---|---|---|---|---|
Dist | I-C Dist | Overall | Type | Position | Severity | Significance | Overall | ||
Qwen-VL-Chat | no (Baseline) | 50.78% | 53.62% | 56.39% | 31.91% | 37.83% | 31.72% | 36.64% | 34.84% |
Q-Instruct | 70.43% | 74.34% | 71.51% | 28.50% | 35.79% | 31.72% | 47.12% | 35.88% | |
ViDA-UGC | 78.60% | 79.93% | 73.98% | 60.27% | 54.44% | 74.37% | 69.49% | 63.34% | |
Qwen2-VL-7B | no (Baseline) | 75.49% | 75.66% | 77.19% | 43.43% | 43.53% | 51.26% | 54.15% | 47.53% |
Q-Instruct | 75.68% | 77.63% | 76.92% | 34.12% | 42.00% | 51.47% | 52.08% | 44.14% | |
ViDA-UGC | 87.16% | 85.20% | 80.6% | 76.37% | 61.17% | 80.46% | 72.20% | 71.45% | |
InternVL2.5-8B | no (Baseline) | 69.07% | 70.07% | 73.98% | 37.08% | 43.78% | 44.96% | 49.04% | 43.51% |
Q-Instruct | 76.65% | 76.64% | 76.45% | 29.99% | 38.32% | 68.07% | 45.53% | 43.40% | |
ViDA-UGC | 87.74% | 84.54% | 79.33% | 81.24% | 62.06% | 82.14% | 78.27% | 74.80% | |
InternVL3-8B | no (Baseline) | 70.43% | 71.38% | 74.72% | 43.57% | 47.08% | 47.06% | 52.08% | 47.37% |
Q-Instruct | 70.43% | 75.99% | 73.18% | 29.10% | 34.90% | 53.15% | 47.28% | 39.77% | |
ViDA-UGC | 82.49% | 79.93% | 77.19% | 82.87% | 61.80% | 78.15% | 72.52% | 73.00% | |
GPT-4o-2024-11-20 | no (Zero-shot) | 75.14% | 78.10% | 78.60% | 46.38% | 60.28% | 53.57% | 59.58% | 55.20% |
Comparison of the low-level Perception ability between baseline MLLMs, Q-Instruct-tuned versions, and ViDA-UGC-tuned versions on both Q-Bench and ViDA-UGC-Bench. For each baseline model, ViDA-UGC-tuned version achieves the highest score in both benchmarks.
Model (variant) | Training Dataset | Q-Bench | ViDA-UGC-Bench | ||||||
---|---|---|---|---|---|---|---|---|---|
completeness | precision | relevance | overall | completeness | precision | reasoning | overall | ||
Qwen-VL-Chat | no (Baseline)* | 0.72 | 0.53 | 1.86 | 3.11 | 0.2 | 0.58 | 1.56 | 2.34 |
no (Baseline)† | 0.78 | 0.55 | 1.99 | 3.32 | 0.5 | 0.77 | 2.13 | 3.4 | |
Q-Instruct* | 0.97 | 0.76 | 1.95 | 3.68 | 0.58 | 0.93 | 1.98 | 3.49 | |
ViDA-UGC† | 1.07 | 0.74 | 1.99 | 3.80 | 1.33 | 1.14 | 2.89 | 5.36 | |
Qwen2-VL-7B | no (Baseline)* | 1.30 | 1.06 | 2 | 4.36 | 0.70 | 0.73 | 2.78 | 4.21 |
no (Baseline)† | 1.36 | 1.28 | 2 | 4.64 | 0.74 | 0.92 | 2.88 | 4.54 | |
Q-Instruct* | 1.15 | 1.35 | 2 | 4.50 | 0.69 | 1.14 | 1.78 | 3.61 | |
ViDA-UGC† | 1.40 | 1.30 | 2 | 4.70 | 1.49 | 1.34 | 2.95 | 5.78 | |
InternVL2.5-8B | no (Baseline)* | 0.96 | 0.72 | 1.83 | 3.51 | 0.74 | 0.74 | 2.76 | 4.24 |
no (Baseline)† | 1.15 | 1.03 | 1.94 | 4.12 | 0.75 | 1.25 | 2.90 | 4.9 | |
Q-Instruct* | 0.97 | 1.22 | 1.93 | 4.12 | 0.63 | 1.28 | 1.63 | 3.54 | |
ViDA-UGC† | 1.22 | 1.23 | 1.99 | 4.44 | 1.46 | 1.32 | 2.94 | 5.72 | |
InternVL3-8B | no (Baseline)* | 1.10 | 0.93 | 1.86 | 3.89 | 0.93 | 0.96 | 2.95 | 4.84 |
no (Baseline)† | 1.27 | 1.34 | 2 | 4.61 | 0.96 | 1.28 | 2.98 | 5.22 | |
Q-Instruct* | 0.98 | 1.32 | 1.95 | 4.25 | 0.64 | 1.25 | 1.63 | 3.52 | |
ViDA-UGC† | 1.34 | 1.35 | 1.99 | 4.68 | 1.51 | 1.36 | 3 | 5.87 |
Comparison of the Quality Description ability between baseline MLLMs, Q-Instruct-tuned versions, and ViDA-UGC-tuned versions. For the training dataset with *, results are obtained under the prompt from Q-Instruct: 'Describe and evaluate the quality of the image. Think step by step.'. For the datasets with †, results are obtained under our proposed CoT framework. For each baseline model, ViDA-UGC-tuned version achieves the highest score and our proposed CoT framewor achieves the second best score in both benchmarks..
Method | Referring Grounding | Distortion Grounding | Region Perception | ||
---|---|---|---|---|---|
Response Rate | Acc0.5 | mIoU | COCO mAP | COCO mAP | |
Qwen-VL-Chat | 1 | 32.4 | 37.3 | - | - |
Qwen-VL-Chat-ViDA | 1 | 41.3(+8.9) | 43.4(+6.1) | 16.8 | 27.1 |
Qwen2-VL-7B | 0.79 | 24.9 | 29.8 | - | - |
Qwen2-VL-7B-ViDA | 0.99 | 42.1(+17.2) | 45.2(+15.4) | 17.7 | 29.4 |
InternVL2.5-8B | 0.98 | 29.0 | 37.0 | - | - |
InternVL2.5-8B-ViDA | 1 | 43.3(+14.3) | 46.4(+9.4) | 20.0 | 32.3 |
InternVL3-8B | 0.95 | 25.8 | 33.5 | - | - |
InternVL3-8B-ViDA | 1 | 44.2(+18.4) | 47.0(+13.5) | 19.7 | 30.4 |
Comparison of the Referring Grounding, Distortion Grounding, and Region Perception abilities between baselines and ViDA-UGC-tuned versions.
@misc{liao2025vidaugcdetailedimagequality, title={ViDA-UGC: Detailed Image Quality Analysis via Visual Distortion Assessment for UGC Images}, author={Wenjie Liao and Jieyu Yuan and Yifang Xu and Chunle Guo and Zilong Zhang and Jihong Li and Jiachen Fu and Haotian Fan and Tao Li and Junhui Cui and Chongyi Li}, year={2025}, eprint={2508.12605}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.12605}, }
Feel free to contact us at:
jasonliao.21@bytedance.com
fanhaotian@bytedance.com
We are pleased to announce that we have successfully organized the Detailed Image Quality Assessment Challenge in MIPI 2025 using the ViDA-UGC dataset. For more information, please visit the challenge page (https://www.codabench.org/competitions/8156/#/pages-tab) and the official MIPI 2025 website (https://mipi-challenge.org/MIPI2025).