ViDA-UGC

Detailed Image Quality Analysis via Visual Distortion Assessment for UGC Images

1 VCIP, CS, Nankai University   2 Bytedance Inc  

Project Lead

ViDA-UGC Data Structure

An illustration of the ViDA-UGC data structure. This dataset focuses on 10 common UGC distortions and is split into three sub-datasets: 1) ViDA-Grounding for fine-grained distortion grounding. This sub-dataset is divided into three tasks: quality description. Besides overall quality analysis, we add a new task, individual distortion assessment, which requires the model to assess a specific distortion in detail. distortion grounding, referring grounding, and region perception. 2) ViDA-Perception for detailed low-level perception. The questions in this sub-dataset have two formats and five concerns, focusing more on distortions. 3) ViDA-Description for reasoning

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have introduced a paradigm shift for Image Quality Assessment (IQA) from unexplainable image quality scoring to explainable IQA, demonstrating practical applications like quality control and optimization guidance. However, current explainable IQA methods not only inadequately use the same distortion criteria to evaluate both User-Generated Content (UGC) and AI-Generated Content (AIGC) images, but also lack detailed quality analysis for monitoring image quality and guiding image restoration.

In this study, we establish the first large-scale Visual Distortion Assessment Instruction Tuning Dataset for UGC images, termed ViDA-UGC, which comprises 11K images with fine-grained quality grounding, detailed quality perception, and reasoning quality description data. This dataset is constructed through a distortion-oriented pipeline, which involves human subject annotation and a Chain-of-Thought (CoT) assessment framework. This framework guides GPT-4o to generate quality descriptions by identifying and analyzing UGC distortions, which helps capturing rich low-level visual features that inherently correlate with distortion patterns. Moreover, we carefully select 476 images with corresponding 6,149 question answer pairs from ViDA-UGC and invite a professional team to ensure the accuracy and quality of GPT-generated information. The selected and revised data further contribute to the first UGC distortion assessment benchmark, termed ViDA-UGC-Bench.

Experimental results demonstrate the effectiveness of the ViDA-UGC and CoT framework for consistently enhancing various image quality analysis abilities across multiple base MLLMs on ViDA-UGC-Bench and Q-Bench, even surpassing GPT-4o.

Data Construction

Chain-of-Thought Assessment Framework

Overview of our CoT assessment framework. (a) Direct inference leads to inaccurate low-level information and superficial reasoning. (b) Standard CoT decomposes the task into reasoning steps but suffers from hallucination and inconsistent quality analysis. (c) Our approach integrates human expertise and ground-truth annotations to guide reliable step-wise reasoning and accurate quality prediction.

Annotation Pipeline

An overview of the ViDA-UGC construction pipeline. This pipeline includes four steps: In (a), MILP sampling strategy ensures approximately uniform distribution for sampled images across low-level features. Human subjects further annotate distortion boxes and image quality scores. In (b), we mark each box with a number and integrate IQA expertise to GPT prompt. GPT-4o then outputs textual descriptions of distortions and their visual attributes. In (c), the generated textual descriptions, together with human annotations and IQA expertise, serve as ground truth information in the proposed CoT framework, which is exploited by GPT-4o to generate quality description data. In (d), the quality description data are transformed into distortion-related VQA and MCQ via GPT-4o. Moreover, the generated distortion attributes in (b) are converted into grounding data using pre-defined chat templates.

Benchmark

MIRAGE Benchmark Workflow

To address the limitations of existing benchmarks, which often lack challenging questions and diversity in IQA tasks, we introduce ViDA-UGC-Bench, a first UGC distortion assessment benchmark. We carefully select 476 images with their distortion triplet samples, 476 overall quality analysis from ViDA-Description, 2,567 multi-choice questions from ViDA-Perception, and 3,106 grounding data from ViDA-Grounding. Based on the collected images and data, a professional team consisting of image-processing researchers is responsible for revising the question-answer pairs and ensuring the accuracy of low-level information, aiming to minimize GPT-4o biases. In the figure above, we present one excluded sample and several test questions retained in ViDA-UGC-Bench. The hence-constructed ViDA-UGC-Bench covers all ten UGC distortions and tests MLLMs from three core IQA tasks: distortion grounding, low-level perception, and quality description.

Comparison

Performance on the low-level perception ability

Model (variant) Training Dataset Q-Bench ViDA-UGC-Bench
Dist I-C Dist Overall Type Position Severity Significance Overall
Qwen-VL-Chat no (Baseline) 50.78% 53.62% 56.39% 31.91% 37.83% 31.72% 36.64% 34.84%
Q-Instruct 70.43% 74.34% 71.51% 28.50% 35.79% 31.72% 47.12% 35.88%
ViDA-UGC 78.60% 79.93% 73.98% 60.27% 54.44% 74.37% 69.49% 63.34%
Qwen2-VL-7B no (Baseline) 75.49% 75.66% 77.19% 43.43% 43.53% 51.26% 54.15% 47.53%
Q-Instruct 75.68% 77.63% 76.92% 34.12% 42.00% 51.47% 52.08% 44.14%
ViDA-UGC 87.16% 85.20% 80.6% 76.37% 61.17% 80.46% 72.20% 71.45%
InternVL2.5-8B no (Baseline) 69.07% 70.07% 73.98% 37.08% 43.78% 44.96% 49.04% 43.51%
Q-Instruct 76.65% 76.64% 76.45% 29.99% 38.32% 68.07% 45.53% 43.40%
ViDA-UGC 87.74% 84.54% 79.33% 81.24% 62.06% 82.14% 78.27% 74.80%
InternVL3-8B no (Baseline) 70.43% 71.38% 74.72% 43.57% 47.08% 47.06% 52.08% 47.37%
Q-Instruct 70.43% 75.99% 73.18% 29.10% 34.90% 53.15% 47.28% 39.77%
ViDA-UGC 82.49% 79.93% 77.19% 82.87% 61.80% 78.15% 72.52% 73.00%
GPT-4o-2024-11-20 no (Zero-shot) 75.14% 78.10% 78.60% 46.38% 60.28% 53.57% 59.58% 55.20%

Comparison of the low-level Perception ability between baseline MLLMs, Q-Instruct-tuned versions, and ViDA-UGC-tuned versions on both Q-Bench and ViDA-UGC-Bench. For each baseline model, ViDA-UGC-tuned version achieves the highest score in both benchmarks.

Performance on the quality description ability

Model (variant) Training Dataset Q-Bench ViDA-UGC-Bench
completeness precision relevance overall completeness precision reasoning overall
Qwen-VL-Chat no (Baseline)* 0.72 0.53 1.86 3.11 0.2 0.58 1.56 2.34
no (Baseline)† 0.78 0.55 1.99 3.32 0.5 0.77 2.13 3.4
Q-Instruct* 0.97 0.76 1.95 3.68 0.58 0.93 1.98 3.49
ViDA-UGC 1.07 0.74 1.99 3.80 1.33 1.14 2.89 5.36
Qwen2-VL-7B no (Baseline)* 1.30 1.06 2 4.36 0.70 0.73 2.78 4.21
no (Baseline)† 1.36 1.28 2 4.64 0.74 0.92 2.88 4.54
Q-Instruct* 1.15 1.35 2 4.50 0.69 1.14 1.78 3.61
ViDA-UGC 1.40 1.30 2 4.70 1.49 1.34 2.95 5.78
InternVL2.5-8B no (Baseline)* 0.96 0.72 1.83 3.51 0.74 0.74 2.76 4.24
no (Baseline)† 1.15 1.03 1.94 4.12 0.75 1.25 2.90 4.9
Q-Instruct* 0.97 1.22 1.93 4.12 0.63 1.28 1.63 3.54
ViDA-UGC 1.22 1.23 1.99 4.44 1.46 1.32 2.94 5.72
InternVL3-8B no (Baseline)* 1.10 0.93 1.86 3.89 0.93 0.96 2.95 4.84
no (Baseline)† 1.27 1.34 2 4.61 0.96 1.28 2.98 5.22
Q-Instruct* 0.98 1.32 1.95 4.25 0.64 1.25 1.63 3.52
ViDA-UGC 1.34 1.35 1.99 4.68 1.51 1.36 3 5.87

Comparison of the Quality Description ability between baseline MLLMs, Q-Instruct-tuned versions, and ViDA-UGC-tuned versions. For the training dataset with *, results are obtained under the prompt from Q-Instruct: 'Describe and evaluate the quality of the image. Think step by step.'. For the datasets with †, results are obtained under our proposed CoT framework. For each baseline model, ViDA-UGC-tuned version achieves the highest score and our proposed CoT framewor achieves the second best score in both benchmarks..

Performance on quality grounding ability

Method Referring Grounding Distortion Grounding Region Perception
Response Rate Acc0.5 mIoU COCO mAP COCO mAP
Qwen-VL-Chat 1 32.4 37.3 - -
Qwen-VL-Chat-ViDA 1 41.3(+8.9) 43.4(+6.1) 16.8 27.1
Qwen2-VL-7B 0.79 24.9 29.8 - -
Qwen2-VL-7B-ViDA 0.99 42.1(+17.2) 45.2(+15.4) 17.7 29.4
InternVL2.5-8B 0.98 29.0 37.0 - -
InternVL2.5-8B-ViDA 1 43.3(+14.3) 46.4(+9.4) 20.0 32.3
InternVL3-8B 0.95 25.8 33.5 - -
InternVL3-8B-ViDA 1 44.2(+18.4) 47.0(+13.5) 19.7 30.4

Comparison of the Referring Grounding, Distortion Grounding, and Region Perception abilities between baselines and ViDA-UGC-tuned versions.

Open Source Resources

GitHub Repository

Complete implementation code and usage examples

Visit GitHub

Datasets

Download links for training and evaluation datasets

Download Data

Pre-trained Models

ViDA-UGC-tuned model checkpoint and backbone

Download Models

Citation

BibTeX
@misc{liao2025vidaugcdetailedimagequality,
      title={ViDA-UGC: Detailed Image Quality Analysis via Visual Distortion Assessment for UGC Images}, 
      author={Wenjie Liao and Jieyu Yuan and Yifang Xu and Chunle Guo and Zilong Zhang and Jihong Li and Jiachen Fu and Haotian Fan and Tao Li and Junhui Cui and Chongyi Li},
      year={2025},
      eprint={2508.12605},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.12605}, 
}
                    

Contact

Feel free to contact us at:

jasonliao.21@bytedance.com

fanhaotian@bytedance.com

Academic Services

We are pleased to announce that we have successfully organized the Detailed Image Quality Assessment Challenge in MIPI 2025 using the ViDA-UGC dataset. For more information, please visit the challenge page (https://www.codabench.org/competitions/8156/#/pages-tab) and the official MIPI 2025 website (https://mipi-challenge.org/MIPI2025).