DiagramAgent
/

Diagram_to_Code_Agent

Visual Question Answering

Model card Files Files and versions

Diagram_to_Code_Agent / README.md

GowiFly's picture

Update README.md

f6a8a50 verified about 1 year ago

|

history blame contribute delete

3.65 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2-VL-7B-Instruct
	pipeline_tag: visual-question-answering
	datasets:
	- DiagramAgent/DiagramGenBenchmark
	---

	[📑paper link](https://arxiv.org/abs/2411.11916)



	## Model Card: DiagramAgent/Diagram_to_Code_Agent

	### 1. Model Overview

	- Name: DiagramAgent/Diagram_to_Code_Agent
	- Description:
	This agent is tasked with converting a given diagram (visual representation) into its corresponding structured code.

	### 2. Intended Use

	- Primary Tasks:
	- Convert existing diagrams into structured code representations.
	- Support diagram editing workflows by providing a reliable code basis for modifications.
	- Capture and preserve implicit logical structures and visual details of diagrams.
	- Application Scenarios:
	- Automated diagram editing: Transforming a diagram into code to enable subsequent modifications.
	- Reverse engineering of visual diagrams for analysis and reusability.
	- Enhancing data visualization tools by integrating code-based diagram representations.

	### 3. Architecture and Training Details

	- Base Model: Utilizes the Qwen2-VL-7B model, which is a vision-language fusion model.
	- Training Process:
	- Trained on diverse diagram samples from the DiagramGenBenchmark dataset.
	- Aims to generate code that is highly consistent with a reference code, ensuring that all diagram elements are accurately captured.
	- Uses a specialized loss function to reduce the edit distance between the generated and reference code.
	- Module Interaction:
	Works closely with the Check Agent, which validates the generated code and provides feedback for further refinement.

	### 4. Usage Examples

	```py
	from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info

	# default: Load the model on the available device(s)
	model = Qwen2VLForConditionalGeneration.from_pretrained(
	"DiagramAgent/Diagram_to_Code_Agent", torch_dtype="auto", device_map="auto"
	)

	# default processer
	processor = AutoProcessor.from_pretrained("DiagramAgent/Diagram_to_Code_Agent")

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "your input",
	},
	{"type": "text", "text": "image path"},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=8192)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)

	```

	### 5. Citation

	If you find our work helpful, feel free to give us a cite.


	```
	@inproceedings{wei2024wordsstructuredvisualsbenchmark,
	title={From Words to Structured Visuals: A Benchmark and Framework for Text-to-Diagram Generation and Editing},
	author={Jingxuan Wei and Cheng Tan and Qi Chen and Gaowei Wu and Siyuan Li and Zhangyang Gao and Linzhuang Sun and Bihui Yu and Ruifeng Guo},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
	year={2025}
	}
	```