GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning

Musen Lin*, Minghao Liu*, Taoran Lu*,†, Lichen Yuan*, Yiwei Liu, Haonan Xu, Yu Miao, Yuhao Chao, Zhaojian Li
ByteDance, UCAS
*Indicates Equal Contribution, Indicates Corresponding Author
Teaser Image

Illustration of GUI-ReWalk Characteristics: Multi-Platform Coverage, Long-Tail Patterns, Reflective Learning, and Multi-Stride Workflows.

Abstract

Graphical User Interface (GUI) Agents, powered by large language and vision-language models, hold promise for enabling end-to-end automation in digital environments. However, their progress is fundamentally constrained by the scarcity of scalable, high-quality trajectory data. Existing data collection strategies either rely on costly and inconsistent manual annotations or on synthetic generation methods that trade off between diversity and meaningful task coverage. To bridge this gap, we present GUI-ReWalk: a reasoning-enhanced, multi-stage framework for synthesizing realistic and diverse GUI trajectories. GUI-ReWalk begins with a stochastic exploration phase that emulates human trial-and-error behaviors, and progressively transitions into a reasoning-guided phase where inferred goals drive coherent and purposeful interactions. Moreover, it supports multi-stride task generation, enabling the construction of long-horizon workflows across multiple applications. By combining randomness for diversity with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects the intent-aware, adaptive nature of human-computer interaction. We further train Qwen2.5-VL-7B on the GUI-ReWalk dataset and evaluate it across multiple benchmarks, including Screenspot-Pro, OSWorld-G, UI-Vision, AndroidControl, and GUI-Odyssey. Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent. These findings establish GUI-ReWalk as a scalable and data-efficient framework for advancing GUI agent research and enabling robust real-world automation.

Pipeline

MY ALT TEXT

Overview of GUI-ReWalk Framework. Starting from a random app, GUI-ReWalk performs Random Walk by selecting actions and interacting with elements step by step; it then transitions to Task-Guided Completion to complete minimal-step tasks forming a stride, followed by Cross-Application Task Initiation to propose and execute new tasks in related apps. After each sub-stage, Retrospective Annotation records executed actions and GUI states. This cycle repeats across multiple strides to generate complete trajectories and overall task objectives.

Data Statistics

GUI-ReWalk Dataset Composition Across Application Domains.

MY ALT TEXT

Comparison of GUI-ReWalk and Other GUI Datasets

Dataset Env. Ann. Dom/AxT. Thoughts Tasks Avg.Step
AndroidControl Mobile Human Short 15283 5.5
AMEX Mobile Human 2991 11.9
AitW Mobile Human 2346 8.1
AitZ Mobile Human Short 1987 6.0
GUI-Odyssey Mobile Human 7735 15.3
OS-Genesis Mobile & Web Model Short 2451 6.4
WonderBread Web Human 598 8.4
AgentTrek Web Model Short 10398 12.1
Mind2Web Web Human 2350 7.3
GUIAct Web Human 2482 6.7
AgentNet Desktop Human Long 22625 18.6
GUI-ReWalk (Ours) Mobile & Desktop Model Long 50k+ 22.5

Experimental Results

▶ Grounding Capability

Results on Screenspot-Pro benchmark.

Model CAD DEV Creative Scientific Office OS Avg
TextIcon TextIcon TextIcon TextIcon TextIcon TextIcon TextIconAvg
GPT-4o 2.00.0 1.30.0 1.00.0 2.10.0 1.10.0 0.00.0 1.30.00.8
SeeClick-9.6B 2.50.0 0.60.0 1.00.0 3.50.0 1.10.0 2.80.0 1.80.01.1
OA-Atlas-7B 12.24.7 33.11.4 28.82.8 37.57.3 33.95.7 27.14.5 28.14.018.9
UGground-7B 14.21.6 26.62.1 27.32.8 31.92.7 31.611.3 17.80.0 25.02.816.5
UI-TARS-1.5-7B 49.2 17.2 56.5 15.9 60.1 14.7 74.3 24.5 81.4 43.4 55.1 18.0 62.7 20.0 46.4
Qwen2.5-VL-7B 17.23.1 35.12.1 23.26.3 36.16.4 41.811.3 28.013.5 29.76.520.8
GUI-ReWalk-7B (ours) 35.017.9 46.811.0 40.99.8 60.428.2 56.528.3 39.219.1 46.217.235.1

Results on OS-World-G benchmark.

Model Text Matching Element Recognition Layout Understanding Fine-grained Manipulation Refusal Avg
UGground-7B 51.3 40.3 43.5 24.8 - 36.4
UI-TARS-1.5-7B 59.8 43.0 50.6 37.6 - 47.5
Qwen2.5-VL-7B 23.0 15.5 19.0 11.4 - 16.8
GUI-ReWalk-7B (ours) 35.2 30.0 31.2 16.1 - 27.5

▶ Navigation Capability

Results on AndroidControl and GUI-Odyssey benchmarks.

Model AndroidControl-Low AndroidControl-High GUI-Odyssey
Type Acc. SR Type Acc. SR Type Acc. SR
GPT-4o 74.319.4 66.320.8 34.33.3
SeeClick-9.6B 93.075.0 82.959.1 71.053.9
OS-Atlas-7B 93.685.2 85.271.2 84.562.0
OS-Genesis-7B 90.774.2 66.244.5 ----
Qwen2.5-VL-7B 91.885.0 70.969.8 59.546.3
GUI-ReWalk-7B (ours) 91.796.3 73.166.2 69.664.2

Notes:

  • Type Acc.: Type Accuracy
  • Step SR: Step Success Rate

BibTeX


      @misc{lin2025guirewalkmassivedatageneration,
      title={GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning}, 
      author={Musen Lin and Minghao Liu and Taoran Lu and Lichen Yuan and Yiwei Liu and Haonan Xu and Yu Miao and Yuhao Chao and Zhaojian Li},
      year={2025},
      eprint={2509.15738},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2509.15738}, 
      }