World Action Models: A Survey

Dream less Act more

Qiuhong Shen, Shihua Zhang, Yue Liao, Qi Li, Zhenxiong Tan, Shizun Wang,
Shuicheng Yan, Xinchao Wang

National University of Singapore

A structured survey of predictive-action models that make a forecast of the future available to action. The homepage mirrors the paper’s two central views: a philosophy-level taxonomy and a component-level anatomy.

arXiv PDF GitHub Discussions BibTeX

Community

Discussions and WeChat group are open.

Use GitHub Discussions to ask questions, suggest missing WAM papers, debate taxonomy placement, and connect with researchers following World Action Models, video world models, VLAs, robot learning, and embodied predictive-action methods.

GitHub Discussions

Join the paper-list discussion

Introduce yourself, ask questions, or suggest updates to the WAM paper list and taxonomy.

Open discussions

WeChat Group

Join the WAM Survey WeChat group

Scan the QR code in the discussion thread to join the Chinese community chat for WAM survey readers.

Open WeChat thread

Definition

A WAM is not just a video generator with an action head.

In the survey, a World Action Model begins when a predicted future becomes action-facing. The future may be rendered pixels, latents, features, flow, affordance maps, audio, or tokens. What matters is that the predicted future helps produce, score, or train the action path.

o observation l language or goal a action o' future observation p(x|y) conditional model

Vision-Language-Action

acts from the current context — no predicted future

p(a | o, l)

World model

a what-if simulator: action in, future out — emits no action

p(o′ | o, a, l)

World Action Model

the predicted future helps produce, score, or train the action

p(o′, a | o, l)

Three different shapes, three different things: a camera frame = what it observes now (o), a thought bubble = the future it predicts (o′), a joystick = the action it outputs (a). A world model stops at the bubble; a WAM routes that bubble forward, in accent, into the action.

Boundary: the predicted future must help produce, score, or train action.

1 · Predict then act future first, action second p(o' | o, l) p(a | o, o', l)

2 · Score actions candidate action, predicted consequence a → o' p(o' | o, a, l)

3 · Joint prediction future and action in one coupled model p(o', a | o, l) one shared action-facing path

Not WAM: a future head discarded before action use, or a simulator used only outside the policy path.

A simplified Section 2 definition: VLA predicts action, world model predicts future, and WAM links future prediction to action.

Neighbouring models

VLA: maps observation and instruction directly to action. World model: predicts a future observation or state. Either can be useful without being a WAM.

WAM contract

A WAM keeps the predicted future in the action path. The action may come after prediction, be scored by prediction, or be generated jointly with prediction.

Boundary

A direct VLA with an auxiliary future loss, a simulator used only for RL training, or a future head discarded before action use does not satisfy the WAM definition.

Chronological map

How the WAM literature unfolds.

The timeline groups representative works by design philosophy. Render-and-Decode appears first, followed by Latent-Only shortcuts and Video-Generation-Free methods that carry predictive supervision outside video generation.

Generated from the paper collection — each mark is a paper at its first arXiv date, placed in its design-philosophy lane.

Fine-grained paper list

Explore papers by taxonomy and component anatomy.

The list is generated from the current Section 4 paper table, then enriched with arXiv links, first-version dates, and weekly updatable citation counts.

Taxonomy × substrate

Orthogonal map

Living metadata

Citation counts refresh weekly.

Self-contained repository

The GitHub Pages repository includes the page, assets, paper data, and the weekly update workflow.

Survey-aligned data

The paper list is generated from the active survey paper table and bibliography, not from the older slide deck.

Metadata sources

Exact first-version dates come from arXiv. Citation counts refresh from Semantic Scholar with OpenAlex fallback.