World Action Model: A Survey

Dream less Act more

Qiuhong Shen, Shihua Zhang, Yue Liao, Qi Li, Zhenxiong Tan, Shizun Wang,
Shuicheng Yan, Xinchao Wang

National University of Singapore

A structured survey of predictive-action models that make a forecast of the future available to action. The homepage mirrors the paper’s two central views: a philosophy-level taxonomy and a component-level anatomy.

Definition

A WAM is not just a video generator with an action head.

In the survey, a World Action Model begins when a predicted future becomes action-facing. The future may be rendered pixels, latents, features, flow, affordance maps, audio, or tokens. What matters is that the predicted future helps produce, score, or train the action path.

o observation l language or goal a action o' future observation p(x|y) conditional model

Vision-Language-Action

acts from the current context — no predicted future

o, la

p(a | o, l)

World model

a what-if simulator: action in, future out — emits no action

o, lao′

p(o′ | o, a, l)

World Action Model

the predicted future helps produce, score, or train the action

o, lo′a

p(o′, a | o, l)

Three different shapes, three different things: a camera frame = what it observes now (o), a thought bubble = the future it predicts (o′), a joystick = the action it outputs (a). A world model stops at the bubble; a WAM routes that bubble forward, in accent, into the action.

Boundary: the predicted future must help produce, score, or train action.

1 · Predict then act future first, action second p(o' | o, l) p(a | o, o', l)
2 · Score actions candidate action, predicted consequence a o' p(o' | o, a, l)
3 · Joint prediction future and action in one coupled model p(o', a | o, l) one shared action-facing path

Not WAM: a future head discarded before action use, or a simulator used only outside the policy path.

A simplified Section 2 definition: VLA predicts action, world model predicts future, and WAM links future prediction to action.

Neighbouring models

VLA: maps observation and instruction directly to action. World model: predicts a future observation or state. Either can be useful without being a WAM.

WAM contract

A WAM keeps the predicted future in the action path. The action may come after prediction, be scored by prediction, or be generated jointly with prediction.

Boundary

A direct VLA with an auxiliary future loss, a simulator used only for RL training, or a future head discarded before action use does not satisfy the WAM definition.

Chronological map

How the WAM literature unfolds.

The timeline groups representative works by design philosophy. Render-and-Decode appears first, followed by Latent-Only shortcuts and Video-Generation-Free methods that carry predictive supervision outside video generation.

Generated from the paper collection — each mark is a paper at its first arXiv date, placed in its design-philosophy lane.

Fine-grained paper list

Explore papers by taxonomy and component anatomy.

The list is generated from the current Section 4 paper table, then enriched with arXiv links, first-version dates, and weekly updatable citation counts.

Taxonomy × substrate

Orthogonal map

Results 0 papers
Metadata Updating

Living metadata

Citation counts refresh weekly.

Self-contained repository

The GitHub Pages repository includes the page, assets, paper data, and the weekly update workflow.

Survey-aligned data

The paper list is generated from the active survey paper table and bibliography, not from the older slide deck.

Metadata sources

Exact first-version dates come from arXiv. Citation counts refresh from Semantic Scholar with OpenAlex fallback.