Vision-Language-Action
acts from the current context — no predicted future
p(a | o, l)
Dream less Act more
National University of Singapore
A structured survey of predictive-action models that make a forecast of the future available to action. The homepage mirrors the paper’s two central views: a philosophy-level taxonomy and a component-level anatomy.
Definition
In the survey, a World Action Model begins when a predicted future becomes action-facing. The future may be rendered pixels, latents, features, flow, affordance maps, audio, or tokens. What matters is that the predicted future helps produce, score, or train the action path.
o observation l language or goal a action o' future observation p(x|y) conditional model
acts from the current context — no predicted future
p(a | o, l)
a what-if simulator: action in, future out — emits no action
p(o′ | o, a, l)
the predicted future helps produce, score, or train the action
p(o′, a | o, l)
Three different shapes, three different things: a camera frame = what it observes now (o), a thought bubble = the future it predicts (o′), a joystick = the action it outputs (a). A world model stops at the bubble; a WAM routes that bubble forward, in accent, into the action.
Boundary: the predicted future must help produce, score, or train action.
Not WAM: a future head discarded before action use, or a simulator used only outside the policy path.
VLA: maps observation and instruction directly to action. World model: predicts a future observation or state. Either can be useful without being a WAM.
A WAM keeps the predicted future in the action path. The action may come after prediction, be scored by prediction, or be generated jointly with prediction.
A direct VLA with an auxiliary future loss, a simulator used only for RL training, or a future head discarded before action use does not satisfy the WAM definition.
Chronological map
The timeline groups representative works by design philosophy. Render-and-Decode appears first, followed by Latent-Only shortcuts and Video-Generation-Free methods that carry predictive supervision outside video generation.
Fine-grained paper list
The list is generated from the current Section 4 paper table, then enriched with arXiv links, first-version dates, and weekly updatable citation counts.
Taxonomy × substrate