LingBot-World · 完整推理 Data Flow & Function Call 图
从首帧图像 → 文本编码 → 相机条件 → 扩散去噪循环 → 视频输出与行动解释(480P · 81帧示例)
← 架构总图
PHASE 1 · INPUT CONDITIONING
PHASE 2 · FLOW MATCHING DENOISING LOOP ×40 STEPS · UNIPC SOLVER
PHASE 3 · OUTPUT GENERATION & ACTION INTERPRETATION
🖼️
Input Image
PIL.Image.open(image_path)
H×W×3 JPEG/PNG
📝
Text Prompt
input_prompt: str (max 256 tokens)
"A soaring journey through jungle…"
🕹️
Control Signals
poses.npy / action_string DSL
c2ws[81,4,4] · intrinsics[81,4]
TF.to_tensor() · sub_(0.5) · div_(0.5)
bicubic resize → target resolution
[3, 480, 832] range(−1, 1)
🎞️ VAE Encoder
wan/modules/vae2_1.py
CausalConv3d ↓4× temporal ↓8× spatial
img[None]+zeros(F-1) → latent
[16, 21, 60, 104]
🎭 Frame Mask + y = concat
msk = ones(1,F,lat_h,lat_w); msk[:,1:]=0
repeat_interleave(msk[:,0:1], 4) → [4,21,60,104]
y = concat([msk, vae_latent])
y: [20, 21, 60, 104]
🔤 HuggingfaceTokenizer
add_special_tokens=True · whitespace clean
ids [1, 256] · mask [1, 256]
📝 T5 Encoder (UMT5-XXL)
wan/modules/t5.py · encoder_only=True
① token_embedding(ids) → [1,256,4096]
② pos_embedding(L,L) → RelPos bias [1,64,L,L]
③ × 24 T5SelfAttention blocks
norm→Attn(Q,K,V)→res + FFN(gate×fc1)→fc2→res
④ T5LayerNorm + Dropout
context: [u[:v] for u,v in zip(out, seq_lens)]
context: [1, ≤256, 4096]
context_null (negative prompt)
sample_neg_prompt → same T5 pipeline
context_null: [1, ≤N, 4096]
🕹️ Action→Cam · wasd_ijkl_to_c2ws.py
Path A: parse DSL → segments_to_wasd_ijkl → generate_trajectory
Path B: np.load("poses.npy") · np.load("wasd_action.npy")
c2ws [81, 4, 4] · wasd [81, 3]
get_Ks_transformed(Ks, 480, 832, h, w)
scale fx·fy·cx·cy to target resolution
Ks [21, 4] (fx, fy, cx, cy)
interpolate_camera_poses() · SLERP
linspace(0,80,21) → interp1d(trans) + Slerp(rot)
c2ws_infer [21, 4, 4] (每4帧1帧)
compute_relative_poses(framewise=True)
SE3_inv(c2ws[0])@c2ws → Δpose; /max_norm
relative [21, 4, 4] (归一化平移)
💫 get_plucker_embeddings(c2ws, Ks, h, w)
create_meshgrid→反投影→rays_d=R@d rays_o=-R.T@t
Plücker = [d×o, d] → [21, H, W, 6]
rearrange('f(hc1)(wc2)c → b c f h w') → DiT cond
c2ws_plucker_emb: [1, 6, 21, 60, 104]
[3, 480, 832] → latent
ids[1,256], mask[1,256]
🎲 Noise Initialization
torch.randn(16, lat_f, lat_h, lat_w)
seed_g.manual_seed(base_seed)
noise [16, 21, 60, 104]
⏱ Timestep + Model Select
set_timesteps(40, shift=5.0) → timesteps
boundary = 0.5 × 1000 = 500
t ≥ 500 → 🔥 high_noise_model
t < 500 → ❄️ low_noise_model
guide_scale = (lo_cfg, hi_cfg) tuple
latent = noise (init for t₀)
🤖 DiT Forward (conditioned)
modules/model.py · WanModel.forward()
arg_c = {context, seq_len, y, dit_cond_dict}
① patch_embed(latent) → tokens [B, 32760, 5120]
② time_emb(t) + rope_params(3D: T+H+W)
③ ×40 blocks: SelfAttn(RoPE+Flash) + CrossAttn(T5) + AdaLN+FFN
④ camera_proj(plucker) injected as extra tokens
noise_pred_cond [16, 21, 60, 104]
🤖 DiT Forward (unconditioned)
arg_null = {context_null, seq_len, y, dit_cond}
同一模型,同一时间步 t,不同文本条件
noise_pred_uncond [16, 21, 60, 104]
⚖️ Classifier-Free Guidance
guide_scale = 5.0 (lo) / 5.0 (hi) by default
v = v_uncond
+ scale × (v_cond − v_uncond)
noise_pred [16, 21, 60, 104]
🔄 FlowUniPCMultistepScheduler.step()
fm_solvers_unipc.py · solver_order=2
① x0 = xt − σ_t × noise_pred (flow → x0)
② UniPC predictor: poly extrap from history
③ UniC corrector: higher-order correction
④ model_outputs.append(noise_pred)
temp_x0 [1, 16, 21, 60, 104]
latent = temp_x0.squeeze(0) ← 下轮输入
init
arg_c
arg_null
v_cond
v_uncond
noise_pred
↺ next t
y [20,21,60,104]
context [1,≤256,4096]
[1,6,21,60,104] plucker
x0 [16, 21, 60, 104]
🎞️ VAE Decoder
wan/modules/vae2_1.py · CausalDecoder
CausalConv3d(16→512)
ResBlock×N + upsample3d ×3 (↑8× ↑4×)
CausalConv3d(→3) → tanh
video [3, 81, 480, 832]
🔄 Coord Transform
(C,F,H,W) → (F,H,W,C) · (-1,1)→(0,1)
videos_np = (videos + 1) / 2 · uint8
[81, 480, 832, 3] range(0, 1)
🖼️ Vis Utils · UI Overlay
utils/vis_utils.py · visualize_wasd_ui
draw_rounded_rectangle → key backgrounds
draw_chevron → direction arrows
cv2.addWeighted → alpha blend UI层
[81, 480, 832, 3] w/ UI
💾 save_video(tensor, fps=16)
wan/utils/utils.py · torchvision encode
normalize=True · value_range=(-1,1)
output.mp4 @ 16 fps
🌍 Action Interpretation (World Model)
🎯 相机轨迹 c2ws = 智能体在世界中的"动作"
🎬 生成视频帧 = 世界模型对动作的响应预测
🤖 Robot Learning: 用预测帧做奖励信号
🎮 Game AI: 实时生成可交互世界 <1s 延迟
🎨 Content Creation: 第一视角探索视频生成
WASD = 平移
IJKL = 旋转
video
np arr
+UI
KEY TENSOR SHAPES (480P · 81 frames) · vae_stride=[4,8,8] · patch_size=[1,2,2] · ulysses_size=8
Image: [3, 480, 832]
VAE latent:[16, 21, 60, 104]
mask: [4, 21, 60, 104]
y: [20, 21, 60, 104]
noise: [16, 21, 60, 104]
context: [1, ≤256, 4096]
plucker: [1, 6, 21, 60, 104]
noise_pred:[16, 21, 60, 104]
video: [3, 81, 480, 832]
c2ws: [81, 4, 4]
c2ws_infer: [21, 4, 4]
max_seq_len = 21×30×52 = 32,760 tokens
timesteps: [T₃₉…T₀] shift=5.0 (720P)