PHASE 1 · INPUT CONDITIONING PHASE 2 · FLOW MATCHING DENOISING LOOP ×40 STEPS · UNIPC SOLVER PHASE 3 · OUTPUT GENERATION & ACTION INTERPRETATION 🖼️ Input Image PIL.Image.open(image_path) H×W×3 JPEG/PNG 📝 Text Prompt input_prompt: str (max 256 tokens) "A soaring journey through jungle…" 🕹️ Control Signals poses.npy / action_string DSL c2ws[81,4,4] · intrinsics[81,4] TF.to_tensor() · sub_(0.5) · div_(0.5) bicubic resize → target resolution [3, 480, 832] range(−1, 1) 🎞️ VAE Encoder wan/modules/vae2_1.py CausalConv3d ↓4× temporal ↓8× spatial img[None]+zeros(F-1) → latent [16, 21, 60, 104] 🎭 Frame Mask + y = concat msk = ones(1,F,lat_h,lat_w); msk[:,1:]=0 repeat_interleave(msk[:,0:1], 4) → [4,21,60,104] y = concat([msk, vae_latent]) y: [20, 21, 60, 104] 🔤 HuggingfaceTokenizer add_special_tokens=True · whitespace clean ids [1, 256] · mask [1, 256] 📝 T5 Encoder (UMT5-XXL) wan/modules/t5.py · encoder_only=True ① token_embedding(ids) → [1,256,4096] ② pos_embedding(L,L) → RelPos bias [1,64,L,L] ③ × 24 T5SelfAttention blocks norm→Attn(Q,K,V)→res + FFN(gate×fc1)→fc2→res ④ T5LayerNorm + Dropout context: [u[:v] for u,v in zip(out, seq_lens)] context: [1, ≤256, 4096] context_null (negative prompt) sample_neg_prompt → same T5 pipeline context_null: [1, ≤N, 4096] 🕹️ Action→Cam · wasd_ijkl_to_c2ws.py Path A: parse DSL → segments_to_wasd_ijkl → generate_trajectory Path B: np.load("poses.npy") · np.load("wasd_action.npy") c2ws [81, 4, 4] · wasd [81, 3] get_Ks_transformed(Ks, 480, 832, h, w) scale fx·fy·cx·cy to target resolution Ks [21, 4] (fx, fy, cx, cy) interpolate_camera_poses() · SLERP linspace(0,80,21) → interp1d(trans) + Slerp(rot) c2ws_infer [21, 4, 4] (每4帧1帧) compute_relative_poses(framewise=True) SE3_inv(c2ws[0])@c2ws → Δpose; /max_norm relative [21, 4, 4] (归一化平移) 💫 get_plucker_embeddings(c2ws, Ks, h, w) create_meshgrid→反投影→rays_d=R@d rays_o=-R.T@t Plücker = [d×o, d] → [21, H, W, 6] rearrange('f(hc1)(wc2)c → b c f h w') → DiT cond c2ws_plucker_emb: [1, 6, 21, 60, 104] [3, 480, 832] → latent ids[1,256], mask[1,256] 🎲 Noise Initialization torch.randn(16, lat_f, lat_h, lat_w) seed_g.manual_seed(base_seed) noise [16, 21, 60, 104] ⏱ Timestep + Model Select set_timesteps(40, shift=5.0) → timesteps boundary = 0.5 × 1000 = 500 t ≥ 500 → 🔥 high_noise_model t < 500 → ❄️ low_noise_model guide_scale = (lo_cfg, hi_cfg) tuple latent = noise (init for t₀) 🤖 DiT Forward (conditioned) modules/model.py · WanModel.forward() arg_c = {context, seq_len, y, dit_cond_dict} ① patch_embed(latent) → tokens [B, 32760, 5120] ② time_emb(t) + rope_params(3D: T+H+W) ③ ×40 blocks: SelfAttn(RoPE+Flash) + CrossAttn(T5) + AdaLN+FFN ④ camera_proj(plucker) injected as extra tokens noise_pred_cond [16, 21, 60, 104] 🤖 DiT Forward (unconditioned) arg_null = {context_null, seq_len, y, dit_cond} 同一模型,同一时间步 t,不同文本条件 noise_pred_uncond [16, 21, 60, 104] ⚖️ Classifier-Free Guidance guide_scale = 5.0 (lo) / 5.0 (hi) by default v = v_uncond + scale × (v_cond − v_uncond) noise_pred [16, 21, 60, 104] 🔄 FlowUniPCMultistepScheduler.step() fm_solvers_unipc.py · solver_order=2 ① x0 = xt − σ_t × noise_pred (flow → x0) ② UniPC predictor: poly extrap from history ③ UniC corrector: higher-order correction ④ model_outputs.append(noise_pred) temp_x0 [1, 16, 21, 60, 104] latent = temp_x0.squeeze(0) ← 下轮输入 init arg_c arg_null v_cond v_uncond noise_pred ↺ next t y [20,21,60,104] context [1,≤256,4096] [1,6,21,60,104] plucker x0 [16, 21, 60, 104] 🎞️ VAE Decoder wan/modules/vae2_1.py · CausalDecoder CausalConv3d(16→512) ResBlock×N + upsample3d ×3 (↑8× ↑4×) CausalConv3d(→3) → tanh video [3, 81, 480, 832] 🔄 Coord Transform (C,F,H,W) → (F,H,W,C) · (-1,1)→(0,1) videos_np = (videos + 1) / 2 · uint8 [81, 480, 832, 3] range(0, 1) 🖼️ Vis Utils · UI Overlay utils/vis_utils.py · visualize_wasd_ui draw_rounded_rectangle → key backgrounds draw_chevron → direction arrows cv2.addWeighted → alpha blend UI层 [81, 480, 832, 3] w/ UI 💾 save_video(tensor, fps=16) wan/utils/utils.py · torchvision encode normalize=True · value_range=(-1,1) output.mp4 @ 16 fps 🌍 Action Interpretation (World Model) 🎯 相机轨迹 c2ws = 智能体在世界中的"动作" 🎬 生成视频帧 = 世界模型对动作的响应预测 🤖 Robot Learning: 用预测帧做奖励信号 🎮 Game AI: 实时生成可交互世界 <1s 延迟 🎨 Content Creation: 第一视角探索视频生成 WASD = 平移 IJKL = 旋转 video np arr +UI KEY TENSOR SHAPES (480P · 81 frames) · vae_stride=[4,8,8] · patch_size=[1,2,2] · ulysses_size=8 Image: [3, 480, 832] VAE latent:[16, 21, 60, 104] mask: [4, 21, 60, 104] y: [20, 21, 60, 104] noise: [16, 21, 60, 104] context: [1, ≤256, 4096] plucker: [1, 6, 21, 60, 104] noise_pred:[16, 21, 60, 104] video: [3, 81, 480, 832] c2ws: [81, 4, 4] c2ws_infer: [21, 4, 4] max_seq_len = 21×30×52 = 32,760 tokens timesteps: [T₃₉…T₀] shift=5.0 (720P)