I've been messing with AI video generation for a while now. Tried the usual suspects — ComfyUI workflows with 47 nodes, cloud APIs that charge per clip, tutorials that assume you already know your CFG scale from your scheduler. The common thread? Everything assumes you either have an RTX 4090 or you're willing to pay someone who does.
Then I found Wan2GP. It's a GitHub project by a solo dev (deepbeepmeep) that runs Wan 2.2, HunyuanVideo, LTX-Video, and a bunch of other models on hardware you probably already own. An RTX 3060 12GB. A 2070 Super. Some 10-series cards if you're patient.
I've been running it for a few weeks now and figured out a bunch of things that aren't in the docs or are buried so deep nobody finds them. This is a writeup of what actually works, what doesn't, and how to set it up if you don't have a local GPU.
Deepy — The Built-in AI Assistant Nobody Turns On
So there's this thing called Deepy inside Wan2GP. It's an offline AI assistant that chains together different generation steps. Instead of manually running an image edit, then taking that output and feeding it into the video generator, you just describe what you want and Deepy handles the pipeline.
How to enable it:
Go to Config > Prompt Enhancer / Deepy tab. In the model dropdown, pick Qwen3.5VL Abliterated 4B — it's the lighter version, won't eat all your VRAM. Toggle "Enable Deepy" on. Hit "Save Deepy Settings" — seriously, hit save, otherwise it resets next time you launch. Then click "Ask Deepy" in the left dock.
Now you just type. Something like:
"take this image, make the sky stormy, then animate it with a slow zoom"
Deepy figures out you want the Image Editor first, then the Video Generator. It grabs the image from your gallery, runs the edit, feeds the result into the video pipeline. All using your existing model settings and LoRAs. You don't wire anything together manually.
It has six tools: Video Generator, Video With Speech, Image Generator, Image Editor, Speech From Description, Speech From Sample. They chain automatically when your prompt implies multiple steps. You can also reference specific files — "use image_3" or "animate the last video" — and it tracks context.
If you're a terminal person, there's python deepy.py --cli. Commands are /add <path> to load media, /select <ref> to pick a target, /size WxH to override resolution, /template <tool> <variant> for presets. Submit with Shift+Enter, Ctrl+Enter for multiline.
One thing to watch: Deepy doesn't always preserve aspect ratios between steps. I told it to edit an image and then animate it. The edit came out at 4:3 (the image's native ratio), but the video template defaulted to 16:9. The final video stretched the sky horizontally. The render took 14 minutes and I only noticed at the end. Always check the intermediate output before committing to a long queue.
The whole thing lives under a tab called "Prompt Enhancer / Deepy." The name sounds like it just expands your prompt into a longer paragraph. It actually runs a full multi-step pipeline.
TeaCache — Cuts Render Time in Half With One Checkbox
This is the biggest quality-of-life improvement I've found. TeaCache is basically a step-skipper. It caches intermediate denoising states and when the change between two consecutive steps is small enough, it skips ahead instead of recomputing.
How to use it:
UI: Expand Advanced Settings in the generation panel, find TeaCache, set threshold to 2.0.
CLI: Add --teacache 2.0 to your launch command: python wgp.py --teacache 2.0
The numbers matter here. At 1.0 it's conservative — skips maybe 20% of steps, barely any quality impact but also not much time saved. At 2.0 it skips about 50%. On Wan 2.2 14B at 480p, I went from 28 minutes down to 14 minutes on a 4070 Ti Super 16GB. Side by side I couldn't tell the difference. At 3.0 it gets aggressive — skips 70%+ and you start seeing motion artifacts. I tested it with a car driving down a road and it looked like the car was teleporting in 3-frame jumps instead of moving smoothly. Not great.
One more thing: TeaCache doesn't play well with LoRA accelerators at very low step counts. If you're running a LoRA accelerator at 4-6 steps AND TeaCache at 2.0, the cache skips too much because there aren't enough steps to begin with. Use one or the other at aggressive settings, not both. I run TeaCache at 2.0 with the base model, or LoRA accelerator at 8 steps without TeaCache.
There's no hype around TeaCache because it's not a feature launch — it's just a checkbox in Advanced Settings that the docs mention in a single sentence.
Headless CLI Queue — Batch Process While You Sleep
The UI is nice but it eats RAM. Chromium on top of a 14B model is rough on a 16GB card. And you can't really queue more than a few prompts before the interface starts getting sluggish.
The solution is headless mode. Here's the workflow:
- Set up your model, resolution, steps in the UI like normal
- Enter your prompt and click "Add to Queue" (not Generate)
- Tweak settings for the next prompt if needed — different seed, different model, whatever
- Add more prompts
- Click "Save Queue" — it exports a
.zipfile with all your prompts and their settings
Then close the browser and run:
python wgp.py --process my_queue.zip
That's it. It processes the entire queue without any UI overhead. No Gradio, no browser, nothing. I queued 20 prompts before bed, most mornings 12-15 finished videos are in the output folder.
Watch out for this: headless mode fails silently when a queue item references a model checkpoint you haven't downloaded yet. It logs a warning to the terminal and moves to the next item. I woke up once to 8 videos instead of 20 because 12 of them used a Hunyuan variant I hadn't downloaded. Check the terminal output, not just count files in the output folder.
It's documented in docs/CLI.md — a separate file. The UI is so polished people assume it's the whole product.
LoRA Accelerators — Drop From 30 Steps to 8
The default recommendation for Wan 2.2 is 30-50 steps. That's what the docs say, that's what every tutorial uses. But there are LoRA adapters specifically designed to speed up sampling.
How to set it up:
- Download a LoRA accelerator for your exact model (check
docs/LORAS.mdor the project's HuggingFace) - Put the
.safetensorsfile in theloras/directory inside Wan2GP - In the UI, go to the LoRA section and select your accelerator
- Set the LoRA weight to
1.0— don't crank it higher, starts introducing artifacts - Drop your step count from 30 to 8
These LoRAs are small — usually under 200MB — and they guide the sampler toward good outputs faster. On LTX-2 Dev with an accelerator, I went from 45 minutes at 30 steps to 8 minutes at 8 steps for a 10-second clip.
The quality tradeoff is real but manageable. If you zoom in on fine textures — hair strands, fabric patterns — you'll see less detail. At normal viewing size it's hard to spot. For social media or quick drafts it's completely fine.
The usual caveat: LoRA accelerators are model-specific. I loaded a Wan 2.1 accelerator on Wan 2.2 and everything came out with a greenish tint, like a cheap filter. The architectures are similar but not identical. Always check the LoRA targets your exact model version.
The LoRA docs are in a separate file and the UI just lists filenames — no labels telling you which are speed boosters vs aesthetic changes. Accelerators usually have "speed," "accel," or "fast" in the name.
LTX-Video — 60-Second Clips With Audio Sync
Wan models max out at 4-8 second clips. If you need longer, people usually stitch them together in an editor. But Wan2GP also supports LTX-Video, which handles long-form generation natively.
How it works:
Switch the model dropdown from Wan to LTX-Video. Two things LTX does that Wan doesn't:
Sliding window generation — Generates long videos by processing overlapping chunks and blending the boundaries. You don't manually storyboard or stitch. Set frame count to 1000+ and it handles transitions internally. On a 10GB card I ran a 30-second generation (roughly 720 frames at 24fps) in one shot.
Audio sync via MMAudio — Generates audio matching the video content. But this isn't automatic. You have to trigger it as a separate step.
For audio:
- Generate your video with LTX-Video
- Open Deepy
- Use "Video With Speech" or "Speech From Sample"
- For Speech From Description: type what audio you want — "rain sounds, distant thunder, light wind"
- For Speech From Sample: load an audio file and it transfers the characteristics
Audio sync is NOT a one-click thing though. I assumed "native audio sync" meant it happened during video generation. It doesn't. You generate video first, then generate audio separately. And the audio quality depends entirely on how specific your description is. Vague prompts give you generic soundscapes.
LTX-Video is just another option in the model dropdown. Most people pick Wan because it's the project namesake. LTX sits there quietly doing the one thing people think Wan2GP can't do.
How to Run This on RunPod (If You Don't Have a Local GPU)
Not everyone has a compatible Nvidia card. Here's the RunPod path:
- Go to RunPod, deploy a GPU instance. An RTX 4090 or A100 works well. Even an A6000 is fine.
- Use the community Wan2GP template if it's available, or start from a base Ubuntu + CUDA image
- Clone the repo:
git clone https://github.com/deepbeepmeep/Wan2GP.git - Run the install script:
cd Wan2GP && bash scripts/install.sh - The install script handles conda, PyTorch, and all the acceleration kernels — SageAttention, TeaCache, GGUF support, everything
- Access the UI at
http://localhost:7860or the RunPod custom port URL - Download model checkpoints — they're large, Wan 2.2 14B is around 28GB
Cost-wise: an A100 hour on RunPod is roughly $1-2 depending on the instance type. A 14-minute Wan 2.2 generation on an A100 costs under $0.50. Compare that to dedicated AI video platforms charging $0.10-$0.50 per clip with zero settings control.
If you want to try it: https://runpod.io?ref=bnbg8jdt
The Signal
Pick one setting. Just one. Flip TeaCache to 2.0 and see if your render time actually halves. Or enable Deepy and ask it to chain two steps together instead of doing them manually.
The gap between people who make AI video and people who talk about making AI video isn't GPU budget. It's knowing which settings to change and which defaults to ignore. The default settings in these tools are tuned for maximum quality at maximum cost. Nobody's job is to make them usable on a gaming card.
Wan2GP is built by someone whose job is exactly that. And the hidden features aren't secret. They're just not advertised.