StableDiffusion

100 readers
2 users here now

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and...

founded 2 years ago
MODERATORS
1
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/AI_Characters on 2025-07-26 09:30:14+00:00.


CivitAI article link: https://civitai.com/articles/17385

I keep getting asked how I train my WAN2.1 text2image LoRa's and I am kinda burned out right now so I'll just post this TLDR of my workflow here. I won't explain anything more than what I write here. And I wont explain why I do what I do. The answer is always the same: I tested a lot and that is what I found to be most optimal. Perhaps there is a more optimal way to do it, I dont care right now. Feel free to experiment on your own.

I use Musubi-Tuner in stead of AI-toolkit or something else because I am used to training using Kohyas SD-scripts and it usually has the most customization options.

Also this aint perfect. I find that it works very well in 99% of cases, but there are still the 1% that dont work well or sometimes most things in a model will work well except for a few prompts for some reason. E.g. I have a Rick and Morty style model on the backburner for a week now because while it generates perfect representations of the style in most cases, in a few cases it for whatever reasons does not get the style through and I have yet to figure out how after 4 different retrains.

  1. Dataset

18 images. Always. No exceptions.

Styles are by far the easiest. Followed by concepts and characters.

Diversity is important to avoid overtraining on a specific thing. That includes both what is depicted and the style it is depicted in (does not apply to style LoRa's obviously).

With 3d rendered characters or concepts I find it very hard to force through a real photographic style. For some reason datasets that are majorly 3d renders struggle with that a lot. But only photos, anime and other things seem to usually work fine. So make sure to include many cosplay photos (ones that look very close) or img2img/kontext/chatgpt photo versions of the character in question. Same issue but to a lesser extent exists with anime/cartoon characters. Photo characters (e.g. celebrities) seem to work just fine though.

  1. Captions

I use ChatGPT generated captions. I find that they work very well enough. I use the following prompt for them:

please individually analyse each of the images that i just uploaded for their visual contents and pair each of them with a corresponding caption that perfectly describes that image to a blind person. use objective, neutral, and natural language. do not use purple prose such as unnecessary or overly abstract verbiage. when describing something more extensively, favour concrete details that standout and can be visualised. conceptual or mood-like terms should be avoided at all costs.

some things that you can describe are:

- the style of the image (e.g. photo, artwork, anime screencap, etc)
- the subjects appearance (hair style, hair length, hair colour, eye colour, skin color, etc)
- the clothing worn by the subject
- the actions done by the subject
- the framing/shot types (e.g. full-body view, close-up portrait, etc...)
- the background/surroundings
- the lighting/time of day
- etc…

write the captions as short sentences.

three example captions:

1. "early 2010s snapshot photo captured with a phone and uploaded to facebook. three men in formal attire stand indoors on a wooden floor under a curved glass ceiling. the man on the left wears a burgundy suit with a tie, the middle man wears a black suit with a red tie, and the man on the right wears a gray tweed jacket with a patterned tie. other people are seen in the background."
2. "early 2010s snapshot photo captured with a phone and uploaded to facebook. a snowy city sidewalk is seen at night. tire tracks and footprints cover the snow. cars are parked along the street to the left, with red brake lights visible. a bus stop shelter with illuminated advertisements stands on the right side, and several streetlights illuminate the scene."
3. "early 2010s snapshot photo captured with a phone and uploaded to facebook. a young man with short brown hair, light skin, and glasses stands in an office full of shelves with files and paperwork. he wears a light brown jacket, white t-shirt, beige pants, white sneakers with black stripes, and a black smartwatch. he smiles with his hands clasped in front of him."

consistently caption the artstyle depicted in the images as “cartoon screencap in rm artstyle” and always put it at the front as the first tag in the caption. also caption the cartoonish bodily proportions as well as the simplified, exaggerated facial features with the big, round eyes with small pupils, expressive mouths, and often simplified nose shapes. caption also the clean bold black outlines, flat shading, and vibrant and saturated colors.

put the captions inside .txt files that have the same filename as the images they belong to. once youre finished, bundle them all up together into a zip archive for me to download.

Keep in mind that for some reason it often fails to number the .txt files correctly, so you will likely need to correct that or else you have the wrong captions assigned to the wrong images.

  1. VastAI

I use VastAI for training. I rent H100s.

I use the following template:

Template Name: PyTorch (Vast) Version Tag: 2.7.0-cuda-12.8.1-py310-22.04

I use 200gb storage space.

I run the following terminal command to install Musubi-Tuner and the necessary dependencies:

git clone --recursive https://github.com/kohya-ss/musubi-tuner.git
cd musubi-tuner
git checkout 9c6c3ca172f41f0b4a0c255340a0f3d33468a52b
apt install -y libcudnn8=8.9.7.29-1+cuda12.2 libcudnn8-dev=8.9.7.29-1+cuda12.2 --allow-change-held-packages
python3 -m venv venv
source venv/bin/activate
pip install torch==2.7.0 torchvision==0.22.0 xformers==0.0.30 --index-url https://download.pytorch.org/whl/cu128
pip install -e .
pip install protobuf
pip install six

Use the following command to download the necessary models:

huggingface-cli login

<your HF token>

huggingface-cli download Comfy-Org/Wan_2.1_ComfyUI_repackaged split_files/diffusion_models/wan2.1_t2v_14B_fp8_e4m3fn.safetensors --local-dir models/diffusion_models
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P models_t5_umt5-xxl-enc-bf16.pth --local-dir models/text_encoders
huggingface-cli download Comfy-Org/Wan_2.1_ComfyUI_repackaged split_files/vae/wan_2.1_vae.safetensors --local-dir models/vae

Put your images and captions into /workspace/musubi-tuner/dataset/

Create the following dataset.toml and put it into /workspace/musubi-tuner/dataset/

# resolution, caption_extension, batch_size, num_repeats, enable_bucket, bucket_no_upscale should be set in either general or datasets
# otherwise, the default values will be used for each item

# general configurations
[general]
resolution = [960 , 960]
caption_extension = ".txt"
batch_size = 1
enable_bucket = true
bucket_no_upscale = false

[[datasets]]
image_directory = "/workspace/musubi-tuner/dataset"
cache_directory = "/workspace/musubi-tuner/dataset/cache"
num_repeats = 1 # optional, default is 1. Number of times to repeat the dataset. Useful to balance the multiple datasets with different sizes.

# other datasets can be added here. each dataset can have different configurations

  1. Training

Use the following command whenever you open a new terminal window and need to do something (in order to activate the venv and be in the correct folder, usually):

cd /workspace/musubi-tuner
source venv/bin/activate

Run the following command to create the necessary latents for the training (need to rerun this everytime you change the dataset/captions):

python src/musubi_tuner/wan_cache_latents.py --dataset_config /workspace/musubi-tuner/dataset/dataset.toml --vae /workspace/musubi-tuner/models/vae/split_files/vae/wan_2.1_vae.safetensors

Run the following command to create the necessary text encoder latents for the training (need to rerun this everytime you change the dataset/captions):

python src/musubi_tuner/wan_cache_text_encoder_outputs.py --dataset_config /workspace/musubi-tuner/dataset/dataset.toml --t5 /workspace/musubi-tuner/models/text_encoders/models_t5_umt5-xxl-enc-bf16.pth

Run accelerate config once before training (everything no).

Final training command (aka my training config):

accelerate launch --num_cpu_threads_per_process 1 --mixed_precision bf16 src/musubi_tuner/wan_train_network.py --task t2v-14B --dit /workspace/musubi-tuner/models/diffusion_models/split_files/diffusion_models/wan2.1_t2v_14B_fp8_e4m3fn.safetensors --vae /workspace/musubi-tuner/models/vae/split_files/vae/wan_2.1_vae.safetensors --t5 /workspace/musubi-tuner/models/text_encoders/models_t5_umt5-xxl-enc-bf16.pth --dataset_config /workspace/musubi-tuner/dataset/dataset.toml --xformers --mixed_precision bf16 --fp8_base --optimizer_type adamw --learning_rate 3e-4 --gradient_checkpointing --gradient_accumulation_steps 1 --max_data_loader_n_workers 2 --network_module networks.lora_wan --network_dim 32 --network_alpha 32 --timestep_sampling shift --discrete_flow_shift 1.0 --max_train_epochs 100 --save_every_n_epochs 100 --seed 5 --optimizer_args weight_decay=0.1 --max_grad_norm 0 --lr_scheduler polynomial --lr_scheduler_power 4 --lr_scheduler_min_lr_ratio="5e-5" --output_dir /workspace/musubi-tuner/output --output_name WAN2.1_RickAndMortyStyle_v1_by-AI_Characters --metadata_title WAN2.1_RickAndMortyStyle_v1_by-AI_Characters --metadata_author AI_Characters

I always use this same config everytime for everything. But its well tuned for m...


Content cut off. Read original on https://old.reddit.com/r/StableDiffusion/comments/1m9p481/my_wan21_lora_training_workflow_tldr/

2
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/ninja_cgfx on 2025-07-26 05:17:51+00:00.

3
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/CQDSN on 2025-07-26 02:22:27+00:00.


This demo is created with the same workflow I posted a couple weeks ago. It's the opposite of the previous demo - here I am using Kontext to generate an anime style from a live action movie and using VACE to animate it.

4
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/coopigeon on 2025-07-25 13:58:48+00:00.

5
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/RecentTwo544 on 2025-07-25 18:53:37+00:00.


For those thinking "what in the 1984 are you on about?" here in the UK we've just come under the new Online Safety Act, after years of it going through parliament, which means you need to verify your age for a lot of websites, Reddit included for many subs, and indeed many that are totally innocent because the filter is broken.

However, so not everyone has to include personal details, many websites are offering a verification method whereby you show your face on camera, and it tells you if it thinks you're old enough. Probably quite a flawed system - it's using AI to determine how old you are, so there'll be lots of error, but that got me thinking -

Could you trick the AI, by using AI?

Me and a few mates have tried making a face "Man in his 30s" using Stable Diffusion and a few different models. Fortunately one mate has quite a few models already downloaded, as Civit AI is now totally blocked in the UK - no way to even prove your age, the legislation is simply too much for their small dedicated team to handle, so the whole country is locked out.

It does work for the front view, but then it asks you to turn your head slightly to one side, then the other. None of us are advanced enough to know how to make a video AI face/head that turns like this. But it would be interesting to know if anyone has managed this?

If you've got a VPN, sales of which are rocketing in the UK right now, and aren't in the UK but want to try this, set your location to the UK and try any "adult" site. Most now have this system in place if you want to check it out.

Yes, I could use a VPN, but a) I don't want to pay for a VPN unless I really have to, most porn sites haven't bothered with the verification tools, they simply don't care, and nothing I use on a regular basis is blocked, and b) I'm very interested in AI and ways it can be used, and indeed I'm very interested in its flaws.

(posted this yesterday but only just realised it was in a much smaller AI sub with a very similar name! Got no answers as yet...)

6
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/More_Bid_2197 on 2025-07-25 23:17:32+00:00.

Original Title: WAN is a very powerful model for generating images, but it has some limitations. While its performance is exceptional in close-ups (e.g., a person inside a house), the model struggles with landscapes, outdoor scenes, and wide shots. The first two photos are WAN, the last is Flux+samsung lora

7
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/FortranUA on 2025-07-25 22:40:21+00:00.

8
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/realtimevideoai on 2025-07-25 16:52:14+00:00.

9
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/FrontalSteel on 2025-07-25 19:52:52+00:00.

10
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/homemdesgraca on 2025-07-25 18:07:37+00:00.


https://reddit.com/link/1m96f4y/video/jmz6gtbo82ff1/player

https://reddit.com/link/1m96f4y/video/ybwz3meo82ff1/player

https://reddit.com/link/1m96f4y/video/ak21w9oo82ff1/player

All of the videos are 1280x720, 30 FPS, 5s.

Original Post (Twitter/X): https://x.com/Alibaba_Wan/status/1948802926194921807

11
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/mrgreaper on 2025-07-25 17:47:18+00:00.

12
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/zer0int1 on 2025-07-25 17:14:14+00:00.

13
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/jurely_you_jestin on 2025-07-25 12:45:15+00:00.

14
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/diStyR on 2025-07-25 12:12:08+00:00.

15
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/leyermo on 2025-07-25 11:23:25+00:00.


Hey everyone!

I'm compiling a list of the most-loved realism models—both SFW and N_SFW—for Flux and SDXL pipelines.

If you’ve been generating high-quality realism—be it portraits, boudoir, cinematic scenes, fashion, lifestyle, or adult content—drop your top one or two models from each:

🔹 Flux:

🔹 SDXL:

Please limit to two models max per category to keep things focused. Once we have enough replies, I’ll create a poll featuring the most recommended models to help the community discover the best realism models across both SFW and N**_**SFW workflows.

Excited to see what everyone's using!

16
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/NoAerie7064 on 2025-07-25 06:22:16+00:00.

17
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/masslevel on 2025-07-24 22:59:11+00:00.

18
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/bigGoatCoin on 2025-07-24 21:35:49+00:00.

19
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/ilzg on 2025-07-24 20:40:48+00:00.


https://preview.redd.it/dwj6x8ysuvef1.png?width=700&format=png&auto=webp&s=28bf9191836669cd6ef3f947822866fcdf2db095

https://preview.redd.it/pabdq5ysuvef1.png?width=688&format=png&auto=webp&s=e54d80b4c8392dfdfa7f5d2fc6bd8d54f5d7ff1a

Instantly place tattoo designs on any body part (arms, ribs, legs etc.) with natural, realistic results. Prompt it with “place this tattoo on [body part]”, keep LoRA scale at 1.0 for best output.

Hugging face: huggingface.co/ilkerzgi/Tattoo-Kontext-Dev-Lora ↗

Use in FAL: https://fal.ai/models/fal-ai/flux-kontext-lora?share=0424f6a6-9d5b-4301-8e0e-86b1948b2859

Use in Civitai: https://civitai.com/models/1806559?modelVersionId=2044424

Follow for more: x.com/ilkerigz

20
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/pheonis2 on 2025-07-24 17:21:58+00:00.

21
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/homemdesgraca on 2025-07-24 18:07:56+00:00.

22
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/More_Bid_2197 on 2025-07-24 16:57:43+00:00.

23
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/looksnicelabs on 2025-07-24 15:58:36+00:00.

24
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/diStyR on 2025-07-24 14:05:52+00:00.

25
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/stablediffusion by /u/Fast-Visual on 2025-07-24 11:43:22+00:00.


I noticed that the official HuggingFace Repository for Chroma uploaded yesterday a new model named chroma-unlocked-v46-flash.safetensors. They never did this before for previous iterations of Chroma, this is a first. The name "flash" perhaps implies that it should work faster with fewer steps, but it seems to be the same file size as regular and detail calibrated Chroma. I haven't tested it yet, but perhaps somebody has insight of what this model is and how it is different from regular Chroma?

Link to the model

view more: next ›