Tutorial: Creating a Consistent Character as a Textual Inversion Embedding #3
Replies: 16 comments 18 replies
-
Hey thanks for putting this together. It really helped me understand SD a lot more. I do have a question though at step: 4 (The training step) when i trained the AI for 150 steps I got no images that looked like my samples. Like really not even close out. Out of every 5 steps most of the images were just very weird ones. Maybe 3 were normal but stull completely different from the sample. I noted some differences between what you trained against vs what I had to use. You stated to "train against the default pack that came with your SD install". For me it was: v1-5-pruned-emaonly.safetensors [6ce0161689]. I'm wondering if this affected my training results somehow. Any ideas? |
Beta Was this translation helpful? Give feedback.
-
That checkpoint should be fine, I think. Could it be that you're hitting a known issue with A1111 and xformers on some 30xx / 40xx cards? AUTOMATIC1111/stable-diffusion-webui#7264 |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for the work, when I have generated the grid, I can only download a single png with the grid, is there a method to download also every single image generated? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the tutorial! When trying to train the embedding, no matter what I do, it always just says: Training finished at 0 steps - and that's it. I've been trying to resolve this for 3 hours now without any luck. Got any idea why I might run into this problem? Edit: After a long night I came to the painful realisation that it is indeed my GPU's fault; seems like 4gb vram isn't enough for training... well, I've been wanting to replace my GPU anyway. Seems like a reasonably good time. |
Beta Was this translation helpful? Give feedback.
-
Thanks for this in-depth tutorial, really interesting. |
Beta Was this translation helpful? Give feedback.
-
ANy chance to get back to us on this.. I was really excited to try this method but now when training I just get wierd random stuff.. nothing that looks like my dataset.. I've been training loras and dreambooth models etc etc.. no problem.. what's going on here ? |
Beta Was this translation helpful? Give feedback.
-
Firstly, thanks very much for the tutorial. A lot of techniques covered that were new to me and are extremely useful. Over the past few days since I started learning about textual inversion (amazing stuff), I've gone from using exclusively img2img to now exclusively txt2img, and have made several inversions I'm pretty happy with. However I will say that I needed extra resources beyond this tutorial, as well as a lot of patience to get there. The main issue I have is where the tutorial ends - there seems to be extra steps from your workflow to get to the result you achieved in the inversions posted to CivitAI. For one thing, the tutorial suggests the end result will be achieved within 150 its, but on CivitAI it's mentioned they were trained for 5000its? It's almost as if the tutorial is meaning epochs where it says iterations - for 25 images, 150 epochs would be 3750 its. Then I checked the stats for inversions made by others, and found many of them to be trained for 20k its or more. That's over 100x more iterations than the point at which the tutorial suggests to stop training, because it'll be approaching overtraining or already overtrained. So I started experimenting with more settings and source image sets, and eventually ended up with much better results. In my opinion, 8 vectors per token for 25 images is too many vectors. I agree with your explanation of the parameter and that it should be set according to the no. of images used for the sources but I think that many vectors is too much information storage for too little source data. For 25 images I would set it much lower at around 3 or 4. In this case, like you said, it will be more difficult for the model to retain as much info per iteration. This isn't really a problem though, if you just do more iterations. That way, the model can progress through training more slowly and accurately, and is less likely to derail or become overtrained before converging to a faithful and accurate representation of the subject. You want to give it as much information as possible about the vectors it is pursuing via your input (and batch size), but if you don't limit the scope of what it is pursuing then it will start imagining undesirable details before it has a chance to properly capture the desirable ones. Re: batch size, I think your good results in the tutorial have a lot to do with good input - the model can just kinda copypaste the likeness and call it a day. But for me (and I think the majority of us with less-than-perfect input, what is even perfect at 512x512 anyway), I think you should just set the batch size to the max your GPU can handle. For me that's 4, I found that it helps a lot with convergence of the training, where the model is able to interpolate angles that don't exist in the input e.g for a shot that isn't quite a medium shot or a closeup. Again, you want it to imagine the good details slowly and accurately, but you want it to do so before it starts choosing the path of least resistance pursuing a bunch of vectors you don't want and over-stylizing the image. What has given me the best results is a set of around 75 input images, including some images which are deliberately not perfect. Like where the eyes are closed, or the character has a weird expression, poking out his/her tongue etc. It'll help the model start to understand how to deform the embedded face in response to a prompt, and lead to more different and more realistic expressions being possible with the final embedding. When I create the embedding, typically I'd just put 1 vector per 6 or 7 images, so 9 or 10 for 75 input images. It depends on the nature of the input though, if there's a lot of potentially undesirable components in it then I'll limit the vectors a bit more to keep the model on the rails. I start off training using just [filewords] for the prompt template and a learning rate at 0.005 for 1 epoch, or 75 iterations in my case. Then I drop it to 0.0006 for 200-400its. When the preview is starting get the general likeness I drop it to 0.0001 and leave it there for up to several thousand its. I progressively drop it to around 0.00004 at around 10k its and finally 0.000025 around 20k its. My best inversions are trained for 20-25k its, and I have a couple of good ones with fewer source images and vectors that converged well by around 15k. The whole time I'm training, I'm monitoring the output. If not watching live previews, then I'll check the folder for previously generated previews every now and then and check for indications to drop the learning rate. The thresholds described above are just where I usually starts seeing indications the model has made good progress toward achieving goals appropriate for that learning rate, based on experience. I'll also introduce the [subject] prompt template and some other prompt templates around the halfway mark, just so that earlier on there's no funny business happening where it considers [subject], [filewords] to be representing identical twins or something. Other than that I follow a similar process to you with plotting particular milestone embeddings together in different models to determine the best candidate. I found that some models will affect the face significantly and can completely change the look of the embedded character. This include Realistic Vision and Deliberate. My best results were with F222, which was the most consistent at providing photorealism without overwriting the essence of the character. Overall textual inversions are pretty sweet and thanks for highlighting them with your tutorial and bringing them to my attention in this way. There's clearly some kind of misinterpretation on my or your end regarding some of the finer details, or perhaps just more than one way to skin it because my results are very good despite what I'm doing is not being consistent with your instructions in a few key areas. But for anyone reading this after not getting the results they wanted following the tutorial, stick with it, if your input is consistent and good, you will get a good result once you have the settings figured out. |
Beta Was this translation helpful? Give feedback.
-
Hey, thanks for this great, easy to read and understand tutorial. Following these steps I created my first TI, that doesn't produce comlete crap.
Thanks in Advance |
Beta Was this translation helpful? Give feedback.
-
Great tutorial! |
Beta Was this translation helpful? Give feedback.
-
Outstanding tutorial. It was very easy to follow and generated great results. I didn't get any long shots in the 400 so I outpainted a couple then scaled them back to 512. Even so, anything longer than a medium shot looks pretty bad so I may try creating more long shot source images and rerun it. Maybe 3 will do, along with 3 cowboy shots and 3 wide shots. |
Beta Was this translation helpful? Give feedback.
-
As great this tutorial is, generating images on a specific shot isn't that easy, because SD don't understand it (most of the times, except the trained one like closeups). |
Beta Was this translation helpful? Give feedback.
-
Great tutorial! It took a couple days but I made my first "Nobody" and I'm extremely pleased. A couple things I noticed:
|
Beta Was this translation helpful? Give feedback.
-
This is the best tutorial on Training embeddings I've seen. In regards to your manner of choosing characters, I think generating them using random names gives quicker results: https://www.reddit.com/r/StableDiffusion/comments/158cv9k/actor_casting_consistent_characters/ |
Beta Was this translation helpful? Give feedback.
-
Btw note the embedding link in the article doesn't work: https://github.com/BelieveDiffusion/tutorials/blob/main/consistent_character_embedding/fr3nchl4dysd15.pt |
Beta Was this translation helpful? Give feedback.
-
Hi, any idea why A1111 stops working for me on the step "Generating permutations"? It removes all of my generated images and doesn't even start the process of generating the 400 poses. Thanks in advance 😇 |
Beta Was this translation helpful? Give feedback.
-
Getting this error while training |
Beta Was this translation helpful? Give feedback.
-
I've written up how I create my LastName character embeddings as a complete walkthrough tutorial. Give it a try, and please do let me know how you get on!
https://github.com/BelieveDiffusion/tutorials/tree/main/consistent_character_embedding#readme
If you have any suggestions, issues, or tips, feel free to use this discussion to ask for help or share your discoveries 👍
Beta Was this translation helpful? Give feedback.
All reactions