Tutorial: Creating a Consistent Character as a Textual Inversion Embedding #3

BelieveDiffusion · 2023-04-07T16:36:07Z

BelieveDiffusion
Apr 7, 2023
Maintainer

I've written up how I create my LastName character embeddings as a complete walkthrough tutorial. Give it a try, and please do let me know how you get on!

https://github.com/BelieveDiffusion/tutorials/tree/main/consistent_character_embedding#readme

If you have any suggestions, issues, or tips, feel free to use this discussion to ask for help or share your discoveries 👍

ghostdrft · 2023-04-09T01:32:48Z

ghostdrft
Apr 9, 2023

Hey thanks for putting this together. It really helped me understand SD a lot more. I do have a question though at step: 4 (The training step) when i trained the AI for 150 steps I got no images that looked like my samples. Like really not even close out. Out of every 5 steps most of the images were just very weird ones. Maybe 3 were normal but stull completely different from the sample. I noted some differences between what you trained against vs what I had to use.

You stated to "train against the default pack that came with your SD install". For me it was: v1-5-pruned-emaonly.safetensors [6ce0161689]. I'm wondering if this affected my training results somehow. Any ideas?

0 replies

BelieveDiffusion · 2023-04-09T19:04:35Z

BelieveDiffusion
Apr 9, 2023
Maintainer Author

That checkpoint should be fine, I think. Could it be that you're hitting a known issue with A1111 and xformers on some 30xx / 40xx cards? AUTOMATIC1111/stable-diffusion-webui#7264

3 replies

ghostdrft Apr 9, 2023

I don't think this is the same thing as my training is actually running and not erroring out. I do have a RTX2070. I can see the names of the files I renamed being pulled into the prompt and it does complete. However, from, what I can tell if i use the same prompt: "an extreme closeup color photo of %mycreatedname% in a forest" and generate not train, I do get similar weird images that resemble the trained images. I think maybe stable diffusion isn't pulling from the directory the files are in?

ghostdrft Apr 9, 2023

update

After looking at the console that opens before the web-ui, I realized in the top few lines it says "No module xformers proceeding without it" I feel like this could be part of the problem. What do you think?

alexgilseg Apr 27, 2023

yea I'm in the same boat, was excited to test this but I'm getting nothing.. I've been training loras dreambooth and other stuff but this did absolutly nothing... Would be nice with some feedback.. It might be an issue with xformers or soemthing.. would be nice to know what your versions are of things...

BermejoCatalanIvan · 2023-04-22T07:38:25Z

BermejoCatalanIvan
Apr 22, 2023

Hi, thanks for the work, when I have generated the grid, I can only download a single png with the grid, is there a method to download also every single image generated? Thanks!

1 reply

arv-m May 25, 2023

Go to the directory where you've installed SD. Open the directory named 'Stable diffusion web ui'. Open 'Outputs'. Open 'txt2img-images', this should contain the individual images.

'txt2img-grids' will contain the grid images.

kolbjosefu · 2023-04-23T22:56:16Z

kolbjosefu
Apr 23, 2023

Thanks for the tutorial! When trying to train the embedding, no matter what I do, it always just says: Training finished at 0 steps - and that's it. I've been trying to resolve this for 3 hours now without any luck. Got any idea why I might run into this problem?

Edit: After a long night I came to the painful realisation that it is indeed my GPU's fault; seems like 4gb vram isn't enough for training... well, I've been wanting to replace my GPU anyway. Seems like a reasonably good time.

1 reply

arv-m May 25, 2023

If you only need it for SD training, check out services like runpod or Google colab pro, you can get pretty good systems for cheap on the hourly.

A 3090 with 30gb of vram costs roughly $0.3 an hour on runpod and you can choose an image that comes pre installed with automatic1111 as well, making setup even easier.

Woisek · 2023-04-24T20:17:20Z

Woisek
Apr 24, 2023

Thanks for this in-depth tutorial, really interesting.
But that leads my to an obvious question: ~~Will it blend?~~ Will this guide also work for making LoRAs? 🙂

1 reply

ryant4333 Jul 9, 2023

No

alexgilseg · 2023-04-27T21:23:33Z

alexgilseg
Apr 27, 2023

ANy chance to get back to us on this.. I was really excited to try this method but now when training I just get wierd random stuff.. nothing that looks like my dataset.. I've been training loras and dreambooth models etc etc.. no problem.. what's going on here ?

0 replies

beanzie · 2023-05-02T21:57:54Z

beanzie
May 2, 2023

Firstly, thanks very much for the tutorial. A lot of techniques covered that were new to me and are extremely useful. Over the past few days since I started learning about textual inversion (amazing stuff), I've gone from using exclusively img2img to now exclusively txt2img, and have made several inversions I'm pretty happy with. However I will say that I needed extra resources beyond this tutorial, as well as a lot of patience to get there.

The main issue I have is where the tutorial ends - there seems to be extra steps from your workflow to get to the result you achieved in the inversions posted to CivitAI. For one thing, the tutorial suggests the end result will be achieved within 150 its, but on CivitAI it's mentioned they were trained for 5000its?

It's almost as if the tutorial is meaning epochs where it says iterations - for 25 images, 150 epochs would be 3750 its. Then I checked the stats for inversions made by others, and found many of them to be trained for 20k its or more. That's over 100x more iterations than the point at which the tutorial suggests to stop training, because it'll be approaching overtraining or already overtrained. So I started experimenting with more settings and source image sets, and eventually ended up with much better results.

In my opinion, 8 vectors per token for 25 images is too many vectors. I agree with your explanation of the parameter and that it should be set according to the no. of images used for the sources but I think that many vectors is too much information storage for too little source data. For 25 images I would set it much lower at around 3 or 4. In this case, like you said, it will be more difficult for the model to retain as much info per iteration. This isn't really a problem though, if you just do more iterations. That way, the model can progress through training more slowly and accurately, and is less likely to derail or become overtrained before converging to a faithful and accurate representation of the subject. You want to give it as much information as possible about the vectors it is pursuing via your input (and batch size), but if you don't limit the scope of what it is pursuing then it will start imagining undesirable details before it has a chance to properly capture the desirable ones.

Re: batch size, I think your good results in the tutorial have a lot to do with good input - the model can just kinda copypaste the likeness and call it a day. But for me (and I think the majority of us with less-than-perfect input, what is even perfect at 512x512 anyway), I think you should just set the batch size to the max your GPU can handle. For me that's 4, I found that it helps a lot with convergence of the training, where the model is able to interpolate angles that don't exist in the input e.g for a shot that isn't quite a medium shot or a closeup. Again, you want it to imagine the good details slowly and accurately, but you want it to do so before it starts choosing the path of least resistance pursuing a bunch of vectors you don't want and over-stylizing the image.

What has given me the best results is a set of around 75 input images, including some images which are deliberately not perfect. Like where the eyes are closed, or the character has a weird expression, poking out his/her tongue etc. It'll help the model start to understand how to deform the embedded face in response to a prompt, and lead to more different and more realistic expressions being possible with the final embedding. When I create the embedding, typically I'd just put 1 vector per 6 or 7 images, so 9 or 10 for 75 input images. It depends on the nature of the input though, if there's a lot of potentially undesirable components in it then I'll limit the vectors a bit more to keep the model on the rails.

I start off training using just [filewords] for the prompt template and a learning rate at 0.005 for 1 epoch, or 75 iterations in my case. Then I drop it to 0.0006 for 200-400its. When the preview is starting get the general likeness I drop it to 0.0001 and leave it there for up to several thousand its. I progressively drop it to around 0.00004 at around 10k its and finally 0.000025 around 20k its. My best inversions are trained for 20-25k its, and I have a couple of good ones with fewer source images and vectors that converged well by around 15k.

The whole time I'm training, I'm monitoring the output. If not watching live previews, then I'll check the folder for previously generated previews every now and then and check for indications to drop the learning rate. The thresholds described above are just where I usually starts seeing indications the model has made good progress toward achieving goals appropriate for that learning rate, based on experience. I'll also introduce the [subject] prompt template and some other prompt templates around the halfway mark, just so that earlier on there's no funny business happening where it considers [subject], [filewords] to be representing identical twins or something.

Other than that I follow a similar process to you with plotting particular milestone embeddings together in different models to determine the best candidate. I found that some models will affect the face significantly and can completely change the look of the embedded character. This include Realistic Vision and Deliberate. My best results were with F222, which was the most consistent at providing photorealism without overwriting the essence of the character. Overall textual inversions are pretty sweet and thanks for highlighting them with your tutorial and bringing them to my attention in this way. There's clearly some kind of misinterpretation on my or your end regarding some of the finer details, or perhaps just more than one way to skin it because my results are very good despite what I'm doing is not being consistent with your instructions in a few key areas. But for anyone reading this after not getting the results they wanted following the tutorial, stick with it, if your input is consistent and good, you will get a good result once you have the settings figured out.

0 replies

nicktar · 2023-06-05T14:48:23Z

nicktar
Jun 5, 2023

Hey, thanks for this great, easy to read and understand tutorial. Following these steps I created my first TI, that doesn't produce comlete crap.
To me there are two open questions and I would be very glad if you could find time to anser them.

Most TIs on civitai are in safetensors format, yet 1111 creates them as .pt. Is there an easy way to convert these? I only found a reference to the Model Converter extension but that reduces the file size from 25 to 1kb and the safetensors doesn't seem to work
From my understanding of TI, it should be possible to use one TI to create images to train another one (like keeping the face but changing the hairstyle), since the Training just searches for a pre existing vector in the model that creates the desired images. So we could just use images that were using a bunch of tokens regardless of the fact that a few of them might be bundled up in another TI. Yet, my attempts all look like crap. Am I just unlucky or doesn't this work at all?

Thanks in Advance

0 replies

Marcellusrex · 2023-06-06T00:32:47Z

Marcellusrex
Jun 6, 2023

Great tutorial!
Followed all the steps (carefully!) and started getting good results first time, thanks, going to have a lot of fun with this :)

0 replies

mweldon · 2023-06-10T01:06:11Z

mweldon
Jun 10, 2023

Outstanding tutorial. It was very easy to follow and generated great results. I didn't get any long shots in the 400 so I outpainted a couple then scaled them back to 512. Even so, anything longer than a medium shot looks pretty bad so I may try creating more long shot source images and rerun it. Maybe 3 will do, along with 3 cowboy shots and 3 wide shots.

0 replies

Woisek · 2023-06-10T09:38:50Z

Woisek
Jun 10, 2023

As great this tutorial is, generating images on a specific shot isn't that easy, because SD don't understand it (most of the times, except the trained one like closeups).
Wouldn't it be now a good time to incorporate ControlNet with defined poses for the shots? With the right poses as template, we eliminate a high amount on "false images" and up the chance to get a consistent character in that pose needed. So instead of generating masses of images with the hope that one, two or three are fitting, we can generate less and still get fitting images. This reduces generation times massively.

2 replies

mweldon Jun 10, 2023

As great this tutorial is, generating images on a specific shot isn't that easy, because SD don't understand it (most of the times, except the trained one like closeups). Wouldn't it be now a good time to incorporate ControlNet with defined poses for the shots? With the right poses as template, we eliminate a high amount on "false images" and up the chance to get a consistent character in that pose needed. So instead of generating masses of images with the hope that one, two or three are fitting, we can generate less and still get fitting images. This reduces generation times massively.

It's my first time making a textual inversion, and I'm making some rookie mistakes. You're right. After generating a second batch of images with slightly different parameters, but still didn't get what I wanted. I'll try generating a bunch of poses and picking out some that I want to use for training, then use those with my original prompt and see if I can get some good ones that match the others.

Woisek Jun 10, 2023

I think a defined pose set with all the shot recommended in this tutorial should help to get a consistent data set every time. With a control weight around 0.5, it gives it enough freedom for varied poses, but still in the way needed.

heroofthebeach · 2023-06-15T08:28:52Z

heroofthebeach
Jun 15, 2023

Great tutorial! It took a couple days but I made my first "Nobody" and I'm extremely pleased.

A couple things I noticed:

Auto1111's preview images during training seem to be absolute nonsense but fortunately have no correlation with how training is actually going.
A lot of checkpoints have no idea what a "medium shot" is, which means you might not get enough data for SD to learn anything about your subject's torso. Replacing the term with "cowboy shot" fixes this, but also sends SD into "DID SOMEONE SAY COWBOY 🤠" mode, where it puts a hat on goddamn everything and the negative prompt will be powerless to stop it. Fortunately, the Cutoff extension can cancel the rodeo before it starts.

0 replies

andupotorac · 2023-07-25T18:33:38Z

andupotorac
Jul 25, 2023

This is the best tutorial on Training embeddings I've seen. In regards to your manner of choosing characters, I think generating them using random names gives quicker results: https://www.reddit.com/r/StableDiffusion/comments/158cv9k/actor_casting_consistent_characters/

2 replies

nicktar Jul 25, 2023

Random names give (consistent) random characters, the goal of this tutorial is to create (consistent) specific characters

andupotorac Jul 25, 2023

The workflow is the same though. You would get consistent random characters, and you would pick one that you like and train an embedding on it.

andupotorac · 2023-07-25T18:34:03Z

andupotorac
Jul 25, 2023

Btw note the embedding link in the article doesn't work: https://github.com/BelieveDiffusion/tutorials/blob/main/consistent_character_embedding/fr3nchl4dysd15.pt

0 replies

catcreative · 2023-08-11T19:39:40Z

catcreative
Aug 11, 2023

Hi, any idea why A1111 stops working for me on the step "Generating permutations"? It removes all of my generated images and doesn't even start the process of generating the 400 poses. Thanks in advance 😇

5 replies

mweldon Aug 11, 2023

Are there any errors in the console? Out of memory maybe? Can your video card handle the batch size you are trying to do?

catcreative Aug 12, 2023

Hi @mweldon and thank you so much for your quick response! The console doesn't show any errors, the only programmes on my new MacBook Pro are the Adobe Creative Suite and Automatic1111 so there is definitely enough memory. Please find the video attached.

A1111.mov

mweldon Aug 12, 2023

Okay it's hard to guess the problem from this video. I would need to see all of the settings including the prompt, but I'm going to take a wild guess. It looks like you've been doing some generating and things are going well and you've made some very nice images, but now the Prompt S/R is failing. When that happens to me it's because of the prompt. Make sure the words "an extreme closeup" and "front shot" both exist in your prompt. If you look at the console window, it would have an error like "an extreme closeup" is missing from the prompt or something like that. So Prompt S/R can't do the search-replace if it can't find the search text, which is the first entry in the comma-separated list. Again, this is just a wild guess. Good luck.

mweldon Aug 12, 2023

Oh yeah when you scroll up I think your prompt says "closeup front shot" change that to "an extreme closeup front shot". That should fix it.

catcreative Aug 13, 2023

@mweldon Thank you! It starts working and goes to like 5% and then it stops working. Can it have anything to do with the internet connection not being strong enough? It's not cutting off, just not very good.

KaitouStn · 2023-09-04T21:25:41Z

KaitouStn
Sep 4, 2023

  Preparing dataset...
 100%|█████████████████████████████| 29/29 [00:04<00:00,  6.75it/s]
  Training textual inversion [Epoch 0: 20/29] loss: 0.2503374:   3%|*** Error training embedding
  Traceback (most recent call last):
  File "C:\Stuff\WebUIStablerDiffusion\webui\modules\textual_inversion\textual_inversion.py", line 593, in train_embedding
    p.sampler_name = sd_samplers.samplers[preview_sampler_index].name
  TypeError: list indices must be integers or slices, not str

Getting this error while training

3 replies

Tutorial: Creating a Consistent Character as a Textual Inversion Embedding #3

BelieveDiffusion Apr 7, 2023 Maintainer

Replies: 16 comments · 18 replies

BelieveDiffusion Apr 9, 2023 Maintainer Author

BelieveDiffusion
Apr 7, 2023
Maintainer

Replies: 16 comments 18 replies

BelieveDiffusion
Apr 9, 2023
Maintainer Author