Proficient in drawing Stable Diffusion graphs, understanding the differences between the four models: LoRA, Dreambooth, Hypernetworks.
With the increasing ability of generative AI technology, more and more peers are focusing on improving research and development efficiency through AI models. There are many popular AI models in the industry, such as the drawing tool Midjourney, the versatile Stable Diffusion, and OpenAI’s recently updated DALL-E 2. While the latter has limited user adoption, many developers have tried the first two.
However, for the R&D team, although Midjourney is powerful and does not require local installation, it has higher hardware performance requirements, and even the same instruction may produce different results each time. In contrast, Stable Diffusion, which is feature-rich, open-source, runs quickly, and has low power consumption and small memory usage, has become a more ideal choice.
Recently, someone even used Stable Diffusion and Dreambooth to train an AI that can imitate the style of a human illustrator. With only 32 works, it was able to create art that looks exactly like the style of illustrator Hollie Mengert.
Currently, there are four methods for training Stable Diffusion models: Dreambooth, Textual Inversion, LoRA, and Hypernetworks. What are the characteristics of these models? Which one is more suitable for developers to use?
The Four Mainstream AI Models for Stable Diffusion Training.
Dreambooth
1. What is DreamBooth?
DreamBooth is a theme-driven AI generation model launched by Google, which can fine-tune text to the results of image diffusion models or new images. DreamBooth can do things that other diffusion models cannot or are not good at, such as lacking contextualization of themes in models like DALL-E 2, Midjourney, and Stable Diffusion.
Dreambooth has the ability to produce personalized results, including both the text-to-image model generated results and any images inputted by the user.
2. The working principle of Dreambooth.
With just a few images as input (usually 3-5), Dreambooth can generate personalized images based on themes with different backgrounds, using adjusted Imagen and other diffusion models. Once the images are inputted, the adjusted Imagen and other diffusion models identify a unique identifier and link it to the theme. During inference, the unique identifier is used to synthesize themes in various contexts.
3. Instructions for use:
1) Preparing input pictures: If you want to turn yourself into an AI artist, you need to prepare at least five clear photos and upload them to the Colab notebook following the subsequent steps. The more input pictures you provide, the better. If the quantity is limited, the code itself will generate some input pictures for training.
Because there is no limit to the number of images you can upload, feel free to input any amount. Take some medium-sized pictures and full-size pictures from different angles and with different lighting. Additionally, avoid uploading pictures that have poor lighting or are too dark. Of course, you can also train Dreambooth with celebrity photos.
2) Go to Google Colab Notebook: Currently, there are three Colab Notebooks that can run Dreambooth using Stable Diffusion: Hugging Face, ShivamShirao, and TheLastBen.
Considering the speed issue and the usage of pf VRAM, we will temporarily use TheLastBen Colab notebook to train and generate images. Open TheLastBen Colab notebook on your computer, click “File” and “Save a copy in Drive”.
3) Obtain Access Token from Hugging Face: To use any Dreambooth related Google Colab, you need to obtain an access token from Hugging Face.
Go to the Hugging Face website and register with your email address. Using a company email can help you find colleagues and join teams. Then, on the settings page accessed through the “Profile icon,” click “Access Token,” and create your access token by clicking “New Token.”
When creating a tag, you must select the “Write” role. However, you can name the tag anything you want, and it’s a good idea to use names that are relevant to the platform you’re accessing, which in this case is Colab notebook. Finally, copy the tag you created.
4) Running the Colab notebook: After opening the copied Colab notebook from TheLastBen, in the “Downloading the model” section, click on the Hugging Face link, accept the terms, and then click “Access repository”.
Now, you can find the “Huggingface_Token” section under “Downloading the model”. Paste the token you copied in step three. Then, you need to run each cell one by one, meaning you should run the first cell and wait for the green check mark before starting the next cell.
After running the first cell, you will see a permission request from Colab to access your Google Drive files. Just click on “Connect to Google Drive” to proceed.
Before running the “Setting up” cell, make sure to input the topic name, instance name, and mention the number of images you want to train or upload.
Then start running this cell and click the “Choose files” button. If you are uploading a small number of images, you can click this button. If you are uploading a large number, mention the folder URL in the “Instance_DIR_Optional” area.
Set the seventh cell to optional, then run the eighth cell “Start Dreambooth”. The last cell will take 30 to 90 minutes to complete.
5) Check the output image in Google Drive: Finally, check the AI-generated image in your Google Drive.
Pictures generated by Dreambooth by netizens.
Textual Inversion
Textual Inversion is a technique for capturing new concepts from a small set of example images, which can then be used to control the pipeline from text to image by learning new “words” in the embedding space of the text encoder in the pipeline. These special words can then be used in text prompts to achieve fine-grained control over the generated images.
1.How does it work?
Before a text command is used to propagate a model, it must first be processed into a numerical representation. This typically involves tokenizing the text, converting each token into an embedding, and then embedding it through a model (often a transformer), whose output will be used as a condition for the propagation model.
Textual Inversion Learn a new logo embedding (i.e., V* in the image above). An instruction (including the logo that will be mapped to this new embedding) is combined with one or more noisy versions of training images as the input to the generator model, which attempts to predict a denoised version of the image. The embedding is optimized based on the model’s performance on this task, and embeddings that better capture the objects or style displayed in the training images will provide more useful information for the diffusion model, thereby reducing denoising loss. After many steps (usually thousands) and various instructions and image variations, the learned embedding should hopefully capture the essence of new concepts.
In addition to using your own trained concepts, the new Stable Diffusion public concept library also has community-created textual inversion training models that you can use. As time goes on and more examples are added, it will become a very useful resource.
2. Example: Running Locally.
The textual_inversion.py script here demonstrates how to implement the training process and make it compatible with Stable Diffusion.
Before running the script, make sure to install the training dependencies for this library.
pip install diffusers[training] accelerate transformers
Then use “accelerate config” to initialize an Accelerate environment.
3. Examples of cat toys.
Before downloading or using the weight, you need to accept the model license. In this case, we are using v1-4, so you will need to access its card, read the license, and then check the box to agree to the terms.
You must be a registered user of Hugging Face Hub and obtain an access token to make the code work. Run the following command to verify your token:
huggingface-cli login
Download three to four images as training data, and then train using the following code:
export MODEL_NAME=”runwayml/stable-diffusion-v1-5″
export DATA_DIR=”path-to-dir-containing-images”
accelerate launch textual_inversion.py \
–pretrained_model_name_or_path=$MODEL_NAME \
–train_data_dir=$DATA_DIR \
–learnable_property=”object” \
–placeholder_token=”” –initializer_token=”toy” \
–resolution=512 \
–train_batch_size=1 \
–gradient_accumulation_steps=4 \
–max_train_steps=3000 \
–learning_rate=5.0e-04 –scale_lr \
–lr_scheduler=”constant” \
–lr_warmup_steps=0 \
–output_dir=”textual_inversion_cat”
Running a complete training session on a V100 GPU usually takes about an hour.
Napoleon photos trained by netizens using Textual Inversion.
Inference: Once you have trained a model with the aforementioned commands, using StableDiffusionPipeline for inference will be relatively easy. Make sure to include a placeholder_token in your commands.
from diffusers import StableDiffusionPipeline
model_id = “path-to-your-trained-model”
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to(“cuda”)prompt = “A backpack”
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
image.save(“cat-backpack.png”)
LoRA
The full name of LoRA is Low-Rank Adaptation, which refers to the low-order adaptation of large language models.
LoRA reduces the number of trainable parameters by learning rank-decomposition matrices, while freezing the original weights. This greatly reduces the storage demands of large language models for specific tasks and enables efficient task switching during deployment without causing inference latency issues. LoRA also outperforms other adaptive methods such as adapters, prefix tuning, and fine-tuning.
User-generated witch photos created with LoRA.
Quick Start:
1. Installing loralib is very simple:
pip install loralib
# Alternatively
# pip install git+https://github.com/microsoft/LoRA
2. You can adjust by replacing certain layers with corresponding layers implemented in loralib. Currently, only nn.Linear, nn.Embedding, and nn.Conv2d are supported. For cases where a single nn.Linear represents multiple layers, we also support MergedLinear, as in some implementations of attention qkv mapping.
# ===== Before =====
# layer = nn.Linear(in_features, out_features)# ===== After ======
import loralib as lora
# Add a pair of low-rank adaptation matrices with rank r=16
layer = lora.Linear(in_features, out_features, r=16)
3.Before the training begins, only mark the LoRA parameter as trainable.
import loralib as lora
model = BigModel()
# This sets requires_grad to False for all parameters without the string “lora_” in their names
lora.mark_only_lora_as_trainable(model)
# Training loop
for batch in dataloader:
…
4. When storing a checkpoint, generate a state_dict that only contains LoRA parameters.
# ===== Before =====
# torch.save(model.state_dict(), checkpoint_path)
# ===== After =====
torch.save(lora.lora_state_dict(model), checkpoint_path)
5. When using load_state_dict to load a checkpoint, make sure to set strict=False.
# Load the pretrained checkpoint first
model.load_state_dict(torch.load(‘ckpt_pretrained.pt’), strict=False)
# Then load the LoRA checkpoint
model.load_state_dict(torch.load(‘ckpt_lora.pt’), strict=False)
Afterwards, the training could proceed as usual.
Hypernetwork
1.What is Hypernetwork?
Hypernetwork was originally developed by Novel AI as a fine-tuning technique. It is a small neural network that is connected to the Stable Diffusion model and is used to modify its style. It is the most critical part of the Stable Diffusion model: the cross-attention module of the noise predictor UNet.
The Hypernetwork is usually a simple neural network: a linear network with dropout and activation, fully connected, just like what you learned in an introductory course on neural networks. They hijack the cross-attention module by inserting two networks to transform the key and query vectors. Compare the original model architecture and the hijacked model architecture below:
During training, the Stable Diffusion model is locked in, but the Hypernetwork attached to it can be modified. As the Hypernetwork is relatively small, the training is fast and requires limited resources, and can be done on an ordinary computer.
Fast training and relatively small files are the main attractions of Hypernetworks.
It should be noted that this is different from a hypernetwork in regular machine learning, as this is a network that generates weights for other networks, unlike in 2016.
The file size of Hypernetwork is usually below 200MB and it cannot work alone. It requires a checkpoint model to generate images.
Hypernetwork is similar to LoRA in that they are both small and only modify the cross-attention module. The difference is that the latter modifies it by changing weights, while Hypernetwork inserts an additional network to modify the cross-attention module. LoRA is a data storage method that does not define the training process, while Hypernetwork can define it.
2. How to use Hypernetwork.
Here, we introduce the method of using Hypernetwork in the AUTOMATIC1111 Stable Diffusion GUI. You can use this GUI on Windows, Mac, or Google Colab.
1) Installing a Hypernetwork model: To install a Hypernetwork model on the AUTOMATIC1111 webui, place the model file in the following folder:
stable-diffusion-webui/models/hypernetworks
2) Use a Hypernetwork model: To use a Hypernetwork, include the following phrases in your instructions:
<hypernet:filename:multiplier>
“filename” is the file name of Hypernetwork, without the extension (such as .pt, .bin, etc.).
The multiplier is a weight applied to the Hypernetwork model, with a default value of 1. Setting it to 0 will disable the model.
How to determine the correct file name? You need to click on the model button under the “Generate” button, instead of writing down this phrase.
By clicking on the Hypernetwork tag, you can see a list of the installed Hypernetworks. Simply click on the one you want to use, and the hypernet phrase will be inserted into the command.
It should be noted that the hypernet phrase is not considered part of the instruction, it simply indicates which hypernetworks to use. Once the Hypernetwork is applied, the phrase will be removed, so you cannot use command syntax like [keyword1:keyword2:0.5] on them.
3) Test and generate art with the model: To increase the success rate of obtaining the desired artistic style, the model can initially be used in conjunction with a trained model. However, don’t stop there, as some Hypernetworks require specific instructions or are only suitable for certain themes. Therefore, be sure to check the instruction examples on the model page to see what is most effective.
Here’s a suggestion: if you find that your images are overly saturated, it might be because you need to adjust the multiplier, which is easily done. Sometimes, Stable Diffusion can interfere with color saturation to achieve its goals, but lowering the multiplier can restore balance.
Images generated by Hypernetwork
Which one should you use?
A developer compared the depth of four different models to discuss the differences and advantages/disadvantages of each, in order to determine which method to use.
The following is the complete transcription:
Between LoRA, Dreambooth, Textual Inversion, and Hypernetworks, which one should you use? To answer this question, I read all the papers and understood people’s likes and dislikes about these models. I made a table and a pretty chart, and then I answered this question.
For each model, we need to answer two questions: what is the method, and how does it work? Based on their respective advantages and disadvantages, what trade-offs do they make?
The workings of these four methods are quite similar, but let’s start with Dreambooth as it is perhaps the most direct one, where the method actually involves changing the structure of the model itself. In Dreambooth, you have two inputs: the first being the concept you want to train, in this case we will use a photo of a Corgi that you want to train, in reality, you may have five or more images to train; the other is the sentence with the unique identifier, which in this case is SKS.
The entire idea behind Dreambooth is to teach a model to associate the unique identifier SKS with the concept of a Corgi. Essentially, this involves transforming the sentence into text embeddings, where each word is represented by a vector (a sequence of numbers, like floating point numbers) that contains unique semantic information related to that word.
We won’t delve deep into embeddings here, but in essence, some vectors contain information related to art, some are about photos, and others are very random.
Our approach involves text embedding and applying a significant amount of noise to sample images, followed by a smaller amount of noise. For example, we might apply 10 steps of noise to one image and 9 steps to another, and we want the model to use the one with 10 steps of noise and output the one with 9 steps. We use Stable Diffusion to denoise the images and restore them to their original appearance.
At first, because the model doesn’t know what image you’re sending, it may perform poorly, producing vastly different results. Your task is to compare the results to the 9-step noise that should be created, then define the loss through gradient updates. If the loss is too high, penalize the model; if it’s very low, reward the model. After several iterations of this process, the model will learn how to handle these types of commands.
In this way, you get a model: you input a noisy image (SKS), and then it is transformed into a clean image (Corgi), which is the final result you will obtain.
This is how Dreambooth works. It’s a bit complicated to explain, but I’ll try to keep it simple so that it’s easier to understand in comparison to other techniques. In Dreambooth, you’re essentially creating a completely new model by modifying the internal structure of the initial model until it understands the concept you’re aiming for. As a result, this may be the most effective training method for specific concepts in Stable Diffusion.
However, its storage efficiency is not high because every time Dreambooth is used, a brand new model is created. For example, training a Corgi model may produce 2GB of data and then training a cat model will require another 2GB of space. Sharing such large amounts of data is inconvenient. It is possible to train multiple concepts with the same model, but sometimes this can be confusing. Nonetheless, Dreambooth remains the most effective method.
Next, let’s talk about Textual Inversion. At first glance, the setup is almost the same. We still have SKS and still have Corgi, and we’re still trying to produce Corgi as the final output. We still need to do noise removal and comparison, but what’s different is that Textual Inversion doesn’t update and penalize the model’s gradients when the results are incorrect. Instead, it updates a vector to ultimately get the desired output.
Interestingly, Dreambooth is an incredibly complex model that can understand thousands of concepts, making it very intelligent. For Textual Inversion, we simply created a very unique and perfect vector to tell the model about the concept of Corgis, and we found that the results from Textual Inversion were very good.
The advantage of Textual Inversion is that you don’t have to create a new model from scratch. It’s just a tiny 12KB embedding that you can upload online, and anyone can download it and apply it to their own model, getting the same outcome.
Next up is LoRA, which stands for Low-rank Adaptation. However, in order to understand how it works, we need to have an understanding of the internal workings of the diffusion model itself and how it operates.
The current working process of neural networks involves setting up a series of consecutive layers, as in this case there are three, but in actual application there are usually hundreds. You receive input, usually a large numerical matrix, which is passed to the first layer. The first layer performs some calculations on this numerical matrix and then produces another matrix. The new matrix is passed to the next layer, which produces another transformed matrix, and so on until you reach the final layer and obtain the output result.
The idea is that as these weights pass through these models, they learn more and more about the structure of the input, until they fully understand what the input is and give you the desired results. This is the basic way that neural networks operate.
So, where does LoRA fit into this process? It’s actually trying to solve the Dreambooth problem, where you try to train a model to understand a concept and then create new models through iteration, which results in huge storage usage. LoRA aims to solve this problem by allowing the model to understand the concept without making a complete copy of the model.
As Stable Diffusion is not a super large model, Dreambooth is still acceptable and can be used. However, LoRA was originally designed for large language models, which typically have billions of parameters, so making a copy every time it is trained is not practical.
What LoRA does is inserting new layers into this model. Initially, the model looked like the one at the bottom of the picture. Now it has two extra layers. The first layer’s output is not passed directly to the second layer. Instead, it goes through the LoRA layer, which produces the second output and then passes it to the second layer. These are very small layers and basically speaking, they won’t affect the model at all when LoRA training starts.
As training progresses, these intermediate layers are updated, and gradually they develop their own perspectives. With enough training, they can often achieve results similar to Dreambooth. So while their approach is somewhat similar, they simply update existing weights and incorporate new ones until the same effect is achieved.
The training process of LoRA is similar to that of Dreambooth, but in comparison, LoRA trains much faster and uses very little memory. The LoRA model is also very small, and you can add it to different models. Usually, its size is around 150MB.
Finally, there’s Hypernetwork. Essentially, it’s similar to LoRA in that there’s no official paper on it yet. However, through reading the AUTOMATIC 111 code library, I discovered how it operates.
However, this method does not directly update and optimize the intermediate layer. Instead, it has a hypernetwork that outputs the intermediate layer, just like a diffusion model outputs a numerical matrix and translates it into an image. The hypernetwork will output multiple numerical matrices, which are then used in the diffusion model as the intermediate layer.
The idea is exactly the same as LoRA: insert a middle layer, constantly update and improve, and eventually get the results you want. The only difference is that you’re not updating the entire layer directly and risking loss, but updating a network that learns how to create these layers, and then constantly updating that network until you achieve the desired result.
Although there are many related studies on Hypernetwork, my intuition (not necessarily correct) is that it is simply the worst version of LoRA because LoRA has many clever mathematical calculations that make it easy to optimize and train. However, by using a network for indirect training, I suspect that its efficiency will be much lower and the results will not be as good. Nevertheless, it has the same advantage as LoRA, which is that it only takes up about 150MB of space.
After conducting qualitative analysis, let’s take a look at quantitative analysis. This table contains important facts I have researched for each training technique, such as how much RAM it takes up and how long it takes to use. Surprisingly, they almost use the same amount of RAM for training, but the difference in training time is significant and the output results vary in size. Textual Inversion takes up the least amount of storage space.
This form contains a large amount of data downloaded from Civitai, all of which are related to people’s preferences for different models. After analyzing the data, I found that the most popular model is Dreambooth, with the highest number of downloads, ratings, and likes. This does not necessarily mean that Dreambooth is the best model, but it does mean that many people are using it, which means there are more related resources available.
If I were to teach a model to understand a concept, I would also use Dreambooth because I know it has better beginner tutorials, saves less time browsing forums, and seems to have better results, considering many people are using it.
In terms of scoring, Dreambooth and Textual Inversion have the same score. However, in practical communication with people, Dreambooth seems to have a slight advantage. According to Civitai data, people like both models very much. The remaining two models have much lower scores, which is obviously bad news for Hypernetwork. In addition to the relatively low download volume, perhaps Hypernetwork should be avoided unless you have no other choice.
The statistical results are not very favorable for LoRA, but it is relatively new and there are only 11 LoRA models included in the statistics. Therefore, these data may not completely represent the potential of LoRA.
Overall, perhaps using Dreambooth directly is the best choice. It’s widely used, and its feedback is also very positive. Just keep in mind two things: firstly, the model is quite large, so if storage space is a concern, Textual Inversion may be a better choice. And of course, LoRA is also good, since it has the shortest training time.