Stable Diffusion

Updated for version: SD3 Medium 

Accessible via: https://huggingface.co/stabilityai/stable-diffusion-3-medium 

Technical paper: https://arxiv.org/pdf/2403.03206 

Ratings

Accuracy / Quality ★★★★☆ 

Flexibility / Features ★★★☆☆ 

Data security / Privacy ★★★★☆ 

Open source model

Pros/cons

Pros 

  • Open-source and locally installed, hence privacy secured. 
  • Highly customizable. 
  • Workflow adaptable to similar models. 
  • Choice between accuracy and speed. 

Cons 

  • Requires some technical experience and programming. 
  • Ease of installation and operation dependent on hardware. 
  • GPU with sufficient VRAM required (see ‘Prerequisites’). 
  • Training data may include copyrighted images. 

The following guide is written for users with WUR-devices. Should you use your personal device the guide should remain applicable, and require fewer administrative rights. 

Prerequisites

Before this model can be used the following prerequisites have to be met: 

  1. The device used should have access to a GPU (either from NVIDIA or AMD) and a VRAM of at least 4GB, though 8GB is recommended.  
    To find out how much VRAM you have available, open the ‘dxdiag’ program on your device, navigate to the ‘Display’ tab and look for the value next to ‘Display Memory (VRAM)’. This should be at least 4000 MB or 8000 MB for this guide. 
  2. Have Python 3.7 or a more recent version installed on your device. 
    If you don’t have Python installed yet, you can do so via the WUR Software Center or via the Python website.  

Description

Next to generating text the most well-known application of GenAI is the generation of images. Doing so via online tools often subjects you to severe restrictions in format or frequency of use. It is also more difficult to document the use of such online tools as their availability is not guaranteed over time, and thus the reproducibility of your work may be in question. An alternative method is therefore to use a locally installed image generation model. The most well-known of these models are the Stable Diffusion models developed by Stability.AI.  

For this guide we will use Stable Diffusion 3 as example. However, alternative versions of this model are available and are listed at the end of this guide. These alternatives may require different workflows and supporting programs, which can often be downloaded from the model repositories on the HuggingFace website.  

Installation guide

  1. Go to the Stable Diffusion 3 Medium (SD3 Medium) page on HuggingFace
  2. Log in and sign in with your HuggingFace account (or create a new one if needed). Once logged in, scroll down and click the ‘Accept’ button to agree to the licence terms. The usage of this model is allowed for research and educational purposes and limited commercial use. 
  3. Go to the ‘Files and Versions’ tab on the SD3 Medium page and scroll down to the file ‘sd3_medium.safetensors’. Download this file, as well as the following files in the ‘text_encoders’ folder. 
    • clip_l.safetensors 
    • clip_g.safetensors 
    • t5xxl_fp8_e4m3fn.safetensors 

You may also want to download the example workflow to make your initial use of the model easier. These are found in the ‘comfy_example_workflows’ folder. The ‘sd3_medium_example_workflow_basic.json’ file will be used for the remainder of this guide. 

  1. Install ComfyUI from the ComfyUI repository on GitHub
  2. Move the downloaded model from step 2 to the folder ‘ComfyUI_windows_portable\ComfyUI\models\checkpoints’. Also move the ‘text encoders’ downloaded to the folder ‘C:\ComfyUI_windows_portable\ComfyUI\models\clip’. 
  3. Launch ComfyUI. Launching can be done by opening ‘run_nvidia_gpu.bat’ for NVIDIA devices, for example. 
  4. On the bottom right in ComfyUI click on ‘Load’ and open the example workflow downloaded in step 2. 
  5. Once the workflow is open, go to the top-left block called ‘Load Models’ and select the appropriate checkpoint (the downloaded model) and clips using the arrows. 
  1. In the block below this called ‘Input’ you can enter the positive prompt (what you do want to see in the image) and a negative prompt (what you don’t want to see in the image). Also available is a setting for the width and length (in pixels, should be in multiples of 64) and the ‘batch_size’, which defines how many different images the model should generate (using the ‘Seed’ number indicated at the top, adding ‘1’ to the seed for each next images required to satisfy the ‘batch_size’ setting.
  1. Once all these settings are in place you can click on the ‘Queue Prompt’ button. The image will then appear on the far-right side of the workflow in the ‘Output’ window. You can track the progress throughout the workflow (the active section is highlighted, and the generation process is indicated with a loading bar in the ‘KSampler’ box next to the Output).

If the resulting image is one large blur / empty image, it means the prompt entered contained keywords that have been restricted by the developers, and has therefore resulted in a censoring of the image.

Further details on model parameters

In the Stable Diffusion 3 Medium basic workflow there are several other parameters that may be modified to change the model behaviour.

Conditioning Timestep start/end

These parameters can range from 0.0 to 1.0 and indicate when the model should start incorporating the information from either the positive or negative prompt. In the default settings of the basic workflow the positive prompt information is incorporated throughout the generation of the image, whereas the negative prompt is only incorporated once the first 10% of the steps of the generation process has been completed.

Model sampling shift

This parameter (by default on 3.0) affects the overall structure of the image. A higher value results in greater emphasis on the larger structures, whereas a smaller value places greater emphasis on details. For ‘upscaling’ (more advanced) a smaller value is recommended. The parameter may be set to at most 100, but values above 10 are known to result in significantly worse images.

Steps

Stable Diffusion uses diffusion technology to create the images, which gradually removes noise from the image. The steps indicate how many times the model should try to remove this noise. A lower number of steps results in faster image generation, but more blurry images. A higher number of steps can increase detail, but will also take more time. When you are just starting out with an image concept a smaller number of steps (10-15) can help you find the right type of image. Once you are happy you can then increase this to greater values.

CFG

How much do you want the model to adhere to the prompt, and how much creative freedom do you want to give the model? The lower the value the more creative the model is allowed to become, whereas a higher value makes the model adhere more strictly to the prompt. A higher adherence to the prompt is not always a good thing, as it might affect the quality and the coherence of the image. A value between 2-6 is usable for more creative images, between 7-10 for more balance, and between 10-15 for more specific prompts. Above that the quality degrades significantly.

Denoise

How significantly should the model stick to the initial image? A value of 1.0 means the model has full creative freedom to change the initial image, whereas a value of 0.0 means the original image remains the same. As in this basic workflow there is no initial image the model should have full creative freedom, but when using more advanced workflows where an image is uploaded for modification this setting should be changed accordingly.

Sampler name and scheduler

These settings are used to define how the image should be created statistically (sampler) and  temporally (scheduler). For the purposes of this guide we recommend keeping these on the default settings (‘dpmpp_2m’ and ‘sgm_uniform’), but should you wish to experiment with alternative methods of generation this is one of the settings that could be changed.

Alternative model versions

This guide used Stable Diffusion 3 Medium as example, but other models are available. These may require different workflows, hardware specifications, or understanding of the methodology.