Stable Diffusion | Zoy的博客之家

用它干哈？

Stable Diffusion 属于 latent diffusion。直白点说，它不是直接在完整像素图上反复折腾，而是先把图像压到一个更小的潜空间里，在潜空间里一步步去掉噪声，最后再解码成图片。

一条简化链路大概是这样：

prompt -> text encoder -> latent noise -> denoise loop -> VAE decoder -> image

这里有几个主角：

text encoder 负责把文字变成向量，让模型知道你想要什么。

U-Net 或对应去噪网络负责在潜空间里修图，把随机噪声逐步推向目标画面。

scheduler 决定每一步怎么走，影响速度、细节和稳定性。

VAE 负责潜空间和真实图像之间的转换。

Stable Diffusion 潜空间流程

这个架构的好处很明显：生成成本比直接在高分辨率像素空间里操作低很多，也更容易加入 prompt、mask、参考图、边缘图、姿态图之类的控制信号。

环境和最小文生图

安装常用依赖：

pip install diffusers transformers accelerate safetensors torch pillow

如果你有 CUDA 设备，torch.float16 可以明显节省显存；如果是 CPU，本地体验会慢很多，建议先用小尺寸、少步骤测试。

import torch
from diffusers import AutoPipelineForText2Image

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if device == "cuda" else torch.float32

pipe = AutoPipelineForText2Image.from_pretrained(
    model_id,
    torch_dtype=dtype,
    use_safetensors=True,
)
pipe = pipe.to(device)

prompt = (
    "a cozy cyberpunk reading room, warm desk lamp, "
    "glass window with city neon, cinematic composition"
)

image = pipe(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=7.0,
    width=1024,
    height=1024,
).images[0]

image.save("sd_room.png")

AutoPipelineForText2Image 会根据模型类型选择合适的管线。num_inference_steps 越高，通常细节越稳，但成本也更高；guidance_scale 越高，模型越听 prompt，但过高可能让画面僵硬、过曝或风格发紧。

prompt 不只是描述画面

很多人刚上手会把 prompt 写成“一个女孩，一只猫，一间房”。这能跑，但可控性弱。更好的写法是把画面拆成几个层次：

主体：画面核心对象
场景：主体所处环境
风格：摄影、插画、电影感、产品渲染
光线：柔光、背光、霓虹、自然光
构图：近景、广角、俯视、对称构图
细节：材质、纹理、色彩、氛围

比如：

prompt = (
    "a handcrafted mechanical keyboard on a dark walnut desk, "
    "macro product photography, warm side light, shallow depth of field, "
    "black and amber color palette, crisp texture, premium commercial style"
)

negative prompt 用来告诉模型“别来这些东西”，它不是魔法清洁剂，但能减少常见瑕疵。

negative_prompt = (
    "low quality, blurry, distorted, extra fingers, broken geometry, "
    "watermark, messy text, overexposed"
)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    width=1024,
    height=1024,
).images[0]

小建议：prompt 不要一上来堆满形容词。先把主体和构图跑稳，再逐步加风格和细节。否则你很难判断到底是哪句话让画面跑偏。

参数怎么调才不瞎拧

Stable Diffusion 的参数很多，但常用的就几类。

Stable Diffusion 参数调校

num_inference_steps 控制去噪轮数。太低容易糊，太高收益会变小。

guidance_scale 控制 prompt 约束强度。太低容易自由发挥，太高容易画面发硬。

seed 控制随机起点。固定 seed 后，修改 prompt 或参数更容易做对比。

scheduler 控制采样策略。换 scheduler 往往会影响质感、速度和细节保留。

示例：固定 seed 生成多张可复现图片。

def generate_with_seed(pipe, prompt, seed, output_path):
    generator = torch.Generator(device=pipe.device).manual_seed(seed)
    image = pipe(
        prompt=prompt,
        negative_prompt="low quality, blurry, watermark",
        generator=generator,
        num_inference_steps=28,
        guidance_scale=7.0,
        width=1024,
        height=1024,
    ).images[0]
    image.save(output_path)
    return image


generate_with_seed(pipe, prompt, 3407, "seed_3407.png")
generate_with_seed(pipe, prompt, 9527, "seed_9527.png")

切换 scheduler 的写法也很直接：

from diffusers import DPMSolverMultistepScheduler

pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=24,
    guidance_scale=7.0,
).images[0]

不要把参数调优想成“找一个永远最优的值”。不同模型、不同主题、不同尺寸下，合适配置都可能不一样。工程里更实用的办法是固定几组预设，比如“快速草稿”“细节增强”“风格探索”，让调用方按场景选择。

批量生成：把灵感变成网格

做视觉探索时，一张一张点很累。可以用 Python 批量跑 prompt，再拼成网格。

from PIL import Image, ImageDraw


def make_grid(images, columns=2, gap=16, bg=(12, 18, 32)):
    w, h = images[0].size
    rows = (len(images) + columns - 1) // columns
    grid = Image.new(
        "RGB",
        (
            columns * w + (columns + 1) * gap,
            rows * h + (rows + 1) * gap,
        ),
        bg,
    )
    for idx, img in enumerate(images):
        x = gap + (idx % columns) * (w + gap)
        y = gap + (idx // columns) * (h + gap)
        grid.paste(img.convert("RGB"), (x, y))
    return grid


styles = [
    "cinematic lighting",
    "isometric game art",
    "editorial illustration",
    "premium product render",
]

images = []
for style in styles:
    styled_prompt = f"{prompt}, {style}"
    img = pipe(
        prompt=styled_prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=24,
        guidance_scale=7.0,
        width=768,
        height=768,
    ).images[0]
    images.append(img)

grid = make_grid(images, columns=2)
grid.save("style_grid.png")

这个套路适合做封面、商品图、插画方向探索。先批量找方向，再挑图微调，比盯着一张图反复改 prompt 更省心。

img2img：保留大结构，换风格和细节

文生图是从噪声起步，img2img 则从一张已有图片起步。它会保留原图的部分结构，同时根据 prompt 重绘细节。

from PIL import Image
from diffusers import AutoPipelineForImage2Image

img2img_pipe = AutoPipelineForImage2Image.from_pretrained(
    model_id,
    torch_dtype=dtype,
    use_safetensors=True,
)
img2img_pipe = img2img_pipe.to(device)

init_image = Image.open("sketch.png").convert("RGB").resize((1024, 1024))

prompt = (
    "a polished sci-fi control panel, dark metal material, "
    "blue and amber light, clean industrial design"
)

image = img2img_pipe(
    prompt=prompt,
    image=init_image,
    strength=0.45,
    guidance_scale=7.0,
    num_inference_steps=28,
).images[0]

image.save("img2img_panel.png")

strength 越低，越保留原图；越高，越接近重新生成。草图上色、摄影图风格化、UI 概念图重绘，都可以用这个思路。

inpainting：只改一块，其他地方别乱动

inpainting 用 mask 指定要重绘的区域，非常适合修补、替换物体、扩展局部细节。

Stable Diffusion 图像编辑

from diffusers import AutoPipelineForInpainting

inpaint_pipe = AutoPipelineForInpainting.from_pretrained(
    model_id,
    torch_dtype=dtype,
    use_safetensors=True,
)
inpaint_pipe = inpaint_pipe.to(device)

base_image = Image.open("room.png").convert("RGB").resize((1024, 1024))
mask_image = Image.open("mask.png").convert("RGB").resize((1024, 1024))

prompt = (
    "a modern amber floor lamp, soft warm glow, "
    "fits naturally with the room, realistic material"
)

image = inpaint_pipe(
    prompt=prompt,
    image=base_image,
    mask_image=mask_image,
    guidance_scale=7.0,
    num_inference_steps=30,
).images[0]

image.save("room_inpaint.png")

mask 的质量很重要。边缘太硬容易有拼贴感，边缘太飘又可能影响不该改的区域。业务里可以让前端提供涂抹工具，也可以用分割模型先做半自动 mask。

做成服务前，先想清楚边界

Stable Diffusion 很适合封装成服务，但不要把 Demo 代码原样塞进线上接口。至少要处理这些问题：

模型加载只做一次
输入尺寸有限制
并发队列要受控
生成参数要有白名单
输出文件要有清理策略
失败信息要可追踪
敏感内容要有审核流程

一个简单的服务类可以这样写：

from pathlib import Path
from uuid import uuid4


class ImageGenerator:
    def __init__(self, model_id, output_dir="outputs"):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)

        device = "cuda" if torch.cuda.is_available() else "cpu"
        dtype = torch.float16 if device == "cuda" else torch.float32

        self.pipe = AutoPipelineForText2Image.from_pretrained(
            model_id,
            torch_dtype=dtype,
            use_safetensors=True,
        ).to(device)

    def generate(self, prompt, negative_prompt=None, width=768, height=768):
        safe_width = min(max(width, 512), 1024)
        safe_height = min(max(height, 512), 1024)

        image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt or "low quality, blurry",
            width=safe_width,
            height=safe_height,
            num_inference_steps=24,
            guidance_scale=7.0,
        ).images[0]

        path = self.output_dir / f"{uuid4().hex}.png"
        image.save(path)
        return path


generator = ImageGenerator(model_id)
result_path = generator.generate(
    "a tiny bookstore hidden in a futuristic alley, warm light"
)
print(result_path)

这里没有写成完整 Web API，是故意的。先把模型加载、参数收口、输出管理这些核心问题封住，再套 FastAPI、Celery、消息队列都容易得多。

<pre>

## 用它干哈？

一条简化链路大概是这样：

```text
prompt -&gt; text encoder -&gt; latent noise -&gt; denoise loop -&gt; VAE decoder -&gt; image
```

这里有几个主角：

`text encoder` 负责把文字变成向量，让模型知道你想要什么。

`U-Net` 或对应去噪网络负责在潜空间里修图，把随机噪声逐步推向目标画面。

`scheduler` 决定每一步怎么走，影响速度、细节和稳定性。

`VAE` 负责潜空间和真实图像之间的转换。

![Stable Diffusion 潜空间流程](https://zoyblogs.oss-cn-guangzhou.aliyuncs.com/sd_latent_pipeline_mrva.svg)

这个架构的好处很明显：生成成本比直接在高分辨率像素空间里操作低很多，也更容易加入 prompt、mask、参考图、边缘图、姿态图之类的控制信号。

## 环境和最小文生图

安装常用依赖：

```bash
pip install diffusers transformers accelerate safetensors torch pillow
```

如果你有 CUDA 设备，`torch.float16` 可以明显节省显存；如果是 CPU，本地体验会慢很多，建议先用小尺寸、少步骤测试。

```python
import torch
from diffusers import AutoPipelineForText2Image

model_id = &quot;stabilityai/stable-diffusion-xl-base-1.0&quot;
device = &quot;cuda&quot; if torch.cuda.is_available() else &quot;cpu&quot;
dtype = torch.float16 if device == &quot;cuda&quot; else torch.float32

pipe = AutoPipelineForText2Image.from_pretrained(
    model_id,
    torch_dtype=dtype,
    use_safetensors=True,
)
pipe = pipe.to(device)

prompt = (
    &quot;a cozy cyberpunk reading room, warm desk lamp, &quot;
    &quot;glass window with city neon, cinematic composition&quot;
)

image = pipe(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=7.0,
    width=1024,
    height=1024,
).images[0]

image.save(&quot;sd_room.png&quot;)
```

`AutoPipelineForText2Image` 会根据模型类型选择合适的管线。`num_inference_steps` 越高，通常细节越稳，但成本也更高；`guidance_scale` 越高，模型越听 prompt，但过高可能让画面僵硬、过曝或风格发紧。

## prompt 不只是描述画面

很多人刚上手会把 prompt 写成&ldquo;一个女孩，一只猫，一间房&rdquo;。这能跑，但可控性弱。更好的写法是把画面拆成几个层次：

```text
主体：画面核心对象
场景：主体所处环境
风格：摄影、插画、电影感、产品渲染
光线：柔光、背光、霓虹、自然光
构图：近景、广角、俯视、对称构图
细节：材质、纹理、色彩、氛围
```

比如：

```python
prompt = (
    &quot;a handcrafted mechanical keyboard on a dark walnut desk, &quot;
    &quot;macro product photography, warm side light, shallow depth of field, &quot;
    &quot;black and amber color palette, crisp texture, premium commercial style&quot;
)
```

negative prompt 用来告诉模型&ldquo;别来这些东西&rdquo;，它不是魔法清洁剂，但能减少常见瑕疵。

```python
negative_prompt = (
    &quot;low quality, blurry, distorted, extra fingers, broken geometry, &quot;
    &quot;watermark, messy text, overexposed&quot;
)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    width=1024,
    height=1024,
).images[0]
```

小建议：prompt 不要一上来堆满形容词。先把主体和构图跑稳，再逐步加风格和细节。否则你很难判断到底是哪句话让画面跑偏。

## 参数怎么调才不瞎拧

Stable Diffusion 的参数很多，但常用的就几类。

![Stable Diffusion 参数调校](https://zoyblogs.oss-cn-guangzhou.aliyuncs.com/sd_sampler_tuning_jkpe.svg)

`num_inference_steps` 控制去噪轮数。太低容易糊，太高收益会变小。

`guidance_scale` 控制 prompt 约束强度。太低容易自由发挥，太高容易画面发硬。

`seed` 控制随机起点。固定 seed 后，修改 prompt 或参数更容易做对比。

`scheduler` 控制采样策略。换 scheduler 往往会影响质感、速度和细节保留。

示例：固定 seed 生成多张可复现图片。

```python
def generate_with_seed(pipe, prompt, seed, output_path):
    generator = torch.Generator(device=pipe.device).manual_seed(seed)
    image = pipe(
        prompt=prompt,
        negative_prompt=&quot;low quality, blurry, watermark&quot;,
        generator=generator,
        num_inference_steps=28,
        guidance_scale=7.0,
        width=1024,
        height=1024,
    ).images[0]
    image.save(output_path)
    return image

generate_with_seed(pipe, prompt, 3407, &quot;seed_3407.png&quot;)
generate_with_seed(pipe, prompt, 9527, &quot;seed_9527.png&quot;)
```

切换 scheduler 的写法也很直接：

```python
from diffusers import DPMSolverMultistepScheduler

pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    pipe.scheduler.config
)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=24,
    guidance_scale=7.0,
).images[0]
```

不要把参数调优想成&ldquo;找一个永远最优的值&rdquo;。不同模型、不同主题、不同尺寸下，合适配置都可能不一样。工程里更实用的办法是固定几组预设，比如&ldquo;快速草稿&rdquo;&ldquo;细节增强&rdquo;&ldquo;风格探索&rdquo;，让调用方按场景选择。

## 批量生成：把灵感变成网格

做视觉探索时，一张一张点很累。可以用 Python 批量跑 prompt，再拼成网格。

```python
from PIL import Image, ImageDraw

def make_grid(images, columns=2, gap=16, bg=(12, 18, 32)):
    w, h = images[0].size
    rows = (len(images) + columns - 1) // columns
    grid = Image.new(
        &quot;RGB&quot;,
        (
            columns * w + (columns + 1) * gap,
            rows * h + (rows + 1) * gap,
        ),
        bg,
    )
    for idx, img in enumerate(images):
        x = gap + (idx % columns) * (w + gap)
        y = gap + (idx // columns) * (h + gap)
        grid.paste(img.convert(&quot;RGB&quot;), (x, y))
    return grid

styles = [
    &quot;cinematic lighting&quot;,
    &quot;isometric game art&quot;,
    &quot;editorial illustration&quot;,
    &quot;premium product render&quot;,
]

images = []
for style in styles:
    styled_prompt = f&quot;{prompt}, {style}&quot;
    img = pipe(
        prompt=styled_prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=24,
        guidance_scale=7.0,
        width=768,
        height=768,
    ).images[0]
    images.append(img)

grid = make_grid(images, columns=2)
grid.save(&quot;style_grid.png&quot;)
```

这个套路适合做封面、商品图、插画方向探索。先批量找方向，再挑图微调，比盯着一张图反复改 prompt 更省心。

## img2img：保留大结构，换风格和细节

文生图是从噪声起步，img2img 则从一张已有图片起步。它会保留原图的部分结构，同时根据 prompt 重绘细节。

```python
from PIL import Image
from diffusers import AutoPipelineForImage2Image

img2img_pipe = AutoPipelineForImage2Image.from_pretrained(
    model_id,
    torch_dtype=dtype,
    use_safetensors=True,
)
img2img_pipe = img2img_pipe.to(device)

init_image = Image.open(&quot;sketch.png&quot;).convert(&quot;RGB&quot;).resize((1024, 1024))

prompt = (
    &quot;a polished sci-fi control panel, dark metal material, &quot;
    &quot;blue and amber light, clean industrial design&quot;
)

image = img2img_pipe(
    prompt=prompt,
    image=init_image,
    strength=0.45,
    guidance_scale=7.0,
    num_inference_steps=28,
).images[0]

image.save(&quot;img2img_panel.png&quot;)
```

`strength` 越低，越保留原图；越高，越接近重新生成。草图上色、摄影图风格化、UI 概念图重绘，都可以用这个思路。

## inpainting：只改一块，其他地方别乱动

inpainting 用 mask 指定要重绘的区域，非常适合修补、替换物体、扩展局部细节。

![Stable Diffusion 图像编辑](https://zoyblogs.oss-cn-guangzhou.aliyuncs.com/sd_edit_canvas_xbtu.svg)

```python
from diffusers import AutoPipelineForInpainting

inpaint_pipe = AutoPipelineForInpainting.from_pretrained(
    model_id,
    torch_dtype=dtype,
    use_safetensors=True,
)
inpaint_pipe = inpaint_pipe.to(device)

base_image = Image.open(&quot;room.png&quot;).convert(&quot;RGB&quot;).resize((1024, 1024))
mask_image = Image.open(&quot;mask.png&quot;).convert(&quot;RGB&quot;).resize((1024, 1024))

prompt = (
    &quot;a modern amber floor lamp, soft warm glow, &quot;
    &quot;fits naturally with the room, realistic material&quot;
)

image = inpaint_pipe(
    prompt=prompt,
    image=base_image,
    mask_image=mask_image,
    guidance_scale=7.0,
    num_inference_steps=30,
).images[0]

image.save(&quot;room_inpaint.png&quot;)
```

mask 的质量很重要。边缘太硬容易有拼贴感，边缘太飘又可能影响不该改的区域。业务里可以让前端提供涂抹工具，也可以用分割模型先做半自动 mask。

## 做成服务前，先想清楚边界

Stable Diffusion 很适合封装成服务，但不要把 Demo 代码原样塞进线上接口。至少要处理这些问题：

```text
模型加载只做一次
输入尺寸有限制
并发队列要受控
生成参数要有白名单
输出文件要有清理策略
失败信息要可追踪
敏感内容要有审核流程
```

一个简单的服务类可以这样写：

```python
from pathlib import Path
from uuid import uuid4

class ImageGenerator:
    def __init__(self, model_id, output_dir=&quot;outputs&quot;):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)

device = &quot;cuda&quot; if torch.cuda.is_available() else &quot;cpu&quot;
        dtype = torch.float16 if device == &quot;cuda&quot; else torch.float32

self.pipe = AutoPipelineForText2Image.from_pretrained(
            model_id,
            torch_dtype=dtype,
            use_safetensors=True,
        ).to(device)

def generate(self, prompt, negative_prompt=None, width=768, height=768):
        safe_width = min(max(width, 512), 1024)
        safe_height = min(max(height, 512), 1024)

image = self.pipe(
            prompt=prompt,
            negative_prompt=negative_prompt or &quot;low quality, blurry&quot;,
            width=safe_width,
            height=safe_height,
            num_inference_steps=24,
            guidance_scale=7.0,
        ).images[0]

path = self.output_dir / f&quot;{uuid4().hex}.png&quot;
        image.save(path)
        return path

generator = ImageGenerator(model_id)
result_path = generator.generate(
    &quot;a tiny bookstore hidden in a futuristic alley, warm light&quot;
)
print(result_path)
```

这里没有写成完整 Web API，是故意的。先把模型加载、参数收口、输出管理这些核心问题封住，再套 FastAPI、Celery、消息队列都容易得多。

</pre>