网络输出是对应时刻的噪声 $z_t$ 输入是图像 $x_t$, 它们的shape大小一致

Earlier models used U-Net with the attention module at each layer for the noise prediction.

Self-attention
- Q & K & V: The outputs of each U-net layer.
Cross-attention
- Q: The output of each U-net layer
- K & V: The outputs from the input condition encoder.

扩散过程（Forward Process）

$a_t = 1 - \beta_t$ , $\beta_t$要越来越大，论文中0.0001到0.002，从而a也就是要越来越小

$x_t = \sqrt{a_t}x_{t-1} + \sqrt{1- a_t} z$

z是满足高斯分布的噪音

$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} z$

\[q({x}_t \mid {x}_0) = \mathcal{N}({x}_t; \sqrt{\bar{\alpha}_t} {x}_0, (1 - \bar{\alpha}_t)\mathbf{I})\]

去噪过程（Reverse Process）

\[p({x}_{t-1} \mid {x}_t, {x}_0) = p({x}_t \mid {x}_{t-1}, {x}_0) \frac{p({x}_{t-1} \mid {x}_0)}{p({x}_t \mid {x}_0)}\] \[\propto \exp\left( -\frac{1}{2}\frac{\left(x_t - \sqrt{\alpha_t} x_{t-1}\right)^2}{\beta_t} - \frac{\left(x_{t-1} - \sqrt{\bar{\alpha}_t} x_0\right)^2}{1 - \bar{\alpha}_t} \right)\] \[= \exp\left( -\frac{1}{2} \left[ \left(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}\right)x_{t-1}^2 - \left(\frac{2\sqrt{\alpha_t}}{\beta_t}x_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}} x_0\right)x_{t-1} + C(x_t, x_0) \right] \right)\]

其中 $C(x_t, x_0)$是常数项，不影响优化.

因为 $\exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) = \exp\left( -\frac{1}{2}\left( \frac{1}{\sigma^2}x^2 - \frac{2\mu}{\sigma^2}x + \frac{\mu^2}{\sigma^2} \right) \right)$, 能得到均值和方差: (方差固定不不变)

\[\tilde{\mu}_t({x}_t, {x}_0) = \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_t} \beta_t}{1 - \bar{\alpha}_t} x_0\]

之前 xt 可以通过$x_t$$x_0$ 得到，逆运算可得 $\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \mathbf{z}_t \right)$ 所以最后得到:

\[\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} (x_t -\frac{ \beta_t}{\sqrt{1 - \bar{\alpha}_t}} z)\]

所有都是已知数除了噪声z, 我们需要通过机器学习求出它