深度学习里的可分离卷积：把大卷积拆成轻快两步

这就是深度可分离卷积，也就是常见的：

Depthwise Convolution + Pointwise Convolution

普通卷积到底贵在哪里

假设某一层输入特征图是：

20 × 20 × 100

也就是高宽都是 20，输入通道数是 100。

如果我们想用 50 个 3×3 卷积核做普通卷积，输出会是：

20 × 20 × 50

计算量大约是：

20 × 20 × 100 × 3 × 3 × 50

这个式子里最狠的是后面那一串：

输入通道数 × 卷积核面积 × 输出通道数

每个输出通道都要看所有输入通道，每个位置还要做 3×3 的空间卷积。卷积核一多，开销自然就上来了。

可分离卷积怎么拆

深度可分离卷积把普通卷积拆成两步。

第一步叫 Depthwise。

它不是一次性把所有输入通道混起来，而是每个通道自己做自己的空间卷积。

如果输入是：

20 × 20 × 100

做 3×3 depthwise 后，输出还是：

20 × 20 × 100

计算量大约是：

20 × 20 × 100 × 3 × 3

第二步叫 Pointwise。

它使用 1×1 卷积，把 100 个通道混合成想要的 50 个通道。

计算量大约是：

20 × 20 × 100 × 1 × 1 × 50

总计算量就是：

20 × 20 × 100 × 3 × 3
+ 20 × 20 × 100 × 50

相比普通卷积：

20 × 20 × 100 × 3 × 3 × 50

差距很明显。

普通卷积和深度可分离卷积

Depthwise 和 Pointwise 的分工

可以粗暴但好记地理解：

Depthwise：每个通道自己看空间纹理
Pointwise：用 1×1 卷积把通道信息聊起来

Depthwise 和 Pointwise 流程

普通卷积把这两件事揉在一起做，灵活但贵。

可分离卷积把它们拆开做，表达力会有一点取舍，但速度和参数量通常更友好。

计算量对比再看一眼

假设：

输入特征图大小是 H × W
输入通道数是 C
输出通道数是 M
卷积核大小是 K × K

普通卷积计算量约为：

H × W × C × K × K × M

深度可分离卷积计算量约为：

H × W × C × K × K + H × W × C × M

可分离卷积计算量

当 M 很大、K 不小的时候，拆分带来的节省会更明显。

这也是很多轻量级 CNN 喜欢它的原因。

Keras 里的 SeparableConv2D

Keras 里可以直接使用 SeparableConv2D。

import tensorflow as tf
from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Input(shape=(32, 32, 3)),

    layers.SeparableConv2D(
        filters=32,
        kernel_size=3,
        padding="same",
        activation="relu"
    ),

    layers.SeparableConv2D(
        filters=64,
        kernel_size=3,
        padding="same",
        activation="relu"
    ),

    layers.GlobalAveragePooling2D(),
    layers.Dense(10, activation="softmax")
])

model.summary()

SeparableConv2D 内部做的事情就是：

DepthwiseConv2D + 1×1 Pointwise Conv2D

你不需要手动拆两层，Keras 已经帮你封装好了。

Keras 里的 DepthwiseConv2D

如果你只想使用第一步，也就是每个通道单独做空间卷积，可以用 DepthwiseConv2D。

inputs = layers.Input(shape=(32, 32, 3))

x = layers.DepthwiseConv2D(
    kernel_size=3,
    padding="same",
    depth_multiplier=1
)(inputs)

x = layers.Conv2D(
    filters=32,
    kernel_size=1,
    padding="same",
    activation="relu"
)(x)

model = tf.keras.Model(inputs, x)
model.summary()

这里 DepthwiseConv2D 负责空间卷积，后面的 Conv2D(kernel_size=1) 负责通道混合。

depth_multiplier 表示每个输入通道生成几个 depthwise 输出通道。默认用 1 就很好理解：

一个输入通道 -> 一个输出通道

PyTorch 里怎么写

PyTorch 没有直接叫 SeparableConv2D 的内置层，但可以用 groups 参数实现 depthwise。

import torch
import torch.nn as nn


class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, padding=1):
        super().__init__()

        self.depthwise = nn.Conv2d(
            in_channels=in_channels,
            out_channels=in_channels,
            kernel_size=kernel_size,
            padding=padding,
            groups=in_channels,
            bias=False
        )

        self.pointwise = nn.Conv2d(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=1,
            bias=False
        )

        self.bn = nn.BatchNorm2d(out_channels)
        self.act = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        x = self.bn(x)
        x = self.act(x)
        return x


layer = DepthwiseSeparableConv(3, 32)
x = torch.randn(8, 3, 32, 32)
y = layer(x)

print(y.shape)

关键点就是：

groups=in_channels

这会让每个输入通道只和自己的卷积核计算，不和其他通道混合。

通道混合交给后面的 1×1 卷积。

参数量粗算一下

普通卷积参数量：

K × K × C × M

深度可分离卷积参数量：

K × K × C + C × M

用 Python 算一下：

def conv_params(k, in_channels, out_channels):
    return k * k * in_channels * out_channels


def separable_params(k, in_channels, out_channels):
    depthwise = k * k * in_channels
    pointwise = in_channels * out_channels
    return depthwise + pointwise


k = 3
c = 100
m = 50

normal = conv_params(k, c, m)
sep = separable_params(k, c, m)

print("normal:", normal)
print("separable:", sep)
print("ratio:", sep / normal)

结果会很直观：可分离卷积的参数量只有普通卷积的一小部分。

它和 Inception、MobileNet 的关系

原文里提到 Inception。Inception 的一个核心想法是“宽度优先”：不同尺度的卷积分支一起提特征，再把结果拼起来。

可分离卷积则是另一种省计算思路：不要让一个大卷积同时负责空间和通道，把任务拆开。

这类思想后来在轻量网络里特别常见，比如 MobileNet 风格的结构中，depthwise separable convolution 就是主角之一。

它的优点很直接：

参数少
计算少
更适合移动端或轻量模型
在很多视觉任务中能保持不错效果

它有什么代价

可分离卷积不是白嫖。

普通卷积可以在一次计算里同时建模空间关系和通道关系，表达能力很强。

可分离卷积把这件事拆开后，计算更轻，但也可能损失一些通道和空间联合建模能力。

所以它适合：

轻量模型
移动端模型
计算资源有限的场景
对速度和模型体积敏感的任务

但如果你追求极致精度，并且计算资源充足，普通卷积或者更复杂的卷积模块仍然值得考虑。

常见误区

DepthwiseConv2D 不等于完整的可分离卷积。

它只是第一步，只做每个通道自己的空间卷积。

完整的深度可分离卷积还需要 1×1 pointwise 卷积来混合通道。

1×1 卷积不是摆设。

它虽然不看邻域空间，但它能把不同通道的信息重新组合，是 pointwise 阶段的关键。

可分离卷积不是只减少参数。

它同时减少计算量，很多时候速度收益比参数收益更重要。

总结

深度可分离卷积可以记成一句话：

先每个通道各卷各的，再用 1×1 卷积把通道混起来。

普通卷积：

空间特征 + 通道混合，一口气做完

可分离卷积：

Depthwise 负责空间，Pointwise 负责通道

这种拆法让计算量和参数量都明显下降，也让它成为轻量 CNN 里非常常见的模块。

如果你在 Keras 里用，直接上：

layers.SeparableConv2D(...)

如果你在 PyTorch 里用，用：

groups=in_channels

实现 depthwise，再接一个 1×1 Conv2d。

理解了这个拆分过程，再看 MobileNet、轻量检测网络或者各种高效 CNN 结构，就会顺很多。

<pre>

这就是深度可分离卷积，也就是常见的：

```text
Depthwise Convolution + Pointwise Convolution
```

## 普通卷积到底贵在哪里

假设某一层输入特征图是：

```text
20 &times; 20 &times; 100
```

也就是高宽都是 20，输入通道数是 100。

如果我们想用 50 个 `3&times;3` 卷积核做普通卷积，输出会是：

```text
20 &times; 20 &times; 50
```

计算量大约是：

```text
20 &times; 20 &times; 100 &times; 3 &times; 3 &times; 50
```

这个式子里最狠的是后面那一串：

```text
输入通道数 &times; 卷积核面积 &times; 输出通道数
```

每个输出通道都要看所有输入通道，每个位置还要做 `3&times;3` 的空间卷积。卷积核一多，开销自然就上来了。

## 可分离卷积怎么拆

深度可分离卷积把普通卷积拆成两步。

第一步叫 `Depthwise`。

它不是一次性把所有输入通道混起来，而是每个通道自己做自己的空间卷积。

如果输入是：

```text
20 &times; 20 &times; 100
```

做 `3&times;3 depthwise` 后，输出还是：

```text
20 &times; 20 &times; 100
```

计算量大约是：

```text
20 &times; 20 &times; 100 &times; 3 &times; 3
```

第二步叫 `Pointwise`。

它使用 `1&times;1` 卷积，把 100 个通道混合成想要的 50 个通道。

计算量大约是：

```text
20 &times; 20 &times; 100 &times; 1 &times; 1 &times; 50
```

总计算量就是：

```text
20 &times; 20 &times; 100 &times; 3 &times; 3
+ 20 &times; 20 &times; 100 &times; 50
```

相比普通卷积：

```text
20 &times; 20 &times; 100 &times; 3 &times; 3 &times; 50
```

差距很明显。

![普通卷积和深度可分离卷积](https://zoyblogs.oss-cn-guangzhou.aliyuncs.com/separable_conv_compare_mtnp.svg)

## Depthwise 和 Pointwise 的分工

可以粗暴但好记地理解：

```text
Depthwise：每个通道自己看空间纹理
Pointwise：用 1&times;1 卷积把通道信息聊起来
```

![Depthwise 和 Pointwise 流程](https://zoyblogs.oss-cn-guangzhou.aliyuncs.com/depthwise_pointwise_flow_lxok.svg)

普通卷积把这两件事揉在一起做，灵活但贵。

可分离卷积把它们拆开做，表达力会有一点取舍，但速度和参数量通常更友好。

## 计算量对比再看一眼

假设：

- 输入特征图大小是 `H &times; W`
- 输入通道数是 `C`
- 输出通道数是 `M`
- 卷积核大小是 `K &times; K`

普通卷积计算量约为：

```text
H &times; W &times; C &times; K &times; K &times; M
```

深度可分离卷积计算量约为：

```text
H &times; W &times; C &times; K &times; K + H &times; W &times; C &times; M
```

![可分离卷积计算量](https://zoyblogs.oss-cn-guangzhou.aliyuncs.com/separable_compute_bruj.svg)

当 `M` 很大、`K` 不小的时候，拆分带来的节省会更明显。

这也是很多轻量级 CNN 喜欢它的原因。

## Keras 里的 SeparableConv2D

Keras 里可以直接使用 `SeparableConv2D`。

```python
import tensorflow as tf
from tensorflow.keras import layers, models

model = models.Sequential([
    layers.Input(shape=(32, 32, 3)),

layers.SeparableConv2D(
        filters=32,
        kernel_size=3,
        padding=&quot;same&quot;,
        activation=&quot;relu&quot;
    ),

layers.SeparableConv2D(
        filters=64,
        kernel_size=3,
        padding=&quot;same&quot;,
        activation=&quot;relu&quot;
    ),

layers.GlobalAveragePooling2D(),
    layers.Dense(10, activation=&quot;softmax&quot;)
])

model.summary()
```

`SeparableConv2D` 内部做的事情就是：

```text
DepthwiseConv2D + 1&times;1 Pointwise Conv2D
```

你不需要手动拆两层，Keras 已经帮你封装好了。

## Keras 里的 DepthwiseConv2D

如果你只想使用第一步，也就是每个通道单独做空间卷积，可以用 `DepthwiseConv2D`。

```python
inputs = layers.Input(shape=(32, 32, 3))

x = layers.DepthwiseConv2D(
    kernel_size=3,
    padding=&quot;same&quot;,
    depth_multiplier=1
)(inputs)

x = layers.Conv2D(
    filters=32,
    kernel_size=1,
    padding=&quot;same&quot;,
    activation=&quot;relu&quot;
)(x)

model = tf.keras.Model(inputs, x)
model.summary()
```

这里 `DepthwiseConv2D` 负责空间卷积，后面的 `Conv2D(kernel_size=1)` 负责通道混合。

`depth_multiplier` 表示每个输入通道生成几个 depthwise 输出通道。默认用 `1` 就很好理解：

```text
一个输入通道 -&gt; 一个输出通道
```

## PyTorch 里怎么写

PyTorch 没有直接叫 `SeparableConv2D` 的内置层，但可以用 `groups` 参数实现 depthwise。

```python
import torch
import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=3, padding=1):
        super().__init__()

self.depthwise = nn.Conv2d(
            in_channels=in_channels,
            out_channels=in_channels,
            kernel_size=kernel_size,
            padding=padding,
            groups=in_channels,
            bias=False
        )

self.pointwise = nn.Conv2d(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=1,
            bias=False
        )

self.bn = nn.BatchNorm2d(out_channels)
        self.act = nn.ReLU(inplace=True)

def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        x = self.bn(x)
        x = self.act(x)
        return x

layer = DepthwiseSeparableConv(3, 32)
x = torch.randn(8, 3, 32, 32)
y = layer(x)

print(y.shape)
```

关键点就是：

```python
groups=in_channels
```

这会让每个输入通道只和自己的卷积核计算，不和其他通道混合。

通道混合交给后面的 `1&times;1` 卷积。

## 参数量粗算一下

普通卷积参数量：

```text
K &times; K &times; C &times; M
```

深度可分离卷积参数量：

```text
K &times; K &times; C + C &times; M
```

用 Python 算一下：

```python
def conv_params(k, in_channels, out_channels):
    return k * k * in_channels * out_channels

def separable_params(k, in_channels, out_channels):
    depthwise = k * k * in_channels
    pointwise = in_channels * out_channels
    return depthwise + pointwise

k = 3
c = 100
m = 50

normal = conv_params(k, c, m)
sep = separable_params(k, c, m)

print(&quot;normal:&quot;, normal)
print(&quot;separable:&quot;, sep)
print(&quot;ratio:&quot;, sep / normal)
```

结果会很直观：可分离卷积的参数量只有普通卷积的一小部分。

## 它和 Inception、MobileNet 的关系

原文里提到 Inception。Inception 的一个核心想法是&ldquo;宽度优先&rdquo;：不同尺度的卷积分支一起提特征，再把结果拼起来。

可分离卷积则是另一种省计算思路：不要让一个大卷积同时负责空间和通道，把任务拆开。

这类思想后来在轻量网络里特别常见，比如 MobileNet 风格的结构中，depthwise separable convolution 就是主角之一。

它的优点很直接：

- 参数少
- 计算少
- 更适合移动端或轻量模型
- 在很多视觉任务中能保持不错效果

## 它有什么代价

可分离卷积不是白嫖。

普通卷积可以在一次计算里同时建模空间关系和通道关系，表达能力很强。

可分离卷积把这件事拆开后，计算更轻，但也可能损失一些通道和空间联合建模能力。

所以它适合：

- 轻量模型
- 移动端模型
- 计算资源有限的场景
- 对速度和模型体积敏感的任务

但如果你追求极致精度，并且计算资源充足，普通卷积或者更复杂的卷积模块仍然值得考虑。

## 常见误区

`DepthwiseConv2D` 不等于完整的可分离卷积。

它只是第一步，只做每个通道自己的空间卷积。

完整的深度可分离卷积还需要 `1&times;1` pointwise 卷积来混合通道。

`1&times;1` 卷积不是摆设。

它虽然不看邻域空间，但它能把不同通道的信息重新组合，是 pointwise 阶段的关键。

可分离卷积不是只减少参数。

它同时减少计算量，很多时候速度收益比参数收益更重要。

## 总结

深度可分离卷积可以记成一句话：

```text
先每个通道各卷各的，再用 1&times;1 卷积把通道混起来。
```

普通卷积：

```text
空间特征 + 通道混合，一口气做完
```

可分离卷积：

```text
Depthwise 负责空间，Pointwise 负责通道
```

这种拆法让计算量和参数量都明显下降，也让它成为轻量 CNN 里非常常见的模块。

如果你在 Keras 里用，直接上：

```python
layers.SeparableConv2D(...)
```

如果你在 PyTorch 里用，用：

```python
groups=in_channels
```

实现 depthwise，再接一个 `1&times;1 Conv2d`。

理解了这个拆分过程，再看 MobileNet、轻量检测网络或者各种高效 CNN 结构，就会顺很多。
</pre>