This commit is contained in:
Yangqing Jia 2014-09-05 17:03:29 -07:00
Родитель f12a74af95
Коммит f15fc36440
1 изменённых файлов: 30 добавлений и 47 удалений

Просмотреть файл

@ -18,7 +18,7 @@ A typical "image" in the real-world may have one color channel ($c = 1$), as in
But in this context, the distinguishing characteristic of an image is its spatial structure: usually an image has some non-trivial height $h > 1$ and width $w > 1$. But in this context, the distinguishing characteristic of an image is its spatial structure: usually an image has some non-trivial height $h > 1$ and width $w > 1$.
This 2D geometry naturally lends itself to certain decisions about how to process the input. This 2D geometry naturally lends itself to certain decisions about how to process the input.
In particular, most of the vision layers work by applying a particular operation to some region of the input to produce a corresponding region of the output. In particular, most of the vision layers work by applying a particular operation to some region of the input to produce a corresponding region of the output.
In contrast, other layers (with few exceptions) ignore the spatial structure of the input, treating it as "one big vector" with dimension $$ c h w $$. In contrast, other layers (with few exceptions) ignore the spatial structure of the input, effectively treating it as "one big vector" with dimension $$ c h w $$.
#### Convolution #### Convolution
@ -27,17 +27,17 @@ In contrast, other layers (with few exceptions) ignore the spatial structure of
* CPU implementation: `./src/caffe/layers/convolution_layer.cpp` * CPU implementation: `./src/caffe/layers/convolution_layer.cpp`
* CUDA GPU implementation: `./src/caffe/layers/convolution_layer.cu` * CUDA GPU implementation: `./src/caffe/layers/convolution_layer.cu`
* Options (`ConvolutionParameter convolution_param`) * Options (`ConvolutionParameter convolution_param`)
- Required: `num_output` ($c_o$), the number of filters - Required: `num_output` (`c_o`), the number of filters
- Required: `kernel_size` or (`kernel_h`, `kernel_w`), specifies height & width of each filter - Required: `kernel_size` or (`kernel_h`, `kernel_w`), specifies height & width of each filter
- Strongly recommended (default `type: 'constant' value: 0`): `weight_filler` - Strongly recommended (default `type: 'constant' value: 0`): `weight_filler`
- Optional (default `true`): `bias_term`, specifies whether to learn and apply a set of additive biases to the filter outputs - Optional (default `true`): `bias_term`, specifies whether to learn and apply a set of additive biases to the filter outputs
- Optional (default 0): `pad` or (`pad_h`, `pad_w`), specifies the number of pixels to (implicitly) add to each side of the input - Optional (default 0): `pad` or (`pad_h`, `pad_w`), specifies the number of pixels to (implicitly) add to each side of the input
- Optional (default 1): `stride` or (`stride_h`, `stride_w`), specifies the intervals at which to apply the filters to the input - Optional (default 1): `stride` or (`stride_h`, `stride_w`), specifies the intervals at which to apply the filters to the input
- Optional (default 1): `group` ($g$) if $>1$, restricts the connectivity of each filter to a subset of the input. In particular, the input to the $i^{th}$ group of $n_f / g$ filters is the $i^{th}$ group of $c_i / g$ input channels. - Optional (default 1): `group` (g). If g > 1, we restrict the connectivity of each filter to a subset of the input. Specifically, the input and output channels are separated to g groups separately, and the i-th output group channels will be only connected to the i-th input group channels.
* Input * Input
- $n \times c_i \times h_i \times w_i$ (repeated $K \ge 1$ times) - `n * c_i * h_i * w_i`
* Output * Output
- $n \times c_o \times h_o \times w_o$ (repeated $K$ times) - `n * c_o * h_o * w_o`, where `h_o = (h_i + 2 * pad_h - kernel_h) / stride_h + 1` and `w_o` likewise.
* Sample (as seen in `./examples/imagenet/imagenet_train_val.prototxt`) * Sample (as seen in `./examples/imagenet/imagenet_train_val.prototxt`)
layers { layers {
@ -66,58 +66,41 @@ In contrast, other layers (with few exceptions) ignore the spatial structure of
The `CONVOLUTION` layer convolves the input image with a set of learnable filters, each producing one feature map in the output image. The `CONVOLUTION` layer convolves the input image with a set of learnable filters, each producing one feature map in the output image.
#### Pooling** #### Pooling
`POOLING` * LayerType: `POOLING`
* CPU implementation: `./src/caffe/layers/pooling_layer.cpp`
#### Local Response Normalization * CUDA GPU implementation: `./src/caffe/layers/pooling_layer.cu`
* Options (`PoolingParameter pooling_param`)
* LayerType: `LRN` - Optional (default MAX): `pool`, the pooling method. Currently MAX, AVE, or STOCHASTIC
* CPU implementation: `./src/caffe/layers/lrn_layer.cpp`
* CUDA GPU implementation: `./src/caffe/layers/lrn_layer.cu`
* Options (`ConvolutionParameter convolution_param`)
- Required: `num_output` ($c_o$), the number of filters
- Required: `kernel_size` or (`kernel_h`, `kernel_w`), specifies height & width of each filter - Required: `kernel_size` or (`kernel_h`, `kernel_w`), specifies height & width of each filter
- Strongly recommended (default `type: 'constant' value: 0`): `weight_filler`
- Optional (default `true`): `bias_term`, specifies whether to learn and apply a set of additive biases to the filter outputs
- Optional (default 0): `pad` or (`pad_h`, `pad_w`), specifies the number of pixels to (implicitly) add to each side of the input - Optional (default 0): `pad` or (`pad_h`, `pad_w`), specifies the number of pixels to (implicitly) add to each side of the input
- Optional (default 1): `stride` or (`stride_h`, `stride_w`), specifies the intervals at which to apply the filters to the input - Optional (default 1): `stride` or (`stride_h`, `stride_w`), specifies the intervals at which to apply the filters to the input
- Optional (default 1): `group` ($g$) if $>1$, restricts the connectivity of each filter to a subset of the input. In particular, the input to the $i^{th}$ group of $n_f / g$ filters is the $i^{th}$ group of $c_i / g$ input channels.
* Input * Input
- $n \times c_i \times h_i \times w_i$ (repeated $K \ge 1$ times) - `n * c * h_i * w_i`
* Output * Output
- $n \times c_o \times h_o \times w_o$ (repeated $K$ times) - `n * c * h_o * w_o`, where h_o and w_o are computed in the same way as convolution.
* Sample (as seen in `./examples/imagenet/imagenet_train_val.prototxt`) * Sample (as seen in `./examples/imagenet/imagenet_train_val.prototxt`)
layers { layers {
name: "conv1" name: "pool1"
type: CONVOLUTION type: POOLING
bottom: "data" bottom: "conv1"
top: "conv1" top: "pool1"
blobs_lr: 1 # learning rate multiplier for the filters pooling_param {
blobs_lr: 2 # learning rate multiplier for the biases pool: MAX
weight_decay: 1 # weight decay multiplier for the filters kernel_size: 3
weight_decay: 0 # weight decay multiplier for the biases stride: 2
convolution_param { }
num_output: 96 # learn 96 filters }
kernel_size: 11 # each filter is 11x11
stride: 4 # step 4 pixels between each filter application
weight_filler {
type: "gaussian" # initialize the filters from a Gaussian
std: 0.01 # distribution with stdev 0.01 (default mean: 0)
}
bias_filler {
type: "constant" # initialize the biases to zero (0)
value: 0
}
}
}
The `CONVOLUTION` layer convolves the input image with a set of learnable filters, each producing one feature map in the output image. #### Local Response Normalization (LRN)
`LRN`
#### im2col #### im2col
`IM2COL` is a helper for doing the image-to-column transformation that you most likely do not need to know about. `IM2COL` is a helper for doing the image-to-column transformation that you most likely do not need to know about. This is used in Caffe's original convolution to do matrix multiplication by laying out all patches into a matrix.
### Loss Layers ### Loss Layers
@ -152,9 +135,9 @@ Loss drives learning by comparing an output to a target and assigning cost to mi
In general, activation / Neuron layers are element-wise operators, taking one bottom blob and producing one top blob of the same size. In the layers below, we will ignore the input and out sizes as they are identical: In general, activation / Neuron layers are element-wise operators, taking one bottom blob and producing one top blob of the same size. In the layers below, we will ignore the input and out sizes as they are identical:
* Input * Input
- $n \times c \times h \times w$ - n * c * h * w
* Output * Output
- $n \times c \times h \times w$ - n * c * h * w
#### ReLU / Rectified-Linear and Leaky-ReLU #### ReLU / Rectified-Linear and Leaky-ReLU