-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Dropout
Dropout function.
Dropout (x)
-
x
: the input to apply the dropout function to
Note: the dropout rate is not a parameter to this function, but instead specified in the SGD
section.
Dropout()
will return the result of the dropout operation applied to the input.
The result has the same tensor dimensions as the input.
Dropout is a popular technique to improve generalizability of models. It sets values to 0 with a given probability called the dropout rate.
In CNTK's implementation, the remaining values that are not set to 0 will instead be multiplied with (1 / (1 - dropout rate)). This way, the model parameters learned with dropout are directly applicable in inference. (If this was not done, the user would have to manually scale them before inference.)
In addition, you need to add a parameter dropoutRate
to the SGD
section to define the dropout rate.
This is done in the SGD
a section, instead of a parameter to Dropout()
itself,
in order to allow to start off a training without dropout, and then enable it after a few epochs,
which is a common scenario.
For this, the dropoutRate
is specified as a vector, where each
value is for a specific epoch.
When running inference, the Dropout()
operation passes its input unmodified (it is a no-op).
The following is a simple convolutional network with a dropout layer towards the end:
features = Input{...}
c = ConvolutionalLayer {32, (5:5), pad=true, activation=ReLU,
init="gaussian", initValueScale=0.0043} (features)
p = MaxPoolingLayer {(3:3), stride = (2:2)} (c)
h = DenseLayer {64, activation = ReLU, init = "gaussian", initValueScale = 12} (p)
d = Dropout (h) #####
z = LinearLayer {10, init = "gaussian", initValueScale = 1.5} (d)
and this is a corresponding entry in the SGD
section which defines to use no dropout for the first 3 epochs,
and then to continue with a dropout rate of 50%. This example uses the asterisk (*
) syntax to denote repetition:
SGD = {
...
dropoutRate = 0*3:0.5
...
}