Actually – there is no a good answer to your question. Most of the architectures are usually carefully designed and finetuned during many experiments. I could share with you some of the rules of thumbs one should apply when designing its own architecture:
-
Avoid a dimension collapse in the first layer. Let’s assume that your input filter has a
(n, n)
spatial shape forRGB
image. In this case, it is a good practice to set the filter numbers to be greater thann * n * 3
as this is the dimensionality of the input of a single filter. If you set smaller number – you could suffer from the fact that many useful pieces of information about the image are lost due to initialization which dropped informative dimensions. Of course – this is not a general rule – e.g. for a texture recognition, where image complexity is lower – a small number of filters might actually help. -
Think more about volume than filters number – when setting the number of filters it’s important to think about the volume change instead of the change of filter numbers between the consecutive layers. E.g. in
VGG
– even though the number of filters doubles after pooling layer – the actual feature map volume is decreased by a factor of 2, because of pooling decreasing the feature map by a factor of4
. Usually decreasing the size of the volume by more than 3 should be considered as a bad practice. Most of the modern architectures use the volume drop factor in the range between 1 and 2. Still – this is not a general rule – e.g. in case of a narrow hierarchy – the greater value of volume drop might actually help. -
Avoid bottlenecking. As one may read in this milestone paper bottlenecking might seriously harm your training process. It occurs when dropping the volume is too severe. Of course – this still might be achieved – but then you should use the intelligent downsampling, used e.g. in
Inception v>2
-
Check 1×1 convolutions – it’s believed that filters activation are highly correlated. One may take advantage of it by using 1×1 convolutions – namely convolution with a filter size of 1. This makes possible e.g. volume dropping by them instead of
pooling
or intelligent downsampling (see example here). You could e.g. build twice more filters and then cut 25% of them by using 1×1 convs as a consecutive layer.
As you may see. There is no easy way to choose the number of filters. Except for the hints above, I’d like to share with you one of my favorite sanity checks on the number of filters. It takes 2 easy steps:
- Try to overfit at 500 random images with regularization.
- Try to overfit at the whole dataset without any regularization.
Usually – if the number of filters is too low (in general) – these two tests will show you that. If – during your training process – with regularization – your network severely overfits – this is a clear indicator that your network has way too many filters.
Cheers.