From Dropout to Winner — Advancing Neural Network Sparsity
In this essay, I will initially provide a detailed exposition of dropout, elucidating the issues it has introduced. Following this, I will explicate a pioneering solution that has demonstrated superior efficacy in terms of robustness and accuracy compared to dropout. Subsequently, I will establish a connection between the underlying motivations of these methods and biological neurons. Lastly, I will discuss about the advantageous implications of sparsity for machine learning as a whole.
Understand the Dropout
Dropout is a regularization technique for neural networks, it randomly “kills” a percentage of the neurons (in practice usually 50%) on every training input presentation, thus introduces random sparse representations during learning. The idea is to prevent co-adaptation, where the neural network becomes too reliant on particular connections/patterns, as this could be symptomatic of overfitting. Intuitively, dropout can be thought of as creating an implicit ensemble of neural networks.
A more generalized form of Dropout is called DropConnect, which sets a randomly selected subset of weights within the network to zero, thus also introduces random sparse representations during learning. Dropout can be regarded as a special case of DropConnect, where a whole column weight is dropped.
Dropout and DropConnect both introduce sparsity in activations and connectivity during training.
Intrinsic Problems in Dropout
Dropout introduces random sparse representations during learning, and has been shown to be an effective regularizer in many contexts. However it still has some major problems:
Patterns Intersected Noise
Dropout can’t utilize the sparsity during inference to resist the noise in the inputs, in fact, using dropout layer could makes the model more susceptible to the noise in the inputs than using dense layer.
By training the model using a dropout layer, we basically force each neuron to learn multiple sparse pattern detectors from its inputs separately. However, during inference, all the connections/weights on that neuron are activated, which means all the sparse pattern detectors from this neuron are activated. This causes an issue when neuron fired due to many sparse pattern detections have a few fired connections, but none of the learned sparse patterns are detected. This allows many noise inputs to cause the neuron to fire (i.e., false positive). I call this type of noise patterns intersected noise.
In comparison to the dropout layer, a dense layer doesn’t force a neuron to learn multiple separated sparse pattern detectors, although depending on the statistics of its inputs, it may. However, neurons in the dense layer do have an incentive to learn a single unified dense pattern detector using all their connections/weights across all their inputs that is best for the task. Thus, many neurons in the dense layer are immune to this patterns intersected noise, albeit at the expense of a higher likelihood of overfitting.
Patterns Conflicted Weights
When training the model using a dropout layer, since dropout is random for each input, the activated connections for each neuron are also random. Since different activated connections mean different input statistics, every neuron is inclined to learn different sparse pattern detectors from different inputs. Thus, for any given pair of inputs, problems tend to occur when a neuron has common activated connections for both inputs (e.g., in the common practice of using a 50% dropout rate, this is almost always the case). Because those two different learned sparse pattern detectors will fight to set those common weights (i.e., Patterns Conflicted Weights), as a consequence, a non-trivial portion of training steps is spent on those conflicts, and the learned model could be less accurate for the task it was trained for. However, note that I said training time is spent, not wasted, because the dropout mechanism relies on this fighting to settle on a set of sparse pattern detectors weights for each neuron. But as we will see soon, it is not necessary.
k-Winner Solution
We can overcome those issues in Dropout by explicitly leverage the sparsity in activations and connectivity during both training and inference.
The paper: How Can We Be So Dense? The Benefits of Using Highly Sparse Representations formulates a version of the Spatial Pooler called k-winners layer that is designed to be a drop-in layer for neural networks trained with back-propagation.
k-winners layer has two main components: Sparse Weights and k-Winner-Take-All function (kWTA)
- Sparse Weights means the weights for each unit are initialized using sparse random distribution so only a fraction of the weights contain non-zero values. Which introduces sparsity in connectivity.
It solves the problem of Patterns Intersected Noise and Patterns Conflicted Weights, because each neuron has a set of pre-defined activated connections, and it will constraint each neuron to learn a single unified sparse pattern detector, which minimize the root cause of those two problems that is a single neuron trying learn multiple sparse pattern detectors for a given task. - kWTA make sure that only the top-k active neurons within each layer are maintained, and the rest set to zero. Which introduces sparsity in activations.
By explicitly choose top-k active neurons, same connections of same neurons will be be activated thus reinforced across similar inputs. This is much more efficient than what Droptout does which is to choose from a random subset of connections and neurons to activate even for similar inputs.
Also, by setting a bar for maximum number of neurons can fire in each layer, it helps to prevent the noise from the input being propagate into the deeper layer.
Algorithm: k-winners layer
Target duty cycle, aˆl, is a constant reflecting the percentage of neurons that are expected to be active.
The boost factor, β, is a positive parameter that controls the strength of boosting.
Comparison result
The result shows k-winners layer is much more robust against noise in the inputs than dropout layer, while have little to none accuracy loss compare to only using fully dense layer in which dropout layer could have non-trivial negative effect on the accuracy.
Sparsity in Biology
From paper: Going Beyond the Point Neuron: Active Dendrites and Sparse Representations for Continual Learning
Biological circuits and neocortical neurons exhibit sparsity in terms of both their 1) activations and 2) connectivity. In regards to activation sparsity, previous studies showed that in the neocortex relatively few neurons spike in response to a sensory stimulus, and that this is consistent across multiple sensory modalities, i.e., somatosensory, olfactory, visual and auditory (Attwell and Laughlin, 2001, Barth and Poulet, 2012, Liang et al., 2019). While the mechanisms maintaining sparsity, and the exact sparsity at the level of the individual neuron, remain to be answered fully, sparsity is a well-documented cortical phenomenon. Sparsity is also present in neural connections: cortical pyramidal neurons show sparse connectivity to each others and receive only few excitatory inputs from most surrounding neurons (Holmgren et al., 2003).
Benefits of Sparsity
Sparsity in activation and connectivity helps layer of neurons to form the minimum overlapping representations, which compare to fully connected dense layer it has following benefits:
Mitigate Overfitting
Since model can learn a more structured solution for given task by focus on learning modular fundamental features.
Robust Against Noise
A model which formed a more structured solution, is a model that have a better or a more generalized solution, which by definition it will makes the model easier to identify and ignore the noise within the inputs.
Circumvent Catastrophic Forgetting
Forgetting in neural networks is a consequence of two aspects acting in unison. First, roughly half of the neurons “fire” for any given input that the network processes. Second, the back-propagation learning algorithm, which enables learning in neural networks, modifies the connections of all firing neurons. (This is a consequence of the mathematics of back-propagation.)
Putting these two aspects together, the implication is that learning to predict the output for a single input causes roughly half of all connections to change! (In the cartoon illustration below, that would be all the green connections.) The “knowledge” of a neural network exists solely in the connections between neurons, hence previously-acquired knowledge is rapidly erased as the network attempts to learn new things. This is a terrible outcome from a continual learning perspective since a large portion of the network is essentially changing all the time.
So in an back-propagation update, if the connections between neurons are sparsely updated, then by definition, it means majority of the “knowledge” in the connections is left unchanged thus not forgot.
Reference
- How Can We Be So Dense? The Benefits of Using Highly Sparse Representations
- Going Beyond the Point Neuron: Active Dendrites and Sparse Representations for Continual Learning
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting
- The HTM Spatial Pooler — A Neocortical Algorithm for Online Sparse Distributed Coding