A great explanation of why Dropout is really good for training large neural networks, and why it's actually the same thing your brain is doing:
Recent advances in machine learning cast new light on two puzzling biological phenomena. Neurons can use the precise time of a spike to communicate a real value very accurately, but it appears that cortical neurons do not do this. Instead they send single, randomly timed spikes. This seems like a clumsy way to perform signal processing, but a recent advance in machine learning shows that sending stochastic spikes actually works better than sending precise real numbers for the kind of signal processing that the brain needs to do. A closely related advance in machine learning provides strong support for a recently proposed theory of the function of sexual reproduction. Sexual reproduction breaks up large sets of co-adapted genes and this seems like a bad way to improve fitness. However, it is a very good way to make organisms robust to changes in their environment because it forces important functions to be achieved redundantly by multiple small sets of genes and some of these sets may still work when the environment changes. For artificial neural networks, complex co-adaptations between learned feature detectors give good performance on training data but not on new test data. Complex co-adaptations can be reduced by randomly omitting each feature detector with a probability of a half for each training case. This random "dropout" makes the network perform worse on the training data but the number of errors on the test data is typically decreased by about 10%. Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov have shown that this leads to large improvements in speech recognition and object recognition.
Essentially, dropout prevents over-fitting by simultaneously averaging many models and adding noise to input / hidden layers.
Additionally, sending a real valued signal of strength x with probability 0.5 is equivalent to sending a signal of strength 0.5 with probability x. Dropout does the former, brains do the latter. This can also be used to reduce neuron output to single bits, saving bandwidth when scaling to large numbers of interconnected machines.