Update weight initialization scheme in mlp.py#106
Update weight initialization scheme in mlp.py#106rfeinman wants to merge 1 commit intolisa-lab:masterfrom rfeinman:patch-1
Conversation
The sparse initialization scheme is considered state-of-the art in random weight initialization for MLPs. In this scheme we hard limit the number of non-zero incoming connection weights to each unit (we used 15 in our experiments) and set the biases to 0 (or 0.5 for tanh units). Doing this allows the units to be both highly differentiated as well as unsaturated, avoiding the problem in dense initializations where the connection weights must all be scaled very small in order to prevent saturation, leading to poor differentiation between units.
|
Thanks for the PR, there is a bug that make our auto tests fail with errors like this: Also, there is other part that I think would need update, as there is still have reference to the previous paper. I didn't read the full paper, which section tell that this initialization is better then the previous one with 1st order optimization? Before merging, this would need some testing and accepting by other people on the machine learning side (I'm in the software side) |
|
Sorry, I forgot to add "import random" at the top of the code file. In regard to why it is better, please see section 3.1 of Hinton's paper. The idea is explained very well here. "The intuitive justification is that the total amount of input to each unit will not depend on the size of the previous layer and hence they will not as easily saturate. Meanwhile, because the inputs to each unit are not all randomly weighted blends of the outputs of many 100s or 1000s of units in the previous layer, they will tend to be qualitatively more “diverse” in their response to inputs." I referenced Martens 2010 because this is where the initialization scheme was first described. |
|
Hi, sorry about the late reply. |
The sparse initialization scheme is considered state-of-the art in random weight initialization for MLPs. In this scheme we hard limit the number of non-zero incoming connection weights to each unit (we used 15 in our experiments) and set the biases to 0 (or 0.5 for tanh units). Doing this allows the units to be both highly differentiated as well as unsaturated, avoiding the problem in dense initializations where the connection weights must all be scaled very small in order to prevent saturation, leading to poor differentiation between units.