Before switching to LSTM models, I decided to first brush up my general NN skills. Neural nets are … different. A lot of the exciting advancements in image and speech recognition are fueled by NN’s. If you have ever worked with neural nets, you probably know they can be a pain. Lately I have been getting some new inspiration though, thanks to the capable guys of Scyfer.
Summing up: neural nets really start to shine with larger and more complex datasets. The key factor being the ability to shape the net to resemble the data. On the contrary: with a small amount of features, XGBoost is generally the beast to defeat. But what if you would like to model time series, or have a lot of categorical data? I have spent a lot of time coming up with time related features. Examples being: the number of positives in the last week, the last 3 months, and the last year. A lot of manual guessing. Or take an overload of categorical features. Just expand these using dummies, right?
Diving into the Kaggle Allstate competition, the expansion of the categorical variables (no description what so ever by the way), got to around 800 extra features. Standard dummy-fying by the way gives two separate features for every binary feature. (In normal regression this would give a lot of indefiniteness.) Building the model amounted to riding a bull like likelihood space. I was not building models, I was trying to crank out grid searches at more than 1 hour per model. No fun, it takes a lot of staring into the sunset.
So where is the victory? Well here it comes. Rethinking the brutal dummy approach, I decided to model each categorical feature separately by adding a an extra layer with one sub-net per categorical variable. Orange in the picture below.
On top of the blue one there are two continuous features. Use three levels of a categorical variable, a synthetic continuous variable is created now. Instead of having a first layer of five nodes, with very diffuse concepts, there is now a ‘first’ layer with three clear concepts. Additional layers require far less parameters. The first model try beat most of the other models in 15s: optimization was clean and crisp. Keras (a tip from Scyfer) really starts to shine here. (See source examples: https://github.com/spdrnl/keras-categorical)
Smooth surfing of the loss space. Victory. Not big, but a nice move up from buckshot XGBoost.