PCA is a common way to reduce numerical features. Factor analysis, Lisrel anyone?, is a related method involving a latent factor model and fit measures to confirm a certain factor structure: it uses the covariance matrix as input. Although it is possible to compute a covariance matrix on binary features using tetrachoric correlations, I was always warned against using factor analysis on binary features.
Modelling categorical features as little subnets is already a step ahead for blatantly dummy-fying categorical variables. RBM’s might offer a way to detect patterns in categorical feature vectors. RBM’s basically extract patterns from binary vectors, compressing the vector to a lower dimension. This is sort of no surprise, RBM’s are the basic building blocks of deep belief networks, or how to discover patterns of patterns of … all the way down. Still nice I nailed that one.
RBM are deemed a bit old fashioned and surpassed by more modern deep learning approaches. Still it could give a good insight into the order of magnitude of things. There are a lot of hyper-parameters in neural nets, any guidance it welcome AFAIK.
Before switching to LSTM models, I decided to first brush up my general NN skills. Neural nets are … different. A lot of the exciting advancements in image and speech recognition are fueled by NN’s. If you have ever worked with neural nets, you probably know they can be a pain. Lately I have been getting some new inspiration though, thanks to the capable guys of Scyfer.
Summing up: neural nets really start to shine with larger and more complex datasets. The key factor being the ability to shape the net to resemble the data. On the contrary: with a small amount of features, XGBoost is generally the beast to defeat. But what if you would like to model time series, or have a lot of categorical data? I have spent a lot of time coming up with time related features. Examples being: the number of positives in the last week, the last 3 months, and the last year. A lot of manual guessing. Or take an overload of categorical features. Just expand these using dummies, right?
Diving into the Kaggle Allstate competition, the expansion of the categorical variables (no description what so ever by the way), got to around 800 extra features. Standard dummy-fying by the way gives two separate features for every binary feature. (In normal regression this would give a lot of indefiniteness.) Building the model amounted to riding a bull like likelihood space. I was not building models, I was trying to crank out grid searches at more than 1 hour per model. No fun, it takes a lot of staring into the sunset.
So where is the victory? Well here it comes. Rethinking the brutal dummy approach, I decided to model each categorical feature separately by adding a an extra layer with one sub-net per categorical variable. Orange in the picture below.
On top of the blue one there are two continuous features. Use three levels of a categorical variable, a synthetic continuous variable is created now. Instead of having a first layer of five nodes, with very diffuse concepts, there is now a ‘first’ layer with three clear concepts. Additional layers require far less parameters. The first model try beat most of the other models in 15s: optimization was clean and crisp. Keras (a tip from Scyfer) really starts to shine here. (See source examples: https://github.com/spdrnl/keras-categorical)
Smooth surfing of the loss space. Victory. Not big, but a nice move up from buckshot XGBoost.
Some LSTM like questions have sparked my neural network curiosity again. Time to dust off that GPU.
Quite a disappointment: my 1.5 years old laptop (tank style HP Zbook) cannot support tensorflow with CUDA 8.0, the CUDA compute capability is 3.0 and it requires 3.5.
HP-Z800 to the rescue! This machine was quite loud, but I managed to install some water cooling, making it sound less like a server. Compiling tensorflow sucked the live out of this dual cpu machine. O.k., it is only twice six cores, but still: 1042s.
Instructions can be found at: https://alliseesolutions.wordpress.com/2016/09/08/install-gpu-tensorflow-from-sources-w-ubuntu-16-04-and-cuda-8-0-rc/
A restart fixed some scary kernel messages, and presto we have flow: