A long long time ago in deep learning time and about two years ago in human years I created a convolutional neural net for the Kaggle National Data Science bowl. In a brave efffort I ported a convolitional net for MNIST to the plankton dataset using python and Theano. More or less a mano in Theano. It worked, I ended up somewhere halfway down the field.
I remember being somewhat proud an somewhat confused. I spent a lot of time learning deep learning concepts, and a lot of time coding in Theano. There was not really time or grit left to improve the working model. Deep learning proved a lot of work.
Flash forward to today. Within a few hours I am running a better convolutional net on the same dataset. Nothing fancy yet, but it works; and I have good hopes because of VGG16. VGG16 is a one time winning model for the ImageNet competition. As it turns out one can create a Franken-VGG16 by chopping of layers and retraining parts of the model specifically for the the plankton dataset. Ergo, the feature learning based on approximately 1.5M images is reused. Just like word embeddings are available for download, feature filters will become available for all types of datasets. Progress.
I added one of the plankton images as an example. The contest is aimed at identifying plankton species to assess the bio diversity in oceans.
As mentioned before, Keras is just kick-ass for creating neural nets. The Keras API allows for the quick creation of deep neural nets and just get on with it. Very impressive. Also many thanks to Jeremy Howard, for democratizing deep learning.
Credit for the title of this post can be taken by dancinghands.com. I’m learning how to play the bongo, and rolling down the hill describes one of the many Latin rhythms. It also applies to data science.
In the quest for real-time processes, some unexpected help came from an enterprise architecture model. These days everything is agile, and architecture is seen as the work of charlatans. I do agree that architecture can go terribly wrong. At the same time though, not all quality stems from weekly iterations. The model I am referring to is depicted below. It gave me insight to what I saw happening.
Once you decide that real-time processes are important (also look at how real-time is real-time), requirements start to trickle down. It is likely that the information architecture needs adjusting in the direction of events and micro-services (more on this later). This in turn will impact the systems and component designs. Real-time systems require high uptime, so application management and infrastructure is likely to change to. This is in essence what I have seen happening in the last year. Rolling down the hill.
Summarizing, the following qualities are required in the layers:
- Business processes – real-time or as fast as needed (might be weekly)
- Information architecture – avoiding hotness of databases and schema changes, focus on integrating using key indexes like contract number, address, postal code etc.
- Application layer – systems are placed in highways of events, keep data streaming, state becomes localized, resilience, idem potency
- Infrastructure layer – Scalability and fault-tolerance, don’t let success become you future problem
I will leave details to your imagination.
Currently I am involved in transitioning a larger corporation to more real-time processes. One of the questions that re-occurs is the following: How real-time is real-time? Are micro-second response times really adding to the bottom line. After all, consumers have contracts for a year or so. So why do it?
I have to admit. This question confused me quite a lot for some time. Gradually the mist has disappeared and a clear answer has formed. It goes like this.
Listen batch is really great. Most algorithms can be optimized in batch mode, and put online. Works, fulfills quite some needs. But still, you are going to lose out. These batches are produced once a month. By working hard, for quite a long time, you could probably run the batches every three of even two weeks. But here it comes.
The cost and complexity of speeding up batching is just going to be really high. Basically you are going to make data move more and faster between databases. It will break. For every company, there is a breaking point for batch. Beyond this breaking point the only paradigm that is going to save you is real-time. It might be that your algorithms only need updating once week. Beyond your breaking point, once a week is real-time for you. Simple.
Did I mention that a lot of batch type of algorithms find it really hard to model sequences through time? And that deep-learning allows to combine convolutional and LSTM networks? Trust me, the future is streaming; real-time.
Well I guess there is a first time for everything. So I confess: I created my first LSTM using Keras. Yay!
The coolness of Keras keeps on amazing me. Creating embeddings is easy. Performance seems unreasonable. I have checked some trial code several times, used a test set; I may just have to admit that it is just darn awesome.
Two of the best posts on LSTMs:
In new parlance: I have got to be honest with you, it is unreasonably effective, I have got to be honest with you.
So yes, absolutely. Of course it depends on your needs.
Word2vec is known for creating word embeddings using text sentences. Words are represented as vectors in an n-dimensional space. More special, vectors in this space can be added and substracted, the famous:
king – man + woman = queen
Edit: Pinterest applied the word2vec model to the pins in a session: pin2vec. Using this technique similar pins could be identified. More information can be found here: https://engineering.pinterest.com/blog/applying-deep-learning-related-pins (It is categorized under deep-learning: this seems discutable.)
It turns out that web visits can also be seen as sentences, and each URL as a word. Applying word2vec on web-visits this way, and then using t-sne for plotting shows that indeed similar URL’s are clustered near each other (sorry, no picture disclosure). URL2vec? It is like Uber, but then for …
Less intuitive though is the substraction and addition of URL vectors. Substracting one URL from another gives …. Well I will have to find out later. For now, the word2vec vector could act as a condensed input for a neural network instead of a large bag of URL’s.
PCA is a common way to reduce numerical features. Factor analysis, Lisrel anyone?, is a related method involving a latent factor model and fit measures to confirm a certain factor structure: it uses the covariance matrix as input. Although it is possible to compute a covariance matrix on binary features using tetrachoric correlations, I was always warned against using factor analysis on binary features.
Modelling categorical features as little subnets is already a step ahead for blatantly dummy-fying categorical variables. RBM’s might offer a way to detect patterns in categorical feature vectors. RBM’s basically extract patterns from binary vectors, compressing the vector to a lower dimension. This is sort of no surprise, RBM’s are the basic building blocks of deep belief networks, or how to discover patterns of patterns of … all the way down. Still nice I nailed that one.
RBM are deemed a bit old fashioned and surpassed by more modern deep learning approaches. Still it could give a good insight into the order of magnitude of things. There are a lot of hyper-parameters in neural nets, any guidance it welcome AFAIK.
Before switching to LSTM models, I decided to first brush up my general NN skills. Neural nets are … different. A lot of the exciting advancements in image and speech recognition are fueled by NN’s. If you have ever worked with neural nets, you probably know they can be a pain. Lately I have been getting some new inspiration though, thanks to the capable guys of Scyfer.
Summing up: neural nets really start to shine with larger and more complex datasets. The key factor being the ability to shape the net to resemble the data. On the contrary: with a small amount of features, XGBoost is generally the beast to defeat. But what if you would like to model time series, or have a lot of categorical data? I have spent a lot of time coming up with time related features. Examples being: the number of positives in the last week, the last 3 months, and the last year. A lot of manual guessing. Or take an overload of categorical features. Just expand these using dummies, right?
Diving into the Kaggle Allstate competition, the expansion of the categorical variables (no description what so ever by the way), got to around 800 extra features. Standard dummy-fying by the way gives two separate features for every binary feature. (In normal regression this would give a lot of indefiniteness.) Building the model amounted to riding a bull like likelihood space. I was not building models, I was trying to crank out grid searches at more than 1 hour per model. No fun, it takes a lot of staring into the sunset.
So where is the victory? Well here it comes. Rethinking the brutal dummy approach, I decided to model each categorical feature separately by adding a an extra layer with one sub-net per categorical variable. Orange in the picture below.
On top of the blue one there are two continuous features. Use three levels of a categorical variable, a synthetic continuous variable is created now. Instead of having a first layer of five nodes, with very diffuse concepts, there is now a ‘first’ layer with three clear concepts. Additional layers require far less parameters. The first model try beat most of the other models in 15s: optimization was clean and crisp. Keras (a tip from Scyfer) really starts to shine here. (See source examples: https://github.com/spdrnl/keras-categorical)
Smooth surfing of the loss space. Victory. Not big, but a nice move up from buckshot XGBoost.
Some LSTM like questions have sparked my neural network curiosity again. Time to dust off that GPU.
Quite a disappointment: my 1.5 years old laptop (tank style HP Zbook) cannot support tensorflow with CUDA 8.0, the CUDA compute capability is 3.0 and it requires 3.5.
HP-Z800 to the rescue! This machine was quite loud, but I managed to install some water cooling, making it sound less like a server. Compiling tensorflow sucked the live out of this dual cpu machine. O.k., it is only twice six cores, but still: 1042s.
Instructions can be found at: https://alliseesolutions.wordpress.com/2016/09/08/install-gpu-tensorflow-from-sources-w-ubuntu-16-04-and-cuda-8-0-rc/
A restart fixed some scary kernel messages, and presto we have flow:
There are not a lot of idols in IT. (I’m not referring to people coding in front of a panel of judges by the way.) This can make it hard to climb the scales. Which way to go? Who should one emulate? Or is newer always better?
There is a lot of talk about ‘good code’, ‘quality’ and craftsmanship. These words are general and fussy and result in a lot of mutual back padding. They should add up to ‘coding motherf*cker’ (I’m slightly paraphrasing Zed A. Shaw here) but could end up in some more discussion-less systeem were one is ‘digging it’ or ‘part of it’ or not. Which amounts to some form of, to put it polite, political movement.
Anyway Peter Norvig is the exception. This ofcourse is an opinion, and not even a subtle one. Peter Norvig ao. heads research at Google. To keep it short: the guy knows stuff. To avoid opinions without facts (since I put myself in this corner now), and mindless idolization, here are three ways to actually verify this fact and learn quite a lot in the same time:
- Check out Peter Norvigs ‘Design of Computer Programs’ at Udacity
- Try out his 21 line speling correcter at http://norvig.com/spell-correct.html
- Read some of Artificial Intelligence: A modern approach
Part of the Great American Coding Book.
No, no, not that elephant. Or, well maybe.
Some big data opportunities are obvious. A lot are internet related. The elephant in the room that is not seen (o.k. normally it is only not mentioned) is enterprise configuration. Standard ERP and relational database technologies often make interpretations of data in different contexts. This makes changing enterprise systems very hard. If the process model changes, these interpretations stop making sense. Read my lips:
Data interpretation makes concrete out of software. Anonymous (well not now)
Enterprises are gearing up to only store events, and continuously generate views that match the current processes. Think Lambda, think CQRS. Once interpretation and its naughty cousin locking have left the building, computers become the amazing data processing machines we imagined them to be. Therefore:
Only continuous data re-interpretation allows for flexible enterprise configuration. Anonymous (well not now)