Mmmm, there seems to be a short-cut to valuable insights. A lot faster than deep-learning, and potentially very powerful. Its origins lie not in cognitive biology and neural nets, but rather in politics.
The short cut is called “alternative facts”: https://www.theguardian.com/us-news/2017/jan/22/donald-trump-kellyanne-conway-inauguration-alternative-facts
Making facts great again? Better not post too much on this competing practice.
Back pressure is often mentioned in the context of data streams (TCP for example.) Back pressure refers to a mechanism where a data receiving component is signalling that the incoming data stream should slow down, or speed up. Apache Flink for example handles back pressure well. But that is not what this post is about. It is about adding extra loss functions to neural nets, and how that can improve overall performance.
Consider the picture at the end of the post. This picture is part of a Kaggle competition. (Note: I heavily cropped and blurred the image to FUBAR it and avoid any rights or recognition issues.) The goal of the competition is to classify the fish that is being caught using machine learning.
Here is the remarkable thing. One could straight-forward classify the picture as is using deep-learning. The net would figure out that the fish is the relevant part, and it would then figure out what type of fish it is, ie. classify it. Alternatively one could first extract the fish from the picture, and then classify the extracted hopefully fishy image segment using a separate neural net.
Here it comes. Adding extra bounding box parameters to the initial fish classification net actually improves the classification rate. Yes, a net that only classifies fish, classifies less accurate than a net that classifies a fish and learns the bounding box.
By providing richer feedback about the subject, the internal structure of the net seems to become better. For lack of a better word, I call it back-pressure. Counter intuitive. More is less; error that is.
A long long time ago in deep learning time and about two years ago in human years I created a convolutional neural net for the Kaggle National Data Science bowl. In a brave efffort I ported a convolitional net for MNIST to the plankton dataset using python and Theano. More or less a mano in Theano. It worked, I ended up somewhere halfway down the field.
I remember being somewhat proud an somewhat confused. I spent a lot of time learning deep learning concepts, and a lot of time coding in Theano. There was not really time or grit left to improve the working model. Deep learning proved a lot of work.
Flash forward to today. Within a few hours I am running a better convolutional net on the same dataset. Nothing fancy yet, but it works; and I have good hopes because of VGG16. VGG16 is a one time winning model for the ImageNet competition. As it turns out one can create a Franken-VGG16 by chopping of layers and retraining parts of the model specifically for the the plankton dataset. Ergo, the feature learning based on approximately 1.5M images is reused. Just like word embeddings are available for download, feature filters will become available for all types of datasets. Progress.
I added one of the plankton images as an example. The contest is aimed at identifying plankton species to assess the bio diversity in oceans.
As mentioned before, Keras is just kick-ass for creating neural nets. The Keras API allows for the quick creation of deep neural nets and just get on with it. Very impressive. Also many thanks to Jeremy Howard, for democratizing deep learning.
Credit for the title of this post can be taken by dancinghands.com. I’m learning how to play the bongo, and rolling down the hill describes one of the many Latin rhythms. It also applies to data science.
In the quest for real-time processes, some unexpected help came from an enterprise architecture model. These days everything is agile, and architecture is seen as the work of charlatans. I do agree that architecture can go terribly wrong. At the same time though, not all quality stems from weekly iterations. The model I am referring to is depicted below. It gave me insight to what I saw happening.
Once you decide that real-time processes are important (also look at how real-time is real-time), requirements start to trickle down. It is likely that the information architecture needs adjusting in the direction of events and micro-services (more on this later). This in turn will impact the systems and component designs. Real-time systems require high uptime, so application management and infrastructure is likely to change to. This is in essence what I have seen happening in the last year. Rolling down the hill.
Summarizing, the following qualities are required in the layers:
- Business processes – real-time or as fast as needed (might be weekly)
- Information architecture – avoiding hotness of databases and schema changes, focus on integrating using key indexes like contract number, address, postal code etc.
- Application layer – systems are placed in highways of events, keep data streaming, state becomes localized, resilience, idem potency
- Infrastructure layer – Scalability and fault-tolerance, don’t let success become you future problem
I will leave details to your imagination.
Currently I am involved in transitioning a larger corporation to more real-time processes. One of the questions that re-occurs is the following: How real-time is real-time? Are micro-second response times really adding to the bottom line. After all, consumers have contracts for a year or so. So why do it?
I have to admit. This question confused me quite a lot for some time. Gradually the mist has disappeared and a clear answer has formed. It goes like this.
Listen batch is really great. Most algorithms can be optimized in batch mode, and put online. Works, fulfills quite some needs. But still, you are going to lose out. These batches are produced once a month. By working hard, for quite a long time, you could probably run the batches every three of even two weeks. But here it comes.
The cost and complexity of speeding up batching is just going to be really high. Basically you are going to make data move more and faster between databases. It will break. For every company, there is a breaking point for batch. Beyond this breaking point the only paradigm that is going to save you is real-time. It might be that your algorithms only need updating once week. Beyond your breaking point, once a week is real-time for you. Simple.
Did I mention that a lot of batch type of algorithms find it really hard to model sequences through time? And that deep-learning allows to combine convolutional and LSTM networks? Trust me, the future is streaming; real-time.
Well I guess there is a first time for everything. So I confess: I created my first LSTM using Keras. Yay!
The coolness of Keras keeps on amazing me. Creating embeddings is easy. Performance seems unreasonable. I have checked some trial code several times, used a test set; I may just have to admit that it is just darn awesome.
Two of the best posts on LSTMs:
In new parlance: I have got to be honest with you, it is unreasonably effective, I have got to be honest with you.
So yes, absolutely. Of course it depends on your needs.
Word2vec is known for creating word embeddings using text sentences. Words are represented as vectors in an n-dimensional space. More special, vectors in this space can be added and substracted, the famous:
king – man + woman = queen
Edit: Pinterest applied the word2vec model to the pins in a session: pin2vec. Using this technique similar pins could be identified. More information can be found here: https://engineering.pinterest.com/blog/applying-deep-learning-related-pins (It is categorized under deep-learning: this seems discutable.)
It turns out that web visits can also be seen as sentences, and each URL as a word. Applying word2vec on web-visits this way, and then using t-sne for plotting shows that indeed similar URL’s are clustered near each other (sorry, no picture disclosure). URL2vec? It is like Uber, but then for …
Less intuitive though is the substraction and addition of URL vectors. Substracting one URL from another gives …. Well I will have to find out later. For now, the word2vec vector could act as a condensed input for a neural network instead of a large bag of URL’s.
PCA is a common way to reduce numerical features. Factor analysis, Lisrel anyone?, is a related method involving a latent factor model and fit measures to confirm a certain factor structure: it uses the covariance matrix as input. Although it is possible to compute a covariance matrix on binary features using tetrachoric correlations, I was always warned against using factor analysis on binary features.
Modelling categorical features as little subnets is already a step ahead for blatantly dummy-fying categorical variables. RBM’s might offer a way to detect patterns in categorical feature vectors. RBM’s basically extract patterns from binary vectors, compressing the vector to a lower dimension. This is sort of no surprise, RBM’s are the basic building blocks of deep belief networks, or how to discover patterns of patterns of … all the way down. Still nice I nailed that one.
RBM are deemed a bit old fashioned and surpassed by more modern deep learning approaches. Still it could give a good insight into the order of magnitude of things. There are a lot of hyper-parameters in neural nets, any guidance it welcome AFAIK.
Before switching to LSTM models, I decided to first brush up my general NN skills. Neural nets are … different. A lot of the exciting advancements in image and speech recognition are fueled by NN’s. If you have ever worked with neural nets, you probably know they can be a pain. Lately I have been getting some new inspiration though, thanks to the capable guys of Scyfer.
Summing up: neural nets really start to shine with larger and more complex datasets. The key factor being the ability to shape the net to resemble the data. On the contrary: with a small amount of features, XGBoost is generally the beast to defeat. But what if you would like to model time series, or have a lot of categorical data? I have spent a lot of time coming up with time related features. Examples being: the number of positives in the last week, the last 3 months, and the last year. A lot of manual guessing. Or take an overload of categorical features. Just expand these using dummies, right?
Diving into the Kaggle Allstate competition, the expansion of the categorical variables (no description what so ever by the way), got to around 800 extra features. Standard dummy-fying by the way gives two separate features for every binary feature. (In normal regression this would give a lot of indefiniteness.) Building the model amounted to riding a bull like likelihood space. I was not building models, I was trying to crank out grid searches at more than 1 hour per model. No fun, it takes a lot of staring into the sunset.
So where is the victory? Well here it comes. Rethinking the brutal dummy approach, I decided to model each categorical feature separately by adding a an extra layer with one sub-net per categorical variable. Orange in the picture below.
On top of the blue one there are two continuous features. Use three levels of a categorical variable, a synthetic continuous variable is created now. Instead of having a first layer of five nodes, with very diffuse concepts, there is now a ‘first’ layer with three clear concepts. Additional layers require far less parameters. The first model try beat most of the other models in 15s: optimization was clean and crisp. Keras (a tip from Scyfer) really starts to shine here. (See source examples: https://github.com/spdrnl/keras-categorical)
Smooth surfing of the loss space. Victory. Not big, but a nice move up from buckshot XGBoost.
Some LSTM like questions have sparked my neural network curiosity again. Time to dust off that GPU.
Quite a disappointment: my 1.5 years old laptop (tank style HP Zbook) cannot support tensorflow with CUDA 8.0, the CUDA compute capability is 3.0 and it requires 3.5.
HP-Z800 to the rescue! This machine was quite loud, but I managed to install some water cooling, making it sound less like a server. Compiling tensorflow sucked the live out of this dual cpu machine. O.k., it is only twice six cores, but still: 1042s.
Instructions can be found at: https://alliseesolutions.wordpress.com/2016/09/08/install-gpu-tensorflow-from-sources-w-ubuntu-16-04-and-cuda-8-0-rc/
A restart fixed some scary kernel messages, and presto we have flow: