Coding with attitude

There are not a lot of idols in IT. (I’m not referring to people coding in front of a panel of judges by the way.) This can make it hard to climb the scales. Which way to go? Who should one emulate? Or is newer always better?

There is a lot of talk about ‘good code’, ‘quality’ and craftsmanship. These words are general and fussy and result in a lot of mutual back padding.  They should add up to ‘coding motherf*cker’ (I’m slightly paraphrasing Zed A. Shaw here) but could end up in some more discussion-less systeem were one is ‘digging it’ or ‘part of it’ or not. Which amounts to some form of, to put it polite, political movement.

Anyway Peter Norvig is the exception. This ofcourse is an opinion, and not even a subtle one. Peter Norvig ao. heads research at Google. To keep it short: the guy knows stuff.  To avoid opinions without facts (since I put myself in this corner now), and mindless idolization, here are three ways to actually verify this fact and learn quite a lot in the same time:

  • Check out Peter Norvigs ‘Design of Computer Programs’ at Udacity
  • Try out his 21 line speling correcter at http://norvig.com/spell-correct.html
  • Read some of Artificial Intelligence: A modern approach

Part of the Great American Coding Book.

Elephant in the room

No, no, not that elephant. Or, well maybe.

Some big data opportunities are obvious. A lot are internet related. The elephant in the room that is not seen (o.k. normally it is only not mentioned) is enterprise configuration. Standard ERP and relational database technologies often make interpretations of data in different contexts. This makes changing enterprise systems very hard. If the process model changes, these interpretations stop making sense.  Read my lips:

Data interpretation makes concrete out of software. Anonymous (well not now)

Enterprises are gearing up to only store events, and continuously generate views that match the current processes. Think Lambda, think CQRS. Once interpretation and its naughty cousin locking have left the building, computers become the amazing data processing machines we imagined them to be. Therefore:

Only continuous data re-interpretation allows for flexible enterprise configuration. Anonymous (well not now)

Back to the Fortran

A lot of numerical programs are still written in Fortran. When I programmed Fortran 18 years ago, it felt to much like Basic. Soon I switched to C and later to C++, to forget all about that later again. Now, if I think about it: Fortran manages to get the performance of C with the simplicity of Basic. That is not a bad deal, if I think about it now.

I wanted to do some Fortran or C programming for R for quite a while. On this page a simple start is given: http://www.r-bloggers.com/fortran-and-r-speed-things-up/ I want to add a little Fortran help to it.

Diving into C/C++ or Fortran can be a bit ackward. A fast way to start out turns out to be Netbeans. Netbeans supports C/C++ and Fortran, there even is a special C/C++ version at netbeans.org; it uses make to drive the build process.  Dowload and install Netbeans C/C++ and install gfortran. Choose New Project -> C/C++ Dynamic Library, name it ‘facto’ and choose Fortran as language in the next step. Paste the source of the mentioned article as facto.f in the source files section. Switch to ‘Release’ in the toolbar and right click on the ‘facto’ project node to create a build. Presto, you can see the build running in the bottom right. A dynamically linked library is built.

nfb

Open R-Studio, create a function that calls the library, and factorial 5 is …

Rf

Unlimited possibilities!

Darn Yarn

Nothing wrong with YARN!

YARN is a resource negotiation framework that allows Hadoop to become the ‘big data appserver’ or ‘big data Tomcat’ of the future. YARN allows for applications to be deployed on Hadoop, taking into consideration resource limits of the machines of the cluster. What kind of resources? Well, CPU, memory, disk for example.

Thanks to vagrant and Virtual Box it is easy to set up a development cluster that resembles a production cluster. Vagrant automates the configuration of virtual devices. I setup a four machine Hadoop 2.3.0 cluster with YARN using this vagrant recipe:

https://github.com/Cascading/vagrant-cascading-hadoop-cluster

I also found that there was some tweaking to be done.

At first Hadoop jobs would be accepted but would not run, second jobs would run but would give a ‘Heap size too small’ error. This has everything to do with resource provisioning. In the first case it turned out that the Hadoop jobs offered were bigger than the resource capacity of any of the datanodes. In the second case the the resource settings were o.k., but the JVM heap was too small to start up a map job. Besides the memory for the map job, also some extra memory for a container has to be added (about 512Mb). Do the settings wrong, and things will not move.

The tweaking is as follows. Get the vagrant set-up using

git clone https://github.com/Cascading/vagrant-cascading-hadoop-cluster.git

Edit the VagrantFile, and give the virtual machines a bit more memory and possibly an extra CPU:

  config.vm.provider :virtualbox do |vb|
    vb.customize ["modifyvm", :id, "--cpus", 
          "2", "--memory", "3082"]
  end

In the folder

modules/hadoop/code

I changed

yarn-site.xml

and

mapred-site.xml

Find the respective changes marked below

<configuration>
  <property>
      <name>yarn.resourcemanager.address</name>
      <value>master.local:8032</value>
  </property>
  <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>master.local:8030</value>
  </property>
  <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>master.local:8031</value>
  </property>
  <property>
      <name>yarn.resourcemanager.admin.address</name>
      <value>master.local:8033</value>
  </property>
  <property>
      <name>yarn.acl.enable</name>
      <value>false</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    <description>shuffle service that needs to be set for Map Reduce to run </description>
  </property>
  <property>
      <name>yarn.web-proxy.address</name>
      <value>master.local:8100</value>
  </property>

  <!-- Changes added below this line -->

  <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>1024</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
    <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>1548</value>
    <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
    <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>2</value>
    <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>3096</value>
    <description>Physical memory, in MB, to be made available to running containers</description>
  </property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>2</value>
    <description>Number of CPU cores that can be allocated for containers.</description>
  </property>
</configuration>
<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>master.local:9001</value>
  <description>The host and port that the MapReduce job tracker runs at.</description>
 </property>
 <property>
     <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>1</value>
 </property>
 <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>1</value>
 </property>
 <property>
    <name>mapreduce.jobhistory.address</name>
  <value>master.local:10020</value>
 </property>
 <property>
    <name>mapreduce.jobhistory.webapp.address</name>
  <value>master.local:19888</value>
 </property>
 <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
 </property>

  <!-- Changes added below this line -->

    <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>1024</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.command-opts</name>
        <value>-Xmx768m</value>
    </property>
    <property>
        <name>mapreduce.map.cpu.vcores</name>
        <value>1</value>
        <description>The number of virtual cores required for each map task.</description>
    </property>
    <property>
        <name>mapreduce.reduce.cpu.vcores</name>
        <value>1</value>
        <description>The number of virtual cores required for each map task.</description>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>1024</value>
        <description>Larger resource limit for maps.</description>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx768m</value>
        <description>Heap-size for child jvms of maps.</description>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>1024</value>
        <description>Larger resource limit for reduces.</description>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx768m</value>
        <description>Heap-size for child jvms of reduces.</description>
    </property>
</configuration>

That should do the trick. If you want to test the installation, go to http://docs.cascading.org/lingual/1.1/ to see some test cases with Cascading and Lingual.

Shouldn’t have, but glad I did

So sometimes you know that you shouldn’t do stuff. Like buy a big PC. I mean, that is totally not online and not virtual. So why did I do it? Well, I usually work on a 13″ power book. Good for a lot of stuff, but not if you want to run VM’s. Apple brought out a new Mac Pro, which is kind of cool. But then I somewhat miss Linux, or more precise the Debian package system. Installing binaries on a Mac is not always that smooth. Ports, brew, a mano: sometimes there are chains of dependencies that do require a lot of work. So what to do?

I  say: bring the beast. And for those that do not know, that is: bring the HP Z800.

Ofcourse it looks and sounds like Darth Vader. But then tool-less. This work horse is the minion of many quant or video editor. With a bit of haggling this is what I got for the price of a big iPad.

unnamed

Yes, you are right: that is no spaghetti there (I will post on that later). Would I not rather have the iPad? Nah, I have real needs that need satisfying. To cut a long story short. I can now run a 4 VM Hadoop cluster with no sweat. To be honest: it is a bit disappointing. If you have a Lamborghini, you want to hear the engine roar. Starting up a cluster of 4 nodes: not a glitch. Nothing, just some spaghetti if you ask for it. Plenty of room for Eclipse, Netbeans, databases, Chrome, whatever.

I am going to say it: 24 cores will always be enough. Well for development that is.

The cola and ham thing

This cooking post is a bit obscure. There is a recipe for cooking a ham in cola and then grilling it in the oven. I think it is from Nigella Lawson.  Now who would to such a thing to a nice ham? Meat is sort of ‘sacred’, an animal died for it. And then you mix it with cola, the essence of, well, industrial? Yesterday I think how this came about.

In the UK glazed ham is made by cooking down apple vinegar, adding honey, worcester sauce and some thyme butter. About half a cup of apple vinegar is cooked down to a spoon full of darker sour substrate. Adding the honey gives the western variety of sweet and sour. Well there you have it. Cola is known for being very sweet, and very sour (polishes copper money very well).

Unfortunately I ended up roasting the ham without adding a cup of water, which made the marinade burn a bit. Still, all in all it was a nice dinner. Served with baked potatoes and mushrooms in a garlic and wine sauce.

M/R, the EJB of 201x?

The current Hadoop ecosystem is very reminiscent of the early J2EE days. EJB was cumbersome: only with the stamina of a stower one could pack a decent application. I never managed to be honest. Around 2003 Spring came along, and alternative O/R mappers were appearing: saving Java’s ass in a big way.

Ofcourse having a choice is great, it is easy to get lost in the forrest though. And sometimes I get the impression that doing programming is like being a boy scout: finding trails in the dark forrest of JAR. The downside of this is that as a practitioner one can be very much sucked into to the darkness instead of focussing on what your clients business is about.

M/R seems to be the EJB of 201x. The difference, and the good news, ofcourse being that M/R allows for abstractions. Pig, Hive to name a few. Two days ago I saw a presentation on Cascading at a pre-Hadoop summit BOF. Cascading is an impressive piece of work: it abstracts out M/R jobs completely, allowing for Hadoop to be used as a means, not and end.

The framework is focussed on creating functional data flows. It also allows to connect R to hdfs using a JDBC driver. I’m going to check that out.