Category Archives: Hadoop

Darn Yarn

Nothing wrong with YARN!

YARN is a resource negotiation framework that allows Hadoop to become the ‘big data appserver’ or ‘big data Tomcat’ of the future. YARN allows for applications to be deployed on Hadoop, taking into consideration resource limits of the machines of the cluster. What kind of resources? Well, CPU, memory, disk for example.

Thanks to vagrant and Virtual Box it is easy to set up a development cluster that resembles a production cluster. Vagrant automates the configuration of virtual devices. I setup a four machine Hadoop 2.3.0 cluster with YARN using this vagrant recipe:

https://github.com/Cascading/vagrant-cascading-hadoop-cluster

I also found that there was some tweaking to be done.

At first Hadoop jobs would be accepted but would not run, second jobs would run but would give a ‘Heap size too small’ error. This has everything to do with resource provisioning. In the first case it turned out that the Hadoop jobs offered were bigger than the resource capacity of any of the datanodes. In the second case the the resource settings were o.k., but the JVM heap was too small to start up a map job. Besides the memory for the map job, also some extra memory for a container has to be added (about 512Mb). Do the settings wrong, and things will not move.

The tweaking is as follows. Get the vagrant set-up using

git clone https://github.com/Cascading/vagrant-cascading-hadoop-cluster.git

Edit the VagrantFile, and give the virtual machines a bit more memory and possibly an extra CPU:

  config.vm.provider :virtualbox do |vb|
    vb.customize ["modifyvm", :id, "--cpus", 
          "2", "--memory", "3082"]
  end

In the folder

modules/hadoop/code

I changed

yarn-site.xml

and

mapred-site.xml

Find the respective changes marked below

<configuration>
  <property>
      <name>yarn.resourcemanager.address</name>
      <value>master.local:8032</value>
  </property>
  <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>master.local:8030</value>
  </property>
  <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>master.local:8031</value>
  </property>
  <property>
      <name>yarn.resourcemanager.admin.address</name>
      <value>master.local:8033</value>
  </property>
  <property>
      <name>yarn.acl.enable</name>
      <value>false</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    <description>shuffle service that needs to be set for Map Reduce to run </description>
  </property>
  <property>
      <name>yarn.web-proxy.address</name>
      <value>master.local:8100</value>
  </property>

  <!-- Changes added below this line -->

  <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>1024</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
    <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>1548</value>
    <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
    <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>2</value>
    <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>3096</value>
    <description>Physical memory, in MB, to be made available to running containers</description>
  </property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>2</value>
    <description>Number of CPU cores that can be allocated for containers.</description>
  </property>
</configuration>
<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>master.local:9001</value>
  <description>The host and port that the MapReduce job tracker runs at.</description>
 </property>
 <property>
     <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>1</value>
 </property>
 <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>1</value>
 </property>
 <property>
    <name>mapreduce.jobhistory.address</name>
  <value>master.local:10020</value>
 </property>
 <property>
    <name>mapreduce.jobhistory.webapp.address</name>
  <value>master.local:19888</value>
 </property>
 <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
 </property>

  <!-- Changes added below this line -->

    <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>1024</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.command-opts</name>
        <value>-Xmx768m</value>
    </property>
    <property>
        <name>mapreduce.map.cpu.vcores</name>
        <value>1</value>
        <description>The number of virtual cores required for each map task.</description>
    </property>
    <property>
        <name>mapreduce.reduce.cpu.vcores</name>
        <value>1</value>
        <description>The number of virtual cores required for each map task.</description>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>1024</value>
        <description>Larger resource limit for maps.</description>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx768m</value>
        <description>Heap-size for child jvms of maps.</description>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>1024</value>
        <description>Larger resource limit for reduces.</description>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx768m</value>
        <description>Heap-size for child jvms of reduces.</description>
    </property>
</configuration>

That should do the trick. If you want to test the installation, go to http://docs.cascading.org/lingual/1.1/ to see some test cases with Cascading and Lingual.