• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Hadoop for Machine Learning Guide

Page history last edited by Kurt 15 years, 3 months ago

Hadoop (hadoop.apache.org/core/) is a tool that makes it easy to run programs on clusters.  It uses the MapReduce framework: it distributes the computation over individual records (such as data points) over a cluster and then allows the results of that computation to be combined in a reduce step.  There is a very good tutorial at hadoop.apache.org/core/docs/current/mapred_tutorial.html that goes over the basics of Hadoop operation.


In order to use Hadoop, you need to either connect to a machine that has it installed or install it on your machine.  Once installed, the main executable can be run by changing to the installation directory and running




This will list all the different options for running Hadoop.  See the README in the code linked below for example usages.


Writing Hadoop Programs for ML


A large number of programs in ML look like:


1.   Initialize parameters

2.   For each data point

    2a.  Do something (compute gradient, sufficient statistics, etc)

    2b.  Combine it with the results on previous data points (add to the gradient, etc)

3.  Update parameters based on the computation of 2.

4.  Goto 2.


Step 2 often takes the longest amount of compute time and can be easily parallelized.  This is where Hadoop comes in.  It distributes your data and data-based computation across the cluster, allowing you to compute and sum gradients in parallel.


The code presented at the ML Tea can be downloaded from hadoop_example.tar.gz.  This is a simple demonstration of how to parallelize logistic regression and we hope you will adapt it to write your own programs.  NOTE: you need to have Java 1.6 in order to run this demo.  Find out about getting Hadoop to run on a 32-bit Mac.


You can also use Hadoop with other languages.  David Rosenberg wrote a HadoopStreaming R Library.  You can also just use the text-based streaming interface which reads/writes from stdin/stdout.


Where to Run Hadoop?


Now that you have a working Hadoop program that you have tested out on your own machine, where can you run it?  If you have access to a cluster, it is fairly easy to set up Hadoop.  Check out the cluster setup guide from Hadoop for more information.


If you do not have access to a cluster, Amazon's EC2 may be the way to go for you.  EC2 allows you to rent many machines for relatively cheap.  You only pay for the time you use, so it costs as much to run a single job for 1000 hours as it does to run 1000 jobs for 1 hour.  This is a great way to get started with Hadoop.


More Questions?


In addition to the tutorial available from Hadoop, there are also several great tutorial videos available from Cloudera.  If you have checked all that and still have more questions, feel free to get in touch with us.


Percy <pliang AT cs> or Kurt <tadayuki AT cs>



Comments (0)

You don't have permission to comment on this page.