| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Browse and search Google Drive and Gmail attachments (plus Dropbox and Slack files) with a unified tool for working with your cloud files. Try Dokkio (from the makers of PBworks) for free. Now available on the web, Mac, Windows, and as a Chrome extension!

View
 

Hadoop for Machine Learning Guide

This version was saved 12 years, 4 months ago View current version     Page history
Saved by Kurt
on March 13, 2009 at 11:40:43 am
 

Hadoop (hadoop.apache.org/core/) is a tool that makes it easy to run programs on clusters.  It uses the Map-Reduce framework: it distributes the computation over individual records (such as data points) over a cluster and then allows the results of that computation to be combined in a reduce step.  There is a tutorial at hadoop.apache.org/core/docs/current/mapred_tutorial.html that goes over the basics of Hadoop operation.

 

In order to use Hadoop, you need to either install it on your machine or connect to a machine that has it installed.  Once installed, the main executable can be run by changing to the installation directory and running

 

bin/hadoop

Writing Hadoop Programs for ML

 

A large number of programs in ML look like:

 

1.   Initialize parameters

2.   For each data point

    2a.  Do something (compute gradient, sufficient statistics, etc)

    2b.  Combine it with the results on previous data points (add to the gradient, etc)

3.  Update parameters based on the computation of 2.

4.  Goto 2.

 

Step 2 often takes the longest amount of compute time and can be easily parallelized.  This is where Hadoop comes in.  It distributes your data and data-based computation across the cluster, allowing you to compute and sum gradients in parallel.

 

 

 

This guide will be updated to contain the code presented at the ML Tea and more details about how to use Hadoop.

 

Comments (0)

You don't have permission to comment on this page.