Hadoop (hadoop.apache.org/core/) is a tool that makes it easy to run programs on clusters. It uses the Map-Reduce framework: it distributes the computation over individual records (such as data points) over a cluster and then allows the results of that computation to be combined in a reduce step. There is a tutorial at hadoop.apache.org/core/docs/current/mapred_tutorial.html that goes over the basics of Hadoop operation.
In order to use Hadoop, you need to either install it on your machine or connect to a machine that has it installed. Once installed, the main executable can be run by changing to the installation directory and running
bin/hadoop
Writing Hadoop Programs for ML
A large number of programs in ML look like:
1. Initialize parameters
2. For each data point
2a. Do something (compute gradient, sufficient statistics, etc)
2b. Combine it with the results on previous data points (add to the gradient, etc)
3. Update parameters based on the computation of 2.
4. Goto 2.
Comments (0)
You don't have permission to comment on this page.