Tuesday, March 1, 2011

Parallelizing the "Reduce" in "MapReduce"

I understand how Map is easily parallelizable - each computer/CPU can just operate on a small portion of the array.

Is Reduce/foldl parallelizable? It seems like each computation depends on the previous one. Is it just parallelizable for certain types of functions?

From stackoverflow
  • Not sure what platform/language you're thinking of, but you can parallelize reduce operators like this:

    // Original
    result = null;
    foreach(item in map) {
        result += item;
    }
    
    // Parallel
    resultArray = array();
    mapParts = map.split(numThreads);
    foreach(thread) {
        result = null;
        foreach(item in mapParts[thread]) {
            result += item;
        }
        resultArray += result;    // Lock this!
    }
    waitForThreads();
    reduce(resultArray);
    

    As you can see, a parallel implementation is easily recursive. You split the map up, operate on each part in its own thread, then perform another reduce once those threads are done to bring the pieces together.

    (This is the programmatic reasoning behind Piotr Lesnick's answer.)

  • If your reduction underlying operation is associative*, you can play with the order of operations and locality. Therefore you often have a tree-like structure in the 'gather' phase, so you can do it in several passes in logarithmic time:

    a  +  b  +  c  +  d
     \   /       \   /
     (a+b)       (c+d)
         \       /
       ((a+b)+(c+d))
    

    instead of (((a+b)+c)+d)

    If your operation is commutative, further optimization are possible as you can gather in different order (it may be important for data alignment when those operations are vector operations for example)

    [*] your real desired mathematical operations, not those on effective types like floats of course.

    Patrick McElhaney : Do you mean "associative" rather than "commutative"?
    Piotr Lesnicki : You're right, thanks, I meant associative, corrected! But in fact it also helps if the operation is commutative, so that you can gather your chunks in any order (you do that for data alignment issues for example)
  • Yes, if the operator is associative. For example, you can parallelise summing a list of numbers:

    step 1: 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8
    step 2:   3   +   7   +   11  +   15
    step 3:       10      +       26
    step 4:               36
    

    This works because (a+b)+c = a+(b+c), i.e. the order in which the additions are performed doesn't matter.

  • Check out the combine phase in Hadoop

    http://wiki.apache.org/hadoop/HadoopMapReduce

  • It depends on your Reduce step. In a Hadoop-style implementation of MapReduce, your Reducer is getting called once per key, with all the rows relevant to that key.

    So, for example, your Mapper might be taking in a lot of unordered web server logs, adding some metadata (e.g., geocoding), and emitting [key, record] pairs with a cookie ID as the key. Your Reducer would then be called once per cookie ID and would be fed all the data for that cookie, and could compute aggregate info such as visit frequency or average pages viewed per visit. Or you could key on geocode data, and gather aggregate stats based on geography.

    Even if you're not doing per-key aggregate analysis - indeed, even if you're computing something over the whole set - it might be possible to break your computation into chunks, each of which could be fed to a Reducer.

  • Technically a reduce is not the same as a foldl (fold-left) which can also be described as an accumulate.

    The example given by Jules illustrates a reduce operation very well:

    step 1: 1 + 2 + 3 + 4 
    step 2:   3   +   7   
    step 3:       10      
    

    Note that at each step the result is an array, including the final result which is an array of one item.

    A fold-left is like the following:

    step 0: a = 0
    step 1: a = a + 1 
    step 2: a = a + 2 
    step 3: a = a + 3
    step 4: a = a + 4
    step 5: a
    

    Now obviously these both produce the same results, but a foldl has a well-defined result when given a non-associative operator (like subtraction) whereas a reduce operator doesn't.

    Frank Shearar : Subtraction's non-associative but it is _left_ associative (because 5 - 3 - 2 yields the same result as (5 - 3) - 2). But what happens if you give foldl a right-associative operator, or foldr a left-associative one, I wonder?

0 comments:

Post a Comment