I actually have been using Map reduce programming paradigm before. Check out my github repo. Since then, I have had a rough concept of how MR works. But since I am so passionate about programming diadigm that I decide to dig into more details of how MR works, what problems it solves and how.
My point of view regarding the workflow of MR
how to determine the R patitions location?
User define a partitioning function to customize the number of partitions. Like hash(key) mod R
, the key is the key generated from the intermidiate phase. Sometimes users might want to partition some keys having some specified patterns into the same partition might want to customize that.
The keys/value pairs generated from the map phase (aka in each partition) are sorted by increasing keys.
How to achieve fault tolerance?
- How to deal with master failure since master is so important?
Master has to write periodic checkpoint so that when it fails, the backup master node get recover from the checkpoint.
Below is quoted from the mapreduce paper:
It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.
- What if a worker fails? Quoted from the mapreduce paper:
The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for schedul- ing on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed on a failure be- cause their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system. When a map task is executed first by worker A and then later executed by worker B (because A failed), all workers executing reduce tasks are notified of the re-execution. Any reduce task that has not already read the data from worker A will read the data from worker B.
- What if one of the map or reduce function fails? Not so sure yet.
Combiner function
This happens between map phase and reduce phase. It is executed on the machine that run map tasks. Imagine the word count program where <The, 1> will show up most of the time. We can use a combiner function to first combine them all like <The, n> and send them via network with just one record to reduce the bandwith. In a nutshell, it is used for partial reducing in order to transfer result from map function to reduce function more efficiently.