What is Hadoop - Overview


Overview of Hadoop : Hadoop MapReduce is a software framework for writing applications that process vast amounts of data (multi-petabyte datasets) in parallel on large clusters consisting of thousands of nodes of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input dataset into independent chunks which are processed by the map tasks in a parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a distributed filesystem.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (HDFS) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

0 comments:

Post a Comment