Data Storage and Analysis in Hadoop


The problem is simple: while the storage capacities of hard drives have increased massively
over the years, access speeds—the rate at which data can be read from drives—
have not kept up. One typical drive from 1990 could store 1,370 MB of data and had
a transfer speed of 4.4 MB/s,4 so you could read all the data from a full drive in around
five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer
speed is around 100 MB/s, so it takes more than two and a half hours to read all the
data off the disk.
This is a long time to read all data on a single drive—and writing is even slower. The
obvious way to reduce the time is to read from multiple disks at once. Imagine if we
had 100 drives, each holding one hundredth of the data. Working in parallel, we could
read the data in under two minutes.
Only using one hundredth of a disk may seem wasteful. But we can store one hundred
datasets, each of which is one terabyte, and provide shared access to them. We can
imagine that the users of such a system would be happy to share access in return for
shorter analysis times, and, statistically, that their analysis jobs would be likely to be
spread over time, so they wouldn’t interfere with each other too much.
There’s more to being able to read and write data in parallel to or from multiple disks,
though.
The first problem to solve is hardware failure: as soon as you start using many pieces
of hardware, the chance that one will fail is fairly high. A common way of avoiding data
loss is through replication: redundant copies of the data are kept by the system so that
in the event of failure, there is another copy available. This is how RAID works, for
instance, although Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS),
takes a slightly different approach, as you shall see later.
The second problem is that most analysis tasks need to be able to combine the data in
some way; data read from one disk may need to be combined with the data from any
of the other 99 disks. Various distributed systems allow data to be combined from
multiple sources, but doing this correctly is notoriously challenging. MapReduce provides
a programming model that abstracts the problem from disk reads and writes,
transforming it into a computation over sets of keys and values. We will look at the
details of this model in later chapters, but the important point for the present discussion
is that there are two parts to the computation, the map and the reduce, and it’s the
interface between the two where the “mixing” occurs. Like HDFS, MapReduce has
built-in reliability.
This, in a nutshell, is what Hadoop provides: a reliable shared storage and analysis
system. The storage is provided by HDFS and analysis by MapReduce. There are other
parts to Hadoop, but these capabilities are its kernel.

0 comments:

Post a Comment