Hadoop Basics: Hadoop Common Things

An inputsplit is a chunk of the input data allocated to a map task for processing. FileInputFormat

generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the

file spans more than a HDFS block at which point it factors in the configured values of minimum split

size, maximimum split size and block size in determining the split size.

Here's the formula, from Hadoop the definitive guide-

Split size = max( minimumSplitSize, min( maximumSplitSize, HDFSBlockSize))

So, if we go with the default values, the split size = HDFSBlockSize for files spanning more than an

HDFS block.

Problem with mapreduce processing of small files :

We all know that Hadoop works best with large files; But the reality is that we still have to deal

with small files. When you want to process many small files in a mapreduce job, by default, each file

is processed by a map task (So, 1000 small files = 1000 map tasks). Having too many tasks that

finish in a matter of seconds is inefficient.

Increasing the minimum split size, to reduce the number of map tasks, to handle such a situation, is

not the right solution as it will be at the potential cost of locality.

Solution

CombineFileInputFormat packs many files into a split, providing more data for a map task to process.

It factors in node and rack locality so performance is not compromised.

Hadoop Basics