Hadoop Basics: July 2017

Tuesday, 18 July 2017

Number of Inputsplits in Mapreduce :

An inputsplit is a chunk of the input data allocated to a map task for processing. FileInputFormat

generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the

file spans more than a HDFS block at which point it factors in the configured values of minimum split

size, maximimum split size and block size in determining the split size.

Here's the formula, from Hadoop the definitive guide-

Split size = max( minimumSplitSize, min( maximumSplitSize, HDFSBlockSize))

So, if we go with the default values, the split size = HDFSBlockSize for files spanning more than an

HDFS block.

Problem with mapreduce processing of small files :

We all know that Hadoop works best with large files; But the reality is that we still have to deal

with small files. When you want to process many small files in a mapreduce job, by default, each file

is processed by a map task (So, 1000 small files = 1000 map tasks). Having too many tasks that

finish in a matter of seconds is inefficient.

Increasing the minimum split size, to reduce the number of map tasks, to handle such a situation, is

not the right solution as it will be at the potential cost of locality.

Solution

CombineFileInputFormat packs many files into a split, providing more data for a map task to process.

It factors in node and rack locality so performance is not compromised.

Six Layers of Big Data:

1: First, the Data Ingestion layer takes in all kinds of data including: structured data; unstructured data; data from devices, sensors, logs, click streams, and applications; and data from both cloud and on premises sources.

2: Next, the Processing and Persistence layer is performed by cloud based systems such as Hadoop and Spark.

3: The Orchestration layer handles transformation and cleansing.

4: The Data Discovery layer is the critical next step, because it solves the silo problem, and it does that using a mixture of data modeling, data preparation, data curation, and data virtualization. Data virtualization creates a combined, virtual view of the data across two or more silos, which can be accessed by consumers in real-time as if the disparate silos were part of the same dataset.

5: The Data Management and Intelligence layer provides security and governance across the other five layers.

6: Finally, the Data Access layer delivers the data directly to analysts or to applications, tools, and dashboards.

Tuesday, 4 July 2017

Hive Conditional function

Hive supports three types of conditional functions. These functions are listed below:

IF( Test Condition, True Value, False Value )

The IF condition evaluates the “Test Condition” and if the “Test Condition” is true, then it returns the “True Value”. Otherwise, it returns the False Value.

Example: IF(1=1, 'working', 'not working') returns 'working'

COALESCE( value1,value2,... )

The COALESCE function returns the fist not NULL value from the list of values. If all the values in the list are NULL, then it returns NULL.

Example: COALESCE(NULL,NULL,5,NULL,4) returns 5

CASE Statement

The syntax for the case statement is:

All the conditions must be of same datatype. Conditions are evaluated in the order listed. Once a condition is found to be true, the case statement will return the result and not evaluate the conditions any further

CASE
WHEN Fruit = 'APPLE' THEN 'The owner is APPLE'
WHEN Fruit = 'ORANGE' THEN 'The owner is ORANGE'
ELSE 'It is another Fruit'
END

Hive Functions

Hive is mostly used Structured Data Processing.

Below are some general function used in Day to day work.

LENGTH( string str )

The LENGTH function returns the number of characters in a string
Example: LENGTH('Bharat') returns 5
Query: Select length(name) from table1;

LTRIM( string str )

The LTRIM function removes all the trailing spaces from the string.
LTRIM(' Bharat') returns 'Bharat'

RTRIM( string str )

The RTRIM function removes all the ending spaces from the string.
LTRIM('Bharat ') returns 'Bharat'

TRIM( string str )

The TRIM function removes all the starting/ending spaces from the string.
LTRIM(' Bharat ') returns 'Bharat'

SPLIT( string str, string pat )

The SPLIT function splits the string around the pattern pat and returns an array of strings. You can specify regular expressions as patterns.
SPLIT('Bharat:Manoj:Amol',':') returns ["Bharat","Manoj",""Amol]

SUBSTR( string source_str, int start_position [,int length] )

The SUBSTR function returns a part of the source string from the start position with the specified length of characters. If the length is not given, then it returns from the start position to the end of the string.

SUBSTR('Bharatkumar',7) returns 'kumar'
SUBSTR('Bharatkumar',6,5) returns 'tkuma'
Note : String start with position 1, First Example string start with 7 postion to end. second example substring of postion 6 to 5 character.

regexp_extract(string, pattern)

Returns the first substring matched by the regular expression pattern in string:
SELECT regexp_extract('1a 2b 14m', '\d+'); -- 1

regexp_replace(string, pattern)

Removes every instance of the substring matched by the regular expression pattern from string:
SELECT regexp_replace('1a 2b 14m', '\d+[ab] '); -- '14m'

ROUND( double value [, int n] )

The ROUND function returns the value rounded to n integer places.
Example: ROUND(123.456,2) Retrun 123.46

UNIX_TIMESTAMP( string date, string pattern )

This function converts the date to the specified date format and returns the number of seconds between the specified date and Unix epoch. If it fails, then it returns 0.
UNIX_TIMESTAMP('2000-01-01 10:20:30','yyyy-MM-dd') returns 946713600

FROM_UNIXTIME( bigint number_of_seconds [, string format] )

The FROM_UNIX function converts the specified number of seconds from Unix epoch and returns the date in the format 'yyyy-MM-dd HH:mm:ss'.
FROM_UNIXTIME( UNIX_TIMESTAMP() ) returns the current date including the time. This is equivalent to the SYSDATE in oracle.

TO_DATE( string timestamp )

The TO_DATE function returns the date part of the timestamp in the format 'yyyy-MM-dd'.
Example: TO_DATE('2000-01-01 10:20:30') returns '2000-01-01'

DATEDIFF( string date1, string date2 )
The DATEDIFF function returns the number of days between the two given dates.
Example: DATEDIFF('2000-03-01', '2000-01-10') returns 51

DATE_ADD( string date, int days )

The DATE_ADD function adds the number of days to the specified date
Example: DATE_ADD('2000-03-01', 5) returns '2000-03-06'

DATE_SUB( string date, int days )

The DATE_SUB function subtracts the number of days to the specified date
Example: DATE_SUB('2000-03-01', 5) returns ‘2000-02-25’

Hadoop Basics