Hadoop Basics: Hive Basics

Hive CLI

Support for SQL to perform ad-hoc queries
Support for map reduce, custom mappers and reducer support with UDF (user defined function in Hive)
Limitation – Support for single user at a time
No Authentication support provided
Hive Cl is simple to use and widely used interface, still in production use

HiveServer1

Hive server is a server client model service
Allow users to connect using Hive CLI interface and using thrift client.
Support for remote client connection but only one client can connect at a time.
No session management support.
Because of thrift no concurrency control due to thrift API.

HiveServer2

Hive server 2 is also a client and server model.
It allows to connect many different clients like thrift
HiveServer2 gives multi client support where many clients can connect at the same time
Authentication is much better using Kerberos.
Support for JDBC and ODBC driver connection.
Beeline cli is used for connecting to HiveServer2

Beeline CLI

Beeline is a command line interface for HiveServer2
This is based on SQL Line CLI.
It gives better support for JDBC/ODBC.
This is not compatible with old HiveServer1
To configure beeline over HiveServer2 you need some extra configuration

Hive Mode :

Embedded mode – In this mode beeline cli starts embedded hive which is similar like regular Hive CLI ( Remember – Hive CLI and HiveServer1 is two different things )

Remote mode – In remote mode you are connecting using thrift API to another HiveServer2 process running on separate machine/server.

Hive Auto Map Joins :

Auto Map-Join is a very useful feature when joining a big table with a small table. if we enable this feature, the small table will be saved in the local cache on each node, and then joined with the big table in the Map phase. Enabling Auto Map Join provides two advantages. First, loading a small table into cache will save read time on each data node. Second, it avoids skew joins in the Hive query, since the join operation has been already done in the Map phase for each block of data.

Hive Skew Joins

We can enable optimization of skew joins, i.e. imbalanced joins by setting hive.optimize.skewjoin property to true either via SET command in hive shell or hive-site.xml file. Below are the list of properties that can be fine tuned to better optimize the skew joins.

Example :

A join of 2 large data tables is done by a set of MapReduce jobs which first sorts the tables based on the join key and then joins them. The Mapper gives all rows with a particular key to the same Reducer.

e.g., Suppose we have table A with a key column, "id" which has values 1, 2, 3 and 4, and table B with a similar column, which has values 1, 2 and 3.

We want to do a join corresponding to the following query

select A.id from A join B on A.id = B.id

A set of Mappers read the tables and gives them to Reducers based on the keys. e.g., rows with key 1 go to Reducer R1, rows with key 2 go to Reducer R2 and so on. These Reducers do a cross product of the values from A and B, and write the output. The Reducer R4 gets rows from A, but will not produce any results.

Now let's assume that A was highly skewed in favor of id = 1. Reducers R2 and R3 will complete quickly but R1 will continue for a long time, thus becoming the bottleneck. If the user has information about the skew, the bottleneck can be avoided manually as follows:

Do two separate queries

select A.id from A join B on A.id = B.id where A.id <> 1;

select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;

The first query will not have any skew, so all the Reducers will finish at roughly the same time. If we assume that B has only few rows with B.id = 1, then it will fit into memory. So the join can be done efficiently by storing the B values in an in-memory hash table. This way, the join can be done by the Mapper itself and the data do not have to go to a Reducer. The partial results of the two queries can then be merged to get the final results.

Advantages

If a small number of skewed keys make up for a significant percentage of the data, they will not become bottlenecks.

Disadvantages

The tables A and B have to be read and processed twice.

Because of the partial results, the results also have to be read and written twice.

The user needs to be aware of the skew in the data and manually do the above process.

We can improve this further by trying to reduce the processing of skewed keys. First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:

If it has key 1, then use the hashed version of B to compute the result.

For all other keys, send it to a reducer which does the join. This reducer will get rows of B also from a mapper.

This way, we end up reading only B twice. The skewed keys in A are only read and processed by the Mapper, and not sent to the reducer. The rest of the keys in A go through only a single Map/Reduce.

The assumption is that B has few rows with keys which are skewed in A. So these rows can be loaded into the memory.

Hive Vectorization

Vectorization feature is introduced into hive for the first time in hive-0.13.1 release only. By vectorized query execution, we can improve performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.

Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory.

Use vectorized query execution, you must store your data in ORC format, and set the following variable as shown in Hive SQL (see Configuring Hive):

set hive.vectorized.execution.enabled = true;

Vectorized execution is off by default, so your queries only utilize it if this variable is turned on. To disable vectorized execution and go back to standard execution, do the following:

set hive.vectorized.execution.enabled = false;

Hadoop Basics

Tuesday, 1 August 2017

Hive Basics

No comments:

Post a Comment

Blog Archive