Hive Query Language
HiveQL is an SQL-like query language for Hive.
It mostly mimics SQL syntax for creation of tables, loading data into tables
and querying the tables. HiveQL also allows users to embed their custom
map-reduce scripts. These scripts can be written in any language using a simple
row-based streaming interface – read rows from standard input and write out
rows to standard output. This flexibility comes at a cost of a performance hit
caused by converting rows from and to strings. However, we have seen that users
do not mind this given that they can implement their scripts in the language of
their choice. Another feature unique to HiveQL is multi-table insert. In this
construct, users can perform multiple queries on the same input data using a
single HiveQL query. Hive optimizes these queries to share the scan of the
input data, thus increasing the throughput of these queries several orders of
magnitude. We omit more details due to lack of space
Phases of Hive
Compiler
·
Parser – Transform a
query string to a parse tree representation.
·
Semantic Analyser –
Transform the parse tree to an internal query representation, which is still
block based and not an operator tree. As part of this step, the column names
are verified and expansions like * are performed. Type-checking and any
implicit type conversions are also performed at this stage. If the table under
consideration is a partitioned table, which is the common scenario, all the
expressions for that table are collected so that they can be later used to
prune the partitions which are not needed. If the query has specified sampling,
that is also collected to be used later on.
·
Logical Plan Generator
– Convert the internal query representation to a logical plan, which consists
of a tree of operators. Some of the operators are relational algebra operators
like 'filter', 'join' etc. But some of the operators are Hive specific and are
used later on to convert this plan into a series of map-reduce jobs. One such
operator is a reduceSink operator which occurs at the map-reduce boundary. This
step also includes the optimizer to transform the plan to improve performance –
some of those transformations include: converting a series of joins into a
single multi-way join, performing a map-side partial aggregation for a
group-by, performing a group-by in 2 stages to avoid the scenario when a single
reducer can become a bottleneck in presence of skewed data for the grouping
key. Each operator comprises a descriptor which is a serializable object.
·
Query Plan Generator –
Convert the logical plan to a series of map-reduce tasks. The operator tree is
recursively traversed, to be broken up into a series of map-reduce serializable
tasks which can be submitted later on to the map-reduce framework for the
Hadoop distributed file system. The reduceSink operator is the map-reduce
boundary, whose descriptor contains the reduction keys. The reduction keys in
the reduceSink descriptor are used as the reduction keys in the map-reduce
boundary. The plan consists of the required samples/partitions if the query
specified so. The plan is serialized and written to a file.
Optimizer
More plan transformations are performed by the
optimizer. The optimizer is an evolving component. As of 2011, it was
rule-based and performed the following: column pruning and predicate pushdown.
However, the infrastructure was in place, and there was work under progress to
include other optimizations like map-side join.
Metastore
The
Metastore provides two important but often overlooked features of a data
warehouse: data abstraction and data discovery. Without the data abstractions
provided in Hive, a user has to provide information about data formats,
extractors and loaders along with the query. In Hive, this information is given
during table creation and reused every time the table is referenced. This is
very similar to the traditional warehousing systems. The second functionality,
data discovery, enables users to discover and explore relevant and specific
data in the warehouse.
Metadata Objects
·
Database – is a
namespace for tables. It can be used as an administrative unit in the future.
The database 'default' is used for tables with no user-supplied database name.
·
Table – Metadata for a
table contains list of columns, owner, storage and SerDe information. It can
also contain any user-supplied key and value data. Storage information includes
location of the underlying data, file inout and output formats and bucketing
information. SerDe metadata includes the implementation class of serializer and
deserializer and any supporting information required by the implementation. All
of this information can be provided during creation of the table.
·
Partition – Each
partition can have its own columns and SerDe and storage information. This
facilitates schema changes without affecting older partitions.
External table in HIVE
(Stores
data on HDFS)
·
External table stores
files on the HDFS server but tables are not linked to the source file
completely.(I have explained below what I meant by completely)
·
If you delete an
external table the file still remains on the HDFS server.
·
The file and the table
link is there but read only.
·
As an example if you
create an external table called “amandeep_test” in HIVE using HIVE-QL and link
the table to file “flat_file.txt”, then deleting “amandeep_test” from HIVE will
not delete “flat_file.txt” from HDFS.
·
External table files
are accessible to anyone who has access to HDFS file structure and therefore
security needs to be managed at the HDFS file/folder level.
·
Meta data is
maintained on master node and deleting an external table from HIVE, only
deletes the metadata not the data/file.
Use external table if you:
·
Want to manage the
data outside HIVE e.g. you are planning to use an ETL tool to load/merge data
files etc.
·
Want to load the
latest information to the table but still want to retain old dataset in a file
on HDFS for regulatory/legal purposes.
·
Are not planning to
create a table from another table schema e.g. Create table1 as (Select * from
table2)
Internal table in HIVE
(Stores
data on HDFS but in a kind of restricted area)
Stored in a directory based on settings in the
following file:hive.metastore.warehouse.dir
by default internal tables are stored in the
following directory “/user/hive/warehouse” you can change it by updating the
location in the config file mentioned above.
e.g. in the following screen shot internal
table “ tbl_batting“ is stored as a file on HDFS in warehouse folder
·
Deleting the table
deletes the metadata & data from masternode and HDFS respectively
·
Security needs to be
managed within HIVE, probably at the schema level (depends on organisation to
organisation). HDFS security is out of scope in this case.
Use internal table if you:
·
Want to store the data
temporary.
·
Want to use HIVE to
manage the lifecycle of tables and data.
Good information provided.
ReplyDeletethis can be used for gives the valuable technical detail of hive functionality. keep share more things about hadoop with new implementaion. thanks for your technical.
ReplyDeleteHadoop Training in Chennai