Hadoop Basics: Apache Pig

What is difference between pig and sql?

Answer: Pig latin is procedural version of SQl.pig has certainly similarities,more difference from sql.sql is a query language for user asking question in query form.sql makes answer for given but dont tell how to answer the given question.suppose ,if user want to do multiple operations on tables,we have write multiple queries and also use temporary table for storing,sql is support for subqueries but intermediate we have to use temporary tables,SQL users find subqueries confusing and difficult to form properly.using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

Explain different execution modes available in Pig.

Three different execution modes available in Pig they are,

Interactive mode or Grunt mode.
Batch mode or Script mode.
Embedded mode
Interactive mode or grunt mode: Pig’s interactive shell is known as grunt shell. If no file is specified to run in Pig it will start.
grunt> run scriptfile.pig

grunt> exec scriptfile.pig

Batch mode or Script mode : Pig executes the specified commands in the script file.

Embedded mode : We can embed Pig programs in Java and we can run the programs from Java.

Differentiate between the physical plan and logical plan in Pig script.

Both plans are created while to execute the pig script.

Physical plan : It is a series of MapReduce jobs while creating the physical plan.It’s divided into three physical operators such as Local Rearrange, Global Rearrange, and package. It illustrates the physical operators Pig will use to execute the script without referring to how they will execute in MapReduce Loading and storing functions are resolved in physical plan.

Logical plan : The Logical plan is a plan which is created for each line in the Pig scripts. It is produced after semantic checking and basic parsing. With every line, the logical plan for that particular program becomes extended and larger because each and every statement has its own logical plan.Loading and storing function are not resolved in logical plan.

What are the advantages of pig language?

Answer: The pig is easy to learn: Pig is easy to learn, it overcomes the need for writing complex MapReduce programs to some extent. Pig works in a step by step manner. So it is easy to write, and even better, it is easy to read.

It can handle heterogeneous data: Pig can handle all types of data – structured, semi-structured, or unstructured.

Pig is Faster: Pig’s multi-query approach combines certain types of operations together in a single pipeline, reducing the number of times data is scanned.

Pig does more with less: Pig provides the common data operations (filters, joins, ordering, etc.) And nested data types (e.g. Tuples, bags, and maps) which can be used in processing data.

Pig is Extensible: Pig is easily extensible by UDFs – including Python, Java, JavaScript, and Ruby so you can use them to load, aggregate and analysis. Pig insulates your code from changes to the Hadoop Java API.

Why do we need Pig?

Answer: Pig is a high level scripting language that is used with Apache Hadoop. Pig excels at describing data analysis problems as data flows. Pig is complete in that you can do all the required data manipulations in Apache Hadoop with Pig. In addition through the User Defined Functions(UDF) facility in Pig you can have Pig invoke code in many languages like JRuby, Jython and Java. Conversely you can execute Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

What Is Difference Between Map reduce and Pig ?

Answer:

In MR Need to write entire logic for operations like join,group,filter,sum etc ..
In Pig Built in functions are available
In MR Number of lines of code required is too much even for a simple functionality
In Pig 10 lines of pig Latin equal to 200 lines of java
In MR Time of effort in coding is high
In Pig What took 4hrs to write in java took 15 mins in pig Latin (approx)
In MR Less productivity
In PIG High Productivity

Explain the uses of Map Reduce in Pig.

Apache Pig programs are written in Pig Latin query language which is similar to the SQL query language. To execute this queries, there requires an execution engine. The Pig engine enables to convert the queries into MapReduce jobs and thus MapReduce acts as the execution engine and is designed to run the programs as per the requirements.
Pigs’ operators are using Hadoops’ API depending upon the configurations the job is executed in local mode or Hadoop cluster. Pig is never passes any outputs to Hadoop instead set the inputs and data locations for map-reduce.
Pig Latin provides a set of standard Data-processing operations, such as join, filter, group by, order by, union, etc which are mapped to do the map-reduce tasks. A Pig Latin script describes a (DAG) directed acyclic graph, where the edges are data flows and the nodes are operators that process the data.

State the usage of ‘filters’, ‘group’ , ‘orderBy’, ‘distinct’ keywords in pig scripts.

Filters : Filters has the similar functionality as where clause in SQL. Filters contain predicate and if it evaluates true for a given record, then that record will be passed down the pipeline. Otherwise, it will not predicate the results and thus contains different operators like ==,>=, <=,!=.so,== and != which is been applied in creating maps and tuples.

A= load ‘inputs’ as (name,address)

B=filter A by symbol matches ‘CM.*';

GroupBy : The group statement collects various records with the same key. In SQL database GroupBy creates a group which feeds directly to one or more aggregate functions. But in Pig Latin has no direct connection between group and aggregate functions.

Input 2 = load ‘daily’ as(exchanges,stocks);

grpds = group input2 by stocks;

Order : The Order statement sorts the data producing a total order of output data. The Order syntax is similar to Group. Give a key or set of keys to order your data as per requirement. The following are the examples for the same:

Input 2 = load ‘daily’ as(exchanges,stocks);

grpds = order input2 by exchanges;

Distinct : The distinct statement is very simple to understand and implement. It removes duplicate records and the original data will be secured. It is implemented only on entire records, not on individual fields. Consider the below examples which explains the same:

Input 2 = load ‘daily’ as(exchanges,stocks);

grpds = distinct exchanges;

What is difference between GROUP and COGROUP?

The GROUP and COGROUP operators are identical. Both operators work with one or more relations. For readability GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations. We can COGROUP up to
but no more than 127 relations at a time.

COGROUP Two Tables

Where COGROUP gets fancy is that you can COGROUP on two tables at once. Pig will group the two tables and then join the two tables on the grouped column. For example, assume we also had a data set of pet names:

Given this table, we could compare for example all the people with a given animal to all the names of that animal. The COGROUP command is:

owners = LOAD 'owners.csv'

USING PigStorage(',')

AS (owner:chararray,animal:chararray);

pets = LOAD 'pets.csv'

USING PigStorage(',')

AS (name:chararray,animal:chararray);

grouped = COGROUP owners BY animal, pets by animal;

DUMP grouped;

This will group each table based on the animal column. For each animal, it will create a bag of matching rows from both tables. For this example, we get:

group	owners	pets
cat	{(adam,cat),(alice,cat)}	{(paws,cat),(wiskers,cat)}
dog	{(adam,dog),(steve,dog)}	{(fido,dog),(rex,dog)}
fish	{(alex,fish)}	{(nemo,fish)}

In summary, you can use COGROUP when you need to group two tables by a column and then join on the grouped column.

What is bag?

Answer: A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag in memory. There is no necessity that the complete bag should fit into memory. We represent bags with “{}”.

What is UDF in Pig?

Answer: The pig has wide-ranging inbuilt functions, but occasionally we need to write complex business logic, which may not be implemented using primitive functions. Thus, Pig provides support to allow writing User Defined Functions (UDFs) as a way to stipulate custom processing.

Pig UDFs can presently be implemented in Java, Python, JavaScript, Ruby and Groovy. The most far-reaching support is provided for Java functions. You can customize all parts of the processing, including data load/store, column transformation, and aggregation. Java functions are also additional efficient because they are implemented in the same language as Pig and because additional interfaces are supported. Such as the Algebraic Interface and the Accumulator Interface. Limited support is provided for Python, JavaScript, Ruby and Groovy functions.

What is the difference between store and dumps commands?

Answer: Dump Command after process the data displayed on the terminal, but it’s not stored anywhere. Whereas store store in local file system or HDFS and output execute in a folder. In the protection environment most often Hadoop developer used ‘store’ command to store data in the HDFS.

Hadoop Basics

Tuesday, 1 August 2017

Apache Pig

No comments:

Post a Comment

Blog Archive