May 28

When to use Pig Latin versus Hive SQL?

Once your big data is loaded into Hadoop, what’s the best way to use that data?  You’ll need some way to filter and aggregate the data, and then apply the results for something useful.  Collecting terabytes and petabytes of web traffic data is not useful until you have a way to extract meaningful data insights out of it.

Image credit: gerard79 at stock.xchng

Image credit: gerard79 at stock.xchng

That’s where MapReduce comes in.  MapReduce permits you to filter and aggregate data from HDFS so that you can gain insights from the big data.  However, writing MapReduce code with basic Java may require you to write many lines of code laboriously, with additional time needed for code review and QA.

So instead of writing plain Java code to use MapReduce, you now have the options of using either the Pig Latin or Hive SQL languages to construct MapReduce programs.  (There’s also another option to use the Hadoop Streaming protocol with STDIN and STDOUT with any language such as Python or even BASH shell script, but we’ll explore that option more on another day.)  The benefit is that you only need to write much fewer lines of code, thus reducing overall development and testing time.  The rule of thumb is that writing Pig scripts takes 5% of the time compared to writing MapReduce programs in Java, while reducing runtime performance by only 50%.  Although Pig and Hive scripts generally don’t run as fast as native Java MapReduce programs, they are vastly superior in boosting productivity for data engineers and analysts.

[See Top 10 Presentations for Learning Hadoop on SlideShare.]

When should you use Pig Latin and when should you use Hive?

Depending on where you work, you may need to simply use whatever standards your company has established.

For example, Hive is commonly used at Facebook for analytical purposes.  Facebook promotes the Hive language and their employees frequently speak about Hive at Big Data and Hadoop conferences.

However, Yahoo! is a big advocate for Pig Latin.  Yahoo! has one of the biggest Hadoop clusters in the world.  Their data engineers use Pig for data processing on their Hadoop clusters.

Alternatively, you may have a choice of Pig or Hive at your organization, especially if no standards have yet been established, or perhaps multiple standards have been set up.

If you know SQL, then Hive will be very familiar to you.  Since Hive uses SQL, you will feel at home with all the familiar select, where, group by, and order by clauses similar to SQL for relational databases.  You do, however, lose some ability to optimize the query, by relying on the Hive optimizer.  This seems to be the case for any implementation of SQL on any platform, Hadoop or traditional RDBMS, where hints are sometimes ironically needed to teach the automatic optimizer how to optimize properly.

However, compared to Hive, Pig needs some mental adjustment for SQL users to learn.  Pig Latin has many of the usual data processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a little different from SQL (particularly the group by and flatten statements!).  Pig requires more verbose coding, although it’s still a fraction of what straight Java MapReduce programs require.  Pig also gives you more control and optimization over the flow of the data than Hive does.

Personally, I use both Pig Latin and Hive, although for different purposes.  I learned Pig Latin first, and have used it to construct dataflows, where I typically have a scheduled job to periodically crunch the massive data from HDFS and to transfer the summarized data into a relational database for reporting, dashboarding, and ad-hoc analyses.  I also use Hive for some simple ad-hoc analytical queries into the data in HDFS, as Hive queries are a lot faster to write for those types of queries.  However, I don’t use Hive for the automated batch jobs that move data between HDFS and other systems.  I find that I can tune the dataflow process better using Pig than with Hive.  Additionally, some of the datasets that I need in Hadoop have not yet been structured with metadata schemas for use with Hive.  In those cases, Pig is much more flexible in reading those datasets than Hive is.

Hadoop expert Alan Gates has an excellent write-up comparing the differences between Pig Latin and Hive and when to use each of them.

If you are a data engineer, then you’ll likely feel like you’ll have better control over the dataflow (ETL) processes when you use Pig Latin, especially if you come from a procedural language background.  If you are a data analyst, however, you will likely find that you can ramp up on Hadoop faster by using Hive, especially if your previous experience was more with SQL than with a procedural programming language.  If you really want to become a Hadoop expert, then you should learn both Pig Latin and Hive for the ultimate flexibility.



Programming Pig

by Alan Gates

Programming Hive

by Edward Capriolo

Hadoop in Practice

by Alex Holmes

Permanent link to this article:


Skip to comment form

    • anagha on September 10, 2013 at 2:17 pm

    Great article!

    • Sreeni on September 19, 2013 at 6:02 am

    Good to know stuff

    • prasad on February 19, 2014 at 8:53 pm

    Nice Article and Thanks for sharing.

    • lanng on May 12, 2014 at 8:13 pm

    What is the difference between Pig LOAD and Hive Select?

    1. Hi lanng,

      Pig LOAD is used more like the Hive FROM clause within a Hive SELECT statement. Both are used to identify which data sources from the Hadoop HDFS file system is used. Typically in Hive however, the metadata schema for the data source needs to be predefined so that the data source appears to be used like a SQL table. In Pig, it is unnecessary to predefine the metadata schema ahead of time–thus Pig can be used to query from arbitrary data, including non-relational data and recursive data structures–however, the burden of data design shifts from the person providing the data in HDFS to the person consuming the data instead. In Hive, the admin needs to predefine the metadata schema so that Hive users can select from well-architected tables and columns with SQL.

      In Pig, the FOREACH GENERATE statement is used to select specific “columns” of data from the data source, as well as aggregated data such as COUNT() and SUM(), similar to how the SELECT clause in Hive is used.

      I hope this answers your question.


Comments have been disabled.