Dear Santa What I Want for Christmas from Hadoop

Dear Santa,

How do you keep track of all the wishlists and presents for the billions of boys and girls around the world?  I bet you have a lot of transactional data collected from all your photo sessions at the shopping mall, from all the letters you receive, and from all the sourcing and manufacturing and packaging of presents you and your elves need to do in order to get ready for Christmas.  Do you use Hadoop and predictive analytics to make your operations more efficient?  If you have some data scientist elves at work on your payroll, you probably already know what I will be asking you for Christmas this year.  However, I’m still going to ask you anyway.  The list is a bit long, but here it goes…

  1. Santa, please bring me faster Hadoop!  Even if not used for real-time streaming use cases, I’d be happy with queries which take only a few seconds to run rather than several minutes or hours.
  2. Please make the syntax for regular expressions in Hive and Pig behave the same way.  Having different conventions for escaping backslashes between Hive and Pig and SQL make it hard to switch between the different query languages.
  3. Please make a full-featured GUI IDE for business analysts to run SQL on Hadoop.  This would open up analysis of big data to many more business analysts without needing an engineer each time.
  4. Please make easier-to-read actionable error messages instead of Java stack traces.  It’s not user-friendly when big data generates big error messages.
  5. Please make Pig batch and interactive modes behave the same.  At the very least, please make the “%declare” command work in the grunt interactive shell so that we can interactively develop and test Pig scripts line-by-line more productively.
  6. Please make Pig and Hive query tuning automatic.  For a query that should take around 15 minutes to run, I’m willing to let an automatic query tuner take 30 seconds to optimize the query even if it only saves 60 seconds run time.  At least the query optimizer should give me recommendations on how I can improve my query, such as making sure to filter on partitioned fields or to create indexes.
  7. Please make my Pig and Hive queries have the option of rerunning from right before the point of failure.  Why should the query need to rerun from the very beginning if I can quickly fix the simple failure and resume where the query left off?  Hopefully the temp files are still available and valid for a short period of time after the query failure before being purged.
  8. Please give me a Hadoop grep command.  Yes, I can probably take a few minutes to write a query in Pig or Hive to find the records that I’m interested in.  However, it would be so cool to have a simple command available, something like “hadoop fs -grep needle haystack“, which would make it almost trivial to locate the specific records.
  9. Please make Hadoop installations, especially as cloud computing, safe and secure from hackers.  I need to reassure our CEO that the customer and employee data are 100% safe and secure.
  10. Please make schema administration and discovery easier to do for Hive and HBase.  Anything that helps with schema evolution will be very appreciated.
  11. Please give me Spark!
  12. Please give me a bigger cluster of HDFS.  A few thousand additional nodes will always be very appreciated!

Santa, thank you for providing date datatypes in Hive already. Thank you also for the Hive and Pig cheat sheets.  I haven’t yet asked for ACID-compliant operations in Hive, but I hear you will be providing them soon anyway!  Santa, you are probably already using a more advanced version of Hadoop to carry out your massive global operations.  Anything you can share with us via open source would be absolutely awesome!

Merry Christmas Santa Claus!  Happy new year in big data and cloud computing!

Permanent link to this article: