Which companies use Hadoop for analyzing big data? How big are their clusters? I thought it would be fun to compare companies by the size of their Hadoop installations. The size would indicate the company’s investment in Hadoop, and subsequently their appetite to buy big data products and services from vendors, as well as their hiring needs to support their analytics infrastructure. See the wordcloud visualization below.
Of the companies listed, Yahoo! has by far the most number of nodes in its massive Hadoop clusters at over 42,000 nodes as of July 2011. Facebook, however, may have the largest publicized Hadoop data size, although it runs at 2,000 nodes as of July 2011, far less than Yahoo, but still a massive cluster nevertheless. LinkedIn is getting up there in node count also, as well as a few others like Quantcast. Many of the smaller companies on the list run their Hadoop clusters in the cloud, as with Amazon EC2 for example, and thus can scale up quickly as needed. For our sharp-eyed readers, as to why a few notable companies are missing from this list, see further below.
As the basis of the source data, I used the Apache Hadoop Wiki “Powered By Hadoop” page last updated December 20, 2012. The entries on that wiki page were put forward by the respective companies and might not contain the latest information. If anyone knows of a better data source, please let me know and I’ll update this visualization. I did get other information about the sizes of Yahoo!’s and Facebook’s Hadoop clusters, which I used instead, for those specific data points. Given the fast growth of big data adoption in internet companies, we’ll likely see rapid shifts in the leaderboard in the next year or so.
Sizes are hard to compare without a standard unit of measure. I compared the cluster sizes by nodes, instead of data volume size (up to Petabytes!) or CPU cores. If the company didn’t indicate the number of nodes, I made some assumptions. I considered one server to be the same as one node, regardless of the number of CPU cores. I assume a typical node to be a quad-core server, so for example, I will assume Quantcast’s 3000 core installation is equivalent to 750 nodes. If the company listed a range of sizes, then I used the higher end of the range.
Notably, there are some well-known companies like Google and Walmart which were not listed here. Google uses its own proprietary version of MapReduce and Google File System instead of the open-source Hadoop. Additionally, Google keeps its data center architecture a closely held secret, but occasionally journalists get a glimpse of the behind-the-scenes and take a guess at the full size of its infrastructure. As for other companies, they did not disclose their Hadoop cluster sizes on the Apache Hadoop Wiki either. For all the companies in this visualization, see the appendix below.
Additionally, see the Top Big Data Companies Using Hadoop for a more thorough list of companies.
Appendix: List of Companies with Number of Hadoop Nodes
|Brockmann Consult GmbH||20|
|ETH Zurich Systems Group||16|
|FOX Audience Network||140|
|Hadoop Korean User Group||50|
|Hotels & Accommodation||3|
|Information Sciences Institute||18|
|Lineberger Comprehensive Cancer Center||8|
|Media 6 Degrees||20|
|Powerset / Microsoft||400|
|Quantcast||3000 cores (est 750 nodes)|
|SLC Security Services LLC||18|
|The Lydia News Analysis Project||120|
|Technical analysis and Stock Research||23|
|Universidad Distrital Francisco Jose de Caldas||5|
|University of Freiburg||10|
|University of Glasgow||30|
|University of Twente||16|
|Visible Measures Corporation||128 cores (est. 32 nodes)|
Apache Hadoop Wiki last updated December 20, 2012
Facebook has the world’s largest Hadoop cluster! Facebook has 2000 nodes in 2010