Which companies use Hadoop for analyzing big data? How big are their clusters? I thought it would be fun to compare companies by the size of their Hadoop installations. The size would indicate the company’s investment in Hadoop, and subsequently their appetite to buy big data products and services from vendors, as well as their hiring needs to support their analytics infrastructure. See the wordcloud visualization below.
Of the companies listed, Yahoo! has by far the most number of nodes in its massive Hadoop clusters at over 42,000 nodes as of July 2011. Facebook, however, may have the largest publicized Hadoop data size, although it runs at 2,000 nodes as of July 2011, far less than Yahoo, but still a massive cluster nevertheless. LinkedIn is getting up there in node count also, as well as a few others like Quantcast. Many of the smaller companies on the list run their Hadoop clusters in the cloud, as with Amazon EC2 for example, and thus can scale up quickly as needed. For our sharp-eyed readers, as to why a few notable companies are missing from this list, see further below.
Methodology
As the basis of the source data, I used the Apache Hadoop Wiki “Powered By Hadoop” page last updated December 20, 2012. The entries on that wiki page were put forward by the respective companies and might not contain the latest information. If anyone knows of a better data source, please let me know and I’ll update this visualization. I did get other information about the sizes of Yahoo!’s and Facebook’s Hadoop clusters, which I used instead, for those specific data points. Given the fast growth of big data adoption in internet companies, we’ll likely see rapid shifts in the leaderboard in the next year or so.
Sizes are hard to compare without a standard unit of measure. I compared the cluster sizes by nodes, instead of data volume size (up to Petabytes!) or CPU cores. If the company didn’t indicate the number of nodes, I made some assumptions. I considered one server to be the same as one node, regardless of the number of CPU cores. I assume a typical node to be a quad-core server, so for example, I will assume Quantcast’s 3000 core installation is equivalent to 750 nodes. If the company listed a range of sizes, then I used the higher end of the range.
Other Companies
Notably, there are some well-known companies like Google and Walmart which were not listed here. Google uses its own proprietary version of MapReduce and Google File System instead of the open-source Hadoop. Additionally, Google keeps its data center architecture a closely held secret, but occasionally journalists get a glimpse of the behind-the-scenes and take a guess at the full size of its infrastructure. As for other companies, they did not disclose their Hadoop cluster sizes on the Apache Hadoop Wiki either. For all the companies in this visualization, see the appendix below.
See Also
Additionally, see the Top Big Data Companies Using Hadoop for a more thorough list of companies.
Also see Up and Coming Big Data Vendors in 2013
Appendix: List of Companies with Number of Hadoop Nodes
| Company | Nodes |
| A9.com | 100 |
| Accela Communications | 10 |
| Adobe | 30 |
| adyard | 12 |
| Able Grape | 2 |
| Adknowledge | 200 |
| Aguja | 3 |
| Alibaba | 15 |
| AOL | 150 |
| ARA.COM.TR | 100 |
| Archive.is | 3 |
| BabaCar | 4 |
| Basenfasten | 4 |
| Benipal Technologies | 35 |
| Beebler | 14 |
| Bixo Labs | 20 |
| Brilig | 10 |
| Brockmann Consult GmbH | 20 |
| Caree.rs | 15 |
| Charleston | 15 |
| Contextweb | 50 |
| Cooliris | 15 |
| Cornell | 100 |
| CRS4 | 400 |
| crowdmedia | 5 |
| Datagraph | 20 |
| Deepdyve | 80 |
| Detektei Berlin | 3 |
| Detikcom | 9 |
| devdaily.com | 3 |
| EBay | 532 |
| eCircle | 120 |
| Enet | 5 |
| Enormo | 4 |
| ESPOL University | 4 |
| ETH Zurich Systems Group | 16 |
| Explore.To | 80 |
| 1400 | |
| FOX Audience Network | 140 |
| Forward3D | 24 |
| GBIF | 18 |
| GIS.FCU | 3 |
| Gruter. Corp. | 30 |
| Gewinnspiele | 6 |
| GumGum | 9 |
| Hadoop Korean User Group | 50 |
| Hotels & Accommodation | 3 |
| Hulu | 13 |
| Hundeshagen | 6 |
| Hosting Habitat | 6 |
| IIIT | 30 |
| IMVU | 4 |
| Information Sciences Institute | 18 |
| Infochimps | 30 |
| Inmobi | 150 |
| Iterend | 10 |
| Kalooga | 20 |
| Clic | 10 |
| Last.fm | 100 |
| Lineberger Comprehensive Cancer Center | 8 |
| MicroCode | 18 |
| Media 6 Degrees | 20 |
| Mercadolibre.com | 20 |
| MobileAnalytic.TV | 2 |
| MyLife | 18 |
| Neptune | 200 |
| NetSeer | 1050 |
| Openstat | 50 |
| PCPhase | 4 |
| Powerset / Microsoft | 400 |
| Pronux | 4 |
| PokerTableStats | 2 |
| Portabilité | 50 |
| PSG Tech | 10 |
| Quantcast | 3000 cores (est 750 nodes) |
| Rackspace | 30 |
| Rakuten | 69 |
| Rapleaf | 80 |
| Recruit | 50 |
| Redpoll | 35 |
| Resu.me | 5 |
| RightNow Technologies | 16 |
| Rovi Corporation | 40 |
| Search Wikia | 125 |
| SLC Security Services LLC | 18 |
| Sling Media | 10 |
| Socialmedia.com | 14 |
| Specific Media | 138 |
| Spotify | 120 |
| Taragana | 16 |
| The Lydia News Analysis Project | 120 |
| Tailsweep | 8 |
| Technical analysis and Stock Research | 23 |
| Tegatai | 32 |
| Telefonica Research | 6 |
| Telenav | 60 |
| Tepgo | 3 |
| Tynt | 94 |
| Universidad Distrital Francisco Jose de Caldas | 5 |
| University of Freiburg | 10 |
| University of Glasgow | 30 |
| University of Twente | 16 |
| Visible Measures Corporation | 128 cores (est. 32 nodes) |
| Webmaster Site | 4 |
| WorldLingo | 44 |
| Yahoo! | 42,000 |
| Zvents | 10 |
Sources:
Apache Hadoop Wiki last updated December 20, 2012
Facebook has the world’s largest Hadoop cluster! Facebook has 2000 nodes in 2010
Yahoo! has up to 42,000 nodes in its Hadoop grids in July 2011 from
Hortonworks Hadoop summit 2011 keynote and Petabyte-scale Hadoop clusters





2 comments
Jimmy Wong
February 4, 2013 at 10:59 pm (UTC -7) Link to this comment
The LinkedIn cluster size was updated on the Hadoop Wiki from 1900 to 4100 on Jan. 25, 2013. I updated the table in this article, but not the wordcloud.
Source: http://wiki.apache.org/hadoop/PoweredBy?action=diff&rev2=415&rev1=413
Jimmy Wong
April 5, 2013 at 11:53 pm (UTC -7) Link to this comment
Sears uses Hadoop. Its subsidiary MetaScale provides Hadoop hosting services. http://www.hadoopwizard.com/sears-now-sells-big-data-services-via-metascale/