«

»

Jan 20

Print this Post

Which Big Data Company has the World’s Biggest Hadoop Cluster?

Which companies use Hadoop for analyzing big data?  How big are their clusters?  I thought it would be fun to compare companies by the size of their Hadoop installations.  The size would indicate the company’s investment in Hadoop, and subsequently their appetite to buy big data products and services from vendors, as well as their hiring needs to support their analytics infrastructure.  See the wordcloud visualization below.

Wordcloud of Companies by their Relative Hadoop Cluster Sizes, Number of Nodes, Dec. 2012

Wordcloud of Companies by their Relative Hadoop Cluster Sizes, Number of Nodes, Dec. 2012

Of the companies listed, Yahoo! has by far the most number of nodes in its massive Hadoop clusters at over 42,000 nodes as of July 2011.  Facebook, however, may have the largest publicized Hadoop data size, although it runs at 2,000 nodes as of July 2011, far less than Yahoo, but still a massive cluster nevertheless.  LinkedIn is getting up there in node count also, as well as a few others like Quantcast.  Many of the smaller companies on the list run their Hadoop clusters in the cloud, as with Amazon EC2 for example, and thus can scale up quickly as needed.  For our sharp-eyed readers, as to why a few notable companies are missing from this list, see further below.

Methodology

As the basis of the source data, I used the Apache Hadoop Wiki “Powered By Hadoop” page last updated December 20, 2012.  The entries on that wiki page were put forward by the respective companies and might not contain the latest information.  If anyone knows of a better data source, please let me know and I’ll update this visualization.  I did get other information about the sizes of Yahoo!’s and Facebook’s Hadoop clusters, which I used instead, for those specific data points.  Given the fast growth of big data adoption in internet companies, we’ll likely see rapid shifts in the leaderboard in the next year or so.

Sizes are hard to compare without a standard unit of measure.  I compared the cluster sizes by nodes, instead of data volume size (up to Petabytes!) or CPU cores.  If the company didn’t indicate the number of nodes, I made some assumptions.  I considered one server to be the same as one node, regardless of the number of CPU cores.  I assume a typical node to be a quad-core server, so for example, I will assume Quantcast’s 3000 core installation is equivalent to 750 nodes.  If the company listed a range of sizes, then I used the higher end of the range.

Other Companies

Notably, there are some well-known companies like Google and Walmart which were not listed here.  Google uses its own proprietary version of MapReduce and Google File System instead of the open-source Hadoop. Additionally, Google keeps its data center architecture a closely held secret, but occasionally journalists get a glimpse of the behind-the-scenes and take a guess at the full size of its infrastructure.  As for other companies, they did not disclose their Hadoop cluster sizes on the Apache Hadoop Wiki either.  For all the companies in this visualization, see the appendix below.

See Also

Additionally, see the Top Big Data Companies Using Hadoop for a more thorough list of companies.

Also see Up and Coming Big Data Vendors in 2013

Appendix: List of Companies with Number of Hadoop Nodes

Company Nodes
A9.com 100
Accela Communications 10
Adobe 30
adyard 12
Able Grape 2
Adknowledge 200
Aguja 3
Alibaba 15
AOL 150
ARA.COM.TR 100
Archive.is 3
BabaCar 4
Basenfasten 4
Benipal Technologies 35
Beebler 14
Bixo Labs 20
Brilig 10
Brockmann Consult GmbH 20
Caree.rs 15
Charleston 15
Contextweb 50
Cooliris 15
Cornell 100
CRS4 400
crowdmedia 5
Datagraph 20
Deepdyve 80
Detektei Berlin 3
Detikcom 9
devdaily.com 3
EBay 532
eCircle 120
Enet 5
Enormo 4
ESPOL University 4
ETH Zurich Systems Group 16
Explore.To 80
Facebook 1400
FOX Audience Network 140
Forward3D 24
GBIF 18
GIS.FCU 3
Gruter. Corp. 30
Gewinnspiele 6
GumGum 9
Hadoop Korean User Group 50
Hotels & Accommodation 3
Hulu 13
Hundeshagen 6
Hosting Habitat 6
IIIT 30
IMVU 4
Information Sciences Institute 18
Infochimps 30
Inmobi 150
Iterend 10
Kalooga 20
Clic 10
Last.fm 100
Lineberger Comprehensive Cancer Center 8
LinkedIn 1900 4100
MicroCode 18
Media 6 Degrees 20
Mercadolibre.com 20
MobileAnalytic.TV 2
MyLife 18
Neptune 200
NetSeer 1050
Openstat 50
PCPhase 4
Powerset / Microsoft 400
Pronux 4
PokerTableStats 2
Portabilité 50
PSG Tech 10
Quantcast 3000 cores (est 750 nodes)
Rackspace 30
Rakuten 69
Rapleaf 80
Recruit 50
Redpoll 35
Resu.me 5
RightNow Technologies 16
Rovi Corporation 40
Search Wikia 125
SLC Security Services LLC 18
Sling Media 10
Socialmedia.com 14
Specific Media 138
Spotify 120
Taragana 16
The Lydia News Analysis Project 120
Tailsweep 8
Technical analysis and Stock Research 23
Tegatai 32
Telefonica Research 6
Telenav 60
Tepgo 3
Tynt 94
Universidad Distrital Francisco Jose de Caldas 5
University of Freiburg 10
University of Glasgow 30
University of Twente 16
Visible Measures Corporation 128 cores (est. 32 nodes)
Webmaster Site 4
WorldLingo 44
Yahoo! 42,000
Zvents 10

Sources:

Apache Hadoop Wiki last updated December 20, 2012

Facebook has the world’s largest Hadoop cluster! Facebook has 2000 nodes in 2010

Yahoo! has up to 42,000 nodes in its Hadoop grids in July 2011 from
Hortonworks Hadoop summit 2011 keynote and Petabyte-scale Hadoop clusters

About the author

Jimmy Wong

Jimmy crunches massive amounts of big data using Hadoop for online advertising and marketing in a public social networking company. He enjoys helping newbies learn more about applying technology to solve business problems. He can be found in the San Francisco Bay Area. For more info, see his http://about.me/jimmy.wong page.

(The views expressed by Jimmy on this blog are his personal opinions and do not represent his employer or other organizations.)

Permanent link to this article: http://www.hadoopwizard.com/which-big-data-company-has-the-worlds-biggest-hadoop-cluster/

2 comments

  1. Jimmy Wong

    The LinkedIn cluster size was updated on the Hadoop Wiki from 1900 to 4100 on Jan. 25, 2013. I updated the table in this article, but not the wordcloud.

    Source: http://wiki.apache.org/hadoop/PoweredBy?action=diff&rev2=415&rev1=413

  2. Jimmy Wong

    Sears uses Hadoop. Its subsidiary MetaScale provides Hadoop hosting services. http://www.hadoopwizard.com/sears-now-sells-big-data-services-via-metascale/

Comments have been disabled.