Tag Archives: facebook

Want To Scale Like Google, Amazon? Design Your Own Data Center

Google is joining the Open Compute Project (OCP), the community-driven open project founded by Facebook for standardizing on IT infrastructure. OCP’s mission statement is to

“break open the black box of proprietary IT infrastructure to achieve greater choice, customization, and cost savings”

Google strategically announced joining the OCP at last week’s OCP Summit, together with its first contribution of a new energy-efficient rack specification that includes 48V power distribution. According to Google, their new rack design  was at least 30% more energy efficient and more cost effective in supporting their higher-performance systems.

The OCP includes, in addition to Facebook and Google, other big names such as Intel, Goldman Sachs, Microsoft and Deutsche Telecom. The member list also includes some traditional server and networking manufacturers such as Ericsson, Cisco, HP and Lenovo, which are expected to be seriously disrupted by the new open standards initiative which undermines their domination over this $140B industry.


Last year Google already made an important move, sharing its next-generation data center network architecture. On their announcement last week, Google hinted for additional upcoming contributions to OCP such as better disk solutions for cloud based applications. In his post, John Zipfel shared Google’s longer-term vision for OCP:

And we think that we can work with OCP to go even further, looking up the software stack to standardize server and networking management systems.

Google and Facebook are among the “big guys” running massive data centers and infrastructure, which sheer scale drove them to drop the commodity IT infrastructure and start developing their own in-house optimized infrastructure to reduce costs and improve performance.

Amazon is another such big guy, especially with the massive infrastructure required to power its Amazon Web Services which has the lion’s share of the public cloud market, followed by Microsoft and Google (both latter are OCP members). In an interview last week, Amazon’s CTO Werner Vogels said:

“To be able to operate at scale like we do it makes sense to start designing your own server infrastructure as well as your network. There is great advantages in [doing so].”

With the growing popularity of cloud computing, many of the “smaller guys” (even enterprises and banks) will migrate their IT to some cloud hosting service to save them from buying and managing their own infrastructure, which in turn will mean even more of the world’s IT will be with the “big guys”. To aggravate things further, the public cloud market is undergoing consolidation, with big names such as HP, Verizon and Dell dropping the race, which would leave most of the world’s IT in the hands of a few top-tier cloud vendors and Facebook-scale giants. These truly “big guys” will not settle for anything short of the best for their IT.

1311765722_picons03 Follow Dotan on Twitter!

Update: At GCP Next conference the following week Google released a 360° virtual tour at its data center. See more here.



Filed under Cloud, IT

Live Video Streaming At Facebook Scale

Operating at Facebook scale is far from trivial. With 1.49 billion monthly active users (and growing 13 percent yearly), every 60 seconds on Facebook 510 comments are posted, 293,000 statuses are updated, and 136,000 photos are uploaded. And there lies the challenge of serving the masses efficiently and reliably without any outages.

For serving the offline content, whether text (updates, comments, etc.), photos or videos, Facebook developed a sophisticated architecture that includes state-of-the-art data center technology and search engine to traverse and fetch content quickly and efficiently.

But now comes a new type of challenge: A few months ago Facebook rolled out a new service for live streaming called Live for Facebook Mentions, which allows celebs to broadcast live video to their followers. This service is quite similar to Twitter’s Periscope (acquired by Twitter beginning of this year) and the popular Meerkat app, which offer their live video streaming services to all and not just celebs. In fact, Facebook announced this month it is piloting a new service which will offer live streaming to the wide public as well.


While offline photos and videos get uploaded fully and then distributed and made accessible to followers and friends, serving live video streams is much more challenging to implement at scale. And to make things even worse, the viral nature of social media (and of celeb content in particular) often creates spikes where thousands of followers demand the same popular content at the same time, a phenomenon the Facebook team calls the “thundering herd” problem.

An interesting post by Facebook engineering shares information on these challenges and the approaches they took: Facebook’s system uses Content Delivery Network (CDN) architecture with a two-layer caching of the content, with the edge cache closest to the users and serving 98 percent of the content. This design aims to reduce the load from the backend server processing the incoming live feed from the broadcaster. Another useful optimization for further reducing the load on the backend is request coalescing, whereby when many followers (in the case of celebs it could reach millions!) are asking for some content that’s missing in the cache (cache miss), only one instance request will proceed to the backend to fetch the content on behalf of all to avoid a flood.


It’s interesting to note that the celebs’ service and the newer public service show different considerations and trade-offs of throughput and latency which brought Facebook’s engineering team to make changes to adapt the architecture to the new service:

Where building Live for Facebook Mentions was an exercise in making sure the system didn’t get overloaded, building Live for people was an exercise in reducing latency.

The content itself is broken down into tiny segments of multiplexed audio and video for more efficient distribution and lower latency. The new Live service (for the wide public) even called for changing the underlying streaming protocol to enable an even better latency, reduce the lag between broadcaster and viewer by 5x.

This is a fascinating exercise in scalable architecture for live streaming, which is said to effectively scale to millions of broadcasters. Such open discussions can pave the way to smaller companies in the social media, internet of things (IoT) and the ever-more-connected world. You can read the full post here.

1311765722_picons03 Follow Horovits on Twitter!

Leave a comment

Filed under Solution Architecture, Uncategorized

A Tale of Two (More) Outages, Featuring Facebook, Instagram and Poor Tinder

Last week we started off with a tale of two outages by Amazon’s cloud and Microsoft’s Skype, showing us what it’s like when you can’t make Skype calls, watch your favorite show on Netflix, or command your smart-home personal assistant.

This week, however, we’ve got a taste of what it’s like when we’re cut off the social network, with both Facebook and Instagram suffering outages of around an hour. No formal clarifications from either as of yet. Interesting to note that Tinder got hit by both last week’s and this week’s outages despite the very different sources (more on that below).


This was Facebook’s 2nd outage in less than a week, and it’s the 3rd this month. Not the best record, even compared to last year. For Instagram it’s not the first outage either. In fact, both Facebook and Instagram suffered outage together beginning of this year, which shows how tightly Instagram’s system was coupled with Facebook’s services (and vulnerabilities) following the acquisition. The coupling stirred up the user community around the globe:


Facebook’s last outage from last week took down the service for 2.5 hours, in what Facebook described as “the worst outage we’ve had in over four years”. Facebook later wrote a detailed technical post explaining the root cause of the failure was a configuration issue:

An automated system for verifying configuration values ended up causing much more damage than it fixed.

Although this automated system is designed to prevent configuration problems, this time it caused them. This just shows us that even the most rigorous safeguards have limitations and no system is immuned, not even the major cloud vendors. We saw configuration problems taking down Amazon’s cloud last week and Microsoft’s cloud late last year, just to recall a few.

Applications which utilize this infrastructure are continually affected by the outages. One good example is Tinder, which got affected last week by Amazon’s outage as it uses its Amazon Web Services, and this week again, this time probably due to its use of Facebook services. But the good news are that though outages are bound to happen, there are things you can do to reduce the impact on your system. If you find that interesting, I highly recommend you have a look at last week’s post.

1311765722_picons03 Follow Dotan on Twitter!

1 Comment

Filed under Cloud

Facebook Shares Open Networking Switch Design, Part of its Next Gen Networking

Facebook’s enormous scale comes with enormous technological challenges, which go beyond conventional available solutions. For example, Facebook decided to abandon Microsoft’s Bing search engine and instead develop its own revamped search capabilities. Another important area is Facebook’s massive networking needs, which called for a whole new paradigm, code named data center fabric.


The next step in Facebook’s next-gen networking architecture is “6-pack” – a new open and modular switch announced just a few days ago. Interesting to note that Facebook chose to announce the new switch the same day Cisco reported its earnings. This is more than a hint at the Networking equipment giant, representing the “traditional networking”. As Facebook says in its announcement, it started the quest for next-gen networking due to

the limits of traditional networking technologies, which tend to be too closed, too monolithic, and too iterative for the scale at which we operate and the pace at which we move.

The new “6-pack” is a modular high volume switch built on merchant silicon based hardware. It enables you to build any size switch using a simple set of common building blocks. The design uses Software Defined Networking (SDN) hybrid approach: While classic SDN separates control plane from forwarding plane and centralizes control decisions, in Facebook’s hybrid architecture each switching element contains a full local control plane on a microserver that communicates with a centralized controller.

Facebook made the design of “6-pack” open as part of the Open Compute Project, together with all the other components of its data center fabric. This is certainly not good news for Cisco and the other vendors, but great news for the community. You can find the full technical design details in Facebook’s post.

Faceook is not the only one in the front line of scaling challenges. The open cloud community OpenStack, as well as the leading public cloud vendors Google and Amazon also shared networking strategies to meet the new challenges coming with the new workloads in modern cloud computing environment.

Cloud and Big Data innovations were born out of necessity in IT, driven by companies with the most challenging use cases and backed by open community. The same innovation is now happening with networking, paving the way to simpler, scalable, virtual and programmable networking based on merchant silicon.

1311765722_picons03 Follow Dotan on Twitter!


Filed under Cloud, IT, SDN

Facebook Dumps Bing, Focuses on Its Revamped Big Data Search Capabilities

Facebook dumps Microsoft Web search results“, that was the juicy headline for Reuters’ scoop a couple of days ago:

The move, confirmed by a company spokesperson, comes as Facebook has revamped its own search offerings, introducing a tool on Monday that allows users to quickly find past comments and other information posted by their friends on Facebook.

On my post after Facebook’s announcement of the new search tool, I speculated that these Google-like search capabilities are Facebook’s aim. Now it appears that indeed that was the case, with Facebook realizing the true value was found not in searching the generic web content but rather in searching its valuable user history (posts, likes, comments, etc.), which reflects the true things that interest people:

We’re not currently showing web search results in Facebook Search because we’re focused on helping people find what’s been shared with them on Facebook

Facebook has been focusing on big data management in general and search capabilities in particular for a while now, investing a lot in both internal research and collaboration with the open-source and academia. The result was an elaborate big data analytics engine underneath the hood of the new search tool.

Building big data search engine is hardly trivial. Microsoft has gained a lot of mileage on that with Bing, and Google reportedly holds 67.6 percent of the U.S. search engine market share. However, Facebook’s database of its users’ interactions may very well be the asset to tip off the balance in favor of the newcomer. As Facebook’s CEO Zukerberg said last July:

There is more than a trillion posts, which some of the search engineers on the team like to remind me, is bigger than any Web search corpus out there

Having this valuable big data, combined with Facebook’s aggressive innovation and collaboration with the community, will enable it to close the gap, and perhaps take the lead.

1311765722_picons03 Follow Dotan on Twitter!


Filed under Big Data, Real Time Analytics

Facebook’s Big Data Analytics Boosts Search Capabilities

A few days ago Facebook announced its new search capabilities. These are Google-like capabilities of searching your history, the feature that was the crown jewel of Google+ – Google’s attempt to fight off Facebook. Want to find that funny thing you posted when you took the ice bucket challenge a few months ago? It’s now easier than ever. And it’s now supported also on your phone.

facebook ice bucket challenge search

You may think this is a simple (yet highly useful) feature. But when you come to think of it, this is quite a challenge, considering the 1.3 billion active users generating millions of events per second. The likes of Facebook, Google and Twitter cannot settle for the traditional processing capabilities, and need to develop innovative ways for stream processing at high volume.

A challenge just as big is encountered with queries: Facebook’s big data stores process tens of petabytes and hundreds of thousands of queries per day. Serving such volumes while keeping most response times under 1 second is hardly the type of challenge of traditional databases.

These challenges called for innovative approach. For example, Facebook’s Data Infrastructure Team was the one to develop and open-source Hive, the popular Hadoop-based software framework for Big Data queries. Facebook also took innovative approach in building its data centers, both in the design of the servers, and in its next-gen networking designed to meet the high and constantly-increasing traffic volumes within their data centers.


Facebook is taking its data challenge very seriously, investing in internal research as well as in collaboration with the academia and the open-source community. In a data faculty summit hosted by Facebook a few months ago, Facebook shared its top open data problems. It raised many interesting challenges in managing Small Data, Big Data and related hardware. With the announced release of Facebook Search for mobile, I remembered in particular the challenges raised in the Facebook data faculty summit on how to adapt their systems to the mobile realm, where network is flaky, where much of the content is pre-fetched instead of pulled on-demand, where privacy checks need to be done much earlier on in the process. The recent release may indicate new innovative solutions to these challenges. Looking to hear some insights from the technical team.

Facebook, Twitter and the like face the Big Data challenges early on. as I said before:

These volumes challenge the traditional paradigms and trigger innovative approaches. I would keep a close eye on Facebook as a case study for the challenges we’d all face very soon.


1311765722_picons03 Follow Dotan on Twitter!


Filed under Big Data, Real Time Analytics, Solution Architecture

Facebook Shares Its Next Gen Networking

In this age of cloud-based services, social media and the Internet Of Things, when everyone and everything is connected and even our once-local assets such as our documents, spreadsheets and photos are now stored and edited online, network connectivity has become more expensive than gold. Naturally, the biggest players with the biggest workloads face the challenges first, and pave the way beyond current technologies, protocols and methodologies. Recently we got great case studies when Amazon and Google shared their next-gen networking strategies.

Another major player that recently shared its next-gen networking strategy is Facebook. In a detailed blog post, Alexey Andreyev, a Facebook network engineer, shared a detailed technical overview of their new “data center fabric” that was piloted in their Altoona data center. This caught the attention of GigaOm, which last week invited Facebook’s Director of Network Engineering Najam Ahmad to a dedicated podcast to gain some more insight.


Facebook moved away from the old cluster-based architecture to the modern fabric-based one. This helped them overcome the endless race after the bleeding-edge and high-end networking equipment and the associated vendor lock-in:

To build the biggest clusters we needed the biggest networking devices, and those devices are available only from a limited set of vendors.

Another interesting point was about the move to a bottom-up Software Defined Networking (SDN) approach:

The only difference is that were essentially saying that we don’t want to build the networks in the traditional way. We want to build them in more of the SDN philosophy, and the vendors need to catch up, and so whoever provides the solution will be part of the system overall.

We see the trend of SDN and virtual networking also with vendors such as Amazon and Google, as well as with the cloud community such as was evident in the last OpenStack Summit. I expect network virtualization and software-defined methodologies shall become even more prominent in Facebook’s architecture as it evolves and as Facebook’s volumes and complexity grow.

Facebook is a great example of an online company in the largest scale, with more than 1.35 billion users around the globe, with a diverse set of services, application and workloads, and with an ever-increasing traffic volume (vast majority of which is machine-to-machine). These volumes challenge the traditional paradigms and trigger innovative approaches. I would keep a close eye on Facebook as a case study for the challenges we’d all face very soon.


Update: on Feb 2015 Facebook shared details on “6-pack”, a new open and modular switch in the heart of their datacenter networking architecture. you can read more about it in this post

1311765722_picons03 Follow Dotan on Twitter!


Filed under Cloud, SDN