Live Video Streaming At Facebook Scale

Operating at Facebook scale is far from trivial. With 1.49 billion monthly active users (and growing 13 percent yearly), every 60 seconds on Facebook 510 comments are posted, 293,000 statuses are updated, and 136,000 photos are uploaded. And there lies the challenge of serving the masses efficiently and reliably without any outages.

For serving the offline content, whether text (updates, comments, etc.), photos or videos, Facebook developed a sophisticated architecture that includes state-of-the-art data center technology and search engine to traverse and fetch content quickly and efficiently.

But now comes a new type of challenge: A few months ago Facebook rolled out a new service for live streaming called Live for Facebook Mentions, which allows celebs to broadcast live video to their followers. This service is quite similar to Twitter’s Periscope (acquired by Twitter beginning of this year) and the popular Meerkat app, which offer their live video streaming services to all and not just celebs. In fact, Facebook announced this month it is piloting a new service which will offer live streaming to the wide public as well.

live_cover_3

While offline photos and videos get uploaded fully and then distributed and made accessible to followers and friends, serving live video streams is much more challenging to implement at scale. And to make things even worse, the viral nature of social media (and of celeb content in particular) often creates spikes where thousands of followers demand the same popular content at the same time, a phenomenon the Facebook team calls the “thundering herd” problem.

An interesting post by Facebook engineering shares information on these challenges and the approaches they took: Facebook’s system uses Content Delivery Network (CDN) architecture with a two-layer caching of the content, with the edge cache closest to the users and serving 98 percent of the content. This design aims to reduce the load from the backend server processing the incoming live feed from the broadcaster. Another useful optimization for further reducing the load on the backend is request coalescing, whereby when many followers (in the case of celebs it could reach millions!) are asking for some content that’s missing in the cache (cache miss), only one instance request will proceed to the backend to fetch the content on behalf of all to avoid a flood.

facebook-live-stream-cache

It’s interesting to note that the celebs’ service and the newer public service show different considerations and trade-offs of throughput and latency which brought Facebook’s engineering team to make changes to adapt the architecture to the new service:

Where building Live for Facebook Mentions was an exercise in making sure the system didn’t get overloaded, building Live for people was an exercise in reducing latency.

The content itself is broken down into tiny segments of multiplexed audio and video for more efficient distribution and lower latency. The new Live service (for the wide public) even called for changing the underlying streaming protocol to enable an even better latency, reduce the lag between broadcaster and viewer by 5x.

This is a fascinating exercise in scalable architecture for live streaming, which is said to effectively scale to millions of broadcasters. Such open discussions can pave the way to smaller companies in the social media, internet of things (IoT) and the ever-more-connected world. You can read the full post here.

1311765722_picons03 Follow Horovits on Twitter!

Leave a comment

Filed under Uncategorized, Solution Architecture

HP Quits Public Cloud Race, Focusing On Hybrid Cloud For Enterprises

If you don’t see how the IT world is changing, just follow the recent tectonic shifts: while some tectonic plates merge (see Dell & EMC), others split (see HP split). The big players are assessing their play in this new world of IT where people and companies consume services rather than products, and where businesses run entire operations without owning “stuff” (think about the biggest taxi company not owning a single vehicle…). In this world, the game shifts from selling boxes and licenses to cloud-based services and open-source software and standards. And that’s a shift the big guys are now facing.

In its recent evaluation of the company’s future, HP (soon to be HP Enterprise) realized it cannot compete in the public cloud global arena, and decided to shut down its HP Helion Public Cloud offering on January 31, 2016. This arena is heavily dominated by Amazon, followed by Google and Microsoft, it requires a lot of upfront investment to gain significant global coverage, and it has a fierce war on price and performance.

hphelion

Instead, HP will focus on hybrid cloud, helping their traditional enterprise customers combine their on-premise data center with different public cloud vendors. This way HP actually plans to partner with the big public cloud vendors. In its recent blog post, Bill Hilf, SVP and GM, HP Cloud, stated that:

To support this new model, we will continue to aggressively grow our partner ecosystem and integrate different public cloud environments.

HP’s strategic choice to focus on hybrid cloud should come as no surprise. With the agenda of bringing hybrid cloud to enterprises  HP acquired Stackato 3 months ago from ActiveState. Also, late last year HP acquired open-source software Eucalyptus to “accelerate hybrid cloud adoption in the enterprise“, which paved HP’s way to offering compatibility with Amazon’s AWS cloud. On the Microsoft front HP has been working to support Azure cloud and Office 365 SaaS offering. This may compete with Microsoft’s own hybrid cloud offering announced earlier this year. And Amazon is debating its position on hybrid cloud as well. so these partnerships will be interesting. If formed well, they could lead HP to a true multi-cloud offering.

hphelion

The big players are all eyeing how to bring hybrid model for the enterprises, where the big money lies, and where complex environments, systems and constraints mandate such hybrid models and enterprise-grade tooling. We’ll also be seeing more use of open-source, such as HP’s adoption of Eucalyptus, CloudFoundry (for PaaS), and OpenStack. In fact, today started the OpenStack Summit in Tokyo, it’d be interesting to hear what HP executives elaborate on the recent and expected moves for Helion.

You can read more on HP’s recent moves around cloud, containers, open-source, HP company split and more on this post.

Leave a comment

Filed under Cloud

Biggest Tech Takeover of All Times: Dell Acquires EMC For $67B

Can two veteran giants re-invent themselves by joining forces? that’s what Dell thinks in acquiring EMC for the staggering amount of $67 billion, marking the largest tech acquisition of all times. You heard it right, Dell is trying to swallow the EMC serial acquirer which bought over 20 companies in the past 5 years alone, and in a variety of areas such as security, virtualization and of course storage, and has recently been trying to make sense in that mammoth as part of its EMC Federation.

What’s the purpose of that? I would take a look at the trends popping out of the press release:

The transaction… brings together strong capabilities in the fastest growing areas of the industry, including digital transformation, software-defined data center, hybrid cloud, converged infrastructure, mobile and security.

All the hot trends and buzzwords are there, the same new trends which made much of their respective traditional businesses irrelevant for the modern age.

Also, beyond technology, there was a cultural gap. The companies painfully discovered that the modern age is much less tolerant to vendor lock-in and black-box type of products, formats and protocols, and expects a more open approach. For example, EMC’s VMware which dominated the enterprise virtualization realm has become less relevant with cloud and containers. With this lesson learnt EMC teamed up with emerging open standards around containers, while VMware adopted OpenStack, the open-source cloud platform. Dell also made its move to team up with open standards around the Internet of Things.

It’s not trivial for such giants to join forces and re-invent themselves. I would expect seeing more modern approaches for their products and services, adopting more openness and collaborative mindset as a means to regain relevance, while focusing on expanding their offers to the new technologies.

1311765722_picons03 Follow Horovits on Twitter!

1 Comment

Filed under Cloud, technology

A Tale of Two (More) Outages, Featuring Facebook, Instagram and Poor Tinder

Last week we started off with a tale of two outages by Amazon’s cloud and Microsoft’s Skype, showing us what it’s like when you can’t make Skype calls, watch your favorite show on Netflix, or command your smart-home personal assistant.

This week, however, we’ve got a taste of what it’s like when we’re cut off the social network, with both Facebook and Instagram suffering outages of around an hour. No formal clarifications from either as of yet. Interesting to note that Tinder got hit by both last week’s and this week’s outages despite the very different sources (more on that below).

facebook-down-28-09-2015-a

This was Facebook’s 2nd outage in less than a week, and it’s the 3rd this month. Not the best record, even compared to last year. For Instagram it’s not the first outage either. In fact, both Facebook and Instagram suffered outage together beginning of this year, which shows how tightly Instagram’s system was coupled with Facebook’s services (and vulnerabilities) following the acquisition. The coupling stirred up the user community around the globe:

instagram-down-28-09-2015-a

Facebook’s last outage from last week took down the service for 2.5 hours, in what Facebook described as “the worst outage we’ve had in over four years”. Facebook later wrote a detailed technical post explaining the root cause of the failure was a configuration issue:

An automated system for verifying configuration values ended up causing much more damage than it fixed.

Although this automated system is designed to prevent configuration problems, this time it caused them. This just shows us that even the most rigorous safeguards have limitations and no system is immuned, not even the major cloud vendors. We saw configuration problems taking down Amazon’s cloud last week and Microsoft’s cloud late last year, just to recall a few.

Applications which utilize this infrastructure are continually affected by the outages. One good example is Tinder, which got affected last week by Amazon’s outage as it uses its Amazon Web Services, and this week again, this time probably due to its use of Facebook services. But the good news are that though outages are bound to happen, there are things you can do to reduce the impact on your system. If you find that interesting, I highly recommend you have a look at last week’s post.

1311765722_picons03 Follow Dotan on Twitter!

1 Comment

Filed under Cloud

A Tale of Two Outages Featuring Amazon, Microsoft And An Un-Smart Home

Update: Following the subsequent official announcements of Amazon and Microsoft I updated the post with more information on the outages and relevant links

Here it is again. A major outage in Amazon’s AWS data center in North Virginia takes down the cloud service in Amazon’s biggest region, and with it, taking down a multitude of cloud-based services such as Netflix, Tinder, AirBnB and Wink. This is not the first time it happens, and not even the worst. At least this time it didn’t last for days. This time it was their DynamoDB that went down and took down a host of other services, as Amazon describes in a lengthy blog post.

And Amazon is not alone in that. Microsoft today also suffered a major outage in its Skype service, which rendered the popular VoIP service unusable. In their update Skype reported the root cause was a bad configuration change:

We released a larger-than-usual configuration change, which some versions of Skype were unable to process correctly therefore disconnecting users from the network. When these users tried to reconnect, heavy traffic was created and some of you were unable to use Skype’s free services …

This time it was Microsoft’s Skype service, but we already saw how Microsoft’s Azure cloud can also suffer major outage, all on account of a configuration update.

One interesting effect was exposed due to this recent outage that is worth noting: up till now the impact was limited to online cloud services such as our movie or dating service. But now, with the penetration of the Internet of Things (IoT) to our homes, the effects of such cloud outage reach far beyond, and into our own homes and daily utilities, as nicely narrated by David Gewirtz’ piece on ZDnet, who tried voice-commanding its Amazon Echo (nicknamed “Alexa”) to turn on the lights and perform other home tasks and was left unanswered during the outage. The loss of faith in Alexas (they have 2 of them) which David described goes beyond technology realm and into psychological effects which extend beyond my field of expertise.

One conclusion could be that cloud computing is bad and should not be used. That would of course be the wrong conclusion, certainly when compared to outages in data centers. As I highlighted in the past, following simple guidelines can significantly reduce the impact of your cloud service to such infrastructure outages. If you are running a mission-critical system you may find that relying on a single cloud provider is not enough and may wish to use multi-cloud strategy to spread the risk and use disaster recovery policies between them. This will become increasingly important as the Internet of Things becomes ubiquitous in our homes and businesses, as heavily promoted by Amazon, Google, Samsung and the likes which combine IoT with their own cloud services.

One thing is for sure: if you connect your door locks to a cloud-based service – make sure you keep a copy of the good-old hard-copy key.

1311765722_picons03 Follow Dotan on Twitter!

1 Comment

Filed under Cloud, IoT

Google Unveils Its Next Gen Datacenter Network Architecture

Organizations such as Google, Amazon and Facebook posses sheer size, scale and distribution of data that pose a new class of challenges for networking, one which traditional networking vendors cannot meet. According to Google’s team:

Ten years ago, we realized that we could not purchase, at any price, a datacenter network that could meet the combination of our scale and speed requirements.

Facebook engineering team ran into much similar problems. Late last year Facebook published its datacenter networking architecture, called “data center fabric”, which is meant to meet this exact challenge, and has continued this year expanding the architecture.

Now Google is joining the game, sharing their in-house datacenter network architecture in a new paper published this week. The current (5th) generation of Google’s architecture, called Jupiter, is able to deliver more than 1 petabit/sec of total bisection bandwidth. This means that each of 100,000 servers can communicate with one another in an arbitrary pattern at 10Gb/s. The new architecture also means substantially improved efficiency of the compute and storage infrastructure, and ultimately much higher utilization in jobs scheduling.

Google based its new networking architecture on the principle of Software-Defined Netowrking (SDN). Using the SDN approach, Google was able to escape the traditional distributed networking protocols with their slow dissemination, high bandwidth overhead and manual switch configurations, and move to a single global configuration for the entire network that is then pushed to all switches, with each switch taking its part of the scheme.

Google has been an advocate of SDN for quite some time, and is a member of the Open Networking Foundation (ONF), a consortium of industry leaders such as Facebook, Microsoft, Deutsche Telecom, Verizon and of course Google, promoting open standards for SDN, primarily the OpenFlow project which Google fully adopted.

SDN and network virtualization have been major trends in the networking realm, especially with cloud-based deployments with their highly distributed, scalable and dynamic environments. All major cloud vendors have been innovating in their next gen networking. Most notably, Google has been actively competing with Amazon on driving its cloud networking to next gen, where Google presented its Andromeda project for network virtualization.

The big players will continue to forefront the networking and scalability challenges of the new cloud and distributed era, and will lead innovation in that field. The open approach that was adopted by the big players, with open standards, open source and sharing with the community, will enable the smaller players to benefit from this innovation and push the industry forward.

You can read Google’s paper on Jupiter here.

1311765722_picons03 Follow Dotan on Twitter!

Leave a comment

Filed under Cloud, SDN

New ‘Cloud Native Computing Foundation’ Trying to Standardize on Cloud and Containers

Cloud Native Computing Foundation (CNCF) is a new open standardization initiative recently formed under the Linux Foundation with the mission of providing standard reference architecture for cloud native applications and services, based on open-source software (OSS). The first OSS is Google’s Kubernetes, which was released in v1.0 the same day, and was donated by Google to the foundation.

Google is one of the 22 founding members, together with big names such as IBM, Intel, Redhat, VMware, AT&T, Cisco and Twitter, as well as important names in the containers realm such as Docker, Mesosphere, CoreOS and Joynet.

The announcement of the new foundation came only a few weeks after the announcement of the Open Container Initiative (OCI), also formed under the Linux Foundation. Even more interesting to note that almost half of the founding companies of CNCF are among the founders of OCI. According to the founders, these two initiatives are complementary: while OCI is focused on standardizing the image and runtime format for containers, CNCF will target the bigger picture of how to assemble components to address a comprehensive set of container application infrastructure needs, starting with the orchestration level, based on Kubernetes. This is the same bottom-up dynamics as we see in most other initiatives and projects, starting from standardizing on the infrastructure and then continuing upwards: cloud computing evolved same way from IaaS to PaaS and to SaaS, Network Function Virtualization (NFV) evolved from the NFV Infrastructure to Management and Orchestration (MANO), etc.

Open strategy has become the name of the game and all the big companies realize that in order to take the technology out of infancy and enabling its adoption in large-scale production deployments in enterprises they need to take the lead on the open field. Google’s Kubernetes and its recent contribution to CNCF is one example. Now we’ll wait to see which other open-source ingredients will be incorporated and which blueprint will emerge and how it succeeds in meeting the industry’s varying use cases.

1311765722_picons03 Follow Dotan on Twitter!

Leave a comment

Filed under Cloud, cloud automation, DevOps