Preventing Netflix from crashing

Every Netflix user watching from their TV has probably had Netflix simply hang or quit on them multiple times. This year, I was fortunate to have had the opportunity to work with the amazing people at Netflix and develop machine learning-based technologies for a rather unconventional application — predicting when the app will crash on your TV set.

Being at the helm of technological advancement, the Bay Area prides itself on developments in content, streaming, compression, memory management, and distributed systems, among other tech. Way back in 2009, companies started migrating to the microservices architecture to cope with the scaling of applications and keep up with the move to cloud services. Netflix was one of the first companies to do so, with Google, AirBnB, Amazon, et al. soon following suit. The echoes of the case studies of these architectures are still heard in the halls of the Gates and Hillman centers, where CMU’s School of Computer Science has actively contributed to several state of the art compression algorithms, database novelties, and microservices architectures. Courses like Advanced Cloud Computing (15-719) and Distributed Systems (15-640) cover these technologies in great detail.

What happens behind the screen when Netflix crashes is that the TV system runs out of memory. It is a peculiar problem to solve on TV sets that have a lot of video buffers — which are portions of computer memory that store video information — but minimum computing power and fast time to action. Streaming services and other companies care about this because a bad user experience may lead to the user not returning to the app for that session at all, and thus the company may lose a potential customer.

Working with a mix of partners and engineers at Netflix on devices like Roku and Samsung TVs, I was able to onboard the domain knowledge of these teams to create a locally deployed solution for your TV set, telling you when the Netflix app is going to crash and run into "Out of Memory." Our solution used a type of machine learning model called random forest. This random forest model is unique because it is trained on a mammoth data of users watching Netflix on their device, and leverages the Netflix Big Data Platform to create a labeled dataset. By examining the history of crashes, we train the model to classify major and minor crashes before they occur and clear the buffer accordingly so that you have a seamless experience watching Netflix. The problem is made even more difficult to solve because most of the streaming data history is seamless (errors occur once in weeks), thus giving rise to a sparse and highly skewed dataset.

Despite the challenges, the work was innovative enough for Netflix to file for a patent for the product, for the uniqueness of using machine learning to predict TV-based app crashes. The results were published by the Netflix Tech Blog, which has earned great respect in Silicon Valley for the innovations that it publishes and open sources. As a computer science student at one of the best CS programs in the world here at Carnegie Mellon, I think such opportunities should reassure the Tartan community that every single concept that we learn here ranging from distributed systems and data processing, to microservices, and machine learning is applied out there in the real world.