Netflix is a video streaming service that has a wealth of information about their user base likes, dislikes, general consumer habits, retention lengths and much more.

Netflix uses their big data to commission original programming content that it knows will succeed and be accepted in relevant published markets (O?Neill, 2016).

They perform various A/B tests to determine which variant of similar things perform higher, for example, when showing cover images for series or movies, they will at random show alternative images to determine which proves more reactive from their user base.

As of Q4 2017, Netflix has around 120 million subscribed users and counting (Statista, 2017). With a steady growth rate year on year, it is important that the company uses its immense data aggregation and analytics to drive new business and support investment into the platform.

The number of titles in Netflix?s database varies wildly from country to country. A recent report of the largest licence zone has indicated that there are over 1000 television shows and almost 5000 movies in the United States database alone.

As millions of users watch the service globally, it totals to 400 billion titles a day and 8 million titles every second (MindSight, n.d.).

While a lot of the data is structured, such as categorisations of shows, actors information or user ratings, there is a massive amount of unstructured information that needs to be processed, such as general analytics, A/B test results, play and resume times per user on each title and others.

Even though the concept of a streaming service is quite simple, the many implementation enhancements to add benefits make it much more complex.

HDFS is a distributed file system that handles large data sets running on commodity hardware (IBM, 2018). In any cluster there is a single NameNode which manages the file system and regulates access to client files. Each cluster consists of any amount of DataNodes, with usually around a single one per each node in a cluster, they handle their own storage capabilities on the distributed space. Nodes are also replicated across servers to guarantee a high level of fault tolerance and availability.

Pig provides a high level language called Pig Latin which allows the operator to perform SQL-like queries on the data without the need to write and execute complex Java applications to retrieve meaningful data. Pig translates Pig Latin scripts into MapReduce jobs that can then automatically be run on the data itself.

In marketing, big data is providing insights into which content is most effective at each stage of a sales cycle (Columbus, 2016).

Never before has this amount of information been available to scrutinise by teams of analysts to drive feature improvement and customer understanding in business.


O?Neill, E. (2016) 10 companies that are using big data [Online], Available from: (Accessed on 27th January 2018)

Statista (2017) Number of Netflix streaming subscribers worldwide from 3rd quarter 2011 to 4th quarter 2017 (in millions) [Online], Available from: (Accessed on 27th January 2018)

MindSight (n.d.) How Netflix Uses Big Data To Drive Big Business [Online], Available from: (Accessed on 27th January 2018)

IBM (2018) What is HDFS? [Online], Available from: (Accessed on 28th January 2018)

Columbus, L. (2016) Ten Ways Big Data Is Revolutionizing Marketing And Sales [Online], Available from: (Accessed on 26th January 2018)