Learnings From the High-Profile Amazon Prime Video Refactor

31 / 08 / 2023

Read Time 10 minutes

Share

Share this episode

Facebook

Twitter

Author

Amazon Prime Video’s engineering team faced a critical challenge when scaling up their audio/video monitoring service. Initially built on a distributed microservices architecture, the team encountered cost and scalability bottlenecks as they attempted to handle thousands of concurrent video streams.

Their high-profile journey from microservices to a monolith has not only created some industry polarisation (mainly because opinions have been formed without proper context), but it also provides valuable insights into the complexities of architectural decisions. You can read the original post on their website here.

The team’s experience highlights the importance of considering context, maintaining logical boundaries, and leveraging refactoring for performance gains. Architectural choices should be driven by the specific requirements of the application, and trends should be balanced with real-world results. In the ever-changing landscape of software architecture, learning from real-world experiences like Amazon Prime Video’s can lead to more effective and scalable solutions.

Let’s delve into the specific scenario at Amazon Prime Video, explore the insights derived from their experience, and highlight the implications of their architectural choices.

The Initial Microservices Architecture

The Video Quality Analysis (VQA) team at Amazon Prime Video developed a tool to automatically identify perceptual quality issues in video streams viewed by customers. This tool was designed as a distributed system using serverless components, such as AWS Step Functions (Azure Logic Apps) and AWS Lambda (Azure Function Apps), to handle the media conversion, defect detection, and orchestration.

The microservices architecture allowed for individual components to be worked on and scaled independently, which was beneficial for building the service quickly. However, as the team scaled the service to handle a larger volume of video streams, they encountered several challenges:

Scaling Bottlenecks: The orchestration management implemented using AWS Step Functions imposed state transition limits, hindering further scaling.

High Cost of Data Transfer: Passing video frames (images) between distributed components required frequent Tier-1 calls to an Amazon Simple Storage Service (Amazon S3) bucket (Hot-tier Azure Storage), leading to significant costs.

Complex Orchestration: Managing the interactions between numerous microservices and components introduced complexity and added to operational overhead.

The Decision to Rearchitect: Monolith Approach

To address the bottlenecks and cost issues, the VQA team decided to rearchitect their infrastructure. They made a bold decision to move away from the distributed microservices approach and consolidate all components into a single process, effectively creating a monolithic architecture.

The updated architecture ran within a single Amazon Elastic Container Service (Amazon ECS) task (Azure Container Instance), reducing the need for data transfer across network boundaries.

Maintaining Logical Boundaries

Despite adopting a monolithic deployment, the team carefully maintained logical boundaries within the service. The service’s main components – media conversion, defect detectors, and orchestration – were preserved, allowing for code reuse and a smooth migration to the new architecture.

Consolidating Components: In the initial microservices setup, a separate media conversion service processed video streams, and detectors ran as individual microservices, downloading and processing frames independently from an S3 bucket (Azure Storage).

However, in the monolith approach, all components were housed within a single container, eliminating the need for external data transfer. This allowed for faster processing and reduced the overall cost.

Vertical Scaling: While the microservices architecture allowed horizontal scaling of detectors by creating new microservices, the monolith approach vertically scaled by replicating the entire service when the capacity of a single instance was exceeded.

Each clone of the service was parametrised with a different subset of detectors, enabling the team to handle thousands of streams efficiently.

Insights on Architecture Choices

Context-Driven Decisions

The move from microservices to a monolith was not a blanket endorsement of one architecture over the other but a context-driven decision based on the specific use case.

Microservices and serverless components excel at a high scale, but the team found that their requirements demanded a different approach.

In the case of video monitoring, where thousands of video streams needed real-time analysis, the monolith approach proved more efficient and cost-effective. However, this might not be the case for all applications, and the choice of architecture should be carefully evaluated based on context.

Logical Boundaries vs. Physical Boundaries

A crucial lesson from Amazon Prime Video’s experience is the distinction between logical and physical boundaries in defining a service. Logical boundaries represent the capabilities of a service and should not change based on the deployment approach.

In Amazon Prime Video’s case, the logical boundary was the service for audio/video quality inspection. This boundary remained intact despite the shift from microservices to a monolith. The physical boundary, on the other hand, changed as components were consolidated into a single process.

Refactoring for Performance Gains

The decision to rearchitect the service was essentially a refactor, not a complete overhaul. By consolidating components, optimising data transfer, and reusing code, the team achieved significant performance gains without rewriting the entire application.

The team decided to move the computationally expensive media conversion process closer to the detectors. While it may have been tempting to cache the conversion results, the in-memory approach proved more cost-effective and faster.

Architectural Trends and Real-World Results

Architectural trends in the industry may change over time. In many cases they are cyclical, and what works best in one scenario may not hold true in another. It is crucial to evaluate architectural decisions based on real-world results and the specific needs of the application.

Microservices, despite their popularity, are not a silver bullet for every use case. They excel in certain scenarios but may introduce unnecessary complexity when applied outside their original context.

It is always a big decision for Amazon CTOs and technical leaders on how to architect their systems. That said, deciding on an architecture to start with should be based on their current needs and not what they think their needs are going to be. This should be followed up with defining their logical boundaries so that physical boundaries don’t become a problem later when they need to change them.