System Design Interview: Design FB Post Search w/ ex: Meta Interviewer

Summary notes created by Deciphr AI

https://www.youtube.com/watch?v=l38XL9914fs

Abstract

Stefan from Hello Interview delves into designing a Facebook post search system, using his firsthand interview experience at Facebook. He outlines a structured framework for system design interviews, emphasizing the importance of defining both functional and non-functional requirements. Key elements include creating an inverted index for efficient search, handling high write volumes via async processing with Kafka, and optimizing read performance through caching and CDNs. He also discusses managing storage with hot and cold data partitions, using probabilistic data structures like Count-Min Sketch for efficient indexing, and addressing multi-keyword search challenges. This comprehensive breakdown aims to prepare candidates for software engineering interviews.

Summary Notes

Introduction to System Design Problem

The podcast discusses the system design for Facebook post search, an infrastructure challenge involving data organization and scalability for both read and write paths.
The speaker, Stefan, shares personal experience with this problem from a Facebook interview.
The Hello Interview Delivery framework is introduced as a structured approach to solving system design problems in interviews.

"Today we're going to be talking about how to design Facebook post search. This is a really interesting infrastructure question around organizing data, enabling scalability on both the read and the write path."

The podcast focuses on the design of Facebook post search, emphasizing the challenges of data organization and scalability.

Hello Interview Delivery Framework

The framework is designed to help candidates maintain progress and avoid getting stuck during interviews.
It includes steps like establishing requirements, discussing core entities, setting up APIs, and high-level design.
Deep dives into non-functional requirements and additional functionality are crucial for comprehensive understanding.

"First, we're going to start out with our requirements. We're going to clarify with our interviewer what the scale is, what our functional requirements, so on and so forth."

The framework begins with clarifying requirements, ensuring a clear understanding of the problem scope.

Establishing Functional Requirements

Functional requirements involve understanding what the system needs to do from a user perspective.
Key functionalities include creating posts, sorting by recency and engagement, and enabling keyword search.
Important to consider both read and write paths in the system design.

"Functional requirements are usually best tackled by putting yourself in the shoes of a user or a client and then trying to figure out what are the functionalities that they may need."

Understanding user needs and functionalities is crucial for defining functional requirements.

Exploring Search Functionality

Search functionality includes keyword search, sorting by likes and time, and exploring additional features like fuzzy search and personalization.
Non-personalized search results are preferred for caching efficiency.
Engagement metrics and recency are key sorting criteria.

"In the case of Facebook posts, we probably care a lot about how recently the post was created and then also how engaged it is."

Recency and engagement are prioritized in sorting search results for Facebook posts.

Non-Functional Requirements Discussion

Non-functional requirements focus on how the system performs, including latency, throughput, and consistency.
Low latency is targeted at 500 milliseconds to balance user experience and system load.
The system should be eventually consistent, with posts appearing in search results within a minute.

"When you do a search, we would expect it to be low latency. It's not really useful if you just say low latency because a big question is how low."

Low latency is crucial for user satisfaction, requiring specific targets for performance.

Considerations for System Design

The system should handle a large number of users and searches, with high availability and fault tolerance.
Estimates for system throughput and user interactions are important for understanding system demands.
The design should accommodate eventual consistency and high recall for search results.

"If the functional requirements are what the system ought to do, the non-functional requirements are how the system does this."

Non-functional requirements define the performance and reliability aspects of the system.

Conclusion and Moving Forward

The discussion covers a comprehensive set of requirements, both functional and non-functional, for Facebook post search.
Estimates and assumptions play a role in shaping the system design and addressing potential bottlenecks.
The podcast emphasizes the importance of a structured approach to system design in interviews.

"So here we've got some good functional requirements. We've left the TBD in terms of our scale, which we'll get into in a second."

A structured approach helps in addressing all aspects of system design, ensuring a thorough understanding and preparation for interviews.

Estimating System Requirements

The system is expected to handle a billion posts a day, with a rough calculation leading to 10,000 posts per second.
User engagement through likes is significantly higher than post creation, estimated at 100,000 likes per second.
Reading or searching is less frequent than writing, with an estimated 10,000 searches per second.
The system is primarily write-heavy, contrary to initial assumptions of it being read-heavy.

"Generally speaking, the right volume of our system is dominated by engagement or likes as opposed to the post creation."

The system's design should focus more on handling write operations efficiently.

"This actually is a write-heavy system. We're doing more writes than we're doing reads."

The storage requirement for posts is substantial, with an estimated 3.6 petabytes needed for 3.6 trillion posts over ten years.

"We have 3.6 trillion times 1 kilobyte per post I think is a pretty reasonable estimate."

It's critical to consider scalable storage solutions to manage such vast data effectively.

Defining Core Entities and API Design

Core entities include users, posts, likes, and searches.
API design should cover all functional requirements, ensuring the system's completeness.
Designing APIs is crucial, as it impacts the system's usability and maintainability.

"In the ideal state, our API is going to cover all of our functional requirements."

The API should allow for post creation, liking/unliking posts, and searching posts.

"Our first requirement is that we must be able to create posts."

Good API design is essential, though the level of detail may vary depending on the interviewer.

"Some interviewers can't wait for you to get to the high-level design. They really want to talk about scalability."

High-Level System Design

The initial focus should be on creating a functional system before optimizing for performance.
A simple load-balanced service behind an API gateway can handle post creation and storage.
The search functionality requires a separate service due to its read-heavy nature.

"What we're trying to do is basically make our system work before we optimize it."

The search service should not rely directly on the post database due to inefficiencies in scanning large datasets.

"This like keyword is actually going to be very slow as well."

An inverted index is recommended for efficient searching, mapping keywords to post IDs.

"An inverted index is basically asking what would be the ideal data structure that I could read from in order to get the fastest possible searches."

Handling Post and Like Events

Post creation should involve updating search indexes to facilitate efficient searching.
A likes service should record likes in a database and inform the search service to update rankings.

"It is important that our like service is letting our search know when a like event happens."

An ingestion service can streamline the process by reading changes from the post database and updating search indexes.

"A really clever way for us to set this up is actually have our ingestion service read via CDC from our post DB."

Search Functionality and Optimization

Sorting search results by likes or time is essential for meeting user expectations.
Avoid querying metadata for sorting by precomputing and storing necessary information in the search index.

"We need some way of answering those queries effectively."

Handling large sets of post IDs efficiently is crucial to avoid performance bottlenecks.

"If we have millions of matches for a particular keyword, that means we need to pass millions of post IDs back to the search service."

Efficient Data Sorting and Retrieval in System Design

The challenge involves handling large volumes of data efficiently, particularly when needing to sort posts by likes or creation time.
An ideal solution would involve pre-sorting post IDs to minimize the need for sorting during data retrieval.
The concept of using separate indexes for likes and creation time is introduced to facilitate quick data access.

"If our list of post ids was sorted by creation time, then when our search service queried that keyword, we could grab the top 100 items and then return those back to the user."

Sorting post IDs by creation time allows for immediate retrieval of the most recent posts without additional sorting.

"What if instead we have an index which is likes for the keyword and we have one which is time for the keyword."

By maintaining separate indexes for likes and time, the system can efficiently handle different sorting requirements.

System Adjustments and Event Handling

The system needs to adjust to handle like events efficiently, ensuring that updates are reflected in the search index without direct writes.
A caching mechanism is suggested to manage like counts, reducing the need for permanent storage.

"Whenever a like count changes for a given post ID, our ingestion service receives that notice and then what it will do is it will look up that post."

The ingestion service updates indexes based on like events, ensuring the system reflects the most current data.

Optimizing Non-Functional Requirements

Non-functional requirements focus on maintaining low latency and high throughput, especially given the high volume of users.
Caching strategies, both local and distributed (e.g., CDN), are proposed to handle frequent, non-personalized search queries.

"Oftentimes, when you think about read volume, your initial instinct should be, is there some way for me to cache this?"

Caching helps manage read volume by storing frequently requested data, reducing the load on the search service.

"A CDN acts like a globally distributed cache. You might have a CDN edge node, basically a server that might be down the street or downtown in your city."

CDN usage allows for faster data retrieval by caching data closer to the user's location.

Handling High Write Throughput

The system faces challenges with high write throughput, particularly with like and post creation events.
Asynchronous processing using Kafka is suggested to manage high write loads and ensure the system can scale effectively.

"Instead of having our ingestion service directly read from the CDC stream, what we'll do is we will have our post database and our like counts write to a Kafka topic."

Kafka allows for scalable, fault-tolerant processing of write events, ensuring the system can handle bursts in activity.

Efficient Like Event Processing

Like events are frequent but often do not impact the order of search results significantly.
Strategies such as batching and logarithmic recording of like counts are proposed to reduce processing overhead.

"One thing that we can do, and this would be what many candidates will propose at first, is we can batch the like events."

Batching like events can reduce the number of updates processed, though it may not significantly impact less viral posts.

"Another solution that we can do is we can take the logarithm of the total number of likes."

Using logarithmic scales for like counts reduces the number of updates needed, focusing on significant changes in popularity.

Conclusion and Future Optimizations

The current system provides a foundational solution to the problem, with room for future optimizations.
Focus shifts to identifying and addressing critical bottlenecks in the system, particularly those related to non-functional requirements.

"The key to deep dives in an infra style interview is to look at your system holistically and then identify those things that stand out as critical bottlenecks."

A holistic approach helps in identifying areas for improvement, ensuring the system meets both functional and non-functional requirements.

Write Volume Challenges and Solutions

Managing write volume involves maintaining a balance between correct indexing and handling the sheer volume of data writes.
Tokenizing posts and writing to multiple keywords increases write volume significantly.
Using Redis for in-memory data storage offers high throughput but requires careful partitioning to prevent overload on individual nodes.

"The volume of writes in this instance is actually a little bit higher than you might expect...tokenizing these posts and writing to multiple keywords for every post."

Tokenization increases write volume by creating multiple entries for each post, which can strain the system.

"With Reddus, your choice of key is basically your choice of partitioning or sharding strategy."

Proper partitioning strategy is crucial to manage write volume and prevent overload on Redis nodes.

Partitioning and Sharding Strategies

Using keywords as partitioning keys can lead to uneven distribution if certain keywords become overly popular.
Including post ID and shard modulo in keys can distribute writes more evenly but increases read complexity.

"If one keyword is getting a ton of writes...we are going to be making a bunch of writes to the same Reddus node."

Popular keywords can cause bottlenecks if not properly managed through effective partitioning.

"Instead of having one single key...we might have four of those keys spread throughout our cluster."

Distributing writes across multiple nodes reduces load on any single node but complicates read operations.

Storage Requirements and Optimization

Limiting the number of post IDs stored per keyword can control the size of search indexes.
Separating hot data from cold data optimizes memory usage and system performance.

"We can restrict the amount of space that's required here by limiting the number of post IDs that we have for every keyword."

Capping the number of stored post IDs helps manage memory usage effectively.

"What we need to do is basically have a way to have these search indexes moved into a colder storage."

Moving less frequently accessed data to cold storage optimizes system resources and performance.

Data Structures for Efficiency

Count Min Sketch provides an efficient way to approximate item counts and manage storage.
This probabilistic data structure helps identify which data can be moved to cold storage.

"Countman sketch is a cool data structure that provides an effective way to approximate the count of a set of items."

Count Min Sketch allows for efficient data management by providing approximate counts for storage decisions.

"We can look at our search index and scan through all the keywords and find out which are accessed the least often."

Identifying less frequently accessed keywords helps determine which data to move to cold storage.

Handling Multi-Keyword Queries

Multi-keyword queries require breaking down into individual keyword searches and filtering results.
Biograms can be added to search indexes to improve search efficiency for multi-word queries.

"We would go look up all the post IDs that contain Taylor and all of the post IDs that contain Swift and return them back to the search service."

Breaking down multi-keyword queries into individual searches helps manage search complexity.

"By adding biograms to our search index, we would theoretically only increase the search index size by 2x."

Adding biograms improves search efficiency but requires careful management of index size.

System Design Considerations

Balancing storage and lookup time is crucial for efficient system design.
Using Count Min Sketch and biograms helps optimize storage and search efficiency.

"We can use countman sketch to find search keywords that are uncommon or not searched very often."

Probabilistic data structures like Count Min Sketch help optimize system design by managing storage efficiently.

"This is a place where we're going to end up making a space versus lookup time tradeoff."

System design often involves trade-offs between storage space and search efficiency for optimal performance.

What others are sharing

Go To Library

#1 Communication Expert: The One Word That Is Causing Divorces@j...

Stop Wasting Your Marketing Dollars with Sacha Awwa

Patagonia Founder Gives Company Away, Directs Profits To Fight Climate Change