100X Scaling: How Figma Scaled its Databases

😘 Kiss bugs goodbye with fully automated end-to-end test coverage (Sponsored)

Bugs sneak out when less than 80% of user flows are tested before shipping. But getting that kind of coverage — and staying there — is hard and pricey for any sized team.

QA Wolf takes testing off your plate:

→ Get to 80% test coverage in just 4 months.

→ Stay bug-free with 24-hour maintenance and on-demand test creation.

→ Get unlimited parallel test runs

→ Zero Flakes guaranteed

QA Wolf has generated amazing results for companies like Salesloft, AutoTrader, Mailchimp, and Bubble.

🌟 Rated 4.5/5 on G2

Learn more about their 90-day pilot


Figma, a collaborative design platform, has been on a wild growth ride for the last few years. Its user base has grown by almost 200% since 2018, attracting around 3 million monthly users.

As more and more users have hopped on board, the infrastructure team found themselves in a spot. They needed a quick way to scale their databases to keep up with the increasing demand.

The database stack is like the backbone of Figma. It stores and manages all the important metadata, like permissions, file info, and comments. And it ended up growing a whopping 100x since 2020! 

That's a good problem to have, but it also meant the team had to get creative.

In this article, we'll dive into Figma's database scaling journey. We'll explore the challenges they faced, the decisions they made, and the innovative solutions they came up with. By the end, you'll better understand what it takes to scale databases for a rapidly growing company like Figma.

The Initial State of Figma’s Database Stack

In 2020, Figma still used a single, large Amazon RDS database to persist most of the metadata. While it handled things quite well, one machine had its limits. 

During peak traffic, the CPU utilization was above 65% resulting in unpredictable database latencies.

While complete saturation was far away, the infrastructure team at Figma wanted to proactively identify and fix any scalability issues. They started with a few tactical fixes such as:

These fixes gave them an additional year of runway but there were still limitations:

It was clear that they needed a longer-term solution.

The First Step: Vertical Partitioning

When Figma's infrastructure team realized they needed to scale their databases, they couldn't just shut everything down and start from scratch. They needed a solution to keep Figma running smoothly while they worked on the problem. 

That's where vertical partitioning came in.

Think of vertical partitioning as reorganizing your wardrobe. Instead of having one big pile of mess, you split things into separate sections. In database terms, it means moving certain tables to separate databases.

For Figma, vertical partitioning was a lifesaver. It allowed them to move high-traffic, related tables like those for “Figma Files” and “Organizations” into their separate databases. This provided some much-needed breathing room.

To identify the tables for partitioning, Figma considered two factors:

  1. Impact: Moving the tables should move a significant portion of the workload.

  2. Isolation: The tables should not be strongly connected to other tables.

For measuring impact, they looked at average active sessions (AAS) for queries. This stat describes the average number of active threads dedicated to a given query at a certain point in time.

Measuring isolation was a little more tricky. They used runtime validators that hooked into ActiveRecord, their Ruby ORM. The validators sent production query and transaction information to Snowflake for analysis, helping them identify tables that were ideal for partitioning based on query patterns and table relationships.

Once the tables were identified, Figma needed to migrate them between databases without downtime. They set the following goals for their migration solution:

Since they couldn’t find a pre-built solution that could meet these requirements, Figma built an in-house solution. At a high level, it worked as follows:

To make the migration to partitioned databases smoother, they created separate PgBouncer services to split the traffic virtually. Security groups were implemented to ensure that only PgBouncers could directly access the database.

Partitioning the PgBouncer layer first gave some cushion to the clients to route the queries incorrectly since all PgBouncer instances initially had the same target database. During this time, the team could also detect the routing mismatches and make the necessary corrections.

The below diagram shows this process of migration.


Latest articles

If you’re not a paid subscriber, here’s what you missed.

  1. Unlocking the Power of SQL Queries for Improved Performance

  2. What Happens When a SQL is Executed?

  3. A Crash Course in API Versioning Strategies

  4. Embracing Chaos to Improve System Resilience: Chaos Engineering

  5. A Crash Course in CI/CD

To receive all the full articles and support ByteByteGo, consider subscribing:

Subscribe now


Implementing Replication

Data replication is a great way to scale the read operations for your database. When it came to replicating data for vertical partitioning, Figma had two options in Postgres: streaming replication or logical replication.

They chose logical replication for 3 main reasons:

However, logical replication was slow. The initial data copy could take days or even weeks to complete. 

Figma desperately wanted to avoid this lengthy process, not only to minimize the window for replication failure but also to reduce the cost of restarting if something went wrong.

But what made the process so slow?

The culprit was how Postgres maintains indexes in the destination database. While the replication process copies rows in bulk, it also updates the indexes one row at a time. By removing indexes in the destination database and rebuilding them after the data copy, Figma reduced the copy time to a matter of hours.

Need for Horizontal Scaling

As Figma's user base and feature set grew, so did the demands on their databases. 

Despite their best efforts, vertical partitioning had limitations, especially for Figma’s largest tables. Some tables contained several terabytes of data and billions of rows, making them too large for a single database.

A couple of problems were especially prominent:

For a better perspective, imagine a library with a rapidly growing collection of books. Initially, the library might cope by adding more shelves (vertical partitioning). But eventually, the building itself will run out of space. No matter how efficiently you arrange the shelves, you can’t fit an infinite number of books in a single building. That’s when you need to start thinking about opening branch libraries.

This is the approach of horizontal sharding.

For Figma, horizontal sharding was a way to split large tables across multiple physical databases, allowing them to scale beyond the limits of a single machine. 

The below diagram shows this approach:

However, horizontal sharding is a complex process that comes with its own set of challenges:

Exploring Alternative Solutions

The engineering team at Figma evaluated alternative SQL options such as CockroachDB, TiDB, Spanner, and Vitess as well as NoSQL databases.

Eventually, however, they decided to build a horizontally sharded solution on top of their existing vertically partitioned RDS Postgres infrastructure. 

There were multiple reasons for taking this decision:

Figma’s Unique Sharding Implementation

Figma’s approach to horizontal sharding was tailored to their specific needs as well as the existing architecture. They made some unusual design choices that set their implementation apart from other common solutions.

Let’s look at the key components of Figma’s sharding approach:

Colos (Colocations) for Grouping Related Tables

Figma introduced the concept of “colos” or colocations, which are a group of related tables that share the same sharding key and physical sharding layout. 

To create the colos, they selected a handful of sharding keys like UserId, FileId, or OrgID. Almost every table at Figma could be sharded using one of these keys.

This provides a friendly abstraction for developers to interact with horizontally sharded tables. 

Tables within a colo support cross-table joins and full transactions when restricted to a single sharding key. Most application code already interacted with the database in a similar way, which minimized the work required by applications to make a table ready for horizontal sharding.

The below diagram shows the concept of colos:

Logical Sharding vs Physical Sharding

Figma separated the concept of “logical sharding” at the application layer from “physical sharding” at the Postgres layer.

Logical sharding involves creating multiple views per table, each corresponding to a subset of data in a given shard. All reads and writes to the table are sent through these views, making the table appear horizontally sharded even though the data is physically located on a single database host.

This separation allowed Figma to decouple the two parts of their migration and implement them independently. They could perform a safer and lower-risk logical sharding rollout before executing the riskier distributed physical sharding. 

Rolling back logical sharding was a simple configuration change, whereas rolling back physical shard operations would require more complex coordination to ensure data consistency.

DBProxy Query Engine for Routing and Query Execution

To support horizontal sharding, the Figma engineering team built a new service named DBProxy that sits between the application and connection pooling layers such as the PGBouncer.

DBProxy includes a lightweight query engine capable of parsing and executing horizontally sharded queries. It consists of three main components:

  1. A query parser that reads the SQL sent by the application and transforms it into an Abstract Syntax Tree (AST).

  2. A logical planner that parses the AST, extracts the query type (insert, update, etc.), and logical shard IDs from the query plan.

  3. A physical planner that maps the query from logical shard IDs to physical databases and rewrites queries to execute on the appropriate physical shard.

The below diagram shows the practical use of these three components within the query processing workflow.

There are always trade-offs when it comes to queries in a horizontally sharded world. Queries for a single shard key are relatively easy to implement. The query engine just needs to extract the shard key and route the query to the appropriate physical database.

However, if the query does not contain a sharding key, the query engine has to perform a more complex “scatter-gather” operation. This operation is similar to a hide-and-seek game where you send the query to every shard (scatter), and then piece together answers from each shard (gather). 

The below diagram shows how single-shard queries work when compared to scatter-gather queries.

As you can see, this increases the load on the database, and having too many scatter-gather queries can hurt horizontal scalability.

To manage things better, DBProxy handles load-shedding, transaction support, database topology management, and improved observability.

Shadow Application Readiness Framework

Figma added a “shadow application readiness” framework capable of predicting how live production traffic would behave under different potential sharding keys. 

This framework helped them keep the DBProxy simple while reducing the work required for the application developers in rewriting unsupported queries. 

All the queries and associated plans are logged to a Snowflake database, where they can run offline analysis. Based on the data collected, they were able to pick a query language that supported the most common 90% of queries while avoiding the worst-case complexity in the query engine.

Conclusion

Figma’s infrastructure team shipped their first horizontally sharded table in September 2023, marking a significant milestone in their database scaling journey.

It was a successful implementation with minimal impact on availability. Also, the team observed no regressions in latency or availability after the sharding operation.

Figma’s ultimate goal is to horizontally shard every table in their database and achieve near-infinite scalability. They have identified several challenges that need to be solved such as:

Lastly, after achieving a sufficient runway, they also plan to reassess their current approach of using in-house RDS horizontal sharding versus switching to an open-source or managed alternative in the future.

References:


SPONSOR US

Get your product in front of more than 500,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].