Last updated 13 days ago by Dap Dapsql
As heavy users of PostgreSQL since 2012, we’ve learned quite a bit about operating PostgreSQL at scale. Our Manta object storage system uses a large fleet of sharded, highly-available, replicated PostgreSQL clusters at the heart of the metadata tier. When an end user requests their object, say
http://us-east.manta.joyent.com/dap/public/kartlytics/videos/2012-09-06_0000-00.mov, Manta winds up looking in this PostgreSQL cluster for the metadata for that object in order to find the storage servers hosting copies of the object (along with the size, checksum, and other useful information).
From 2012 right up through the present, one of our biggest challenges has been managing PostgreSQL’s vacuum process. I won’t rehash the details here, as there’s plenty of useful information about it. Indeed, managing vacuum is such well-known challenge in operating PostgreSQL that major efforts are underway to fundamentally change the on-disk storage system to eliminate the underlying problem. Since these efforts appear years away (especially for deployments where major PostgreSQL version upgrades will take time to test and orchestrate), it remains important to understand this problem today.
This post assumes that you’re familiar enough with PostgreSQL to understand:
In our production deployments, vacuum operations often take at least 5-10 days. For various reasons, it’s not uncommon for them to take 15-20 days, and we’ve seen some take upwards of 40 days. Vacuum has historically been a somewhat opaque process: it runs for a long time until it finishes, at which point it logs some basic information about what happened. Fortunately, recent versions of PostgreSQL include a view describing the progress of ongoing vacuum operations. This is fantastic! The docs are a bit light on interpreting this view, and while you can find posts explaining a bit about how to use the new view, I was left with a lot of questions. Which of the several sets of counters should we be looking at to determine overall progress? Are there even counters for all the parts we care about? Are all the counters expected to reach 100%? How do we know when certain phases will repeat and what can we do about this? This post documents how we’ve visualized the data in the view and what we’ve learned about the underlying process.