How an OOM Issue With BIND 9 Led to Discovering a Memory Allocator Glitch

We recently dealt with an interesting case of a fleet of BIND 9.16.27 instances running on FreeBSD 12.3. It was reported to us that over a period of roughly a few days, named processes were exhausting all available memory on the hosts, causing them to start swapping memory to disk. The interesting part was that the servers in question were neither performing recursion (so the ever-increasing memory footprint could not be attributed to cache bloat) nor subject to a particularly significant query load. This blog post describes the steps we took to determine the root cause of this issue and what we found out during that journey.

Off We Go

We started the investigation by checking whether the increase in memory use was tracked by BIND 9’s internal memory contexts. These have the capability to record every allocation made by our own code (down to file:line tuples). We could use that capability to periodically collect dumps of allocation data and then look for differences between them, which would probably allow us to at least determine the part of the code to take a closer look at. Unfortunately, in this case the graph of total memory use tracked by libisc memory contexts turned out to plateau after an initial build-up. This meant that the problem lay either in the code of one of the external libraries that the BIND 9 code uses or in the way BIND 9 interacts with these libraries.

Since we did not get similar reports for Linux environments, our first suspicion was that the root cause was a BIND 9 bug caused by a missing destructor for some synchronization primitive. One important distinction between libthr (FreeBSD’s POSIX threads library) and other implementations such as glibc’s pthreads is that libthr allocates its primitives (mutexes, condition variables, etc.) on the heap. This means that failing to invoke a destructor for some primitive (e.g. pthread_mutex_destroy()) at the end of its intended lifetime causes memory allocated for that primitive to leak. This may ultimately lead to an out-of-memory condition, particularly if the problem exists on a hot path. This was exactly what was happening in the case of a bug we fixed recently. However, that particular problem was already addressed in the version of BIND 9 running on the affected servers, so we had to look for other memory leaks of this type in the environment in question to ensure we had not missed anything.

Quick experiments with Valgrind did not uncover anything suspicious. Since using Valgrind incurs a significant performance penalty, we also tried finding memory leaks using a work-in-progress method of using jemalloc’s leak-checking capabilities on named. The idea was that memory could be leaking due to some specific ordering of events, which never happened when named was running under Valgrind due to the latter’s significant overhead. This was slightly tricky to implement, for two reasons:

  1. While jemalloc is the system allocator on FreeBSD, the default system build does not include support for profiling (--enable-prof).

  2. Autoconf-based build-time detection of jemalloc is available only in BIND 9.18+.

To work around these obstacles, we built jemalloc from source with profiling enabled and used LD_PRELOAD to make the runtime linker use our custom profiling-enabled jemalloc build for memory management, instead of the one baked into FreeBSD’s libc. This allowed us to generate call graphs of the functions responsible for creating all outstanding (not free()d) memory allocations at named exit time. Unfortunately, analyzing these graphs did not uncover a promising path forward, either: all reported memory allocations stemmed from early named initialization and therefore could not contribute to a memory leak which was growing over time.

Homing In

While conducting the seemingly fruitless experiments with a profiling-enabled jemalloc build, we noticed one peculiarity: a ps auxw loop launched in the background was occasionally reporting a weird sequence of memory-use statistics:

                             USER     PID  %CPU  %MEM       VSZ      RSS  TT  STAT  STARTED     TIME  COMMAND
Thu Mar 31 14:29:53 UTC 2022 named  76113   0.0   1.6   3722832  1074476   0  I+    13:31    0:16.89  /usr/local/sbin/named -g -c /var/named/named.conf
Thu Mar 31 14:29:58 UTC 2022 named  76113   0.0   1.6   3722832  1074476   0  I+    13:31    0:16.89  /usr/local/sbin/named -g -c /var/named/named.conf
Thu Mar 31 14:30:03 UTC 2022 named  76113  28.2   6.4  19557584  4273736   0  I+    13:31    0:20.24  /usr/local/sbin/named -g -c /var/named/named.conf
Thu Mar 31 14:30:08 UTC 2022 named  76113  23.2   1.6   3844268  1100952   0  I+    13:31    0:21.65  /usr/local/sbin/named -g -c /var/named/named.conf
Thu Mar 31 14:30:13 UTC 2022 named  76113  12.5   1.6   3844268  1101004   0  I+    13:31    0:21.65  /usr/local/sbin/named -g -c /var/named/named.conf

Around 14:30:03, over a period of just a few seconds, VSZ jumped from about 3.5 GB to about 18.5 GB, RSS jumped from about 1 GB to about 4 GB, and then both values dropped down to almost their previous values; yet memory use in the “new stable state” was visibly higher than before that event.

Looking closer at the ps output history, we quickly noticed that such events were happening at 15-minute intervals, which we tracked down to a cron job collecting named statistics from http://localhost:8053/json. We ran the same ps output collection process without LD_PRELOAD affecting named (i.e. using the stock OS allocator) and the periodic jumps in memory use were still happening:

                             USER     PID  %CPU  %MEM       VSZ      RSS  TT  STAT  STARTED     TIME  COMMAND
Thu Mar 31 14:44:57 UTC 2022 named   5831   0.0   0.6    616092   417880   0  I+    14:36    0:00.72  /usr/local/sbin/named -g -c /var/named/named.conf
Thu Mar 31 14:44:59 UTC 2022 named   5831   0.0   0.6    616092   417880   0  I+    14:36    0:00.72  /usr/local/sbin/named -g -c /var/named/named.conf
Thu Mar 31 14:45:01 UTC 2022 named   5831   2.2   0.8    711020   503780   0  I+    14:36    0:00.96  /usr/local/sbin/named -g -c /var/named/named.conf
Thu Mar 31 14:45:03 UTC 2022 named   5831   1.5   0.8    711020   503780   0  I+    14:36    0:00.96  /usr/local/sbin/named -g -c /var/named/named.conf
Thu Mar 31 14:45:05 UTC 2022 named   5831   1.1   0.8    711020   503780   0  I+    14:36    0:00.96  /usr/local/sbin/named -g -c /var/named/named.conf

So what transpired during the statistics collection process that was causing such shifts in memory use? Flame graphs to the rescue!

We used a slightly tweaked dtrace invocation suggested by Brendan Gregg for collecting on-CPU stack traces for a process of choice:

dtrace -x ustackframes=100 -n 'profile-1000 /execname == "named" && arg1/ { @[ustack()] = count(); } tick-2s { exit(0); }' -o out.stacks

We ran the above command while a curl fetch collecting JSON statistics was running. The resulting out.stacks file was post-processed using stackcollapse.pl and passed to flamegraph.pl, which produced the following image:

Chart titled 'named activity during a JSON statistics fetch,' with one unusually high spike

This immediately pointed at isc_taskmgr_renderjson() as the primary suspect. That function was present in 40% of all named stack traces collected during the profiling period.

Another thing that drew our attention was the huge presence of functions related to memory management in the flame graph. Freeing a JSON object taking 15% of all CPU time assigned to named? Something had to be wrong there. Before moving forward, we decided to simplify the graph a bit by removing shared object names from the labels, cutting off the stack traces at all libc functions related to memory management, and marking such cuts with contrasting colors:

Chart titled 'named activity during a JSON statistics fetch,' with functions labeled

This emphasized how omnipresent memory management functions were in the profile: in total, about 30% of all CPU time assigned to named during the profiling period was spent in either malloc(), calloc(), or free(). That is a lot of time to spend purely on memory management.

-ETOOMUCHSTATISTICS

We resumed our investigations by going back to isc_taskmgr_renderjson(), the function responsible for producing task manager statistics. A quick glance at the JSON object returned by named on a host with 32 logical CPUs confirmed that the task manager statistics were quite huge - and indeed by far the largest component of the statistical data:

$ curl -s http://localhost:8053/json | jq -c | wc -c
4106028
$ curl -s http://localhost:8053/json | jq -c ".taskmgr" | wc -c
3599300

Task manager statistics are nothing else than a list of task objects, each of which has just a few integer/string fields attached to it. So… what is using so much space? Exactly how many tasks are there in a named instance running on a 32-CPU machine?

$ curl -s http://localhost:8053/json | jq ".taskmgr.tasks | length"
36897

Whoa, that’s a lot of tasks. Since each of them has a name which describes its purpose and therefore also identifies the piece of code which created it, we took a look at which types of tasks were the most prevalent ones:

$ curl -s http://localhost:8053/json | jq ".taskmgr.tasks[].name" | sed -E "s|res[0-9]+|res<number>|" | sort | uniq -c | sort -n
      1 "ntatable"
      1 "server"
      1 "zmgr"
      2 "ADB"
      2 "cache_dbtask"
      2 "resolver_task"
      2 "statchannel"
      2 "view"
     10 "loadzone"
     10 "zone"
   2048 null
   2048 "res<number>"
  32768 "udpdispatch"

The high count of udpdispatch tasks (which scales linearly with the number of CPUs available on the host) obviously stands out. It is a known issue in BIND 9.16 and older versions; in BIND 9.18, the dispatch code was reworked from the ground up. One of the results of that huge refactoring effort was dropping dedicated udpdispatch tasks altogether, as their high counts were not translating to improved performance in any way. Let’s estimate the amount of memory-related work required to produce JSON statistics on a 32-CPU host: each of the ~37,000 task objects included in JSON statistics has three string fields and three integer fields. Each of these fields is a JSON object, so a quick back-of-the-envelope calculation suggests that there are at least 200,000 memory allocations happening for every JSON statistics fetch on a 32-CPU machine. That explained why a profiling-enabled jemalloc build needed tons of space to store the stack trace for every single allocation. But that’s not all - after being allocated, the primitive objects then have to be put together into one large object, which then in turn needs to be serialized. And finally, all of those objects have to be freed, which also does not happen instantly.

The simplest workaround for this problem was not to use the “catch-all” /json URL, instead relying on other URLs allowing more granular access to JSON statistics (to avoid generating task manager statistics altogether). We could end the story here, but “just do not use this feature” is not the best response to give to a user reporting a problem. Also, if the JSON objects are properly destroyed after they are sent to the client, how could this be the root cause of a long-term problem?

Pop the Hood!

While some of the techniques we used to investigate the problem up to this point (e.g. building a library from source) may not be possible to employ in every production environment, all of them could at least be applied to a live named instance serving real traffic. To get to the bottom of what was happening, we needed to be able to reproduce the problem in an environment allowing free experimentation (using debuggers, experimental code changes, etc.).

Interestingly enough, we quickly managed to reproduce the problem in a FreeBSD 12.3 QEMU VM with 32 virtual CPU cores, but not on Linux (where BIND 9.16 binaries were forcibly linked with jemalloc using a configure.ac hack), not even when we forced use of the same build-time options on Linux as for the default FreeBSD system build. The behavior we observed on Linux was that as named processed numerous requests for JSON statistics, its memory use increased over time, but not as quickly as on FreeBSD and, most importantly, not indefinitely.

To examine more detailed data than just ps output, we added malloc_stats_print() calls to named and tried to determine which counters reflected the ps-reported VSZ/RSS values. It quickly turned out that the number of dirty pages for a specific jemalloc arena was correlated with the values reported by ps. Over time, the value of that counter was increasing in a seemingly unbounded manner on FreeBSD, but not on Linux.

Even without knowing what was happening behind the scenes, this discovery was substantial because it led us to the first practically feasible solution for the problem. Our line of thinking was: if the infinite memory use growth is caused by unbounded amassing of dirty pages, can we somehow prevent it from happening? Browsing the jemalloc documentation, we came across the dirty_decay_ms option, which looked promising:

Approximate time in milliseconds from the creation of a set of unused dirty pages until an equivalent set of unused dirty pages is purged (…) and/or reused. Dirty pages are defined as previously having been potentially written to by the application, and therefore consuming physical memory, yet having no current use. (…) A decay time of 0 causes all unused dirty pages to be purged immediately upon creation. (emphasis added)

“Job Done, off to the Pub”?

We confirmed that starting named with the MALLOC_CONF environment variable set to dirty_decay_ms:0 causes memory use to remain stable. This solution could be successfully deployed with the stock FreeBSD system allocator and precompiled BIND 9 binaries - no patching was required anywhere. It looked elegant, but… if it was really as gorgeous as our complacent selves thought it was, why would dirty_decay_ms not be set to 0 by default by jemalloc itself?

The catch is that not retaining any dirty pages (in other words, immediately returning all dirty pages to the operating system upon their creation) hinders performance because it prevents dirty pages from being recycled for subsequent allocations, increasing the allocator’s system call footprint. So while this fix was acceptable in our specific use case, it was not necessarily feasible in the general case (e.g. under more significant load). That conclusion posed a different question: why were the dirty pages in named not purged over time, as promised by jemalloc’s documentation? A bug like this just had to have been triggered somewhere, somehow, by someone before, right?

Keep It Simple, Stupid

Since named is a non-trivial piece of software, we wanted to strip the problem down to the bare basics, hoping to be able to better understand jemalloc’s behavior in a simpler application. To that end, we wrote a C program which naïvely mimics what named does in isc_taskmgr_renderjson(): each time a signal is received, one of the threads spawned by the program creates a JSON object comprised of tens of thousands of other JSON objects, serializes that object, and then frees it; jemalloc statistics are output after every iteration.

Unfortunately, this program was not able to trigger infinite memory use growth on any operating system we tried it on. This meant that there was some difference between the way jemalloc behaved for this program and for named. Following this lead, we once again resorted to comparing malloc_stats_print() outputs for our sample application and for named. Both on Linux and FreeBSD, one difference became striking after a few dozen iterations. This is a sample excerpt from malloc_stats_print() statistics for a single jemalloc arena in named running on Linux:

decaying:  time       npages       sweeps     madvises       purged
   dirty: 10000        17625            0            0            0
   muzzy: 10000            0            0            0            0

and this is a corresponding sample from our attempt at a reproducer:

decaying:  time       npages       sweeps     madvises       purged
   dirty: 10000        10715           17           53        19299
   muzzy: 10000           27            7           17         9542

As you can see, in our sample program, dirty pages were clearly being purged over time, as expected. But in named, even on Linux, where the dirty page count was not growing indefinitely, it seemed that jemalloc was not returning any dirty pages to the operating system.

Comic with two identical panels of a frustrated person staring at a computer, with the caption on one panel 'It doesn't work...why' and on the other 'It works...why'

rr, or There and Back Again

To learn what jemalloc code paths are taken when dirty pages are purged, we captured an rr recording of our sample program and used a data watchpoint set on the purged counter, combined with the power of reverse execution, to understand the logic behind the process. That allowed us to determine that in jemalloc 5.1.0 (which is the version used in FreeBSD 12.3), in order for arena decay (i.e. dirty page purging) to be triggered, a so-called “ticker” has to fire. A ticker is a simple construct that jemalloc uses to ensure that certain actions are taken “once every N times” (e.g. “try purging some dirty pages once every 1000 (de)allocations”). Apparently these tickers were not firing during JSON statistics generation in named. Why? rr to the rescue again!

To our astonishment, while we were capturing an rr recording of named on Linux, the counter of dirty pages started significantly growing with each curl request, just like it originally did on FreeBSD! Hooray? We had no idea why that happened, but we first focused on figuring out why the tickers were not firing - fortunately, the number of purged dirty pages remained at 0 even under rr. After all, with a captured rr recording of what looked like a successful reproduction of the original issue, we could just get back to exploring that angle later on (which we did).

Going back and forth through both rr recordings, we ultimately reached a surprising conclusion: some ticker structures were never initialized by jemalloc in named. Further analysis of the code paths involved led us to an explanation of that phenomenon: the threads that did not have their ticker structures initialized were the ones which called free() before issuing any *lloc() call. This can be the case with e.g. named’s worker threads: they loop over event queues, executing callbacks as events are posted to them, and it is common for an event callback to free the event structure associated with it. That happens e.g. during named startup, when the so-called task-exclusive mode (which, by the way, is also being removed from BIND 9) is entered early in the process in order to load DNS zones; doing that involves pausing most worker threads, which the network manager achieves by posting “pause” events to worker threads’ event queues. Each worker thread then processes the “pause” event and frees it afterward.

Drumroll…

In short, we learned that if a thread starts with a free(), it becomes incapable of ever triggering dirty-page purging during its lifetime. Note that purging (returning dirty pages to the operating system) is not the same thing as recycling (reusing dirty pages for subsequent allocations before returning them to the operating system). On Linux, dirty-page recycling was efficient enough to keep memory use in check in the long term. On FreeBSD, it was not (more on that below). However, jemalloc was affected by the broken dirty-page purging issue on both of these systems.

To confirm our findings, we tweaked our sample program slightly, so that each of its threads started with a free():

diff --git a/reproducer.c b/reproducer.c
index 494d05e..666f9a1 100644
--- a/reproducer.c
+++ b/reproducer.c
@@ -14,8 +14,12 @@
 #define THREADS_TO_CREATE 4
 
 static bool done;
+void *dummy;
 
 void generate_json(void) {
+	free(dummy);
+	dummy = malloc(sizeof(dummy));
+
 	json_object *root = json_object_new_object();
 	assert(root);
 
@@ -108,6 +112,8 @@ int main(void) {
 
 	fprintf(stderr, "PID: %d\n", getpid());
 
+	dummy = malloc(sizeof(dummy));
+
 	setup_threads(threads);
 	setup_signals();
 

With that change in place, lo and behold, the malloc_stats_print() statistics reported by our sample program started looking like this:

decaying:  time       npages       sweeps     madvises       purged
   dirty: 10000        11253            0            0            0
   muzzy: 10000            0            0            0            0

This issue was reported to jemalloc developers, but not before we found out that it was inadvertently fixed in jemalloc’s dev branch by moving the ticker state to a different structure. However, that fix is not yet present in any numbered jemalloc release (the latest one being 5.2.1 as of this writing), though it will be included in the upcoming 5.3 version.

Knowing that an upstream fix is in the pipeline is good, but it is not much consolation to all the existing deployments out there in the wild that are affected by this issue. To help with that, we prepared a workaround which forces each worker thread to perform a dummy memory allocation at the beginning of its lifetime. That workaround will be included in BIND 9.16.29.

Note that while this problem was discovered in BIND 9.16 running on FreeBSD, analyzing its nature uncovered that it also affects BIND 9.18+ binaries linked with jemalloc on other platforms. We have not yet received any reports about similar issues with BIND 9.18 - and given that this problem will also be worked around in BIND 9.18.3+, hopefully we never will!

Tying Up Loose Ends

The one burning question that was left unanswered was this: how come we managed to trigger the issue on Linux under rr, but not without it? One of the limitations of rr is that it emulates a single-core machine, which caused jemalloc to allocate just a single arena for all named worker threads. Knowing this, we were able to reproduce FreeBSD-like jumps in memory use after each curl request on Linux without rr by setting the MALLOC_CONF environment variable to narenas:1, emulating what happens when named is run under rr.

But… if this change is necessary to trigger the problem, then why was it originally noticed on FreeBSD, where named uses multiple jemalloc arenas? It’s… complicated. The answer lies in the nitty-gritty, OS-specific details of how jemalloc recycles dirty pages when realloc() calls are issued for large extents. (This is what the json-c library does when it converts an in-memory representation of a JSON object to a string. Specifically, it resizes the target buffer to twice its current size each time it runs out of space. Given that serialized JSON statistics are several megabytes large for hosts with numerous logical CPUs, the serialization process causes realloc() to be called with several-megabytes-large target sizes.) One example difference between Linux and FreeBSD is that on the former, jemalloc prefers to retain dirty pages when it would normally give them back to the kernel on other operating systems. But there are more subtleties involved in the process that could easily take a separate article to describe; suffice it to say that we did manage (albeit very rarely) to observe several-megabytes-large (FreeBSD-like) jumps in memory use after a single curl request on Linux - they just do not happen as often there as on FreeBSD.

Where We’re Going, We Won’t Need 32768 udpdispatch Tasks

While investigating this issue, we employed a variety of techniques for debugging memory-use issues in BIND 9. We also learned quite a bit about how jemalloc works and how to interpret its arena statistics, which will certainly pay off the next time we have to investigate a similar issue. We also found out that one of the factors contributing to the problem (a huge number of tasks created on certain types of hosts) is already gone from newer BIND 9 versions, which reassures us that our continuous refactoring efforts are worth their non-trivial cost in the long term. Back to work, then!

Recent Posts

What's New from ISC