The BIND 9 core development team includes three people who focus on quality assurance. Given the size of the BIND legacy codebase, and the activity level of the larger development team, ensuring the quality of our monthly BIND releases is akin to the task given to Hercules, to clean out the Augean Stables in a day.
BIND 9 Statistics
Statistic |
Value |
Code base |
680000 lines of code (approximately) |
Changes per month |
50 - 150 MRs merged |
Core developers |
9 engineers, including 3 QA specialists |
The Synopsys Black Duck Open Hub site rates BIND 9 as “Very High Activity,” which means the rate of change in the repo is unusually high compared to other open source projects. This high rate of change, along with the overall size of the codebase, make ensuring quality a particular challenge.
To get an update on what this team is doing, and how they are managing this task, I interviewed the team leader, Michał Kępień, via email. Clearly an above-average ability to prioritize tasks is important in this role: the answers below came back about five months later.
BIND 9 QA Overview
What are the main responsibilities of the BIND QA team? (I am thinking about release operations, packaging, maintaining the build farm, monitoring performance lab, triaging bugs, CVE processes, etc.)
We try to help everywhere we are needed, but our “official” day-to-day duties revolve around:
-
Improving existing tools (and developing new ones) which help the
developers make informed decisions and ensure the code being
committed is not broken in one way or another,
-
Overseeing the monthly release preparation process (which includes
enforcing the schedule, looking for missing bits, polishing
documentation, examining final test results, packaging, and more),
-
Maintaining the CI environment (keeping the list of operating
systems we run tests on up-to-date, monitoring capacity, tweaking
settings for optimal resource usage, and more).
However, we also help developers reproduce bugs, review merge requests, carry out ad hoc tests on request, and many other things, depending on current needs.
What things make BIND QA challenging?
The same things which make maintaining and improving BIND 9 code challenging: the DNS protocol itself is fairly complex, the deployment base is huge (which means the number of use cases out there is practically unlimited), every deployment environment is different (both in terms of the hardware/software platform used and ever-changing network conditions), and there is a lot of source code which was not written with testability in mind. This means we have to prioritize to at least cover the most typical scenarios.
Continuous Integration (CI) Tests and Release Operations
What kinds of tests do we do on every commit as part of our CI?
The code is built and tested (unit tests + system tests) on several
popular Linux distributions, FreeBSD, and Windows (where applicable).
Some of those builds employ various sanitizers (ASAN, UBSAN, TSAN).
Both GCC and Clang are used; compilation warnings are treated as errors
on the supported platforms.
Apart from the above, other tools are also run to ensure consistency of
coding style (clang-format, Coccinelle for C code; flake8, PyLint for
test code written in Python), enforce the development process we follow
(Danger), and detect the more obvious bugs early (Clang Static
Analyzer).
All in all, about 70 jobs are run for every revision of each merge
request. On top of that, scheduled pipelines are started for each
maintained branch on a daily basis; these include a few extra jobs whose
purpose is to either fill the gaps in platform coverage or run tests
which take too long to be invoked for every merge request (e.g. some
performance tests, respdiff).
What other QA processes do we do on a release candidate?
In terms of code testing, it is not so much a question of what extra
tests we run, but what test results we look at closely before signing off on a release.
All of the tests run for release tarballs are
automated and therefore also run (at least periodically) for preceding
revisions of the source tree. At release preparation time, we compare
current performance results with those obtained for previous releases
and analyze intermittent test failures to ensure they are not
manifestations of lurking bugs (usually they turn out to be test
code deficiencies, but it is not always immediately obvious).
We also clean up the release notes and verify whether the documentation
changes introduced since the previous release are accurate, correct, and
complete.
How much time do you require between code freeze and release of a maintenance version? What happens during that time?
It varies case by case. In a typical release, it takes about two days to do
the things listed in my response to the previous question. That may sound
like a lot, but note that in certain months, we have five releases to
prepare: 9.17, 9.16, 9.11, plus Subscription Editions: 9.16-S and
9.11-S. Sometimes we wrap up within a day, but then other times some
nasty bug is found at the very last minute and that throws a spanner
into the works.
What tasks take the most time for the team?
It really is a mixture of all of the things listed above. While we try
to make sure the infamous bus factor is above 1 for critical work, each
one of us has their own niche of a sort in terms of what we spend most
of our time on. Scheduling and prioritizing can be tricky at times
because on the one hand, an innocent-looking OS update might trigger
issues which take days to solve, while on the other hand a task we
anticipated would take weeks to complete can sometimes be finished
sooner.
How much has our increased effort at packaging added to the workload?
Most of the work required to make packaging work was a one-off effort of
setting up a build system and integrating it with Cloudsmith. It took a
while, but things are pretty stable these days and ongoing maintenance
boils down to bumping version numbers and applying occasional tweaks to
the packaging recipes or the testing scripts when something breaks in a
new release.
Assessing Our QA Effectiveness
Of the different types of testing - static analysis, packet fuzzing, unit tests, system tests, build tests, performance tests, security testing, and so on - where do you think we have good coverage/effective testing, and where could we improve?
I think we are doing pretty well in terms of testing the build process.
We used to get non-trivial numbers of reports about broken builds after
each public release. This effect seems to have subsided in the past
months and I think GitLab CI running on a reasonably broad spectrum of
platforms, combined with pairwise testing of build options, played a major
role in that.
The development team also managed to eradicate all known issues reported
by various sanitizers (ASAN, UBSAN, TSAN) in BIND 9.16. The recent
refactoring of the dispatch code opened up some new code paths, leading
to new warnings which need to be addressed, but I am confident we will
get these sorted out over time.
We are continuously improving the scope of our internal performance
tests. The goal here is to be able to make informed design choices
based on solid data rather than just gut feelings and/or educated
guesses.
Fuzzing tests these days seem to have reached a point of diminishing
returns in terms of issues discovered in existing code, but they allow
us to sleep better at night, knowing that any issues with new code
will be detected in due course.
As for unit and system tests, the challenge here is that writing them is a
retroactive effort in the case of BIND 9: while we are writing tests
whenever possible for new code, there is still a large volume of code
which was committed in the past and not accompanied by tests. In terms
of the ratio of lines of code covered by unit and system tests, we are
currently just shy of 80%, but this is just one of the applicable
metrics.
How do you evaluate the effectiveness of our internal QA efforts (e.g., do we track how many bugs we find in internal testing vs external testing? do you have a sense of whether we are finding a healthy proportion internally?)?
Tongue-in-cheek: as long as we are getting any external reports about
actual bugs, it means there is room for improvement in our internal
testing.
Given the above, I am afraid we do not do any kind of tracking or
statistics. With the resources we have, we try to prioritize fixing
problems, of which there is an abundance.
Do you feel good about our ability to prevent the recurrence of previously discovered and fixed bugs?
Yes, definitely. I am more concerned about our ability to predict
future problems and/or catch mistakes in new code before it gets
released to the public than I am about our regression suite. Given the
number of test jobs we run, even rarely occurring but known problems
should become exposed over time.
What about performance testing and preventing performance regressions - what do we do as far as that, and to what extent is this ad hoc vs. a regular automated process?
Performance evaluation is currently not fully automatic: the tests are
run automatically on a regular basis, but their results are examined by
humans. Resolver performance in particular is a multi-dimensional subject
and there is no single metric that would allow one to look at it and say
“this is unequivocally better/worse than before.” We are, however,
exploring possible solutions for automatic flagging of drastic shifts in
performance numbers.
What are the BIND QA accomplishments in the past year or two that you are most proud of?
The most significant accomplishment was Petr Špaček’s work leveraging the CZNIC resolver performance tools for benchmarking BIND 9. (Note: this is a realistic test bed for benchmarking resolver operations, which Petr describes in this talk at RIPE79.)
The other thing I am happy about is that we have managed to establish and maintain a
monthly release cadence, which is more challenging than it may sound.
Looking Forward
If you had more time or resources, what are some projects you would love to tackle?
Given the current state of the worldwide software industry, starting an
alpaca ranch crossed my mind more than one time in the past–
…oh, you mean for BIND 9? Due to the nature of our work, we discover
new research and experimentation opportunities almost on a daily basis and
each one of us has some ideas on what we could follow up on if there were
time. There is room for improvement in the way we write tests, store
their results, visualize them, and track trends over time. Sometimes,
we allow ourselves to go down a rabbit hole or two for the fun of it as
it helps prevent burnout and sometimes even results in new automated
tests being implemented. But usually there is enough priority work in
the queue to force us to defer “greenfield research” until some
undetermined point in the future.
If you could order some new tools to be “magicked into existence,” what would you like to have?
If I could ask for a pony, I would like to have a tool that would give us
deeper insight into how people use our software: what configuration
options they use, how that relates to the nature of their user base,
what platforms they run our software on, etc. Given our huge deployment
base, it is challenging to assess whether certain changes we make are
good for at least the majority of our users - and we have to resort to
educated guesses. For some changes, we consult the community through
our mailing lists, but even though the subscriber lists are of
substantial size, it is still just a small fraction of the entire user
population.
For years already, we have had internal discussions about potential
solutions that would allow such “reporting” in a secure and anonymous
way, but finding the sweet spot between “spyware” and “white noise
generator” is quite challenging. But hey, you said magic was on the
table.
Are there free open source tools we are using that you would recommend to other open source projects?
It may not be the type of thing you asked about, but for bug reproduction
and troubleshooting I can certainly recommend rr. Seriously, if
you have not tried it, drop everything and try it now as it might
revolutionize the way you approach troubleshooting.
What can users do to help improve BIND quality? How important are open source user bug reports to our overall quality process?
Bug reports are always much appreciated as long as the reporter provides
us with actionable information (we provide a GitLab issue template to
indicate what we consider useful information) and/or is willing to
cooperate when asked for more information (or experiments). Some
classes of bugs are pretty much impossible to track down without
extensive help from the reporter, either due to the platform used, the
specific network conditions in effect at a given time, or other dynamic
factors.
Please also remember that while we would love to make our software 100%
bug-free, we have to prioritize and we do not have the resources to fix
every single problem reported. That does not mean we do not appreciate
the reports and users’ cooperation, though.
This sounds like a good opportunity to remind people that
kindness goes a long way in the open source world. There is no better
way to make sure your problem will be ignored than through
obnoxiousness.
How do you keep the QA staff motivated, and how do you maintain your own motivation in the face of a fairly large volume of ongoing issues?
I touched upon that in one of the responses above: going down a
rabbit hole that fascinates you in one way or another from time to time
tends to be good for morale. Rabbit holes can be
opportunities in disguise because you never know when exploring new
areas of knowledge or experimenting with new tools (even seemingly
unrelated) will make you more effective at your day job and/or make the
product you are working on better. Every job involves bits that you
hate about it - the important part is to make sure those ugly bits do
not take most of your work time.