Due to security, reliability, and growth reasons, organizations are constantly upgrading their software to newer releases. Some upgrades are incremental and minor in nature. Others, like the upgrade from Django Rest Framework (DRF) V2 to V3, require coding changes due to incompatibilities between the releases. This article is about BitSight's upgrade experience, lessons learned, and how we improved because of it.
Django Rest Framework (DRF)
DRF is a toolkit for authentication, Object Relational Management (ORM), and REST endpoint management. As such, it's a critical part of the customer portal, which is our product’s user interface. Instead of detailing all the changes required to upgrade from DRF 2.4 to DRF 3.6, we refer readers to the DRF 3.0 Announcement page.
There were three specific reasons why BitSight needed to upgrade to DRF 3. The first was an immutable DRF request object which could be a potential security enhancement; by disallowing the alteration of requests, requests could not be spoofed. The second reason was the difficulty of understanding multi-release documentation with differing specifications. Consider a Google search of how to validate field data. Both DRF versions use method validate_<field_name>, but the signature changed between DRF 2 and 3. The third reason was the need for newer functionality like pagination of large amounts of data returned to the customer portal.
There were two less obvious reasons to upgrade. The first was a decrease in developer productivity. In order to get some required DRF 3 functionality into our product we created our own ad-hoc implementations of that functionality. The other less obvious reason was that developing on older releases might impact our ability to recruit engineers; it is certainly more attractive to work with the latest tools than older, out-of-date releases.
How did we approach the upgrade?
As with the beginning of any project, we needed to create estimates before making any code changes, but we faced several challenges. We made estimates of the time and effort needed to migrate to DRF 3, knowing that until the actual upgrade began the estimates might be suspect. From our experience with previous upgrade efforts, we knew that as functionality was continuously added to the code base the upgrade estimates would become less accurate. The accuracy would impact not only the resource planning, but during project retrospective we would have difficulty determining how to attribute any difference between the estimate and actual time required to complete the upgrade.
Our plan of attack for the code changes centered around the fact that we have a large number of unit and integration tests that are run constantly. The test suite is the knowledge base of the original development intent. It could be leveraged to ensure that the majority of the customer portal's functionality remained constant with the newer DRF. Before we could execute the tests we first needed a customer portal image that had DRF 3.6 and all its prerequisite software installed and was otherwise deployable. The time required to create this initial, deployable image is important because of the time required to resolve prerequisites and interdependencies. Luckily the time required for the initial deployment was minimal.
Since the serialization process of translating, storing, and then reconstructing data was a major change in DRF 3, we categorized our tests into three groups by the kind of serializers used. We planned to first port the simpler serializers and views that used only built-in field types and ensured that these simple tests passed. This gave us some experience and provided a project completion metric of percentage of tests executing. The second group of serializers to be ported was more complex, custom field serializers. The third group of serializers to be ported was the most complex serializers that required the most code changes.
DRF 2 had a two-step object creation process where the serializers validated the data and instantiated the object in one step and then saved in a separate step. DRF 3 moved to a three-step object creation process where the serializer validated the data, instantiated the object, and then saved it to the database. We categorized a fourth and last group of serializers to be ported from those serializers that had nested serializers or list serializers that would benefit from the new object creation process. In addition, we placed serializers that had code that violated the request object immutability of DRF 3 or that had ad-hoc code into this last group.The ad-hoc code consisted of code that added DRF 3 functionality like pagination to DRF 2 or custom packages like JSON that are now part of DRF3.
Once our test suite executed successfully we had confidence that the upgrade handled the majority of the functionality provided in the customer portal. We then conducted a lot of sniff tests in which four or five personnel tried out the new customer portal. Each sniff test had a different group of testers chosen from teams outside of customer portal development. We used project office personnel, testers from diverse projects, developers of yet-to-be-released features, and DevOps personnel. Sniff tests brought the experience of the entire team to contribute to the quality of the upgrade. From the problem report tickets created during the sniff tests, we created new test cases, additional process documentation, and enhancements to our regular regression testing.
What are the lessons learned?
Allocate more time for discovery and planning. Even though we created tasks for reading documentation and doing code discovery at the outset of the effort, we believe that one reason the upgrade took longer than estimated is that we did not spend enough upfront time doing discovery and planning. Once the estimated time allotted for reading and discovery was completed, developers felt that they should move onto the actual upgrade. The upgrade might have been more efficient if the tasks for upfront work included some exit criteria such as confirming the remaining tasks’ content or schedule.
Additionally, although we thought that the immutable request object used in DRF 3 might provide some security enhancement, generally it seems to have merely enforced good coding practices. If we had spent more time in discovery and planning, we might have determined that there was no explicit security enhancement because the request is contained within our system. The immutable request object would only protect our systems against an injection attack.
Sniff tests were very effective. Because we had a large test suite, we assumed that sniff tests would not be effective and pushed our sniff testing out until later in the project. Sniff tests were actually surprisingly effective at finding untested functionality. The fixes from the sniff tests improved the documentation and drove additions to the automated test suite. In hindsight, early, frequent sniff test teams should have been composed of personnel selected for their diversity. The diversity would have ensured that problems found in one area of code would limit the other personnel's investigation during the sniff test. In other words, if all the sniff testers were DevOps personnel that tested the addition and deletion of user functionality, any problems found would effectively end that particular sniff test. In hindsight, preparing sniff testers with assigned areas of focus, briefing them on the upgrade changes, and alerting them to potential problems to look for might have been helpful. Lastly, having performance focused sniff tests might have allowed for additional insight.
Automated performance tests were very important. Performance bottlenecks were never observed in the DRF serializer code, but there were performance differences in the customer portal from the last major deliverable that used our suite of automated performance tests. Having regular performance runs or early performance focused sniff tests might have been helpful. As we began to regularly run the performance tests, the removal of ad-hoc code now available in DRF 3 actually made a slight, unexpected performance improvement. The removal of ad-hoc code also made the customer portal easier to maintain. In fact, with all the serializers using a similar standard data process, we believe the code is easier to maintain. We have added two new engineers recently and believe that the standardized serialization has helped their onboarding process.
How have we improved?
We expect that upgrading code critical to our customer portal will make us more efficient in delivering new functionality. We have subject matter experts and other personnel more knowledgeable in our code base. We have an even more robust test suite. We’ve been able to add the new functionality, like pagination, provided in the latest DRF toolkit. We now are poised to increase our ability to deliver exciting new features to our customers.
BitSight is moving fast, but we don’t want to sacrifice code quality for speed, which is why tests have always played an important role in our development process. Although we are not doing TDD (Test-driven development), one of the key...
A few months back we added a new feature to the heart of our security ratings portal: the ability for users to not only filter companies in their portfolios, but also to see real-time updated counts of how many "filtered" companies match...