The new Mass.gov site launched after roughly a month in pilot mode, and we saw an increase in traffic which corresponded with a small response time bump. The site initially launched with a cache lifetime of over an hour for both the CDN and the Varnish cache. This meant that the site was stable (well insulated from traffic spikes), but that editors had to wait a relatively long time to see the content they were publishing.
We rolled out the Purge module, which we used to clear Varnish whenever content was updated. Editors now knew that it would take less than an hour for their content changes to go out, but at this point, we still weren’t clearing the CDN, which also had an hour lifetime. Site response time spiked up to about two and a quarter seconds as a result of this work; introducing “freshness” was slowing things down on the back end.
We realized that we had a cache-tagging problem. Authors were updating content and not seeing their changes reflected everywhere they expected. This was fixed by “linking up” all the site cache tags so that they were propagating to the pages that they should be. We continued to push in the direction of content freshness, at the expense of backend performance.
To address the growing performance problem, we increased the Drupal cache lifetime to three hours, meaning Varnish would hold onto things for up to three hours, so long as the content didn’t get purged out. As a result of our Purge work, any content updates would be pushed up to Varnish, so if a page was built and then immediately updated, Varnish would show that update right away. However, we saw very little performance improvement as a result of this.
Careful study of the cache data revealed that each time an editor touched a piece of content, the majority of the site’s pages were being cleared from Varnish. This explained the large spike in the response time when the Purge work was rolled out, and why raising the Drupal cache lifetime really didn’t affect our overall response time. We found the culprit to be the node_list cache tag, and so we replaced it with a system that does what we called “relationship clearing.” Relationship clearing means that when any piece of content on the site is updated, we reach out to any “related” content, and clear the “cache tag” for that content as well. This let us replace the overly-broad node_list cache tag with a more targeted and efficient system, while retaining the ability to show fresh content on “related” pages right away. The system was backed by a test suite that ensured that we did not have node_list usages creep back in the future. This earned us a massive performance boost, cutting our page load time in half.
We found that the metatag module was generating tokens for the metatags on each page twice. The token generation on this site was very heavy, so we patched that issue and submitted the patch back to Drupal.org.
We had another backend disruption due to some heavy editor traffic hitting on admin view; our backend response time spiked up suddenly by about 12 seconds. A pre-existing admin view had been modified to add a highly desired new search feature. While the search feature didn’t actually change the performance of the view, it did make it much more usable for editors, and as a result, editors were using it much more heavily than before. This was a small change, but it took what we already knew was a performance bottleneck, and forced more traffic through it. It demonstrates the value of being proactive about fixing bottlenecks, even if they aren’t causing immediate stability issues. It also taught us a valuable lesson—that traffic profile changes (for example, as a result of a highly desired new feature) can have a large impact on overall performance.
We got a free performance win just by upgrading to PHP 7.1, bringing our backend response time from about 500 milliseconds down to around 300.
We used New Relic for monitoring, but the transaction table it gave us presented information in a relatively obtuse way. We renamed the transactions so that they made more sense to us, and had them broken down by the specific buckets that we wanted them in, which just required a little bit of custom PHP code on the backend. This gave us the ability to get more granular about what was costing us on the performance side, and changed how we started thinking about performance overall.
We added additional metadata to our New Relic transactions so we could begin answering questions like “What percentage of our anonymous page views are coming from the dynamic page cache?” This also gave us granular insight on the performance effects of changes to particular types of content.
We performed a deep analysis of the cache data in order to figure out how we could improve the site’s efficiency. We broke down all the cache bins that we had by the number of reads, the number of writes, and the size. We looked for ways to make the dynamic page cache table, cache entity table, and the render cache bin a little bit more efficient.
We replaced usages of the url.path “cache context” with “route” to make sure that we were generating data based on the Drupal route, not the URL path.
On the feedback form at the bottom of each page on the site, the form takes a node ID parameter, and that’s the only thing that changes when it’s generated from page to page. We were able to use “the lazy builder” to inject that node ID after it was already cached, and we were able to generate this once, cache it everywhere, and just inject the node ID in right as it was used.
We took a long hard look at the difference between the dynamic page cache and the static page cache. Without using the Drupal page caching, our average response time was 477 milliseconds. When we flipped on the dynamic page cache, we ended up with a 161 millisecond response, and with the addition of the static page cache, we had a 48 millisecond response. Closer analysis showed that since Varnish already handled the same use case as the Drupal page cache (caching per exact URL), the dynamic page cache was the most performant option.
We automated a nightly deployment and subsequent crawl of site pages in a “Continuous Delivery” environment. While this was originally intended as a check for fatal errors, it gave us a very consistent snapshot of the site’s performance, since we were hitting the same pages every night. This allowed us to predict the performance impact of upcoming changes, which is critical to catching performance-killers before they go to production.
As a result of all the work done over the previous 5 months, we were able improve our content freshness (cache lifetime) from 60 minutes to 30 minutes.
We enabled HTTP2, an addition to the HTTP protocol that allows you to stream multiple static assets over the same pipeline.
We discovered that the HTML response was coming across the wire with, in some cases, up to a megabyte of data. That entire chunk of data had to be downloaded first before the page could proceed onto the static assets. We traced this back to the embedded SVG icons. Any time an icon appeared, the XML was being embedded in the page. In some cases, we were ending up with the exact same XML SVG content embedded in the page over 100 times. Our solution for this was to replace the embedded icon with an SVG “use” statement pointing to the icon’s SVG element. Each “used” icon was embedded in the page once. This brought pages that were previously over a megabyte down to under 80 kilobytes, and cut page load time for the worst offenders from more than 30 seconds to less than three seconds.
We reformulated the URL of the emergency alerts we’d added previously to specify exactly the fields that we wanted to receive in that response, and we were able to cut it down from 781 kilobytes to 16 kilobytes for the exact same data, with no change for the end users.
We switched from WOFF web fonts to WOFF2 for any browsers that would support it.
We used preloading to make those fonts requested immediately after the HTML response was received, shortening the amount of time it took for the first render of the page pretty significantly.
We added the ImageMagick toolkit contrib module, and enabled the “Optimize Images” option. This reduced the weight of our content images, with some of the hero images being cut by over 100 kilobytes.
The Mass.gov logo was costing the site over 100 kilobytes, because it existed as one large image. We broke it up so that the image of the seal would be able to be reused between the header and the footer, and then utilized the site web font as live text to create the text to the right of the seal.