Illustration of Yeti character writing with a pencil.

Improving Our Production Drupal Deployments on Pantheon

Profile picture for user Jeff
Jeff Landfried
Lead Developer

We work with Pantheon a lot. We love their platform, and how it allows us to focus on developing Drupal projects while leaving the system administration to them. We have a reliable CI pipeline that allows us to develop in GitHub and push a production-ready artifact to Pantheon’s downstream repository - it’s a great developer experience. We’re happy with this portion of our workflow, but once our work is merged to the main branch and deployed to the dev environment on Pantheon, things began to get a little more dicey. Deploying to test and live seems like it should be the easiest part, since Pantheon has their drag & drop UI that everyone reading this is probably already familiar with. The issues that we bump into tend to come when configuration changes are made directly to a production environment.

How we used to deploy

First, let’s take a look at how we have historically deployed to these environments:

  1. Deploy production-ready code to target environment by using Pantheon’s drag & drop UI.
  2. Use a Quicksilver script to run drush cim
  3. Use a Quicksilver script to run database updates using drush updb

This workflow is great, but it makes the big assumption that there are no config overrides on the target environment. Sure, we like to imagine that our code is the only source of truth for production configuration, but that is not always the case. Sometimes, there’s a legitimate reason for a client to make a quick change to production config. When we deploy to an environment with overridden configuration using the above workflow, the client’s configuration changes will get reverted unless the developer catches the overridden config prior to initiating deployment. While there are many approaches that we as developers can and should take to help prevent configuration overrides on production - like setting up appropriate roles, using config_ignore for certain special cases, and core’s config_exclude_modules settings - they can still happen from time to time.

We’ve had a lot of success using Pantheon’s Quicksilver Hooks to automate our deployment steps (seen above), but what are we to do when we deploy to an environment that has overridden configuration? Should we not import our new configuration? Or should we blindly import our config changes and revert the existing overrides? Clearly, neither option is ideal. Along with this dilemma, relying solely on Quicksilver hooks presented a few other challenges that we wanted to improve on:

  • Reporting: Unless you are running terminus workflow:watch or looking at terminus workflows:info:logs for every deployment, it’s not clear what’s actually taking place during a deployment.
  • Lack of clarity: Without reading about them in a project’s docs or checking the project’s pantheon.yml , a developer initiating a deployment may not even be aware that quicksilver hooks exist and are going to execute on the target environment!
  • Inflexible: Quicksilver hooks do the same thing every deployment, and don’t ask questions. Without reverting to something like keywords in commit messages, there’s no way that a step can be skipped or altered per-deployment.
  • Lack of an escape hatch: Once a deployment is initiated, there’s no pre-flight check that can give the option to abort.

Our new approach

These are the reasons that we started investigating a new method to handle deployments to test and live on Pantheon, and in order to address them we created a few hard requirements:

  • As a developer, I should be able to abort a deployment if there are configuration overrides on the target environment.
  • As a developer, I should be able to easily know what steps are executed during a deployment. There should be no surprises.
  • As a developer, I should be able to easily see logs from all deployments initiated by our team.
  • As a developer, I should be able to update our deployment workflow in one place across all of our projects. (This one was a nice-to-have.)

As a developer, I should be able to abort a deployment if there are configuration overrides on the target environment.

To start, we looked at how we could create a deployment process that could self-abort if there were configuration overrides. This was our highest priority requirement. We needed to avoid blindly reverting existing configuration changes that had been made on production. Since telling our development team to “just check config on prod prior to deployment” was not an acceptable solution for us, we created a new terminus plugin to help with this: lastcallmedia/terminus-safe-deploy. This plugin adds a new command terminus safe-deploy:deploy SITE.ENV that will run through all of the steps of our traditional deployment (along with a few more optional ones). Before initiating the deployment on Pantheon, the plugin will check for overridden configuration on the target environment and abort if it finds any. If the --force-deploy flag is set it will still check for overridden configuration and output what it finds, and then continue the deployment.

As a developer, I should be able to easily know what steps are executed during a deployment. There should be no surprises.

We added several other flags to the safe-deploy:deploy command that would allow us to explicitly state which operations we wanted to perform during a deployment:

  • --with-cim: triggers config imports post-deploy
  • --with-updates triggers DB updates post-deploy
  • --clear-env-caches clears the target environment CDN and Redis caches. This is something that we didn’t typically include in our Quicksilver scripts, but we saw value in making it easily accessible for the times that we needed it.

Each flag must be added so we make the conscious decision to include it as a part of our deployment.

As a developer, I should be able to easily see logs from all deployments initiated by our team.

We preferred not to rely on terminus workflow:info:logs to see what happened for each deployment. Our developers were already visiting the GitHub repository to review and merge pull requests, so GitHub seemed like the perfect place to initiate our deployments and store the logs as well. We decided to use GitHub Actions to trigger the deployments. We use their workflow_dispatch event to initiate our deployments manually, and as a bonus they provide an interface for workflow inputs, which we can correlate to the flags on the terminus command. We also included the ability to post success/failure messages to Slack, with links to each job so the team can easily see when deployments are run, if they pass/fail, and have a link to view the logs without having to search.

To use Slack alerts the command accepts a --slack-alert flag and a --slack-url argument (or a SLACK_URL can be set as an environment variable).

As a developer, I should be able to update our deployment workflow in one place across all of our projects.

This was a bonus requirement that we’re really excited about. GitHub actions allows the reuse of workflows from public repositories in other workflows, so we built a single terminus-safe-deploy workflow, and we are able to reference it from individual project workflows as seen in this gist. This lets us merge changes into the workflow (like updating the docker image used, or adding another step if needed) without having to update each individual project’s workflow files. In the example above, we are calling the workflow from the main branch but you can reference specific tags or commits if you prefer to prevent changes to the workflow for a particular project.

The End Result

Initializing a deployment from GitHub Actions
Initialization of a deployment from GitHub Actions

The time spent on this investigation was well worth the effort for our team. As we remove the quicksilver hooks from our projects and replace them with GitHub Actions workflows, we feel a new sense of confidence, knowing that if our deployments to test and prod are going to impact overridden configuration, the deployment will abort itself unless we explicitly tell it to continue. Having a user interface that allows us to explicitly choose which steps we run (with the most common options being set by default) gives us the control that had we desired for these deployments, while still being as simple as using the drag and drop UI. An added benefit of this approach is that it doesn’t require any institutional knowledge, so if another team gets involved, or the client pushes code but is not familiar with using GitHub Actions, there’s no harm for them to use the drag & drop UI within Pantheon, and they don’t have to worry about any unexpected operations taking place in the background once their code is deployed to the target environment.

Setting it up yourself

We chose to implement this as a single terminus command that gets invoked by a reusable GitHub Actions workflow to keep setup easy. In order to add this deployment workflow there are just a few steps:

  1. Copy the contents of this gist workflow file to .github/workflows/pantheon_deploy.yml
  2. Add required secrets to your repository:
    • Visit https://github.com/{ORGANIZATION}/{REPO_NAME}/settings/secrets/actions
    • Add:
      1. PANTHEON_PRIVATE_SSH_KEY: A private key associated with an account that has access to run Drush commands on Pantheon.
      2. TERMINUS_MACHINE_TOKEN: A Pantheon machine token associated with a user who has access to deploy to the current project on Pantheon
    • Note: At LCM since we use this workflow across multiple projects, we have stored these as organization secrets. This makes the secrets available to any repositories we specify, and we only have to create one set.
  3. Add required actions variables:
    • Visit https://github.com/{ORGANIZATION}/{REPO_NAME}/settings/variables/actions
    • Add:
      1. PANTHEON_SITE_NAME: The machine name of the pantheon site. This is the name used as a part of the pantheon-specific domains such as https://dev-{SITE_NAME}.pantheonsite.io
      2. SLACK_URL (optional): A url provided by slack that you can post to, which will send messages to your channel of choice. Talk to your Slack admin to set one of these up if needed.

One last thing

We still love Quicksilver hooks. We continue to use them for other types of background tasks such as creating deployment markers in New Relic and notifying Slack when certain operations are performed. They offer great functionality, but we prefer to keep our mission-critical deployment steps elsewhere.