Dev, Staging, CI, and Prod (dis)parity

May 9, 2024

Hi there y'all!

Remember when some of our teammates argue about "Production Parity"? That Staging must be like Production? Or shout out "it worked on my machine"?

Have you heard this one: "it worked on my machine, it is green CI, test coverage report is >80%, it worked on staging, manual smoke/regression tests found nothing" but it still doesn't work in Production"?

So (for the context of this post) this is about parity [or, if you prefer, (dis)parity]! This is about having the same underlying infrastructure and configuration, so the software behavior becomes predictable when you deploy to your users.

There are also those folks who prefer to test in Production, which in several cases is a valid and sometimes the only possible strategy, but in some cases it may just be because everything works in their Staging environment! 🤣

On a more serious note, what is important here is neither parity nor testing in production, is understanding what is acceptable and expected for your Company, Organization, Startup, Team, and your Users! If you work on a team that will shame you for a failed production deployment (when it works on local, CI, and staging, remember?) then you need to learn to ignore, plan to move on, or else.

If you work on a team where there aren't strong and mature rollback procedures, or there isn't a status page with uptime alert, monitoring and observability tools in place, then you have some work to do, and points to make. If a production deployment fails while CI, Staging, and Dev are all fine (but on 🔥)... then you got some more work to do!

But if you work on a team that is like: "Oh yeah, I am glad you found that! I don't think anyone is looking at that"... You still have work to do, but it is a lot more fun and rewarding! 🤩

Side note: do you know the sound of "Yo... if you get those 2 things done will be amazing!"? 🤞

Not too long ago, in a galaxy not too far away, I ran into this specific problem when updating a 9 1/2 years old Rails app from v6.1 to v7.0.

We've made 2 failed attempts to deploy it to production and we would have made a 3rd failed attempt if I hadn't buried my head in the logs to find the root cause of the 2nd failed attempt.

The issues I ran into all fall into Dev, Staging, CI, and Production (dis)parity, which caused issues to only show up when deploying to production. Some have already been fixed in this (soon to be a 10 years old) Rails app, but some others are still in the TODO list.

    With that said, let me dive into 3 specifics issues I've had to deal with:
    1. Rails cache_store using redis in production, and null_store everywhere else.
    2. Staging environment on Heroku using RAILS_ENV=staging.
    3. A strange Rails 7.0 auto loading behavior (or a bug, or a feature...).

The first failed attempt to deploy Rails 7.0 was due to the following error:

You're using a cache store that doesn't support native cache versioning.
Your best option is to upgrade to a newer version of ActiveSupport::Cache::RedisStore
that supports cache versioning (ActiveSupport::Cache::RedisStore.supports_cache_versioning? #=> true).
  
Next best, switch to a different cache store that does support cache versioning:
https://guides.rubyonrails.org/caching_with_rails.html#cache-stores.
  
To keep using the current cache store, you can turn off cache versioning entirely:
  
    config.active_record.cache_versioning = false

Side note: I feel like this should be documented in the official Rails upgrade guide, but I know that sometimes what you see in https://guides.rubyonrails.org/upgrading_ruby_on_rails.html is not what is the latest (meaning: doesn't include all the contributions to date).

A few years back, I've made some contributions to Rails upgrade guide and I recall something like my contributions not being backported to another branch, or having to wait until certain release (to be in guides.rubyonrails.org). In the weeks to come, I will circle back on this and see if it makes sense to open a PR. 🤞

This is a specific error message somewhere in Rails, related to a new behavior, where in Rails 7.0 is not supported and causes rails app to fail to start.

Now back to Dev/Staging/CI/Prod (dis)parity...

The issue only showed up in Production because in production.rb we had:

if ENV['REDIS_URL']
  config.cache_store = :redis_store, { ... }
end

While Dev and CI, we had this:

config.cache_store = :null_store

And Staging (commented out!):

# Use a different cache store in production.
# config.cache_store = :mem_cache_store

Interesting detail: because of Sidekiq, we already have Redis available across the board!

In production, the app was still using the redis-rails gem, which doesn't support the versioning feature required by Rails 7.0.

To be fair, there was a ticket in the Team's backlog talking about the necessity to get rid of redis-rails gem, so someone in the Dev Team was on the right track. This could have never happened in Production if that tech debt ticket was taken care of, but it was forgotten by the time I started working with this team to help upgrade Ruby and Rails (dependencies, some tech debt, etc).

All this analysis and work was done on the next day. I took care of that task, removed redis-rails gem, changed cache_store in production to :redis_cache_store, changed Dev and Staging to also use it, so all the environments are more like each other, but CI...

This work was merged to the main branch and deployed to production, and synced with the Rails 7.0 branch afterwards.

The second failed deployment happened right then and there with some classic cowboy coding!

You know when your production deployment fails but you want it to succeed? Or at least you had zero expectation of having a failure?! :-D

While I was hammering myself hard trying to understand why in the world I am first seeing this in production (remember: works in Dev, Staging, and CI), and trying to research about the error message, the clock ticking, and the very end of the error message started to glow, and glow, and glow some more... and blink in sparling rainbow colors! 🌈

Read it again below, and see it for yourself:

To keep using the current cache store, you can turn off cache versioning entirely:

  config.active_record.cache_versioning = false

The error message, that aborted the production deployment, had a way out: you can ignore this new behavior and move on. And it provided the exact line: copy, paste, copy, paste, copy, paste, copy...

Can you please pause reading for a second and try to remember when was the last time you were deploying something to production and it failed (should I say unexpectedly?!?!)?!?! If you are lucky, you won't have many events in memory and it may be hard to remember the exact details! For myself, it has been a few good years that I prefer not to trust my memories too much! :-D

After some time reading, thinking, and trying to find what to do next, my teammate and I agreed to give it a try (the way out, disabling the cache_version). And once we agreed on it, my teammate proceeded with the changes, and I refocused into thinking and reading about what could be the side effects of doing that, while waiting for another production deployment attempt...

Reading through the error message and trying to think logically (after a failed production deployment) I was leaning towards that being a safe bet!

    Here is the glimpse:
    1. Rails 6.1 doesn't need the cache versioning
    2. Rails 7.0 fails with a specific message about it but provides a way out
    3. To my knowledge this app doesn't rely heavily on cache
    4. Nobody else in the Dev Team ever mentioned anything caching related

But my head was still wandering around: how can this show up for the first time only in the Production deployment?

I had a dedicated Heroku Staging environment for the Rails 7.0 branch. I've done numerous deployments of this branch to the Staging env... but still it worked on Staging (🔥) and it failed in Production ()!

As I was going through all that in my head, I got another notification popping up. It read (something along the lines): "I deployed the cache_versioning = false and it worked, but we are getting HTTP 503 from the uptime/status check.".

Disabling the cache versioning worked, the Rails 7.0 branch was deployed successfully, but just to make us stumble upon another disparity between Dev/Staging/CI & Production!

Given we were getting 503s, and this was the 2nd attempt, we decided to rollback the deployment, and go back to the drawing board! Back to Rails 6.1 in Production and figure out where the 503s were coming from.

The next day I've executed the exact same steps in Staging to try to replicate the 503s, but TL'DR I could not replicate it.

It was easy to replicate the failed deployment related to the cache_store and redis-rails gem, but after setting it to false, the app worked just fine in Staging and no sign of the 503s! I tested the end point used for the uptime check and all was good (but on 🔥).

The 503s were a mystery for quite a while and I started getting suspicious about Heroku Staging using RAILS_ENV=staging and not production.

Looking at staging.rb and production.rb it was easy to realize that I would not be able to change RAILS_ENV in Staging before Rails 7.0 deployment. Things like bucket name, api keys, server host are not fully & consistently configured. I was using a Heroku 'Staging-2' app, which shares buckets with a Heroku Staging app... so... this was not a battle I would be able to pick...

As I was gearing up for another Rails 7.0 deployment, I was still concerned about not knowing the root cause of the 503s! So I decided to ask if we could set up a log management tool, just to find out that it was already being pushed over to AppSignal! GREAT! 🦾

I was not very optimistic that the 503s would not happen again! Even though we found and changed some things, the root cause for the 503s were still unknown, so I dove into the logs and found the following error message:

app web.4 - [37] ! Unable to load application: 
NameError: uninitialized constant WickedPdf::OpenStruct

The missing root cause problem was solved! The web dynos were failing to load the app because it could not find WickedPdf::OpenStruct.

During the load process, the app with Rails 7.0 was failing to find WickedPdf::OpenStruct. OpenStruct comes with Ruby and WickedPdf gem (in its latest version) is in the app for quite some time. It works just fine with Rails 6.1, it works just fine in Staging with Rails 7.0! But production...

The Rails 6.1 app already had Zeitwerk enabled in production for at least a week, so I was not expecting any issues loading the app... It is possible that WickedPdf needs to refer to '::OpenStruct' instead of just 'OpenStruct'. Or that it should be in WickedPdf gemspec as a dependency... but still Rails 7.0 loaded everything just fine on Heroku Staging env (which uses RAILS_ENV=staging), and the generate PDF feature was also working...

In the coming weeks I will circle back on this and ask in the WickedPdf repo, or perhaps in the Rails discussion forums whether this could be an edge case of Zeitwerk or somewhere else in Rails! 🤞

Once it was determined the 503s were due to WickedPdf preventing the app from loading, I discovered a work-around [which I still could not replicate it in Staging (remember: Staging all was good and on 🔥)], but I was now confident for another attempt deploying the Rails 7.0 branch.

And this time it worked! The app booted just fine, and Rails 7.0 was finally in production.

Rails 7.0 is in production, some tech debt was paid, new tech debt was uncovered, all of it is documented in my head, some of it already in the task management tool being used!

Next time you do a Rails deployment, please check the config.cache_store, pay closer attention to the differences in the environments, raise the point Staging apps should run RAILS_ENV=production (BTW: Heroku deployment warns you about it), but keep in mind that perfect parity is unlikely to be achievable or is expensive enough that makes it cheaper to test in Production! 🦾

And at last but not least: don't practice cowboy coding! At least not very often! Do understand that when not used in excess it can save you a failed deployment or two! Like testing in Production, deploying code to Production on Friday... 🤣

If you liked it, say Hi on or by (the good and old fashioned) ! :-D