A Caching Journey

How to take back control over caching with Rails’ Active Storage and Amazon S3.

TL;DR

Add an upload key to your configuration in config/storage.yml to set the Cache-Control header for uploaded files:

amazon:
  service: S3
  upload:
    cache_control: 'max-age=86400, private'

I just had a memorable TIL moment with Ruby on Rails’ Active Storage and caching.

It happened while working on a React app with a small Rails API backend. The app was displaying loads of images of various types, dimensions and file sizes and one would expect the browser to cache them after their initial retrieval from the backend. Caching these images on the client means sending less data over the wire on subsequent interactions and a smoother user experience when re-rendering components or re-loading the page. Unfortunately, in our case browsers were not caching any images which resulted in noticeable performance impacts and a flickering UI for users. With this caching issue, the app was wasting resources and sucking up its users’ network traffic 🚨.

Inspecting image requests and responses in the “network” tab of the browser’s devtools, we see images being retrieved freshly with a 200 OK status and this Cache-Control response header:

max-age=0, must-revalidate

The max-age=<seconds> directive specifies the amount of time the browser can treat a resource as “fresh” after its last retrieval. Within that period it may serve a cached copy before actually checking back with the server again. In our case this is set to zero seconds, so images are instantly considered “stale”. The must-revalidate directive instructs the browser to never serve a stale, cached version of the resource but first “validate” whether the resource is up to date by asking the server again.

With this response header, the browser is essentially instructed to not cache images. Why are we doing this?

Detective Game

In our setup, image are uploaded and served by the backend API. It uses Active Storage under the hood, a framework offered by Rails for uploading files to various cloud services like Amazon S3 or Microsoft Azure and attaching them to Active Record models. Active Storage adds an extra layer of abstraction between the file upload in the app and its actual storage location on a remote cloud service: Rails only stores the upload’s metadata and a reference to identify it on the storage service. It does not store actual file content. Requests for file uploads go through the Rails server first, which then returns the URL of the actual file location on the cloud service. So in our app, retrieving an image actually consists of two requests:

  1. Request the file from the backend API
  2. Request the original file from Amazon S3 (the cloud service we use)

The response for the first request is a redirect containing the actual file URL for the second request in its Location response header.

Active Storage image request
Active Storage image request • Own work Bitcrowd

Looking at the network tab, we see the 302 Found redirect response for the first request already being cached correctly. The devtools state a “(from disk cache)” info and the Cache Control header matches as well:

max-age=10800, private

It instructs the browser to treat the resource as “fresh” for 10800 seconds (three hours) and “private”, which means it’s intended for a single user and may not be stored in a cache shared among multiple users. Rails allows to configure the expiry time of these redirect URLs generated by Active Storage. It defaults to five minutes, but our API already had it set to:

ActiveStorage::Service.url_expires_in = 3.hours

This is what we want: spare extra network requests by storing responses to frequently requested resources on disk. But why do we only cache the redirect responses, not the resource-heavier one for the actual image on S3?

With caching enabled in development, I started an investigation around our setup and configuration for Active Storage and direct uploads. After hitting walls for hours and still being stuck with no-cache or max-age=0, private, must-revalidate Cache Control response headers for image requests, I realized: I am looking at the wrong thing! Besides integrating with cloud providers, Active Storage also has a local disk-based storage backend which we use in development. Intended for the test and development environment, this backend does not even support caching. And even worse: I was looking at image responses from our Rails server for hours, while in production the requests target Amazon S3. It’s Amazon who puts together the response headers, not our servers 🙈.

So I set up a bucket on S3 and changed my setup to target Amazon for file uploads. Now image requests were served by Amazon S3 (success!). Still the Cache Control header was wrong and nothing cached…

Looking at the S3 docs and blindly clicking around, I found out I can set the respective Cache Control header editing the “metadata” of my uploads directly on S3 🎉

Edit upload metadata on S3
Edit upload metadata on S3 • Own work Bitcrowd

With our app not offering any means for updating existing images, I set the value to:

max-age=86400, private

It instructs the browser to consider the resource “fresh” for 24 hours and store it only for the single user.

And - 🍾 - it worked! The response from S3 had the right header and the browser started caching images! The network tab indicated a 200 OK (from disk cache) as status code now.

So all done, problem solved 👋.

But… new uploads of course didn’t have the header and weren’t cached. Editing metadata on S3 also felt too weird to be an actual solution. The information should be provided by Active Storage instead. A wise person on Stackoverflow then pointed me to this line in the Active Storage source code:

def initialize(bucket:, upload: {}, public: false, **options)
  @client = Aws::S3::Resource.new(**options)
  @bucket = @client.bucket(bucket)

  @multipart_upload_threshold = upload.fetch(:multipart_threshold, 100.megabytes)
  @public = public

  @upload_options = upload
  @upload_options[:acl] = "public-read" if public?
end

Initializing the S3Service allows to pass an optional upload hash. The service uses the aws-sdk-s3 gem under the hood and passes the contents of this upload hash as additional arguments to the put or upload_stream methods called on the underlying Aws::S3::Object when uploading files 💡.

def upload_with_single_part(key, io, checksum: nil, content_type: nil, content_disposition: nil)
  object_for(key).put(body: io, content_md5: checksum, content_type: content_type, content_disposition: content_disposition, **upload_options)
rescue Aws::S3::Errors::BadDigest
  raise ActiveStorage::IntegrityError
end

With this knowledge, I checked the docs for put's arguments and found:

:cache_control (String) — Can be used to specify caching behavior along the request/reply chain

Jackpot 🎰!

So we put something like this into config/storage.yml:

amazon:
  service: S3
  access_key_id: <%= ENV.fetch('AWS_ACCESS_KEY') %>
  secret_access_key: <%= ENV.fetch('AWS_SECRET_KEY') %>
  region: eu-central-1
  bucket: <%= ENV.fetch('S3_BUCKET') %>
  upload:
    cache_control: 'max-age=86400, private'

Now the Cache Control metadata is set when uploading (also for direct uploads) and files are served with the right response headers, allowing the browser to happily do its caching work 🤗

© 2020 bitcrowd GmbH.