S3 (Serverless 3, not the other thing)

Before I kick off, here's what I'm most proud of right now:

If that looks too horrific, here's the same info from Postman:

And if that's too much... this might not be the writeup for you.

So the story so far: I'd gotten rid of all my S3 bucket and IAM policies and whatnot in favour of using Travis to deploy to Lambda directly, only to find out that Travis wasn't playing particularly nice: see Serverless 2 for the specifics, but I hacked around with it for a bit and couldn't find the root cause.

After trying for a bit longer, and having a bit of a sleep to approach the problem with a fresh mind, I made the call to switch back to S3 deployment. I'd read up on Lambda some more, and knew that as well as changing the code in S3, I'd need to run UpdateFunctionCode with a link to that particular zip file.

First things first -- I actually needed a zip file. Travis doesn't make one by default, so I couldn't use the defaults when deploying to S3. I could, however, specify a local directory to limit the upload to: that meant I could run a pre-deploy script to bundle the current working directory up into a zip, and put that zip in a build directory to push to S3.

I also knew I needed permission to actually run UpdateFunctionCode, so I added the permission to the IAM user I was already using for Travis to be able to upload to S3. I actually think I've done a good job keeping permissions to a minimum: instead of following what most tutorials seem to recommend, I've made sure using inline policies that the 'travis-ci' user can only upload to the code bucket, and only update the function code of this specific lambda function.

Still, I didn't want to be passing around credentials to that account in plaintext -- even if I trusted Travis CI, my plan is to make the repository currently holding the pluralize code public. So I used the inbuilt travis encrypt tool to add an encrypted copy of the secret access key to the S3 deploy stage, as well as an encrypted environment variable containing that access key to the env.global object. I'll need that environment variable in the second stage of deployment: updating the function code for our Lambda.

To actually update the function code, I've written a small script to be run after the "deploy" section (i.e. after the code is pushed to S3). It grabs credentials from the environment, then uses the AWS CLI to update the function code using the hardcoded path to the zip file. Yeah, I know, hardcoded=bad, but my build zip is always gonna be called "build.zip", so I don't feel too guilty about that one.

The one limitation with this system is that it doesn't fail the build if the UpdateFunctionCode doesn't go as planned. That's pretty minor (after all, there's no reason why it should fail, now that it's been set up), but it is pretty worrying. I experimented with extra deploy steps (which broke with some VERY weird errors), but the only way I can think of doing it reliably is by using "script" as a deploy method and rewriting the inbuilt S3 deploy. Possible, but not something I'm particularly keen on doing.

So after playing around with this Travis configuration for a while, I finally got my repository deploying to Lambda on each successful build. But I wasn't done yet -- API Gateway, which I'd thought would be the simplest piece of the puzzle, actually turned out to be quite difficult to configure.

I got correct responses back fairly quickly. But what I wasn't managing to get were 400 (Bad Request) responses: even the most heinous of errors were being sent along with a status of 200.

A rough "mental checklist" I used while debugging this:

  • 400 exists as an HTTP Status on the Method Response page
  • Bad Request: .* is being mapped to 400 on the Integration Response page
  • The Lambda function is returning something of the form {'errorMessage': "Bad Request: ..."}

After changing many, many things, I'm still uncertain of the exact cause, but I'm happy to report that I'm now getting a correct 400 status on bad requests. A couple of things that may have caused the change:

  • Changing the Lambda function to raise Exception("Bad Request: ...") instead of just returning a dict. (It's now a subclass of Exception called BadRequest).
  • Using a trivially different request: at one point I got some strange behaviour which seemed like my requests were being cached, despite leaving the option to cache API requests OFF

I also used API Gateway's ability to "template-map" Lambda responses onto HTTP responses in order to remove the "stack trace" information that was being sent, and only leave the error message. The body mapping template for "bad request" is very simple, and looks like this:

#set($inputRoot = $input.path('$'))
  "message" : "$input.path('$.errorMessage')"

(Essentially, it's the "Error" template, plus a lookup of the errorMessage field to pass through with the message key.)

I'm hoping to test this more thoroughly, and re-release this project in the form of a tutorial. I think that'll reinforce my understanding, and let me refamiliarise myself with a few of the concepts I've come across for the first time.