Streaming upload to S3

Here’s a short recipe of how to transmit files from an external source to an S3 bucket, without downloading the whole source and hence unnecessarily allocating memory:

It’s taking advantage of request’s stream capability.

Even with files over 2 GB in size, the Lambda container consumed only about 120 MB of memory. Pretty sweet. Of course, this approach is applicable to any platform, not just Lambda.

Overseeing a long running process with Lambda and Step Functions

In my day job, we’re using Lambda and Step Functions to create data processing pipelines. This combo works great for a lot of our use cases. However for some specific long running tasks (e.g. web scrapers), we “outsource” the computing from Lambda to Fargate.

This poses an issue – how to plug that part of the pipeline to the Step Function orchestrating it. Using an Activity does not work when the processing is distributed among multiple workers.

A solution I came up with is creating a gatekeeper loop in the Step Function to oversee the progress of the workers by a Lambda function. This is how in looks:

gatekeeper-step-function

The gatekeeper function (triggered by the GatekeeperState) checks, if external workers have finished yet. This can be done by waiting until an SQS queue is empty, counting the number of objects in an S3 bucket or any other way indicating that the processing can move onto the next state.

If the processing is not done yet, the gatekeeper function raises a NotReadyError. This is caught by the Retry block in the Step Function, pausing the execution of a certain period of time, as defined by its parameters. Afterwards, the gatekeeper is called again.

Eventually, if the work is not done even after MaxAttempts retries, the ForceGatekeeperState is triggered. It adds a "force: true" parameter to the invocation event and calls the gatekeeper right back again. Notice that the gatekeeper function checks for this force parameter as the very first thing when executed. Since it’s present from the ForceGatekeeperState, it returns immediately and the Step Function moves on to the DoneState.

For our use case, it was better to have partial results than no results at all. That’s why the ForceGatekeeperState is present. You can also leave it out altogether and have the Step Function execution fail after MaxAttempt retries of the gatekeeper.

A better way to package Python functions for AWS Lambda

The default way of creating a zip package that’s to be deployed to AWS Lambda is to place everything – your source code and any libraries you are using – in the service root directory and compress it. I don’t like this approach as, due to the flat hierarchy it can lead to naming conflicts, it is harder to manage packaging of isolated functions and it creates a mess in the source directory.

What I do instead is install all dependencies into a lib directory (which is as simple as pip install -r requirements.txt -t lib step in the deployment pipeline) and set the PYTHONPATH environment variable to /var/runtime:/var/task/lib when deploying the Lambda functions.

This works because the zip package is extracted into /var/task in the Lambda container. While it might seem as an unstable solution, I’ve been using this for over a year now without any problems.

Python pre-commit hook

This is a pre-commit hook I use in my Python projects.

https://gist.github.com/3849310

Nevermind my feak bash-fu, in the end the script does what I want it to – the three following things:

  • First, it checks if I haven’t forgotten to add a new module to the requirements.txt file. Most of the time this works like a charm with virtualenv and pip. The only drawback is installing modules in local experimental branches – these modules are not necessary in upstream branches and so they don’t belong to requirements.txt yet. When you switch back and want to commit in an upstream branch, the pre-commit hook fails. However, this is easily avoidable by using the --no-verify option of git commit.
  • Second, it runs pyflakes on all the .py files in the repository. If there’s something pyflakes doesn’t like, the pre-commit hook fails and shows the output of pyflakes. There’s one case which is ignored and that is using the _ (underscore) function from the gettext module as install makes it available everywhere. Pyflakes documentation is non-existent and I guess there’s no way to create a configuration profile for it, so I had to resort to this hack.
  • Finally, since I deployed code with forgotten set_trace() calls a couple of times, the third thing the script does is it checks for these and prints out the file and line number if it encounters any.

I keep this file as a part of the repository, making a symbolic link to it in .git/hooks/pre-commit. Make sure the file is executable.

    Do you have similar stuff in your VCS hooks? Is there anything I could improve in mine? I’ll be glad to see your tips in the comments.

    Server-side verification of Google Play subscriptions

    TL;DR To programatically verify Google Play subscriptions, you have to use the OAuth 2.0 for web server applications auth flow, not service accounts. This is a massive FUBAR on Google’s side and makes life of developers very painful.

    Lately, I’ve been working on the backend part of a upcoming app we’re developing for one of our clients. This app offers monthly and yearly subscriptions, so I had to implement a check if the recurring payment happened, the credit card got billed and the app provider got the money. Of course, for multiple reasons, this has to be done server-side, completely automatically and without any intervention from the app user or provider.

    Google provides an API called android-publisher for this. To use any API from Google, first you have to enable it from the Console and then authenticate with it. The authentication is done via OAuth 2.0. As Google offers API access to many of their services which are used in different occasions, they also offer different OAuth 2.0 authentication flows.

    The flow/mechanism for server to server communication is called Service accounts in Google terminology. This is precisely what I needed. However, for reasons beyond my understanding, this is not the one used for android-publisher API. Instead, they chose Web server applications flow, which for this use case is absurd.

    (Sidenote: When we started to build the aforementioned app, recurring transaction were not even available for Android. We planned to use Paypal as we did for the Blackberry version. However, during development, Google introduced subscriptions for Android which made us happy.

    I started reading the docs and implementing the whole auth and check code, but it didn’t work; I was getting “This developer account does not own the application.” HTTP 401 error. Googling for this didn’t help – at that time, the only search results were two couple of hours old questions on Stack Overflow. I would swear the docs at that time mentioned to use Service accounts for authentication and later Google changed it. I had to re-read the docs from the beginning to debug this infuriating error.)

    Using Web server applications flow is ridiculous because human interaction is involved. At least once, you (in this case our client!) need to press an “Allow” button in you web browser. Palm, meet face.

    Here are the instructions you need to follow to achieve automated subscription verification. The code is in Python but it’s easy to adapt.

    First of all, in the Console, you need to create a Client ID for Web applications. You can use http://localhost as the redirect hostname. As you’ll see in a minute, it doesn’t matter much. You mostly need the Client ID and Client secret.

    Next, fire up the Python REPL and enter this:

    https://gist.github.com/3450666

    Use the Client ID and Client secret you obtained from Console. This piece of code will give you an authentication URL; by default, it will contain access_type=offline parameter. This is very important, make sure it’s there. Open the URL in your browser and log in with the Google account that you will be using for publishing the Android application. After a successfull login and authorization, you’ll be redirected to localhost in your browser. Unless you’re running a webserver locally, this will probably fail, but it doesn’t matter. The address you are redirected to will contain a code parameter. Copy its value and go back to the REPL again:

    https://gist.github.com/3450785

    Finally you’ve got an instance of the oauth2client.client.OAuth2Credentials class. It contains couple of properties but the only one that’s really interesting is the refresh_token. Store the refresh token to your server configuration, you can use it forever meaning until someone does not revoke the access to the API. Then you would have to got through this whole process again.

    Basically, thanks to this refresh token you will able to obtain a new access token on each call to the API. To do that, you create an instance of OAuth2Credentials and use that to authorize an httplib2.Http object:

    https://gist.github.com/3451039

    You can now build a service and call the get purchases API call.

    The following gist summarizes the whole blogpost:

    https://gist.github.com/3451509

    As long as the API access will not be revoked, you should be fine using this method.

    Programming for iOS from the point of view of a python developer

    I gave a talk at Prague’s python user group meetup. It was about my experience of learning and using Obj-C to develop iOS apps as a python developer. You can check out the recorded video below. Slides are on Speakerdeck (I tried to embed them but Posterous doesn’t play nicely with Speakerdeck).

    A huge thanks to Jiří, Aleš and Jakub for inviting me and everyone else who attended.

    Introducing Myngo: A web admin for MongoDB

    Myngo is a web administration interface for MongoDB. It is written in
    Python, runs on Tornado and uses jQuery on the front-end. It is a
    fresh, new project so there’s no package yet. If you want to try it
    out follow the instructions.

    So far, Myngo can display a list of available databases, collections
    and some server info. You can also do some actions with the DBs and
    collections. Check the screenshots for more details.

    Myngo-dbMyngo-collectionsMyngo-server-ingoMyngo-modal

    There’s a lot of features in plan for Myngo, the most significant being:
    * querying (or an interactive console)
    * user auth and permission system
    * slick UI
    * some kind of test suite

    I hope you’ll like what you see. Please report bugs and feature
    requests to the project’s bug tracker or directly to me. I’ll also
    gladly appreciate any kind of help (especially with design and layout)
    so feel free to fork and hack away.

    If you find Myngo valuable, please consider donating so I can spend
    more time improving it.