Heroku and Git submodules
TL;DR: We developed a new buildpack which fetches the Git object tree and uses Git natively to fetch submodules: https://github.com/SectorLabs/heroku-buildpack-git-submodule
We currently host our websites on Heroku. Heroku is a great platform for getting started. It takes the pain out of managing infrastructure and allows you to get up and running. It might not be a good long-term solution, but it gets you up and running quickly. This isn't to say that the experience is completely pain free. From time to time, you run into a problem that requires some engineering to solve. We ran into one.
We host all our code on Github, in a private organisation. Heroku deploys straight from our Github repository. It does it automatically every time we push into a certain branch. This is very convienent, and gives us continious deployment without a lot of effort. Recently, we started modularizing our code base to allow code re-use. This also meant that we moved some code into a separate repository that is now included as a Git submodule. Git submodules aren't perfect, but good enough for us now.
Git submodules
Our repository looks like this now:
- duck
- @paddle
- index.html
- somegreatfile.txt
- somefile.txt
- .gitmodules
Where duck
is the main repository that gets deployed and paddle
is another repository that is included in duck
as a Git submodule. A Git submodule is tied to a specific commit. This allow you to pin it to a specific version (versioning it). When you check out the parent repository, you'll get the exact same version as you committed. If you want to update it, you have to make a commit that stores a new reference point (a commit hash). This commit hash is stored in the Git object tree.
On top of the commit hash of the referenced repository being stored in your Git object tree, you'll also have a .gitmodules
file. This file describes where the referenced repository is stored:
[submodule "paddle"]
path = paddle
url = https://github.com/SectorLabs/paddle.git
When you clone a Git repository that has submodules, you won't automatically see the contents of those. You have to either instruct git clone
to also clone out the submodules, or check them out manually after cloning:
git clone --recursive [repo url]
OR
git submodule update --init --recursive
You could use Git subtree to truly embed a Git repository into another, but this can dramatically increase your repository's size, as this would store a copy of the other repository. It would also complicate actively working on the submodule.
Read more about Git submodules and Git subtree here:
- https://git-scm.com/book/en/v2/Git-Tools-Submodules
- https://git-scm.com/book/en/v1/Git-Tools-Subtree-Merging
The problem
When we thought about introducing Git submodules into our project we also researched how Heroku would deal with this. We found: https://devcenter.heroku.com/articles/git-submodules and life was good. It was already taken care of. The joy!
Our joy was short-lived when we saw the following warning:
That's a problem. We're using Github Sync to do our deploys. This means that Heroku pulls from our repository when deploying (as a result of some event, like a push) rather than us pushing the changes to them. Heroku only supports Git submodules natively wheing doing Git pushes and not when it pulls from your repository.
We figured that somebody else must have encountered this problem. Google to the rescue! Of course, somebody else did encounter this problem and came up with a solution. The solution is a custom buildpack that takes care of checking out the submodule. Brilliant!
Again, our joy was short-lived (a bit longer-lived this time). Although the custom buildpack worked fine, it came with two major disadvantages:
Does not support SSH authentication.
This means you have to hard-code credentials in your repository. Obviously, this doesn't apply if your submodule is a public repository which can be clone without credentials.
Does not actually read the Git object tree to checkout the submodule at the stored commit.
It instead looks at the
branch
setting specified in the.gitmodules
file and simply checks out that branch, completely ignoring the hash stored in the Git object tree.
Number #2 is a big disadvantage. The great advantage of storing the commit hash at which a submodule is tied, is that it allows versioning the dependency. When you upgrade the submodule to a newer version, you make a commit. By simply tying it to a branch, you'd have to make sure that branch always contains the right version. Such a system easily breaks and makes our deployment model more complicated.
At first, we wondered why the buildpack we found doesn't simply do git submodule update --init --recursive
. That would have been much easier. Here's why it doesn't:
- Buildpacks don't have access to the Git tree. Heroku simply clones your repository at the start of the build, zips up the result (without the
.git
directory) and starts executing the configured buildpacks.
Ouch. This makes things a bit more complicated. In order to solve this problem, we'd have to somehow develop a buildpack that can access the Git object tree and check out the submodules properly.
The solution
We developed a new buildpack to solve our problem: https://github.com/SectorLabs/heroku-buildpack-git-submodule
This new buildpack should be secure and simple to use. On top of that, it fetches Git submodules exactly like Git does instead of relying on a hack or trick.
Making a simple buildpack
Developing a Heroku buildpack is quite easy. You simply create a public Git repository with the following structure:
- bin/
- detect
- compile
Where detect
and compile
are marked as executable (chmod +x
). This means we can simply write a Bash script to do all the magic.
Read more about creating buildpacks for Heroku here: https://devcenter.heroku.com/articles/buildpack-api
Getting the object tree
The first step is to get the Git object tree in there. Heroku strips it. We have to find some way to get it back. We do however not want to check out the whole repository. Not only would this be time-consuming for large repositories, it would also overwrite the files Heroku already packed up. We also don't need the entire Git history, just the version that is being deployed.
Git is very powerful source version control system. If you know where to look, you can pull off quite some magic. We can take advantage of the following two Git features to achieve what we want:
-
A sparse checkout will allow us to only get the files we're interested in. This feature is usually used to save bandwidth and time when you don't actually need everything in a repository.
-
A shallow clone (
--depth 1
) will prevent pulling in the entire Git history and get only the last commit.
In order to clone the Git repository (or at least part of it), we need to know where it is located. The Heroku app is connected to our Github repository. Inside the buildpack, we don't have access to the Git URL. Therefor, we're forced to ask the user of the buildpack to add a setting to their Heroku app:
heroku config:set GIT_REPO_URL=https://github.com/SectorLabs/duck.git
This allows the buildpack to figure out where to clone the repository from. Just cloning the repository is not enough. We also need to checkout the right version. When Heroku deploys, it checks out from a specific branch (that you configured). We have to replicate this behavior. Luckily, Heroku sets an environment variable for buildpacks: SOURCE_VERSION
, which contains the commit hash of the version that is being deployed. This allows us to check out the right version.
The sparse checkout is a little bit more complicated then it seems. It allows you to check out part of the repository, we actually want none of it. Unfortunately, when you try to do that:
error: Sparse checkout leaves no entry on working directory
Our work-around is simple:
rm .gitmodules
echo ".gitmodules" > .git/info/sparse-checkout
Instructing Git to only checkout the .gitmodules
file. We have to remove the original before we do that. This is not a problem because the version that we're checking out should be the exact same.
Putting it all together, we get something like this:
git init
git config core.sparseCheckout true
echo ".gitmodules" > .git/info/sparse-checkout
git remote add origin "$GIT_REPO_URL"
git fetch -q --depth 1 origin -a
git checkout -q $SOURCE_VERSION
That leaves us with a .git
directory containing our Git object tree. We can now use normal Git commands to accomplish the rest.
Dealing with authentication
As mentioned, another problem was the fact that the solutions we found required hard-coding the authentication as part of the Git url:
[submodule "paddle"]
path = paddle
url = https://username:password@github.com/SectorLabs/paddle.git
This is a bit of a security risk. It would be much nicer if we could specify this as part of the Heroku app. Not only would this be more secure, it would also make it easier to use different credentials per environment.
We could simply get the username/password from an environment variable set on the Heroku app. However, we'd rather use SSH keys for authentication. Also because this wouldn't require us to pay an additional $9 a month for a user on Github. SSH keys can be specified as Github deploy keys.
For this, our buildpack requires the GIT_SSH_KEY
setting to be set on the Heroku app, specifying a private SSH key to use to authenticate to both the repository and its submodules:
heroku config:set GIT_SSH_KEY=$(cat ~/.ssh/id_rsa)
Open-source
As usual, we release our solution to the community under the liberal MIT license:
https://github.com/SectorLabs/heroku-buildpack-git-submodule
We look forward to hearing your feedback. Positive or negative.
Come work for us
Does all of this excite you and would you like to work for us? Don't wait, check out our job listings here: https://www.sectorlabs.ro/jobs/
Yes, we do have rubber ducks.
🦆