Using git repository as a database backend

Answering my own question is not the best thing to do, but, as I ultimately dropped the idea, I’d like to share on the rationale that worked in my case. I’d like to emphasize that this rationale might not apply to all cases, so it’s up to architect to decide.

Generally, the first main point my question misses is that I’m dealing with multi-user system that work in parallel, concurrently, using my server with a thin client (i.e. just a web browser). This way, I have to maintain state for all of them. There are several approaches to this one, but all of them are either too hard on resources or too complex to implement (and thus kind of kill the original purpose of offloading all the hard implementation stuff to git in the first place):

  • “Blunt” approach: 1 user = 1 state = 1 full working copy of a repository that server maintains for user. Even if we’re talking about fairly small document database (for example, 100s MiBs) with ~100K of users, maintaining full repository clone for all of them makes disc usage run through the roof (i.e. 100K of users times 100MiB ~ 10 TiB). What’s even worse, cloning 100 MiB repository each time takes several seconds of time, even if done in fairly effective maneer (i.e. not using by git and unpacking-repacking stuff), which is non acceptable, IMO. And even worse — every edit that we apply to a main tree should be pulled to every user’s repository, which is (1) resource hog, (2) might lead to unresolved edit conflicts in general case.

    Basically, it might be as bad as O(number of edits × data × number of users) in terms of disc usage, and such disc usage automatically means pretty high CPU usage.

  • “Only active users” approach: maintain working copy only for active users. This way, you generally store not a full-repo-clone-per-user, but:

    • As user logs in, you clone the repository. It takes several seconds and ~100 MiB of disc space per active user.
    • As user continues to work on the site, he works with the given working copy.
    • As user logs out, his repository clone is copied back to main repository as a branch, thus storing only his “unapplied changes”, if there are any, which is fairly space-efficient.

    Thus, disc usage in this case peaks at O(number of edits × data × number of active users), which is usually ~100..1000 times less than number of total users, but it makes logging in/out more complicated and slower, as it involves cloning of a per-user branch on every login and pulling these changes back on logout or session expiration (which should be done transactionally => adds another layer of complexity). In absolute numbers, it drops 10 TiBs of disc usage down to 10..100 GiBs in my case, that might be acceptable, but, yet again, we’re now talking about fairly small database of 100 MiBs.

  • “Sparse checkout” approach: making “sparse checkout” instead of full-blown repo clone per active user doesn’t help a lot. It might save ~10x of disc space usage, but at expense of much higher CPU/disc load on history-involving operations, which kind of kills the purpose.

  • “Workers pool” approach: instead of doing full-blown clones every time for active person, we might keep a pool of “worker” clones, ready to be used. This way, every time a users logs in, he occupies one “worker”, pulling there his branch from main repo, and, as he logs out, he frees the “worker”, which does clever git hard reset to become yet again just a main repo clone, ready to be used by another user logging in. Does not help much with disc usage (it’s still pretty high — only full clone per active user), but at least it makes logging in/out faster, as expense of even more complexity.

That said, note that I intentionally calculated numbers of fairly small database and user base: 100K users, 1K active users, 100 MiBs total database + history of edits, 10 MiBs of working copy. If you’d look at more prominent crowd-sourcing projects, there are much higher numbers there:

│              │ Users │ Active users │ DB+edits │ DB only │
├──────────────┼───────┼──────────────┼──────────┼─────────┤
│ MusicBrainz  │  1.2M │     1K/week  │   30 GiB │  20 GiB │
│ en.wikipedia │ 21.5M │   133K/month │    3 TiB │  44 GiB │
│ OSM          │  1.7M │    21K/month │  726 GiB │ 480 GiB │

Obviously, for that amounts of data/activity, this approach would be utterly unacceptable.

Generally, it would have worked, if one could use web browser as a “thick” client, i.e. issuing git operations and storing pretty much the full checkout on client’s side, not on the server’s side.

There are also other points that I’ve missed, but they’re not that bad compared to the first one:

  • The very pattern of having “thick” user’s edit state is controversial in terms of normal ORMs, such as ActiveRecord, Hibernate, DataMapper, Tower, etc.
  • As much as I’ve searched for, there’s zero existing free codebase for doing that approach to git from popular frameworks.
  • There is at least one service that somehow manages to do that efficiently — that is obviously github — but, alas, their codebase is closed source and I strongly suspect that they do not use normal git servers / repo storage techniques inside, i.e. they basically implemented alternative “big data” git.

So, bottom line: it is possible, but for most current usecases it won’t be anywhere near the optimal solution. Rolling up your own document-edit-history-to-SQL implementation or trying to use any existing document database would be probably a better alternative.

Leave a Comment