Removing Large Files from a Repository's Git History

I consider myself to be a grug-brained developer. I enjoy building and maintaining software systems that eschew complexity in favor of readability and ease of understanding. This website is one of those systems: the entire thing is built as a static site from a bunch of loose HTML and Markdown files all jammed together using a dead simple templating system in a Bash script. All pieces of media content (images, videos, etc.) are stored directly in the Git repository alongside the textual content. As a result of this grug-brained structure, the website repository has ended up with some rather large binary files in its Git history, many of which correspond to media content no longer accessible on the actual deployed website.

I put together a tool to scrape this repository's commit history and dump the largest files from that history to the terminal, sorted by size:

#!/bin/sh
#
# Usage: git-history-largest-files.sh [NUMBER-OF-FILES]
set -eu

if [ $# -ge 2 ]; then
    echo "Usage: $0 [NUMBER-OF-FILES]"
    exit 1
fi

COUNT=${1:-10}
BYTES_PER_GB=1073741824
BYTES_PER_MB=1048576
BYTES_PER_KB=1024

git rev-list --objects --all | \
    git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | \
    awk '/^blob/ && $4 != "" {print $3 "\t" $4}' | \
    sort -rn | \
    head -n "$COUNT" | \
    awk -F'\t' \
        -v gb="$BYTES_PER_GB" \
        -v mb="$BYTES_PER_MB" \
        -v kb="$BYTES_PER_KB" \
        '{
            size = $1
            path = $2

            if (size >= gb) {
                printf "%.2f GB %s\n", size/gb, path
            }
            else if (size >= mb) {
                printf "%.2f MB %s\n", size/mb, path
            }
            else if (size >= kb) {
                printf "%.2f KB %s\n", size/kb, path
            }
            else {
                printf "%d B %s\n", size, path
            }
        }'

Running the tool over this repo, we can see that a massive percentage of this repository's size comes from this Braille Apple video that I had actually moved from this site to Vimeo a while back:

$ git gc >/dev/null 2>&1 && git count-objects -vH
count: 0
size: 0 bytes
in-pack: 1523
packs: 2
size-pack: 109.24 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
$ sh tools/git-history-largest-files.sh
47.50 MB src/misc/2021-10-18-braille-apple.webm
28.72 MB src/misc/2021-10-18-braille-apple.webm
3.93 MB src/blog/2022-03-19-breaking-cookie-clicker/living-room-pi.jpg
3.82 MB src/blog/2022-03-19-breaking-cookie-clicker/supplies.jpg
3.76 MB src/blog/2022-03-19-breaking-cookie-clicker/living-room-pi-and-tv.jpg
2.13 MB src/blog/2022-03-19-breaking-cookie-clicker/cookie-clicker-final-stats.png
2.08 MB src/misc/2020-07-04-disco-descent-1-1.mp3
1.87 MB src/blog/2024-09-02-scripting-with-value-semantics-using-lumpy/lumpy-game.mp4
1.57 MB src/blog/2022-03-19-breaking-cookie-clicker/cookie-clicker-auto-clicker.png
1.18 MB src/misc/2020-08-16-dvd-screensaver/2020-08-16-dvd-screensaver.mp4

I don't actually need to keep this src/misc/2021-10-18-braille-apple.webm video around in the repository git history; the content is stored elsewhere, and this site is much more of a living document than it is a historical archive. So I am fine removing this file if it shrinks the repository size and gives me more leeway to jam different media blobs in later.

Okay so now that we have a file path, let's use git filter-branch1 to replace the binary file with a placeholder across the entire git history, and then yeet remaining references to the old objects/commits:

#!/bin/sh
#
# Usage: git-history-elide-file.sh FILE
set -eu

if [ $# -ne 1 ]; then
    echo "Usage: $0 FILE"
    exit 1
fi

FILE="$1"

export FILE
export FILTER_BRANCH_SQUELCH_WARNING=1
git filter-branch -f --tree-filter '
if git ls-files | grep -q "^${FILE}$"; then
    mkdir -p "$(dirname "${FILE}")"
    echo "original file removed from git history" > "${FILE}"
fi
' -- --all
rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now --aggressive

Running this tool we can see that it removed all of that repository bloat:

$ sh tools/git-history-elide-file.sh src/misc/2021-10-18-braille-apple.webm
Rewrite e326cb816c72b2986ab9d6a7472ee3999860936b (287/294) (9 seconds passed, remaining 0 predicted)
WARNING: Ref 'refs/heads/main' is unchanged
WARNING: Ref 'refs/remotes/github/main' is unchanged
WARNING: Ref 'refs/remotes/github/main' is unchanged
WARNING: Ref 'refs/remotes/sourcehut/main' is unchanged
Enumerating objects: 1513, done.
Counting objects: 100% (1513/1513), done.
Delta compression using up to 16 threads
Compressing objects: 100% (1495/1495), done.
Writing objects: 100% (1513/1513), done.
Total 1513 (delta 973), reused 387 (delta 0), pack-reused 0
$ git count-objects -vH
count: 0
size: 0 bytes
in-pack: 1513
packs: 1
size-pack: 33.01 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
$ sh tools/git-history-largest-files.sh 5
3.93 MB src/blog/2022-03-19-breaking-cookie-clicker/living-room-pi.jpg
3.82 MB src/blog/2022-03-19-breaking-cookie-clicker/supplies.jpg
3.76 MB src/blog/2022-03-19-breaking-cookie-clicker/living-room-pi-and-tv.jpg
2.13 MB src/blog/2022-03-19-breaking-cookie-clicker/cookie-clicker-final-stats.png
2.08 MB src/misc/2020-07-04-disco-descent-1-1.mp3

After getting rid of the src/misc/2021-10-18-braille-apple.webm file, we observe that we reduced our pack size from ~110 MB to ~33 MB. That size reduction is good enough for now, but there are plenty of other now-deleted files that I could remove in the future!

Footnotes

1. Yes, we probably should be using the more modern git-filter-repo. And yes, I was too lazy to install it on my machine for this blog post; I am grug brained and git filter-branch was already on my machine and perfectly usable.