Sometimes you’re unsatisfied with the past - and sometimes you would like to change it… but how would you do that? Changing history would need a time machine, but when using Git you can start rewriting history today!

We had some Git repositories with credentials and other confidential data in their history and wanted to migrate from an internal Git server to GitHub. Before uploading our repositories to an external platform we needed to remove any private content. Here is how we did it.

A note in advance: when rewriting history you should always tell your team that they should push unpublished changes and stop adding commits. They should also be aware that they have to cleanup their local clones after you’ve modified the history. See the section Aftermath at the end of the article for details.

Theory: git filter-branch

Git allows us via filter-branch to apply any kind of changes to every commit in a repository. Looking at the example taken from the GitHub docs you might get an idea why no one wants to remember the syntax:

$ git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA' \
--prune-empty --tag-name-filter cat -- --all
> Rewrite 48dc599c80e20527ed902928085e7861e6b3cbe6 (266/266)
> Ref 'refs/heads/master' was rewritten

Under the hood the git rm ... command will be applied to every commit, the result will replace old commits, and Git will also change the commit hash. For big repositories such a rewrite can take some time, the syntax is non-trivial and rewriting history is nothing you do every day. So, this one doesn’t feel very safe.

There’s help, though. The BFG Repo Cleaner can hide the technical details for you and also streamlines many possible use cases when you need to rewrite a Git repository.

Practice: BFG Repo Cleaner

The BFG Repo Cleaner (BFG) improves usability and also performs much better on bigger repositories than Git’s built-in feature. Let’s start with installing BFG. BFG is written for the Java VM and available as .jar file at their homepage. You can also download it like this:

wget -q -O bfg.jar http://repo1.maven.org/maven2/com/madgag/bfg/1.13.0/bfg-1.13.0.jar

Before rewriting the history you have to ensure that your repository’s HEAD reflects the desired final state. In other words: you have to create a commit where you delete passwords and other confidential data. BFG won’t touch that commit so that you don’t need to be afraid of losing anything.

The BFG operates on a bare repository which you can clone from a remote repo by adding --mirror to the familiar git clone command:

git clone --mirror https://git.example.com/username/dirty-repo.git

For our use case of removing confidential text we used the --replace-text option. You only need to provide a text file with a line-seperated list of patterns you’d like to removed. By default BFG replaces matching patterns with the text ***REMOVED***. BFG also allows to fine tune expressions and replacements, but we’re going to keep it simple.

The following example creates a file patterns.txt with our highly secure passwords:

cat << EOF > patterns.txt
super-secret
foo
password1234
EOF

Now we have everything in place:

  • A local copy of the bfg.jar
  • A locally installed JRE, too… yep, Docker images with JRE and BFG are also available!
  • A bare clone of our dirty repo (located at ./dirty-repo.git/)
  • The latest commit in our repo reflects the desired state
  • A text file ./patterns.txt containing our blacklist

Let’s perform the cleanup:

java -jar ./bfg.jar --replace-text patterns.txt dirty-repo.git/

BFG will run through the complete history and rewrite matching patterns. Only the latest commit won’t be touched.

You’ll see some stats printed by BFG, and you can also verify its success by manually looking at the rewritten history. For example, you can verify that the diff between the two latest commits doesn’t contain confidential data anymore:

cd dirty-repo.git/
git log --pretty=oneline --abbrev-commit
git diff <second-commit-hash> <first-commit-hash>

When everything looks good, you’ll need to cleanup Git’s index and make the new history the new truth for everyone else:

git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push

Aftermath

Due to the changed commit hashes, other clones won’t be able to find common ancestors in their git history. A simple git pull won’t work without conflicts. Everyone should perform a fresh clone so that they won’t mix their old and dirty commits with your cleaned ones. They can also reset their local history like this - beware that any staged changes will be removed:

cd /path/to/old/clones/dirty-repo/
git fetch
git reset origin/master --hard

Now everyone should have a beautiful repo free of confidential details and the road is clear for transferring the repo to a provider like GitHub.

Update 2019-08-01

Please note that you’ll have to perform additional steps when you’re trying to clean a repository which is already pushed to GitHub. On GitHub, you’ll most likely work with pull requests, which has some effects regarding your history on GitHub. Merged pull requests won’t be cleaned by the process described above, because GitHub creates read-only synthetic branches. You should verify that those branches don’t contain any sensitive detail. If so, please contact the GitHub support to remove any data for you. In other words: please follow the steps described at GitHub: Removing sensitive data from a repository, especially the following one:

Contact GitHub Support or GitHub Premium Support, asking them to remove cached views and references to the sensitive data in pull requests on GitHub.