Skip to content
This repository has been archived by the owner on Oct 29, 2023. It is now read-only.

Scrub binary files from git history #162

Open
wants to merge 580 commits into
base: master
Choose a base branch
from

Conversation

jiridanek
Copy link

Before: Receiving objects: 100% (6743/6743), 121.52 MiB | 210.00 KiB/s, done.
After: Receiving objects: 100% (6421/6421), 36.37 MiB | 210.00 KiB/s, done.

This change has to be force-pushed. Merging does not do the trick. I am including the exact commands I executed to do this. It might be best if you just run the commands yourself.

Fixes #101

List all files ever in the repository

# https://git-scm.com/docs/git-log
# http://stackoverflow.com/a/13547351/1047788
git log --name-only --pretty=format: | sort | uniq

List all deleted files ever in the repository

# http://stackoverflow.com/a/21871377/1047788
git log --name-only --diff-filter=D --pretty=format: | sort | uniq

Get changelog

git log --name-status > changelog.txt

Decide what to scrub

# http://www.tldp.org/LDP/abs/html/here-docs.html
cat << EOF > filenamestoscrub.txt
contigs.fasta
google-genomics-dataflow.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140604/dataflow-sdk-1.0.140604.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140709/dataflow-sdk-1.0.140709.pom.sha1                                                                                                              
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.jar                                                                                                                   
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.jar.md5                                                                                                               
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.jar.sha1                                                                                                              
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801-javadoc.jar                                                                                                           
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140801/dataflow-sdk-1.0.140801.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140808/dataflow-sdk-1.0.140808.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140818/dataflow-sdk-1.0.140818.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140828/dataflow-sdk-1.0.140828.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140915/dataflow-sdk-1.0.140915.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.140924/dataflow-sdk-1.0.140924.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013-javadoc.jar.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013-javadoc.jar.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.pom.md5
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141013/dataflow-sdk-1.0.141013.pom.sha1
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141027/dataflow-sdk-1.0.141027.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141027/dataflow-sdk-1.0.141027.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141120/dataflow-sdk-1.0.141120.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141120/dataflow-sdk-1.0.141120-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141120/dataflow-sdk-1.0.141120.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141120/dataflow-sdk-1.0.141120-sources.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141206/dataflow-sdk-1.0.141206.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141206/dataflow-sdk-1.0.141206-javadoc.jar
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141206/dataflow-sdk-1.0.141206.pom
jars/com/google/cloud/dataflow/dataflow-sdk/1.0.141206/dataflow-sdk-1.0.141206-sources.jar
jars/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml
jars/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml.md5
jars/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml.sha1
jars/org/broadinstitute/sting/gatk/gatk/3.1-1/gatk-3.1-1.jar
jars/org/sf/picard/picard/1.115/picard-1.115.jar
lib/bwa-0.7.9a/bamlite.c
lib/bwa-0.7.9a/bamlite.h
lib/bwa-0.7.9a/bntseq.c
lib/bwa-0.7.9a/bntseq.h
lib/bwa-0.7.9a/bwa.1
lib/bwa-0.7.9a/bwa.c
lib/bwa-0.7.9a/bwa.h
lib/bwa-0.7.9a/bwa-helper.js
lib/bwa-0.7.9a/bwamem.c
lib/bwa-0.7.9a/bwamem_extra.c
lib/bwa-0.7.9a/bwamem.h
lib/bwa-0.7.9a/bwamem_pair.c
lib/bwa-0.7.9a/bwape.c
lib/bwa-0.7.9a/bwase.c
lib/bwa-0.7.9a/bwase.h
lib/bwa-0.7.9a/bwaseqio.c
lib/bwa-0.7.9a/bwtaln.c
lib/bwa-0.7.9a/bwtaln.h
lib/bwa-0.7.9a/bwt.c
lib/bwa-0.7.9a/bwtgap.c
lib/bwa-0.7.9a/bwtgap.h
lib/bwa-0.7.9a/bwt_gen.c
lib/bwa-0.7.9a/bwt.h
lib/bwa-0.7.9a/bwtindex.c
lib/bwa-0.7.9a/bwt_lite.c
lib/bwa-0.7.9a/bwt_lite.h
lib/bwa-0.7.9a/bwtsw2_aux.c
lib/bwa-0.7.9a/bwtsw2_chain.c
lib/bwa-0.7.9a/bwtsw2_core.c
lib/bwa-0.7.9a/bwtsw2.h
lib/bwa-0.7.9a/bwtsw2_main.c
lib/bwa-0.7.9a/bwtsw2_pair.c
lib/bwa-0.7.9a/ChangeLog
lib/bwa-0.7.9a/COPYING
lib/bwa-0.7.9a/example.c
lib/bwa-0.7.9a/fastmap.c
lib/bwa-0.7.9a/is.c
lib/bwa-0.7.9a/kbtree.h
lib/bwa-0.7.9a/khash.h
lib/bwa-0.7.9a/kopen.c
lib/bwa-0.7.9a/kseq.h
lib/bwa-0.7.9a/ksort.h
lib/bwa-0.7.9a/kstring.c
lib/bwa-0.7.9a/kstring.h
lib/bwa-0.7.9a/ksw.c
lib/bwa-0.7.9a/ksw.h
lib/bwa-0.7.9a/kthread.c
lib/bwa-0.7.9a/kvec.h
lib/bwa-0.7.9a/main.c
lib/bwa-0.7.9a/Makefile
lib/bwa-0.7.9a/malloc_wrap.c
lib/bwa-0.7.9a/malloc_wrap.h
lib/bwa-0.7.9a/NEWS.md
lib/bwa-0.7.9a/pemerge.c
lib/bwa-0.7.9a/QSufSort.c
lib/bwa-0.7.9a/QSufSort.h
lib/bwa-0.7.9a/qualfa2fq.pl
lib/bwa-0.7.9a/README.md
lib/bwa-0.7.9a/utils.c
lib/bwa-0.7.9a/utils.h
lib/bwa-0.7.9a/xa2multi.pl
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar.md5
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.jar.sha1
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar.md5
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617-javadoc.jar.sha1
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom.md5
lib/com/google/cloud/dataflow/dataflow-sdk/1.0.140617/dataflow-sdk-1.0.140617.pom.sha1
lib/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml
lib/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml.md5
lib/com/google/cloud/dataflow/dataflow-sdk/maven-metadata-local.xml.sha1
lib/org/broadinstitute/sting/gatk/gatk/3.1-1/gatk-3.1-1.jar
lib/org/sf/picard/picard/1.115/picard-1.115.jar
README.md~
EOF

Scrub the files from history

# DO NOT DO THIS
# http://stackoverflow.com/a/1521498/1047788
while read filename; do
    # https://help.github.com/articles/remove-sensitive-data/
    git filter-branch --force --index-filter \
    "git rm --cached --ignore-unmatch $filename" \
    --prune-empty --tag-name-filter cat -- --all
done < filenamestoscrub.txt

Wait for this to complete. It takes a very long time, which proves that scrubbing the files one by one was a bad idea.

# DO THIS INSTEAD
# http://stackoverflow.com/a/4229151/1047788
git filter-branch --force --index-filter \
"git rm --cached --ignore-unmatch -- $(tr '\n' ' ' < filenamestoscrub.txt)" \
--prune-empty --tag-name-filter cat -- --all

Review and push the result

mvn package

git push origin --force --all
git push origin --force --tags

Local clones

Do steps # 8 and # 9 from https://help.github.com/articles/remove-sensitive-data/ on each local clone you have

deflaux and others added 30 commits March 11, 2015 14:03
Sharded BAM reader and a sample read counting pipeline
Its now in the codelabs repository.
--machineType does not have a default of n1-standard-4 in all contexts.
Now that we are doing client-side filtering for strict shard boundaries, we need to ensure that we are requesting the field that the filter will check.
Allow start/end when reading from the API in CountReads, improve script documentation.
Add Genomics API counters for Dataflow UI display.
…tation

Implement sample variant annotation dataflow pipeline
Adding a more comprehensive test for ReadConverter
Added gat-tools-java as a dependency to access the Read->SAMRecord converter
Fixing a few bugs in ReadConverter and adding a new test for Sam -> Read -> Sam conversion.
Update to latest version of DataflowSDK.
Also add validation to --output which fixes googlegenomics#31
deflaux and others added 28 commits November 13, 2015 17:27
Implement fallback CoderProvider.
Proto2Coder is preferred because it is deterministic.
Switch VerifyBamId to v1 Position objects.
Also default to use gRPC for VariantSimilarity.
Also remove obsolete version of JoinNonVariantSegmentsWithVariants.
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for the commit author(s). If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.

@deflaux
Copy link
Contributor

deflaux commented Feb 1, 2016

@jirkadanek Thanks so much for these detailed instructions!!! We will make it so.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.