Skip to content

Commit

Permalink
Find data files in subdirectories in archives
Browse files Browse the repository at this point in the history
See dogsheep#54 for discussion. This also ignores files in the new "assets"
directory, which appear to be some stuff for a browser interface
Twitter's created.
  • Loading branch information
jacobian committed Jan 5, 2021
1 parent 21fc1ca commit b1936fb
Showing 1 changed file with 6 additions and 2 deletions.
8 changes: 6 additions & 2 deletions twitter_to_sqlite/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -641,8 +641,12 @@ def read_archive_js(filepath):
"Open zip file, return (filename, content) for all .js"
zf = zipfile.ZipFile(filepath)
for zi in zf.filelist:
if zi.filename.endswith(".js"):
yield zi.filename, zf.open(zi.filename).read()
# Ignore files in a assets dir -- these are for Twitter's archive
# browser thingie -- and only use final filenames since some archives
# appear to put data in a data/ subdir, which can screw up the filename
# -> importer mapping.
if zi.filename.endswith(".js") and not zi.filename.startswith("assets/"):
yield pathlib.Path(zi.filename).name, zf.open(zi.filename).read()


def extract_and_save_source(db, source):
Expand Down

0 comments on commit b1936fb

Please sign in to comment.