Implement archive:upload:delta

Ok great, we can now run the delta archive process!

It'd be nice to get this running on cron on the impress-2020 server, to a temporary folder? I *do* want to be remembering to run something regularly on my personal machine too though, to keep my own copy up-to-date…
This commit is contained in:
Emi Matchu 2022-11-05 02:15:31 -07:00
parent 12b5a56694
commit 88511d3dc6
2 changed files with 21 additions and 2 deletions

View file

@ -8,7 +8,7 @@ yarn aws s3 ls --recursive s3://dti-archive/ \
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2}\s+[0-9]{2}:[0-9]{2}:[0-9]{2}\s+[0-9]+\s+/https:\/\//' \
| \
# Hacky urlencode; the only % value in URLs list today is %20, so...
sed -E 's/ /%20/' \
sed -E 's/ /%20/g' \
| \
# Output to manifest-remote.txt, and print to the screen.
tee $(dirname $0)/../manifest-remote.txt

View file

@ -1 +1,20 @@
echo 'archive:upload:delta -- TODO!'
cat $(dirname $0)/../manifest-delta.txt \
| \
# Remove the URL scheme to convert it to a folder path in our archive
sed -E 's/^https?:\/\///' \
| \
# Hacky urldecode; the only % value in the URLs list today is %20, so...
sed -E 's/%20/ /g' \
| \
# Upload each URL to the remote archive!
# NOTE: This is slower than I'd hoped, probably because each command has to
# set up a new connection? If we needed to be faster, we could refactor
# the `create` step to download to a temporary delta folder, then `cp`
# that into the main archive, but run `aws s3 sync` on just the delta
# folder (with care not to delete keys that are present in the remote
# archive but not in the delta folder!). But this seems to run at an
# acceptable speed (i.e. a few hours) when it's run daily.
while read -r path; do
yarn aws s3 cp $ARCHIVE_DIR/$path s3://$ARCHIVE_STORAGE_BUCKET/$path;
done