Delta version of archive scripts
I like running the full `archive:create` to help us be _confident_ we've got the whole darn thing, but it takes multiple days to run on my machine and its slow HDD, which… I'm willing to do _sometimes_, but not frequently.
But if we had a version of the script that ran faster, and only on URLs we still _need_, we could run that more regularly and keep our live archive relatively up-to-date. This would enable us to build reliable fallback infra for when images.neopets.com isn't responding (like today lol)!
Anyway, I stopped early in this process because images.neopets.com is bad today, which means I can't really run updates today, lol :p but the delta-ing stuff seems to work, and takes closer to 30min to get the full state from the live archive, which is, y'know, still slow, but will make for a MUCH faster process than multiple days, lol
2022-10-13 15:08:29 -07:00
|
|
|
# List all the files in our bucket. (The CLI handles pagination, thank you!)
|
|
|
|
yarn aws s3 ls --recursive s3://dti-archive/ \
|
|
|
|
| \
|
|
|
|
# Filter out unnecessary lines; just give us lines formatted like results.
|
|
|
|
grep -E '^[0-9]{4}-[0-9]{2}-[0-9]{2}\s+[0-9]{2}:[0-9]{2}:[0-9]{2}\s+[0-9]+\s+' \
|
|
|
|
| \
|
|
|
|
# Replace all the extra info like time and size with "https://".
|
|
|
|
sed -E 's/^[0-9]{4}-[0-9]{2}-[0-9]{2}\s+[0-9]{2}:[0-9]{2}:[0-9]{2}\s+[0-9]+\s+/https:\/\//' \
|
|
|
|
| \
|
|
|
|
# Hacky urlencode; the only % value in URLs list today is %20, so...
|
|
|
|
sed -E 's/ /%20/' \
|
|
|
|
| \
|
2022-10-13 16:07:12 -07:00
|
|
|
# Output to manifest-remote.txt, and print to the screen.
|
|
|
|
tee $(dirname $0)/../manifest-remote.txt
|