impress-2020/scripts/archive/create/download-urls.sh
Matchu b9b0db8b3a Some archive:create tweaks
I'm looking into what it would take to update the archive on a regular basis. The commands right now *are* pretty good at avoiding duplicate work… but the S3 upload still seems like it's taking very long even to just validate what's in the archive already. We might have to build our own little cache rather than using `aws s3 sync`, if we want faster incremental updates?

Here, I make a few quality-of-life changes to add a `archive:create` command that runs everything in a straight line. That way, I can let it run and see how much wall-time it takes, to be able to decide whether speeding it up feels necessary. (vs whether it's a few-hours task I can just set a reminder to manually run every week or something)
2022-10-02 07:08:40 -07:00

15 lines
861 B
Bash
Executable file

echo 'Starting! (Note: If many of the URLs are already downloaded, it will take some time for wget to quietly check them all and find the new ones.)'
xargs --arg-file=$(dirname $0)/urls-cache.txt -P 8 wget --directory-prefix=${ARCHIVE_DIR=$(dirname $0)} --force-directories --no-clobber --timeout=10 --retry-connrefused --retry-on-host-error --no-cookies --compression=auto --https-only --no-verbose
# It's expected that xargs will exit with code 123 if wget failed to load some
# of the URLs. So, if it exited with 123, exit this script with 0 (success).
# Otherwise, exit with the code that xargs exited with.
# (It would be nice if we could tell wget or xargs that a 404 isn't a failure?
# And have them succeed instead? But I couldn't find a way to do that!)
XARGS_EXIT_CODE=$?
if [ $XARGS_EXIT_CODE -eq 123 ]
then
exit 0
else
exit $XARGS_EXIT_CODE
fi