impress-2020

OpenNeo/impress-2020

Author	SHA1	Message	Date
Matchu	35713069fa	Delta version of archive scripts I like running the full `archive:create` to help us be _confident_ we've got the whole darn thing, but it takes multiple days to run on my machine and its slow HDD, which… I'm willing to do _sometimes_, but not frequently. But if we had a version of the script that ran faster, and only on URLs we still _need_, we could run that more regularly and keep our live archive relatively up-to-date. This would enable us to build reliable fallback infra for when images.neopets.com isn't responding (like today lol)! Anyway, I stopped early in this process because images.neopets.com is bad today, which means I can't really run updates today, lol :p but the delta-ing stuff seems to work, and takes closer to 30min to get the full state from the live archive, which is, y'know, still slow, but will make for a MUCH faster process than multiple days, lol	2022-10-13 15:08:29 -07:00
Matchu	b9b0db8b3a	Some archive:create tweaks I'm looking into what it would take to update the archive on a regular basis. The commands right now are pretty good at avoiding duplicate work… but the S3 upload still seems like it's taking very long even to just validate what's in the archive already. We might have to build our own little cache rather than using `aws s3 sync`, if we want faster incremental updates? Here, I make a few quality-of-life changes to add a `archive:create` command that runs everything in a straight line. That way, I can let it run and see how much wall-time it takes, to be able to decide whether speeding it up feels necessary. (vs whether it's a few-hours task I can just set a reminder to manually run every week or something)	2022-10-02 07:08:40 -07:00
Matchu	d9ca07c9b0	Use wget for archive:create:download-urls Hey this is an exciting development! A list of URLs, that we want to clone onto our hard drive, turns out to be something `wget` is already very good at! Originally I used `wget`'s `--input-file` option to process the `urls-cache.txt` file, but then I learned how to parallelize it from this StackOverflow answer: https://stackoverflow.com/a/11850469/107415. (Following the guidance in the comments, I removed `-n 1`, to avoid the overhead of extra processes and allow `wget` instances to keep using shared connections over time. Idk why it was in there, maybe the author didn't know `wget` accepts multiple args?) Anyway yeah, it's working great, except for the weird images.neopets.com downtime! 😅 Specifically I'm noticing that all the item thumbnail images came back really fast, but the customization images are taking for-EV-er. I wonder if that's just caching properties, or if there's a different backing server for it and it's responding much more slowly? Who's to say! In any case, I'm keeping the timeout in this script pretty low (10 seconds), and just letting failures fail. We can try re-running it again sometime when the downtime is resolved or the cache is warmed up.	2022-09-12 21:47:28 -07:00
Matchu	ea8715cd90	Sanitize URLs saved by archive:create:list-urls Especially in our item thumbnails, there's a lot of messiness about what the URL protocol is. There are also some SWF assets whose "URLs" are just saved as paths. In this change, we start processing all our outputted URLs through a `sanitizeUrl` function, which tries to massage it into an `https://images.neopets.com` URL, and warns if it cannot. This also warns on some intentionally-different URLs, like our April Fools prank item lol Anyway, I love functions like this, because the warnings always help me discover the data problems! I wasn't aware of the path-only SWF URLs, for example, until this script started warning about the URL parse errors!	2022-09-12 20:52:45 -07:00
Matchu	ef9958c11e	Add asset URLs to archive:create:list-urls script Here, we read URLs out from the swf_assets table, including SWFs, manfests, and everything referenced by the manifests. There are a few data-polishing tricks we needed to do to get this to work! Most notably, newer manfests reference themselves, but older ones don't; so we try to infer the manifest URL from the other URLs. (Our database caches the manifest content, but not the manifest URL it came from.)	2022-09-12 17:26:11 -07:00
Matchu	3ce895d52f	Start building archive:create:list-urls script Just working on making an images.neopets.com mirror, just in case! To start, I'm extracting all the URLs we need to back up; and then I'll make a separate script whose job is to mirror all of the URLs in the list.	2022-09-12 15:53:22 -07:00

6 commits