Commit graph

5 commits

Author SHA1 Message Date
95949da6f9 Create swf_assets:remove_duplicates task
I'm not sure where these duplicate records have been coming from over
the years (I checked the timestamps and it's been happening
occasionally since 2013 up to late last year, there were ~1,600
instances), but for now let's just get rid of them!

This is related to the issues we've been addressing lately where some
biology assets have manifests but no PNG specified in them: the older
copies of the assets would have our generated PNG as a fallback, but
the newer copies would get served as part of the pet appearance *in
addition to* the older copies, and the newer copies would be marked as
having no DTI-generated image, which our system wasn't always able to
handle.

We've primarily been addressing this by leaning into more graceful
failure modes of skipping certain layers, but… these layers *shouldn't
be here*, and are cluttering up support tools and such; let's be rid of
them!

I ran this today seemingly without issue, but I kept a backup of the
`yarn db:export:public-data` task in `impress-2020` to be able to check
and rollback if we discover a mistake.

One last note: the `ORDER BY` clause in the `GROUP_CONCAT` call was a
late addition, *after* I ran this in production. Scanning the console
output, it seems like ordering by ID was MySQL's default behavior here
anyway (makes sense!), so I'm not gonna bother to rollback and re-run,
but I think specifying this is helpful to ensure we're not depending on
unspecified behavior and to be really clear about our intentions of
which record to keep (the one with the smallest DTI ID number).
2024-02-09 09:53:41 -08:00
0845881aba Add TIMEOUT parameter to swf_assets:manifests task
At this point, I've gone through all the assets, and the only ones
without manifests are:

1. The ones that truly have no manifest yet (that we know of)
2. The ones where execution happened to time out

I think the 5-second timeout is a very reasonable default for starting
the backfill, in a way that prioritizes moving forward; but now that we
have most things, I'd rather be able to re-run it with a more generous
timeout. So here we are!
2023-11-11 11:04:53 -08:00
eb4a3ce0d9 More gracefully handle batches that fail to save
I noticed a thing with like, an asset that I think referenced an item that
doesn't exist, which caused an error in the `body_specific?` validation
step?

Tbh that validation step needs fixed up in a number of ways, but I'm
scared to, since it's hard to know what will break modeling lol.

But in any case, more graceful handling is nice! If something happens,
I'd rather leave it null and try again later than have the job crash!
2023-11-10 17:42:56 -08:00
80bd229bc6 Clarify an error message in swf_assets:manifests task
It's not just that none of them were 200 OK, it's that they were all 404.
In the event that something returns not-200 and not-404, we immediately
abort, so we shouldn't get to this case unless they were all 404!
2023-11-10 17:27:35 -08:00
dc22a458bf Move manifest backfill to swf_assets:manifests task
Okay, I've simplified the migration to *just* add the column, and
instead added a task to find assets without manifest URLs and backfill
them.

Performance is a lot better now, using the `async-http` library, which
as I understand it supports both persistent connections when invoked
like this, and maybe also HTTP/2 multiplexing?? (Though I'm not
actually sure images.neopets.com does lol)

I'm not sure about the number of concurrent tasks I picked here, 100
seems okay for an internet thing and for such small requests, but I
worry that the CDN is gonna get annoyed or something. Well, we'll see!
This task is very resumable if it turns out we get frozen out or
something.
2023-11-10 16:52:50 -08:00