Create swf_assets:remove_duplicates task

I'm not sure where these duplicate records have been coming from over
the years (I checked the timestamps and it's been happening
occasionally since 2013 up to late last year, there were ~1,600
instances), but for now let's just get rid of them!

This is related to the issues we've been addressing lately where some
biology assets have manifests but no PNG specified in them: the older
copies of the assets would have our generated PNG as a fallback, but
the newer copies would get served as part of the pet appearance *in
addition to* the older copies, and the newer copies would be marked as
having no DTI-generated image, which our system wasn't always able to
handle.

We've primarily been addressing this by leaning into more graceful
failure modes of skipping certain layers, but… these layers *shouldn't
be here*, and are cluttering up support tools and such; let's be rid of
them!

I ran this today seemingly without issue, but I kept a backup of the
`yarn db:export:public-data` task in `impress-2020` to be able to check
and rollback if we discover a mistake.

One last note: the `ORDER BY` clause in the `GROUP_CONCAT` call was a
late addition, *after* I ran this in production. Scanning the console
output, it seems like ordering by ID was MySQL's default behavior here
anyway (makes sense!), so I'm not gonna bother to rollback and re-run,
but I think specifying this is helpful to ensure we're not depending on
unspecified behavior and to be really clear about our intentions of
which record to keep (the one with the smallest DTI ID number).
This commit is contained in:
Emi Matchu 2024-02-09 09:53:41 -08:00
parent 355297d977
commit 95949da6f9

View file

@ -2,6 +2,39 @@ require 'async/barrier'
require 'async/http/internet/instance'
namespace :swf_assets do
# NOTE: I'm not sure how these duplicate records enter our database, probably
# a bug in the modeling code somewhere? For now, let's just remove them, and
# be ready to run it again if needed!
# NOTE: Run with DRY_RUN=1 to see what it would do first!
desc "Remove duplicate SwfAsset records"
task remove_duplicates: [:environment] do
duplicate_groups = SwfAsset.group(:type, :remote_id).
having("COUNT(*) > 1").
pluck(:type, :remote_id, Arel.sql("GROUP_CONCAT(id ORDER BY id ASC)"))
total = duplicate_groups.size
puts "Found #{total} groups of duplicate records"
SwfAsset.transaction do
duplicate_groups.each_with_index do |(type, remote_id, ids_str), index|
ids = ids_str.split(",")
duplicate_ids = ids[1..]
duplicate_records = SwfAsset.find(duplicate_ids)
if ENV["DRY_RUN"]
puts "[#{index + 1}/#{total}] #{type}/#{remote_id}: " +
"Would delete #{duplicate_records.size} records " +
"(#{duplicate_records.map(&:id).join(", ")})"
else
puts "[#{index + 1}/#{total}] #{type}/#{remote_id}: " +
"Deleting #{duplicate_records.size} records " +
"(#{duplicate_records.map(&:id).join(", ")})"
duplicate_records.each(&:destroy)
end
end
end
end
desc "Backfill manifest_url for SwfAsset models"
task manifests: [:environment] do
timeout = ENV.fetch("TIMEOUT", "5").to_i