2024-03-01 13:18:58 -08:00
|
|
|
require "open-uri"
|
Create `rails public_data:commit` task, to share public data dumps
I'm starting to port over the functionality that was previously just,
me running `yarn db:export:public-data` in `impress-2020` and
committing it to Git LFS every time.
My immediate motivation is that the `impress-2020` git repository is
getting weirdly large?? Idk how these 40MB files have blown up to a
solid 16GB of Git LFS data (we don't have THAT many!!!), but I guess
there's something about Git LFS's architecture and disk usage that I'm
not understanding.
So, let's move to a simpler system in which we don't bind the public
data to the codebase, but instead just regularly dump it in production
and make it available for download.
This change adds the `rails public_data:commit` task, which when run in
production will make the latest available at
`https://impress.openneo.net/public-data/latest.sql.gz`, and will also
store a running log of previous dumps, viewable at
`https://impress.openneo.net/public-data/`.
Things left to do:
1. Create a `rails public_data:pull` task, to download `latest.sql.gz`
and import it into the local development database.
2. Set up a cron job to dump this out regularly, idk maybe weekly? That
will grow, but not very fast (about 2GB per year), and we can add
logic to rotate out old ones if it starts to grow too far. (If we
wanted to get really intricate, we could do like, daily for the past
week, then weekly for the past 3 months, then monthly for the past
year, idk. There must be tools that do this!)
2024-02-29 14:30:33 -08:00
|
|
|
require "open3"
|
|
|
|
|
|
|
|
desc "Tools to save and import DTI's public modeling data"
|
|
|
|
namespace :public_data do
|
|
|
|
desc "Save the local database's public data to a local file"
|
|
|
|
task :commit, [:name] => :environment do |_, args|
|
|
|
|
if Rails.env.development?
|
|
|
|
puts "NOTE: The `public_data:commit` task is primarily meant to be " +
|
|
|
|
"run in production, to create public data files we can copy to our " +
|
|
|
|
"development machines via `public_data:pull`. I'll still run it " +
|
|
|
|
"locally and save to #{Rails.configuration.public_data_root}, though!"
|
|
|
|
end
|
|
|
|
|
|
|
|
# Generate a filename from the current time, and the option name argument
|
|
|
|
# provided to the command (e.g. `rails public_data:commit[scheduled]`).
|
|
|
|
timestamp = Time.now.utc.iso8601.gsub(':', '_')
|
|
|
|
name = args.fetch(:name, "manual")
|
|
|
|
filename = "#{timestamp}-#{name}.sql.gz"
|
|
|
|
dest_path = Rails.configuration.public_data_root / filename
|
|
|
|
|
|
|
|
args = []
|
|
|
|
|
|
|
|
# The connection details for our database!
|
2024-03-01 13:18:58 -08:00
|
|
|
config = ApplicationRecord.connection_db_config.configuration_hash
|
Create `rails public_data:commit` task, to share public data dumps
I'm starting to port over the functionality that was previously just,
me running `yarn db:export:public-data` in `impress-2020` and
committing it to Git LFS every time.
My immediate motivation is that the `impress-2020` git repository is
getting weirdly large?? Idk how these 40MB files have blown up to a
solid 16GB of Git LFS data (we don't have THAT many!!!), but I guess
there's something about Git LFS's architecture and disk usage that I'm
not understanding.
So, let's move to a simpler system in which we don't bind the public
data to the codebase, but instead just regularly dump it in production
and make it available for download.
This change adds the `rails public_data:commit` task, which when run in
production will make the latest available at
`https://impress.openneo.net/public-data/latest.sql.gz`, and will also
store a running log of previous dumps, viewable at
`https://impress.openneo.net/public-data/`.
Things left to do:
1. Create a `rails public_data:pull` task, to download `latest.sql.gz`
and import it into the local development database.
2. Set up a cron job to dump this out regularly, idk maybe weekly? That
will grow, but not very fast (about 2GB per year), and we can add
logic to rotate out old ones if it starts to grow too far. (If we
wanted to get really intricate, we could do like, daily for the past
week, then weekly for the past 3 months, then monthly for the past
year, idk. There must be tools that do this!)
2024-02-29 14:30:33 -08:00
|
|
|
args << "--host=#{config[:host]}" if config[:host]
|
|
|
|
args << "--user=#{config[:username]}" if config[:username]
|
|
|
|
args << "--password=#{config[:password]}" if config[:password]
|
|
|
|
|
|
|
|
# Don't lock the database to do it!
|
|
|
|
args << "--single-transaction"
|
|
|
|
|
2024-05-02 13:06:27 -07:00
|
|
|
# Skip dumping tablespaces, so this requires fewer privileges.
|
|
|
|
args << "--no-tablespaces"
|
|
|
|
|
Create `rails public_data:commit` task, to share public data dumps
I'm starting to port over the functionality that was previously just,
me running `yarn db:export:public-data` in `impress-2020` and
committing it to Git LFS every time.
My immediate motivation is that the `impress-2020` git repository is
getting weirdly large?? Idk how these 40MB files have blown up to a
solid 16GB of Git LFS data (we don't have THAT many!!!), but I guess
there's something about Git LFS's architecture and disk usage that I'm
not understanding.
So, let's move to a simpler system in which we don't bind the public
data to the codebase, but instead just regularly dump it in production
and make it available for download.
This change adds the `rails public_data:commit` task, which when run in
production will make the latest available at
`https://impress.openneo.net/public-data/latest.sql.gz`, and will also
store a running log of previous dumps, viewable at
`https://impress.openneo.net/public-data/`.
Things left to do:
1. Create a `rails public_data:pull` task, to download `latest.sql.gz`
and import it into the local development database.
2. Set up a cron job to dump this out regularly, idk maybe weekly? That
will grow, but not very fast (about 2GB per year), and we can add
logic to rotate out old ones if it starts to grow too far. (If we
wanted to get really intricate, we could do like, daily for the past
week, then weekly for the past 3 months, then monthly for the past
year, idk. There must be tools that do this!)
2024-02-29 14:30:33 -08:00
|
|
|
# Dump the public data tables from the primary database.
|
|
|
|
args << config.fetch(:database)
|
|
|
|
args += %w(species colors zones) # manual constants
|
|
|
|
args += %w(alt_styles items parents_swf_assets pet_states pet_types
|
|
|
|
swf_assets) # from modeling
|
|
|
|
|
|
|
|
# Set up a shell, and register the commands we need.
|
|
|
|
Shell.def_system_command("mysqldump")
|
|
|
|
Shell.def_system_command("gzip")
|
|
|
|
sh = Shell.new
|
|
|
|
|
|
|
|
# Ensure the output directory exists.
|
|
|
|
dest_path.dirname.mkpath
|
|
|
|
|
|
|
|
# Run mysqldump, pipe it into gzip, and output to the destination file.
|
2024-03-01 13:18:58 -08:00
|
|
|
sh.transact do
|
|
|
|
sh.mysqldump(*args) | sh.gzip("-c") > dest_path.to_s
|
|
|
|
end
|
|
|
|
|
Create `rails public_data:commit` task, to share public data dumps
I'm starting to port over the functionality that was previously just,
me running `yarn db:export:public-data` in `impress-2020` and
committing it to Git LFS every time.
My immediate motivation is that the `impress-2020` git repository is
getting weirdly large?? Idk how these 40MB files have blown up to a
solid 16GB of Git LFS data (we don't have THAT many!!!), but I guess
there's something about Git LFS's architecture and disk usage that I'm
not understanding.
So, let's move to a simpler system in which we don't bind the public
data to the codebase, but instead just regularly dump it in production
and make it available for download.
This change adds the `rails public_data:commit` task, which when run in
production will make the latest available at
`https://impress.openneo.net/public-data/latest.sql.gz`, and will also
store a running log of previous dumps, viewable at
`https://impress.openneo.net/public-data/`.
Things left to do:
1. Create a `rails public_data:pull` task, to download `latest.sql.gz`
and import it into the local development database.
2. Set up a cron job to dump this out regularly, idk maybe weekly? That
will grow, but not very fast (about 2GB per year), and we can add
logic to rotate out old ones if it starts to grow too far. (If we
wanted to get really intricate, we could do like, daily for the past
week, then weekly for the past 3 months, then monthly for the past
year, idk. There must be tools that do this!)
2024-02-29 14:30:33 -08:00
|
|
|
puts "Saved dump to #{dest_path}"
|
|
|
|
|
|
|
|
# Link this latest dump as `latest.sql.gz`.
|
|
|
|
latest_path = Rails.configuration.public_data_root / "latest.sql.gz"
|
|
|
|
File.unlink(latest_path) if File.exist?(latest_path)
|
|
|
|
File.symlink(dest_path, latest_path)
|
2024-03-01 13:18:58 -08:00
|
|
|
|
Create `rails public_data:commit` task, to share public data dumps
I'm starting to port over the functionality that was previously just,
me running `yarn db:export:public-data` in `impress-2020` and
committing it to Git LFS every time.
My immediate motivation is that the `impress-2020` git repository is
getting weirdly large?? Idk how these 40MB files have blown up to a
solid 16GB of Git LFS data (we don't have THAT many!!!), but I guess
there's something about Git LFS's architecture and disk usage that I'm
not understanding.
So, let's move to a simpler system in which we don't bind the public
data to the codebase, but instead just regularly dump it in production
and make it available for download.
This change adds the `rails public_data:commit` task, which when run in
production will make the latest available at
`https://impress.openneo.net/public-data/latest.sql.gz`, and will also
store a running log of previous dumps, viewable at
`https://impress.openneo.net/public-data/`.
Things left to do:
1. Create a `rails public_data:pull` task, to download `latest.sql.gz`
and import it into the local development database.
2. Set up a cron job to dump this out regularly, idk maybe weekly? That
will grow, but not very fast (about 2GB per year), and we can add
logic to rotate out old ones if it starts to grow too far. (If we
wanted to get really intricate, we could do like, daily for the past
week, then weekly for the past 3 months, then monthly for the past
year, idk. There must be tools that do this!)
2024-02-29 14:30:33 -08:00
|
|
|
puts "Linked dump to #{latest_path}"
|
|
|
|
end
|
|
|
|
|
|
|
|
desc "Pull and import the latest public data from production (dev only)"
|
2024-03-01 13:18:58 -08:00
|
|
|
task :pull => :environment do
|
Create `rails public_data:commit` task, to share public data dumps
I'm starting to port over the functionality that was previously just,
me running `yarn db:export:public-data` in `impress-2020` and
committing it to Git LFS every time.
My immediate motivation is that the `impress-2020` git repository is
getting weirdly large?? Idk how these 40MB files have blown up to a
solid 16GB of Git LFS data (we don't have THAT many!!!), but I guess
there's something about Git LFS's architecture and disk usage that I'm
not understanding.
So, let's move to a simpler system in which we don't bind the public
data to the codebase, but instead just regularly dump it in production
and make it available for download.
This change adds the `rails public_data:commit` task, which when run in
production will make the latest available at
`https://impress.openneo.net/public-data/latest.sql.gz`, and will also
store a running log of previous dumps, viewable at
`https://impress.openneo.net/public-data/`.
Things left to do:
1. Create a `rails public_data:pull` task, to download `latest.sql.gz`
and import it into the local development database.
2. Set up a cron job to dump this out regularly, idk maybe weekly? That
will grow, but not very fast (about 2GB per year), and we can add
logic to rotate out old ones if it starts to grow too far. (If we
wanted to get really intricate, we could do like, daily for the past
week, then weekly for the past 3 months, then monthly for the past
year, idk. There must be tools that do this!)
2024-02-29 14:30:33 -08:00
|
|
|
unless Rails.env.development?
|
|
|
|
raise "Can only pull public data in development mode! This helps us " +
|
|
|
|
"ensure we won't overwrite the production database accidentally."
|
|
|
|
end
|
|
|
|
|
2024-03-01 13:18:58 -08:00
|
|
|
args = []
|
|
|
|
|
|
|
|
# The connection details for our database!
|
|
|
|
config = ApplicationRecord.connection_db_config.configuration_hash
|
|
|
|
args << "--host=#{config[:host]}" if config[:host]
|
|
|
|
args << "--user=#{config[:username]}" if config[:username]
|
|
|
|
args << "--password=#{config[:password]}" if config[:password]
|
|
|
|
args << "--database=#{config.fetch(:database)}"
|
|
|
|
|
|
|
|
# Set up a shell, and register the commands we need.
|
|
|
|
Shell.def_system_command("mysql")
|
|
|
|
Shell.def_system_command("gunzip")
|
|
|
|
sh = Shell.new
|
|
|
|
|
|
|
|
URI.open("https://impress.openneo.net/public-data/latest.sql.gz") do |file|
|
|
|
|
# Pipe the latest public data SQL into `gunzip` to unpack it, then pipe
|
|
|
|
# it into mysql to execute it.
|
|
|
|
#
|
|
|
|
# NOTE: We need `open(file)` to wrap it in a plain `File` object, so the
|
|
|
|
# `Shell` will recognize it correctly! It doesn't accept `Tempfile`.
|
|
|
|
sh.transact do
|
|
|
|
(sh.gunzip("-c") < open(file)) | sh.mysql(*args)
|
|
|
|
end
|
|
|
|
end
|
Create `rails public_data:commit` task, to share public data dumps
I'm starting to port over the functionality that was previously just,
me running `yarn db:export:public-data` in `impress-2020` and
committing it to Git LFS every time.
My immediate motivation is that the `impress-2020` git repository is
getting weirdly large?? Idk how these 40MB files have blown up to a
solid 16GB of Git LFS data (we don't have THAT many!!!), but I guess
there's something about Git LFS's architecture and disk usage that I'm
not understanding.
So, let's move to a simpler system in which we don't bind the public
data to the codebase, but instead just regularly dump it in production
and make it available for download.
This change adds the `rails public_data:commit` task, which when run in
production will make the latest available at
`https://impress.openneo.net/public-data/latest.sql.gz`, and will also
store a running log of previous dumps, viewable at
`https://impress.openneo.net/public-data/`.
Things left to do:
1. Create a `rails public_data:pull` task, to download `latest.sql.gz`
and import it into the local development database.
2. Set up a cron job to dump this out regularly, idk maybe weekly? That
will grow, but not very fast (about 2GB per year), and we can add
logic to rotate out old ones if it starts to grow too far. (If we
wanted to get really intricate, we could do like, daily for the past
week, then weekly for the past 3 months, then monthly for the past
year, idk. There must be tools that do this!)
2024-02-29 14:30:33 -08:00
|
|
|
end
|
|
|
|
end
|