I don't think this is actually relevant in-app right now, but I figured sending it is More Correct, and is likely to prevent future bugs if anything (and prevent future question about why we're _not_ sending it).
I also removed the `maxAge: 0` on `currentUser`, now that I've updated Fastly to no longer default to 5-minute caching when no cache time is specified. I can see why that's a reasonable default for Fastly, but we've been pretty careful about specifying Cache-Control headers when relevant, so the extra caching is mostly incorrect.
We had previously configured the client to not bother to try a GET request for GraphQL queries, and just jump straight to POST instead, because the `vercel dev` server for create-react-app reloaded the backend code for every request anyway, which doubled the dev response time.
The Next.js server is more efficient than this, and keeps some memory, so GET requests work similarly in dev as on prod now! (i.e. it fails the first time, but then succeeds on the second)
In this change, we remove the code to skip `createPersistedQueryLink` in development, and instead always call it. We simplify the code accordingly, too.
If the user is searching for things they own or want, make sure we don't CDN cache it!
For many queries, this is taken care of in practice, because the search result includes `currentUserOwnsThis` and `currentUserWantsThis`. But I noticed in testing that, if the search result has no items, so those fields aren't actually part of the _response_, then the private header doesn't get set. So this mainly makes sure we don't accidentally cache an empty result from a user who didn't have anything they owned/wanted yet!
Some queries, like on `/your-outfits`, had the cache hint `max-age=0, private` set. In this case, our cache code sent no cache header, on the assumption that no header would result in no caching.
This was true on Vercel, but isn't true on our new Fastly setup! (Which makes sense, Vercel was a bit more aggressive here I think.)
This was causing an arbitrary user's data to be cached by Fastly as the result for `/your-outfits`. (We found this bug before launching the Fastly cache though, don't worry! No actual user data leaked!)
Now, as of this change, the `/your-outfits` query correctly sends a header of `Cache-Control: max-age=0, private`. This directs Fastly not to cache the result.
To fix this, we made a change to our HTTP header code, which is forked from Apollo's stuff.
Comments explain most of this! Vercel changed around the Cache-Control headers a bit to always essentially apply max-age:0 when scope:PRIVATE was true.
I'm noticing this isn't *fully* working yet though, because we're not getting a `Cache-Control: private` header, we're just getting no header at all. Fastly might aggressively choose to cache it anyway with etag stuff! I bet that's the fault of our caching middleware plugin thing, so I'll check on that!
Hmm, I see, Vercel chews on Cache-Control headers a bit more than I'm used to, so anything marked `scope: PRIVATE` would not be cached at all.
But on a more standard server, this was coming out as privately cacheable, but for an actual amount of time (1 hour in the homepage case), because of the `maxAge` on other fields. That meant the device browser cache would hold onto the result, and not always reflect Own/Want changes upon page reload.
In this change, we set `maxAge: 0`, because we want this field to be very responsive. I also left `scope: PRIVATE`, even though I think it doesn't really matter if we're saying the field isn't cacheable anyway, because I want to set the precendent that `currentUser` fields need it, to avoid a potential gotcha if someone creates a cacheable `currentUser` field in the future. (That's important to be careful with though, because is it even okay for logouts to not clear it? TODO: Can we clear the private HTTP cache somehow? I guess we would need to include the current user ID in the URL?)
Yeah ok, let's just run one browser instance and one pool.
I feel like I observed that, when I killed chromium in prod, pm2 noticed the abrupt loss of a child process and restarted the whole app process? which is rad? so maybe let's just trying relying on that and see how it goes
We used Playwright in the first place to try to work around a Vercel deploy issue, and I'm not sure it really ended up mattering lol :p
But yeah, I'm putting the new Puppeteer code through the same prod stress test, and it just doesn't seem to be getting into the same broken state that Playwright was. I'm guessing it's just that Puppeteer has more investment in edge-case handling? (There's also the fact that we're no longer running things as root, which could have been a fucky problem, too?)
Oh, I made a typo that caused pm2 to be running our processes as `root` instead of `matchu`! Let's very fix that!! 😳
I noticed this because I'm trying Puppeteer, and it got upset about running in sandboxed mode as root, and I'm like "as root??"
So yeah, good, fixed, lol 😬
The motivation is that I want VERCEL_URL and local net requests outta here :p and we were doing some cutesiness with leveraging the CDN cache to back the GQL fields. No more of that, folks! lol
Previously we were using HTTP queries to keep individual function bundle sizes small, but that doesn't matter in a server where all the code is shared!
The immediate motivation is that I want /api/outfitImage requesting against the same server, not impress-2020.openneo.net. For other stuff I'm probably gonna fix this by replacing VERCEL_URL with something else, but here I figured this was a change worth making anyway.
Now that we're not on Vercel's AWS Lambda deployment, we can switch to something a bit more standard!
I also tweaked up our version of Playwright, because, hey, why not?
Getting the package list was a bit tricky, but we got there! Left a comment to explain where it's from.
I noticed this when Playwright was trying to draw cute ASCII art and it wasn't showing up right! Not a big deal, but it's a bit more correct to do this, so let's do it!
Oh right, like the SSH stuff, I did this the first time I set up, but didn't add it to the script! I like having things in the script :3 (I also had forgotten to check on the time zone last time, nice to have it with some rigor!)
I noticed that incoming port 3000 connections were being allowed, oops! Not a huge deal, but I don't want to allow connections without HTTPS, and I don't want surprise surface area even if I'm not currently aware of attacks on it. Close it out!
As an exercise, I've wiped the box clean, and I'm reinstalling from the scripts! :3
I added the SSH hardening rules to the playbook instead of doing them by hand this time.
I made a mistake with creating `/srv/impress-2020`, right, you need to *say* what it should be created *as* for the creation step to work!
I also guess my recent pm2 changes made it not actually be willing to start the app anymore, because `/srv/impress-2020/current` doesn't exist or have `node_modules` yet. I'm doing a cute thing where I create a placeholder app during setup, so there's always something to run, without introducing the complexities of a real deploy to the setup process.
And right, of course, we need to install nginx before running certbot! But we need to add certbot config *after* running certbot!
And then just some misc cleanups for consistency and correctness!
Okay so you know how in 3f07933f7a I switched the newline stuff here?
Yeah, right, I forgot, newlines are significant in bash :p I forgot this because I've never used them inside a `bash -c` invocation, but like, of course
Now, I'm still using `|` for clarity and reduced dependence on magic, but getting my lines right :p
I somehow had it in my head that `realpath` would crash if the file didn't exist, but that's super not true! It returns the tentative path for if you _were_ gonna create this!
I'm not doing this thoroughly enough for it to matter (e.g. the deployed rsynced versions aren't having the group permissions set).
I think doing this *right* (to be extensible to additional users) is too much complexity to be worth it, and doing it halfway is more confusing than helpful.
I did this because I was anticipating multi-users permissions to be a bit of an issue for like, granting the web server permission to access the source code. But it turns out, since we're running with pm2, it's all working just fine!
Okay well, we added monit to solve a problem that I coincidentally solved within an hour of getting monit working lol!
This also enables us to remove the pm2 pid file, which we were only using to allow monit to track the pm2 app.
Okay huh, while digging a bit into another issue, I found what was wrong with our config and pm2's built-in monitoring! You can't use `yarn start`, because the wrapper script breaks its ability to look inside and see what's happening.
I also removed the compiler flag thing from the `start` script in `package.json`, because I think it's redundant? There's no compilation to be done in a live server.
I think I might remove monit after this? It's nice extra resilience in a sense, but it feels like extra complexity when it's doing the job `pm2` is supposed to do. (And tbh I've almost never heard of nginx crashing, and if it does it's probably a scenario worth investigating by hand.)
Oops, I didn't understand the different multiline string formats in YAML! I was using one that chomped through newlines, and converted them to normal spaces.
I think that didn't matter in this context anyway? because indentation is an exception to this behavior. What a weird behavior!
Anyway, uhh, yeah, I'll use the simpler multiline string format now 😅 for consistency and clarity!
When I woke up this morning, the app had crashed because the mysql connection was closed!
I'm not sure, why that caused a _crash_? Or why pm2 didn't pick up on it, and said the process was still online? (Maybe the process was running, but the server had stopped?) Those could be good to investigate?…
…but better than diving too far into the details, is to just address the high-level problem: if the app goes down for unexpected reasons, I want it back up!! lol
In this change, we add `monit`, a solid system for monitoring processes (including checking for behavior, like responding to net requests), and configure it to watch the app process and the nginx process.
To test, you can run `pm2 stop impress-2020`, or `systemctl stop nginx`, to see that Monit brings them back up within seconds!
This does add some potential surprise if you're _trying_ to take the processes down. The easiest way is to send the stop command through monit, like `monit stop nginx`. This will disable monitoring until you start it again through monit, I think? (You can also disable/enable monitoring as a direct command, regardless of app state.)
You can see how, instead of the default experience where certbot edits your config for you, I've referenced the certificates in the config in the first place, and set up certbot to just generate them!
Also, I learned about certbot non-interactive mode! At first I wrote this with the Ansible `expect` module lol :p