It was all the seemingly innocuous build_#{name}
…
We have an employer on-boarding flow of 3 fixed steps that can't be bypassed:
Employer
.Employer
record with phone AND creates EmployerProfile
with company name and logo.Vacancy
.Yet every now and then we found employers who had an incomplete step 2. There were signs they had completed it, but the corresponding record was missing from the database. How were they skipping it without being redirected back? Or why wasn't the record persisting?
From logs I could see they'd gone through the flow normally with all the corresponding GET and PATCH and redirect headers. At the point of updating an Employer (when an EmployerProfile is created) we also save an analytics record—this was there as normal with the phone and company name in the metadata.
The Employer was updated with the phone number, but no EmployerProfile was created. It looked like it was but then it wasn't in the DB. How?
I spent a long time pouring over logs and trying to reproduce, but always got redirected back as expected if I didn't create a valid profile. Postgres uses transactions and is ACID compliant, so if something went wrong everything with validations or a crash everything should be rolled back. Moreover, why were they being redirected to new vacancy if they didn't have a profile? The persisted?
check in our code seemed pretty concrete to me:
def update
@employer.update(update_params)
if @employer.profile&.persisted?
redirect_to employer_vacancies_path(@employer)
else
render :edit
end
end
Eventually I managed to reproduce, after a lot of failed attempts at trying to break the image upload with invalid files and other wild goose chases. It was not what I expected.
I'd been focussing on the create
action as that's the only one that touches the database right? But it was actually when clicking back in the browser (I didn't expect users to do this) and visiting the edit
action again (and refreshing the page) that the record was… deleted! A big fat SQL: DELETE FROM "employer_profiles"
in the logs. What‽ Lets look at that edit action:
def edit
@employer = current_employer.tap(&:build_profile)
end
The only thing here is build_profile
, which we added so the profile wouldn't be nil to be able to generate a blank form in the view. All it does is instantiate an EmployerProfile
AR object without touching the database, right? Searching Rails docs for build
you see it's an alias for new
—as expected instantiating an object but not running any SQL. However the build_xyz_relation
methods are dynamically generated so aren't documented.
Searching the Rails source I found where build_xyz
is generated, that calls build, and what ultimately destroys the record for some reason. So this seems deliberate but it's so unexpected and poorly documented. I opened an issue anyway to perhaps get some rationale and aid googleability.
def edit
@employer = current_employer.tap { |e| e.build_profile unless e.profile }
end
So simple, a 1 line change, it wasn't anything complicated just a naming quirk in the Rails API.
I've never seen records just disappear from the DB before. If I'd paid more attention to gaps in sequential IDs, or trusted the persisted?
call I might have found the critical part sooner. Then again, persisted?
looks just as concrete and harmless as build_profile
, so it really was tricky to let go of assumptions about the Rails API and hone in on the line with the bug. Usually I'm a big fan of the Rails API and magic, it's mostly very well considered but this is the first big time it's bitten me.