AMY bugfix release v1.5.2 and bug postmortem
24 hours ago, Maneesha Sane noticed and informed on GitHub that one of the tags in AMY is missing.
This observation led to an investigation on the servers and eventually to fix for a critical bug that caused the data loss.
But before we jump into sysadmin work…
What is a “tag” in AMY terms?
A tag is label that we give to various workshops. For example, all Software
Carpentry workshops will have
SWC tag, and all Data Carpentry workshops will
There are more labels we use, and the one that went missing was
One event can have between zero and all the tags we have in the system, which means it’s a many-to-many relationship between events and tags. This type of relationship requires additional intermediate table in the database.
Contents from that table were missing because they were removed with removal
I started the investigation by narrowing timespan where the event, that led to data loss, occurred.
Then I followed by reading WWW server access logs to find out what happened in that timespan in hope I could find the bug.
After narrowing list of suspects, I was able to reproduce the bug.
Finally I retrieved the lost data from the most recent backup that still had it.
Ways to remove tags from AMY
There’s no interface for removing tags other than Django’s auto-generated admin interface; only a couple of people have access to it.
So the data loss was either human error or it was caused by code bug. This conclusion helped me define what I should be looking for in the WWW server log.
Narrowing event occurence timespan
AMY’s being backed-up by multiple systems; I logged into each of them and run
multiple SQL queries on different databases to find out which backup had the
DC tag and was the newest.
It turned out that backup from 2016-04-06 17:00 UTC-4 was the most recent still
In the meantime I was fighting timezone correction… Our backup systems are in different datacenters and were running on different timezones.
Reading access log
First thing I checked in the access log is if anyone was using the admin panel to remove the tag. Unfortunately this possibility was quickly ruled out; so the loss was caused by code bug.
However, after reading the log no action stood out.
Short, important side story: Software Carpentry website rebuilds every 30 minutes. Each rebuild is shown in the log by multiple requests to AMY’s API:
Website grabs published events tagged by
TTT tags. This
sequence of requests repeats every 30 minutes.
After reading the log over and over I noticed that two consecutive calls to
/api/v1/events/published.yaml?tag=DC yielded results of very different
Apparently then the DC tag disappeared, the API started returning all the
published events, no matter if they were tagged
SWC or something else.
This was a clear indication that the
DC tag disappeared between 15:01 and
That timespan doesn’t look like 17:00. Timezones… programmer’s nightmare.
There was some user activity in this 30-minutes long window and one thing caught my eye:
(The actual URL was slightly changed to remove unnecessary information.)
This was a call to event merge functionality: someone wanted to merge
I started testing different strategies. I had a feeling that the bug had something to do with strategy for event tags. :)
Finally I reproduced the bug by using following strategy:
- base event:
- tags: from event B.
At that point I decided to retrieve the lost data using SQL import/export functionality from the optimal (newest & containing the lost data) backup found earlier.
The only code used in event merge functionality that would trigger accidental removal is:
This code is used for substituting related objects (tags in our case). It works like this:
if some field’s strategy is to switch to objects from the other event, then remove all currently assigned objects and add objects from the other event’s field.
Translated into tags:
if user wants to use event
2016-05-06-RDAP16-Atlantaas base event, but keep tags from the other event (
2016-05-06-asist) then remove current tags from base event and add tags from the other event.
See what’s going on here? Base event’s tags were removed instead of being unassigned.
Django: related manager and assignments
In this section I’m going to talk about how relations work and if they can be unassigned instead of being removed.
For many-to-many relationships (e.g. multiple events can be assigned multiple tags) Django creates an intermediate table that stores assignments. In this case, unassigning event from tag is as simple as removing that stored assignment from the intermediate table.
For one-to-many relationships (e.g. multiple events can have the same
organizer) there’s no need for additional table; storing the organizer looks
event.organizer = SomeOrganizer.
In case of the one-to-many relationships we can unassign the event from
SomeOrganizer if and only if
event.organizer field can store
If it cannot, then we have to remove the event.
So the bug existed because the case of unassignment was not taken into account – only removal of related objects was accounted for.
Fix: need to find out when we can unassign
Long story short: in Django only related manager with
.clear method can
unassign; if this method is not present then the only option is removal.
So fixed code looks like this (minus the comments):
(Yes, it probably should use
try - except block instead of
All in all, I feel good about this bug; if anything, I’d like eliminate the errorneous timezone arithmetics.
Also all backup mechanics and logging worked really nice.
As a result of investigation described above, the bug and the solution to it last night I released AMY v1.5.2.