You might think that data failure can never happen to you. You’ll think that until it does happen. It’s inevitable. Whether it’s down to technological failure or human error, data loss can happen to all of us. Even when you think you’re well protected, as GitLab learnt recently.
The firm recently lost 300GB of user data from its primary database after one of their technicians made a data entry error when trying to remove a PostgreSQL database directory. They were trying to wipe it on the secondary server, but the command they inputted referred to the primary server. The technician quickly realised their mistake, but in those seconds the command had already wiped masses of user data.
In a blog post detailing the incident, the firm estimate that roughly 5000 projects, 5000 comments and 700 user accounts were affected by the data loss. GitLab enterprise customers, GitHot customers, and self-hosted users were not impacted by the outage or data loss.
The reason GitLab were wiping the directory was because they were suffering an increase in database load, which they thought was down to a spam attack. However, the load was also caused by a background task which was removing a GitLab employee and their account’s associated data. Both caused system lag on the secondary database’s replication process, which meant that it had to be resynchronised.
The first repair attempts failed, which led to the engineer to take the action they did. The company have apologised for the incident, saying that the data loss they experienced was unacceptable.
The main thing they learned from the incident was that “there’s no such thing as a valid backup – there’s only valid restores”. And that’s perfectly true; you can have all the backups in the world, but they’re useless if you can’t restore from them.
GitLab experienced multiple resource failures during the database server performance drop, which they weren’t aware of until it all mounted up into a larger failure. They had backups which weren’t generated properly, and some which couldn’t be restored, which led to the operator performing a manual restore.
While GitLab originally started as a side business, the firm say they didn’t account for the large growth that it saw, thus outgrew the IT systems and procedures they had in place. They had become a target for spammers and didn’t have the proper mechanisms in place to support it.
Though systems have a large part in this, so do the people who operate them. They learnt that they need to better prepare their employees and support them with the correct procedures. GitLab have said that they are undergoing new training and hiring to help their workers stay ahead of needed systems improvements.
One of the best things GitLab did during this incident was communication – they were open with their users about what happened and how they were going to improve. However, they were poor in identifying those users who had been impacted by it all.
GitLab have learned from their experience, but it should also be a clear message to consumers too. Don’t put all your data into online services and expect it to be safe. Companies are susceptible to data loss just as much as anyone.
GitLab Learn After Their Backup Fail
No comments yet. Sign in to add the first!