Skip to main content

Incident Report: November Mail Issues on the Auckland Mail Cluster

If you are a user of our Auckland based Email system, you may have noticed over the past few weeks there were a few extended periods of poor performance including an inability to connect to the mailserver to download email during peak times on certain days while emergency maintenance was being carried out. This affected approximately half of users using that network.

First and foremost, we would like to apologise for this. We use this platform ourselves, so we know how much of a disruption it is to business and communications. Secondly, we am happy to say as of last week the issues behind this extended poor performance have been fully resolved and you can now expect mail to be back to normal. This issue was limited to our mail platform and had no impact on web sites, servers or other services.

So what happened?

On the 12th of November, one of our three mail storage zones experienced hardware failure. This is not uncommon in itself, however upon replacing the hardware our providers found the devices file system was corrupted. Turning a simple issue into a significant event. They attempted to recover the data; however a decision was made that it would be less disruptive to fail over to the redundant mail storage. They proceeded to fail over and your mail service was restored with some disruption during the day. The initial issue was resolved relatively quickly on the 12th.

To restore full redundancy, they then then needed to replicate your data back onto the repaired storage, a process that took until Tuesday last week due to the immense amount of data that cleints store in tehir email accounts these days. Unfortunately this is what has caused slow performance over the ensuing weeks. Our priority has always been to ensure data integrity, in this case it means we focused on data-replication at the cost of performance. If our providers did it the other way around, there was a risk of data loss. One of the Engineers has written a technical overview with more specific details which you can view on our Lounge Network Blog.

Looking forward, our providers are going to ensure redundancy can be restored much more quickly and painlessly, as the past two weeks have not been acceptable by our very high standards, and it is the first time in our 12 years of operation, that a problem has taken that amount of time to be fully completed.

 Again, we sincerely apologise for any disruption caused.