Back to overview
Downtime

Participant View Outage

Jan 16 at 07:30pm EST
Affected services
Participant View
Crowdpurr Ingress

Resolved
Jan 16 at 08:30pm EST

Last night around 7:30PM EST we experienced a forty-five minute outage of the Crowdpurr Ingress service of Crowdpurr. Our team became aware of the issue around 7:45PM EST. The issue was resolved around 8:15PM EST.

The root cause of the outage was due to an Amazon Web Services (AWS) hardware failure in the Participant View's load balancer server (Crowdpurr Ingress). This is the server that balances all the thousands of incoming participants evenly and sends them on to our underlying set of application servers (which were running fine). This is why the Participant View completely failed to load. Historically, when a server crashes or hangs it’s usually due to a Crowdpurr software/bug issue. This failure was truly due to the underlying cloud hardware provided by AWS.

While these AWS hardware incidents are very rare, they do happen. Cloud servers are meant to be monitored and replaced quickly. Once we knew the server was down, our team assembled, and it was quickly replaced and service was restored.

In order to quickly know when any part of the Crowdpurr server ecosystem goes down, we make use of a third-party monitoring service that checks on every server in the system every sixty seconds from eighteen different locations all over the world. If any server goes down for even a minute, our team is notified via email and SMS text to emergency inboxes reserved for critical service events.

A secondary issue also occurred in alerting our team of the server issue. Our monitoring service uses an SMS texting service that is experiencing issues with west Los Angeles phone numbers on the T-Mobile network. Our phone number for receiving alerts happens to be a west Los Angeles phone number on T-Mobile. Consequently, our Engineering Team didn’t receive the immediate SMS text messages about the app’s health which is critical in responding after hours. We did receive alerts via email and our Customer Support Team was online and quickly manually alerted our Engineering Team of the issue.

Once our Engineering Team knew about the outage, we successfully replaced the Crowdpurr Ingress server. We also have dropped the above monitoring service vendor and started up with a new vendor that is much more powerful and offers several additional features. You can even subscribe to our status page now to be updated about incidents in real-time.

Additionally, we are adding several extra layers to the design of our notification and response procedures to make sure we don’t have a single point of failure for notifications when Crowdpurr experiences an issue. This includes adding more email and phone numbers inboxes, and a second, redundant monitoring system in the event one doesn’t work. Additionally, we added a redundant load balancing server that we can quickly switch to in the event something like this happens again.

We understand Crowdpurr is a software for live events and it’s of utmost importance that it work for every rehearsal and every production live event. We take application uptime very seriously. If you’re a seasoned Crowdpurr user, you know that the application rarely has an outage. However, due to the nature of Internet base software, outages do sometimes occur. This has been the most severe outage for Crowdpurr in about five years. That's how rare downtime is for us. However, even with a forty-five minute outage, Crowdpurr is still far and well above 99.9% uptime over the last twelve months. But, with live events, we’re aware that even 99.9% uptime is not good enough. We will always strive for 100% uptime.

Please accept our deepest apologies and always know that we are learning and working hard to prevent something like this from happening again. We deeply respect and appreciate your patronage. You pay for our service, you expect it to work. Please let us know how we can make this right.

Created
Jan 16 at 07:30pm EST

At 7:32PM EST the Participant View experienced an outage. Our team was able to restore the service around forty minutes later at 8:15PM EST.

We corrected the problematic server and are investigating the cause.