(EN) Incident Report on November 9th downtime

Home Forums Dragonborn Z (EN) Incident Report on November 9th downtime

This topic contains 0 replies, has 1 voice, and was last updated by  Pyvos 2 months, 1 week ago.

  • Author
    Posts
  • #5845

    Pyvos
    Keymaster
    • Topics: 1
    • Replies: 0
    • Total: 1

    From 0800 to 1700 EET on November 9th, 2017, the Dragonborn Z website (and respective VPS) was unresponsive. This downtime was the result of an electrical outage at our provider’s Strasbourg data-center and they moved quickly to resolve this outage.

    Chronological order of events

    • At 1200 EET, repairs began on their primary 20KV line and generators were started. Routing rooms began to come back online. We monitored the situation via the provider’s social media accounts, as their status page was not resolving.
    • At 1344 EET, primary and secondary electrical lines were both restored and our provider began spinning up services on SBG2 (Strasbourg 2 datacenter).
    • At 1544 EET, they began bringing online services in SBG1 (which hosts our VPS) and SBG4.
    • At 1640 EET, they began reporting via their monitoring dashboard that the VPS monitoring services were down, indicating that they were beginning to bring the respective server back online.
    • At 1647 EET, we were able to successfully SSH into the server and began validating, reading logs, and restarting services (we utilize supervisord for logging and process management, our work was primarily done there).
    • At 1658 EET, nginx began responding and the respective database service was running as well.

    Future resolution

    We’ll be working on introducing an independently hosted status page to indicate the status of services (database, website and MushRaider). Furthermore, we’ll be investigating implementations for database redundancy and load balancing to reduce the likelihood of a single point of failure.

You must be logged in to reply to this topic.