What does complete failure smell like at Shadowserver?

August 15, 2015

Introduction

In any corporation there is a fine line between success and failure.  Part of that is how each one is dealt with.  We at Shadowserver are as proud of our successes as we are of our failures.  Usually a failure means that something has grown beyond the originally built capability and some sort of expansion or new implementation of some untried technology is at hand.  Some times these failures are embarrassing because it was something simple and sometimes it so complex that it takes us a while to correct what broke.

We try to be upfront when something breaks and explains what occurred.  We failed completely at that this time as well.

Architecture

To actually deliver what Shadowserver does is fairly complex.  After over a decade of growing our systems and processes the back end has gotten very large.  The systems are tightly coupled and well integrated with each other.  Of course this can cause cascading failures if something truly mucks up the back end.  But there are many different relief values to ensure that the systems cannot be overloaded.  The best plans of mice and men…

The Problem

Take a slight clock drift, add in a slow memory leak, add a dash of new code, and mix it all together.  Bake well with some gun powder and a new data set.

Timeline

[Day Minus more than One]

  • freed0 – Hey Chief Architect do you mind if I add in some new DNS REGEX’s to the system?  There are just a few.  Wafer thin ones.
  • Chief Architect – What are these?
  • freed0 – DGA REGEX’s.
  • Chief Architect – Do we have to have them?
  • freed0 – Of course!
  • Chief Architect – But there are millions of them…  It might impact performance.
  • freed0 – Nah!

[Day Minus One]

  • Chief Architect – Well I’m off on holiday for the next ten days.  Everything is running fine and well.  You guys should be okay without me.
  • freed0 – Are you sure?  The last time you went on a holiday the Event Cluster shutdown until you returned.
  • Chief Architect – That cannot happen.  Everything was fixed.
  • freed0 – And the time before that….
  • Chief Architect – IT WAS ALL FIXED!
  • freed0 – Umm, you’ll have internet access there?
  • Chief Architect – NOPE!  Otherwise it would not be a holiday…
  • freed0 – Err…  you’ll have a mobile phone?
  • Chief Architect – NOPE!  It seems that in the wild and wholly mountains of Hawaii there is a specific volcanic aura that will prevent any mobile phone from working.
  • freed0 – Wow… really?
  • Chief Architect – Yup – see you in two weeks.

[Day One]

  • Kjellchr – Hey, it seems the reports have not come out.
  • freed0 – Odd, let me take a look at that.  Hey System Administrator, where are our reports?
  • System Administrator – hmm, it seems that a bunch of the minions crashed.
  • freed0 – how many?
  • System Administrator – 300 of them
  • freed0 – Well can you bring them back up?
  • System Administrator – sure – not a problem, hold on.  I went ahead and started the report process again as well
  • freed0 – Awesome, great work!
  • Kjellchr – umm, only some of the reports came out…
  • dav3 – Hey – my scan data was not imported!
  • freed0 – … umm, System Administrator?

[Day Two]

  • Kjellchr – Hey, it seems the reports have not come out.
  • freed0 – Sigh.
  • System Administrator – It seems that today 500 minions crashed.  I keep bring them up and some keep crashing.  It looks like some new code got pushed about a week ago but nothing was rebooted since then.
  • freed0 – …
  • System Administrator – I will keep hacking at this.  I have brought “The Systems Programmer” to assist.
  • freed0 – Great.  Between the two of you it should be something quick then.
  • Kjellchr – Still no reports.
  • freed0 – hmm, only about 100 emails asking about the reports today.

[Day Three]

  • Kjellchr – Hey, it seems the reports have not come out.
  • freed0 – [starts sobbing]
  • System Administrator – Today 700 minions have crashed!
  • freed0 – How many do we have?
  • System Administrator – 701
  • freed0 – …
  • System Administrator – The SP and I have been looking at this and not sure what keeps causing them to crash.  It seems almost as soon as we bring up the minions another set crash.
  • freed0 – What is not running because of this?
  • System Administrator – Well, without the minions there are no imports.  If there are no imports then nothing goes to the Event Cluster.  Which means that there is nothing to report!
  • freed0 – [sobs some more]
  • Kjellchr – Still no reports.
  • freed0 – I think I received about 400 messages today asking about the reports…

[Day Four]

  • Kjellchr – Hey, it seems the reports have not come out.
  • System Administrator – Wow, all the systems are running so fast now.
  • freed0 – The minions are up?
  • System Administrator – Nope!
  • freed0 – so nothing is really processing
  • System Administrator – Correct!
  • Kjellchr – Still no reports.
  • freed0 – Wow, only another 200 messages today.  It seems people think there is an issue here.

[Day Five]

  • Kjellchr – Hey, it seems the reports have not come out.
  • System Administrator – [hides]
  • Yanceyslide – Maybe we need to sacrifice to the computing gods?
  • freed0 – Feh – silly person, that never helps.
  • Kjellchr – No reports.

[Day Six]

  • Kjellchr – Hey, it seems the reports have not come out.
  • freed0 – Yancey – do you think chickens or goats would be better?
  • Yanceyslide – Chickens!  Works for Voodoo doesn’t it?
  • [Insert scene of feathers and blood]
  • Kjellchr – Hey guys, when will the reports start?

[Day Seven]

  • Kjellchr – Hey, it seems the reports have not come out.
  • freed0 – [starts ignoring email]
  • [Insert amusing picture of “The Oatmeal” showing hordes of zombies attacking some brave defenders]

[Day Eight]

  • Kjellchr – Hey, it seems the reports have not come out.
  • freed0 – [poops in the CA’s chair]

[Day Nine]

  • Kjellchr – Hey, it seems the reports have not come out.
  • freed0 – …

[Day Ten]

  • Kjellchr – Hey, it seems the reports have not come out.
  • freed0 – Well, only one more day before the CA comes back.  Maybe if we just pretend everything is working no one will notice.

[Day of Return]

  • Kjellchr – Hey, it seems the reports have not come out.
  • Chief Architect – Ah, what a relaxing holiday.  I really considered staying longer.  Now, let us look over the systems.  HEY!  Why are there feathers and blood sprayed all over the data center?  Who pooped on my chair!  [Sniff-sniff] Did someone pee in the corners of the Data Center? What is going on here….

Conclusion

Our Chief Architect works 12-hours days most days of the week.  He is probably the hardest worker we have as well as the smartest.  Way smarter than the rest of us.  Our systems are complex enough where even those that know it inside and out can have difficulty managing it let alone debug the odd issues that creep in.

Being a non-profit organization that relies on donations and sponsorships we do not always have the latest equipment needed.  There is not a specific budget for hardware nor anything else and what we can install and upgrade is completely dependent on what has been given to us recently.  This will sometimes mean that while we can make software updates and changes to our operating processes we cannot always make the most optimal upgrade which might require money being spent.  We do not charge for our services which will occasionally break and fall over.  We have managed to put it all back together each time better than the previous time for over ten years.  We are always looking for more support to assist us.  So reach out and let us know how you can help us.

Yeah…  we could have done it better.  The older reports are still coming out and the most current ones should also be going out.  We will catch up soon enough.

  • Chief Architect – Everything *IS* caught up now.

Enjoy your next holiday.

Recent Articles