Sunday, January 11, 2015

Resilience

re·sil·ience
rəˈzilyəns/
noun
noun: resilience; plural noun: resiliences
  1. 1
    the ability of a substance or object to spring back into shape; elasticity.
    "nylon is excellent in wearability and resilience"
  2. 2
    the capacity to recover quickly from difficulties; toughness.
    "the often remarkable resilience of so many British institutions"

My parents were children of the Great Depression here in the U.S.  The experience influenced how they approached their lives - stability over all else.  Even though they never quite realized the goal of stability, it is still one of the characteristics passed down by their generation to those that followed.  That's why society tends to idolize successful risk takers; they broke the self-imposed limitations that come from the focus on stability.

My own experience in working with NASA taught me that stability is not the key to preserving the viability of any system.  The key is resilience.  With all the hooey happening recently both in the IT world and in real life, I thought the idea of resilience might be worth a brief post here.

Let's consider a very narrowly-focused, basic statistic.  The U.S. Navy is the target of, on average, 30 cyber-attacks every second - every minute, every hour, every day over the course of a year.  That's over 1 billion attacks in a 12-month period.  Common sense alone says they can't stop them all.  Some attacks succeed, some data is lost, some damage is done.  Stability can't be preserved.  So, for the U.S. Navy, the focus is on system resilience.

Resilience in IT systems essentially embraces the following idea:  bad things will eventually happen to your system, and you can't prevent it.  Make every effort to defend against hackers.  Build earthquake-ready systems to house your data center.  Keep your patches and maintenance up to date.  In my little corner of NASA, we used a floating iceberg analogy and referred to this as the "above-the-waterline" stuff...things we could see or foresee.  

But it's the things below the waterline that hold the most risk: a new hacking approach, a natural disaster of massive proportions, a unique anomaly, etc.  To address these risks, you design system architectures that can bounce back quickly from attacks and damage.  

So it's not a matter of preventing all the bad things from happening (you can't), its a matter of how quickly your system can adapt and bounce back from the bad things that happen.

Then following is a direct quote from the Rockefeller Foundation's work on resilience:

Resilient systems, organizations, or individuals possess five characteristics in good times and in times of stress. They are:
  • Aware. Awareness means knowing what your strengths and assets are, what liabilities and vulnerabilities you have, and what threats and risks you face. Being aware is not a static condition; it’s the ability to constantly assess, take in new information, reassess and adjust your understanding of the most critical and relevant strengths and weaknesses and other factors on the fly. This requires methods of sensing and information-gathering, including robust feedback loops, such as community meetings or monitoring systems for a global telecommunications network.
  • Diverse. Diversity implies that a person or system has a surplus of capacity such that it can successfully operate under a diverse set of circumstances, beyond what is needed for every-day functioning or relying on only one element for a given purpose. Diversity includes redundancy, alternatives, and back-ups, so it can call up reserves during a disruption or switch over to an alternative functioning mode. Being diverse also means that the system possesses or can draw upon a range of capabilities, information sources, technical elements, people or groups. 
  • Self-Regulating. This means elements within a system behave and interact in such a way as to continue functioning to the system’s purpose, which means it can deal with anomalous situations and interferences without extreme malfunction, catastrophic collapse, or cascading disruptions. This is sometimes called “islanding” or “de-networking”—a kind of failing safely that ensures failure is discrete and contained. A self-regulating system is more likely to withstand a disruption, less likely to exacerbate the effects of a crisis if it fails, and is more likely to return to function (or be replaced) more quickly once the crisis has passed.
  • Integrated. Being integrated means that individuals, groups, organizations and other entities have the ability to bring together disparate thoughts and elements into cohesive solutions and actions. Integration involves the sharing of information across entities, the collaborative development of ideas and solutions, and transparent communication with people and entities that are involved or affected. It also refers to the coordination of people groups and activities. Again, this requires the presence of feedback loops.
  • Adaptive. The final defining characteristic of resilience is being adaptive: the capacity to adjust to changing circumstances during a disruption by developing new plans, taking new actions, or modifying behaviors so that you are better able to withstand and recover from a disruption, particularly when it is not possible or wise to go back to the way things were before. Adaptability also suggests flexibility, the ability to apply existing resources to new purposes or for one thing to take on multiple roles.
Resilience is all about making systems and the components of those systems stronger:  hardware, software, people, communities, etc.

From an IT perspective, the next time you design a solution or a system, stop and think about how your solution or system could be designed for greater resilience.  You'd be amazed how simple and inexpensive it can be once you invest a little brain power.

And what I just wrote about the IT perspective?  It applies to real life situations too.  How's that for a pearl of wisdom?

Your thoughts?  Love to hear 'em!  Comments...

No comments: