Agility in Infrastructure Engineering

In infrastructure management, agility represents the need to handle changing demand for service, to repair an absence of service, and to improve and deploy new services in response to changes from users and market requirements. It is tied to economic, business or work-related imperatives by the need to maintain a competitive lead.

The compelling event that our system must respond to might represent danger, or merely a self-imposed deadline. In either case, there is generally a penalty associated with a lack of agility: a blow, a fall or a loss.

1.1 What make agility possible?

To respond to a challenge there are four stages that need attention:

To comprehend the challenge.
To solve the challenge.
To respond to the challenge.
To confirm or verify the response.

Intuitively, we understand agility to be related to our capacity to respond to a situation. Let's try to pin this idea down more precisely.

1.2 The capacity of a system

The capacity of a system is defined to be its maximum rate of change. Most often, this refers to speed of the system response to a single request¹.

In engineering, capacity is measured in changes per second, so it represents the maximum speed of a system within a single thread of activity².

1.3 Speed

Speed is the rate at which change takes place. For a configuration tool like CFEngine, speed can be measured either as

Engineers can try to define an engineering scale of agility as the ratio of available speed to required speed and ratio of number ways a system can be changed to the number of ways imperatives require us to change.

Agility is proportional to both how much speed we can muster compared to what is required, and the number of change-capabilities we possess, compared to what we need to meet a challenge. In other words: how well equipped are we? As engineers, we could write something like this:

                Available speed under control        Changes available
    Agility =~  -----------------------------  *  -----------------------
                      Required speed                 Changes Required

What is required speed? It is is the rate of change we have to be able to enact in order to achieve and maintain a state (keep a promise) that is aligned with our intent. This requires a dependence on technology and human processes.

The weakest link in any chain of dependencies limits the speed. The weakest link might be a human who doesn't understand what to do, or a broken wire or a misconfigured switch, so there are many possible failure modes for agility. An organization is an information rich society with complex interplays between Man and Machine; agility challenges us to think about these weakest links and try to bolster them with CFEngine's technology.

What we call `acceptable' is a subjective judgement, i.e. a matter for policy to decide. So there are many uncertainties and relativisms in such definitions. It would be inconceivable to claim any kind of industry standard for these.

We can write some scaling laws for the dependencies of agility to see where the failure modes might arise.

The speed available to meet a challenge is (on average) the maximum speed we can reliably maintain over time divided by the number of challenges we have to share between.

                                Expected capacity * reliability
    Average available speed =~  -------------------------------
                                    Consumers or challenges

The appearance of reliability in this expression therefore tells us that maintenance of the system, and anticipation of failure will play a key role in agility. Remarkably this is usually unexpected for most practitioners, and most of system planning goes into first time deployment, rather than maintaining operational state.

1.4 Precision

Acting quickly is not enough: we also need to be accurate in responding to change⁴. We need to be able to:

CFEngine is a fault tolerant system – it continues to work on what it can even when some parts of its model don't work out as expected⁶.

1.5 Comprehension

The next challenge is concerns a human limitation. One of the greatest challenges in any organization lies in comprehending the system.

Comprehensibility increases if something is predictable, or steady in its behaviour, but it decreases in proportion to the number of things we need to think about – which includes the many different contexts such as environments, or groups of machines with different purposes or profiles.

                        Predictability (Reliability)    Predictability
  Comprehensibility =~  ---------------------------- = ----------------
                                 Contexts                 Diversity

To keep the number of contexts to a minimum, CFEngine avoids mixing up what policy is being expressed with how the promises are kept. It uses a declarative language to separate the what from the how. This allows ordinary users to see what was intended without having to know the meaning of how, as was the case when scripting was used to configure systems.

1.6 Efficiency

Finally, if we think about the efficiency of a configuration, which is another way of estimating its simplicity, we are interested in how much work it takes to represent our intentions. There are two ways we can think about efficiency: the efficiency of the automated process and the human efficiency in deploying it.

If the technology has a high overhead, the cost of maintaining change is high and efficiency is low:

The efficiency of the technology decreases with the more resources it uses, e.g. like memory and CPU. Resources used to run the technology itself are pure overhead and take away from the real work of your system.

                                      Resources used
         Resource Efficiency =~  1 -  ---------------
                                      Total resources

The efficiency of a model decreases when you put more effort into managing a certain number of things. If you can manage a large number of things with a few simple constraints, that is efficient.

                              Number of objects affected
      Model Efficiency =~  -------------------------------
                           Number of rules and constraints

Efficiency therefore plays a role in agility, because it affects the cost of change. Greater efficiency generally means greater speed, and more greater likelihood for precision.

2 Aspects of CFEngine that bring agility

Ability to express clear intentions about desired outcome (comprehension).
Availability of insight into system performance and state (comprehension).
Ability to manage large numbers of hosts and resources with a few generic patterns (efficiency).
Ability to bundle related details into simple containers (comprehension without loss of adaptability).
Ability to accurately customize policy down to a low level without programming (adaptability).
Ability to recover quickly from faults and failures. The default, parallelized execution framework verifies promises every 5 minutes for rapid fault detection and change deployment (clock speed)⁷ .
A quick system monitoring/sampling rate – every 2.5 minutes (Nyquist frequency), for automated hands-free response to errors.
Ability to recover cheaply. The lightweight resource footprint of CFEngine that consumes few system resources required for actual business (system speed – low overhead, maximum capacity).
Ability to increase number of clients without significant penalty (scalability and easy increase of capacity).
A single framework for all devices and operating systems (ease of migrating from one platform to another).

2.1 What agility means in different environments

Let's examine some example cases where agility plays a role. Agility only has meaning relative to an environment, so in the following sections, we cite the remarks of CFEngine users (in quotes), and their example environments.

Users' expectations for agility can differ dramatically in the present; but if we think just a few years down the line, and follow the trends, it seems clear that limber systems must prevail in IT's evolutionary jungle.

2.1.1 Desktop management

"The desktop space can be a very volatile environment, with multiple platforms."

2.1.2 Web shops

Modern web-based companies often base their entire financial operations around an active web site. Down-time of the web service is mission critical.

2.1.3 Cloud providers

2.1.4 High Performance Computing

High Performance clusters are typically found in the oil and gas industry, in movie, financial, weather and aviation industries, and any other modelling applications where raw computation is used to crunch numbers.

2.1.5 Government

2.1.6 Finance

2.1.7 Manufacturing

SCADA (supervisory control and data acquisition) generally refers to industrial control systems (ICS): computer systems that monitor and control industrial, infrastructure, or facility-based processes, as described below.

2.2 Separating What from How (DevOps)

Think of CFEngine as an active knowledge management system, rather than as a relatively passive programming framework.

For `DevOps': programming is for your application, consider its deployment to be part of the documentation.

CFEngine is a declarative system. In a declarative system, the reverse is true. You begin by writing down What you are trying to accomplish and the How is more implicit. The way this is done is by separating data from algorithm in the model. CFEngine encourages this with its language, but you can go even further by using the tools optimally.

CFEngine allows you to represent raw data as variables, or as strings within your policy. For example:

By separating `what' data like this out of the details of how they are used, it becomes easier to comprehend and locate, and it becomes fast to change, and the accuracy of the change is easily perceived. Moreover, CFEngine can track the impact of such a change by seeing where the data are used.

CFEngine's knowledge management can tell you which system promises depend on which data in a clear manner, so you will know the impact of a change to the data.

For example, reading in data from a system file is very convenient. This is what Unix-like system do for passwords and user management.

What you might lose when making an input matrix is the why. Is there an explanation that fits all these cases, or does each case need a special explanation? We recommend that you include as much information as possible about `why'.

2.3 Packaging limits agility

Atomicity enables agility. Atomicity, or the avoidance of dependency, is a key approach to simplicity. Today this is often used to argue to packaging of software.

Handling software and system configuration as packages of data makes certain processes appear superficially easy, because you get a single object to deal with, that has a name and a version number. However, to maintain flexibility we should not bundle too many features into a package.

A tin of soup or a microwave meal might be a superficially easy way to make dinner, for many scenarios, but the day you get a visitor with special dietary requirements (vegetarian or allergic etc) then the prepackaging is a major obstacle to adapting: the recipe cannot be changed and repackaged without going back to the factory that made it. Thus oversimplification generally tends to end up sending up back to work around the technology.

For example: your system's package model can cooperate with CFEngine make asking CFEngine to promise to work with the package manager:

If you need to change what happens under the covers, it is very simple to do this in CFEngine. You can copy the details of the existing methods, because the details are not hard-coded, and you can make your own custom version quickly.

2.4 How abstraction improves agility

In this case, it is both easy and simple. Let's check why. We have to ask the question: how does this abstraction improve speed and precision in the long run?

Obviously, it provides short term ease by allowing many complex operations to take place with the utterance of just a single word¹⁰. But any software can pull that smoke and mirrors trick. To be agile, it must be possible to understand and change the details of what happens when this services is promised. Some tools hard-code processes for this kind of statement, requiring an understanding of programming in a development language to alter the result. In CFEngine, the definitions underlying this are written in the high-level declarative CFEngine language, using the same paradigm, and can therefore be altered by the users who need the promise, with only a small amount of work.

Thus, simplicity is assured by having consistency of interface and low cost barrier to changing the meaning of the definition.

2.5 Increasing system capacity (by scaling)

Capacity in IT infrastructure is increased by increasing machine power. Today, at the limit of hardware capacity, this typically means increasing the number of machines serving a task. Cloud services have increased the speed agility with which resources can be deployed – whether public or private cloud – but they do not usually provide any customization tools. This is where CFEngine brings significant value.

By scalability we mean the intrinsic capacity of a system to handle growth. Growth in a system can occur in three ways: by the volume of input the system must handle, in the total size of its infrastructure, and by the complexity of the processes within it.

For a system to be called scalable, growth should proceed unhindered, i.e. the size and volume of processing may expand without significantly affecting the average service level per node.

Although most of us have an intuitive notion of what scalability means, a full understanding of it is a very complex issue, mainly because there are so many factors to take into account. One factor that is often forgotten in considering scalability, is the human ability to comprehend the system as it grows. Limitations of comprehension often lead to over-simplification and lowest-common-denominator standardization.

Scalability is addressed in a separate document: Scale and Scalability, so we shall not discuss it further here.

3 Agility in your work

3.1 Easy versus simple

Just as we separate goals from actions, and strategy from tactics, so we can separate what is easy from what is simple. Easy brings short-term gratification, but simple makes the future cost less.

Simple is about what happens next. Once you have started, what happens if you want to change something?

Total cost of ownership is reduced if a design is simple, as there are only a few things to learn in total. Even if those things are hard to learn, it is a one-off investment and everything that follows will be easy.

Unlike some tools, with CFEngine, you do not need to program `how' to do things, only what you want to happen. This is always done by using the same kinds of declarations, based on the same model. You don't need to learn new principles and ideas, just more of the same.

3.2 How does complexity affect agility?

In the past¹¹, it was common to manage change by making everything the same. Today, the individualized custom experience is what today's information-society craves. Being forced into a single mold is a hindrance to adaptability and therefore to agility. To put it another way, in the modern world of commerce, consumers rule the roost, and agility is competitive edge in a market of many more players than before.

Of course, it is not quite that simple. Today, we live in a culture of `ease', and we focus on what can be done easily (low initial investment) rather than worrying about long term simplicity (Total Cost of Ownership).

At CFEngine, we believe that `easy' answers often suffer from the sin of over-simplification, and can lead to risky practices. After all, anyone can make something appear superficially easy by papering over a mess, or applying raw effort, but this will not necessarily scale up cheaply over time. Moreover, making a risky process `too easy' can encourage haste and carelessness.

Any problem has an intrinsic complexity, which can be measured by the smallest amount of information required to manage it, without loss of control.

Ease is the absence of a barrier or cost to action.
Simplicity is a strategy for minimizing Total Cost of Ownership.

At CFEngine, we believe in agility through simplicity, and so we invest continuous research into making our technology genuinely simple for trained users. We know that a novice user will not necessarily find CFEngine easy, but after a small amount of training, CFEngine will be a tool for life, not just a hurried deployment.

3.3 An effective understanding helps agility

Knowledge is an antidote to uncertainty. Insight into patterns, brings simplicity to the information management, and insight into behaviour allows us to estimate impact of change, thus avoiding the risk associated with agility.

In configuration `what' represents transitory knowledge, while `how' is often more lasting and can be absorbed into the infrastructure. The consistency and repairability of `how' makes it simpler to change what without risk.

3.4 Maximizing business imperatives

Agility allows companies and public services to compete and address the needs of continuous service improvement. This requires insight into IT operations from business and vice versa. Recently, the `DevOps' movement in web arenas has emphasized the need for a more streamlined approach to integrating business-driven change and IT operations. Whatever we choose to call this, and in whatever arena, `connecting the dots between business and IT' is a major enabler for agility to business imperatives.

Some business issues are inherently complex, e.g. software customization and security, because they introduce multifaceted conflicts of interest that need to resolved with clear documentation about why.

Be careful about choosing a solution because it has a low initial outlay cost. Look to the long term cost, or the Total Cost of Ownership over the next 5 years.

Many businesses have used the argument: everything is getting cheaper so it doesn't matter if my software is inefficient – I can brute force it in a year's time with more memory and a faster CPU. The flaw in this argument is that complexity and scale are also increasing, and you will need those savings down the line even more than you do now.

In the industrial age, the strategy was to supply sufficient force to a small problem in order to `control' it by brute force. In systems today the scale and complexity are such that no such brute force approach can seriously be expected to work. Thus one is reduced to a more even state of affairs: learning to work with the environment `as is', with clear expectations of what is possible and controlling only certain parts on which crucial things depend.

3.5 What does agility cost?

At a recent deployment in the banking sector, CFEngine replaced an incumbent software solution where 200 machines were required to make the management infrastructure scale to the task.

CFEngine replaced this with 3 machines, and a reduced workforce. After the replacement the clock-time required for system updates went from 45 minutes to 16 seconds.

3.6 Who is responsible for agility?

Competitive edge, response to demands, in both private sector and research, makes agility the actual product of a not-too-distant tomorrow.

CFEngine has been carefully designed to support agile operations for the long term, by investing in knowledge management, speed and efficiency.

Footnotes

[1] Capacity is often loosely referred to as `bandwidth' because of its connection to signal propagation in communication science, but this is not strictly correct, as bandwidth refers to parallel channels.

[2] For example, for a single coding frequency, the capacity of a communications channel is measured in bits per second, and the bandwidth is the number multiplied by the number of parallel frequencies.

[3] If available speed matches need, and we have the capability to make all required changes, then we can claim exactly 100% agility. If we have less than required, then we get a smaller number, and if we have excess speed or changeability then we can even claim a super-efficiency.

[4] In the 20th century, science learned that there is no such thing as determinism – the idea that you can guarantee an outcome absolutely. If you still think in such terms, you will be quickly disappointed. The best we can accomplish is to maximize the likelihood of a predictable result, relative to the kind of environment in which we work.

[5] In some other configuration software, assumptions are hard-coded into the tools themselves, making the outcome undocumented.

[6] Other systems that claim to be deterministic simply stop with error messages. What is the correct behaviour? Clearly, this is a subjective choice. The important thing is that your system for change behaves in a predictable way.

[7] Two related concepts that are frequently referred to are the classic reliability measures: Mean Time Before Failure (MTBF) or proactive health and Mean Time To Repair (MTTR), speed of recovery: (i) If we are proactive or quick at recovering from minor problems, larger outages can be avoided. Recovery agility plays a role in avoiding cascade failure.

(ii) If the time to repair is long, or the repair is inaccurate, this could result if more widespread problems. Inaccurate change or repair often leads to attempts to `roll-back', causing further problems.

[8] See http://www.cfengine.com/blog/sysadmin-3.0-and-the-third-wave

[9] Service promises, as described here, were introduced into version 3.3.0 of CFEngine in 2012.

[10] All good magic stories begin like this.

[11] Perhaps not just in the past. We are emerging from an industrial era of management where mass producing everything the same was the cheapest approach to scaling up services. However, today personal freedom demands variety and will not tolerate such oversimplification.