CFEngine and the Enterprise

Next: Enterprise integration, Previous: (dir), Up: (dir)

CFEngine-Modularization

COMPLETE TABLE OF CONTENTS

Summary of contents

Next: CFEngine past and present, Previous: Top, Up: Top

1 Enterprise Integration

Next: ITIL introduced, Previous: Enterprise integration, Up: Enterprise integration

1.1 Business alignment

The goal of most IT installations is to work as a support infrastructure for some other primary activity, such as the running of a business or other organization. Even if the primary activity is the design of computer systems, or the writing of software, the supporting infrastructure is a tool whose management is in principle separate from the main business goals. As organizations become larger, the management of the IT system and other ancillary activities frequently become isolated from “front line” activities.

IT infrastructure is an enabler, so it is important to ensure that it succeeds in this task. How do we do this? This document is about how to make cfengine-management best support primary business or organizational processes.

We write this document in the light to two trends: the demotion of system administration as a job description and the rise of service oriented thinking to replace it, along with monolithic design philosophy of systems. Service orientation is not so much a technological innovation as it is a different kind of social structure. It is a move away from hierarchy as the main model of organization, toward generalized network structure. In computer parlance, service orientation is essentially a peer to peer structure. There are no automatic kings or commanders in chief, only peers who need help from other peers. If such key positions arise, they emerge naturally by necessity, not by presumption.

For example, in the 1960s factory work in the United Kingdom was organized hierarchically with powerful unions attending to a dutiful “separation of concerns”, much like an idealized object oriented system. To build a ship, one would have to ask the management to ask panel producers for panels, then when they were finished they would send the message back up the hierarchy so that management would schedule the welders to arrive, then the painters and so on. Much delay and inefficiency was caused by this organizational bureaucratic structure.

Although this behaviour persists to a lesser extent, today we use more direct communication between the parts that need to connect and so save much time and overhead. This service oriented thinking can be applied to computing services, their organization and even the support of those computing services. The service model can be applied at all levels.

In the late 1980s it was realized that a service oriented view of management could profitably be formalized so as to be of benefit to all organizations. This began the Information Technology Infrastructure Library, building on the experience of leaders in government and industry, including organizations such as the British Broadcasting Corporation, the office of Government Commerce and others.

Next: Business processes and goals, Previous: Business alignment, Up: Enterprise integration

1.2 ITIL introduced

The IT Infrastructure Library (ITIL) has emerged as a de-facto set of ideas about service delivery. It is not based on any theoretical model or design criteria. It is rather a set of self-proclaimed best practices compiled by representatives from government and industry. As such its claims can be discussed, but we shall not do so here. We shall refer to ITIL because it has become a popular set of guidelines for all manner of IT organizations, and because it promotes the idea of IT-business alignment.

ITIL was an important source of concepts and processes documented in the following British and ISO standards:

BS 15000
ISO/IEC 20000 (successor of BS 15000)

ITIL now encompasses various books and courses and has its own qualification scheme allowing for a certification of Service Managers or IT staff.

The key concepts of ITIL include service and process orientation, and service orientation is an important model for system organization because it can encompass everything from the monolithic hierarchical systems of yesteryear to modern day peer to peer architectures which better mirror a free-market economic business interaction. It can be applied to computer-provided services (e.g. web services, or even configuration operations like cfengine) or it can be applied to human services and operations such as help desks and support. This makes it an important centre-piece in the discussion.

ITIL has its own particular terminology for discussing service related matters. To relate these to the use of a technology such as cfengine we need to understand the words and how they are used. ITIL uses many terms and phrases in a different way to system administrators.

The verb “to manage” originally meant “to cope”. Only more recently has strategic thinking changed it into a transitive verb: something that we do to systems, like driving a car, or flying a plane.

Today the term “management” signifies the introduction of a bureaucratic level of governance, to control and verify the workings of a system. The terminology this has come about mainly because the people who wrote ITIL live in that kind of world and understand things through these eyes. Ironically, computer engineers now speak of “self-management” and “autonomics” to recover the original idea of systems that can cope.

In this document we have two principal aims:

To explain a number of patterns for using cfengine to allow systems to cope with business needs.
To demystify ITIL for technicians and engineers who do not naturally respond to business-speak, relating cfengine's capabilities (both the technical aspects that are well known and the non-technical aspects of instrumentation and reporting that are less well known) to the goals of ITIL.

Next: About Promises, Previous: ITIL introduced, Up: Enterprise integration

1.3 Business processes and goals

What do we need to make a business? Do we need a demand for a “product”, a workflow to implement it, a supply (chain) mechanism for selling it to a market? It turns out that the service abstraction is a paradigm that fits all enterprises without too much shoe-horning.

Businesses have probably many goals in their grand designs: they have high level visions, notions of secure and best practices, sometimes even ethical policies. All of these can be couched in the language of promises to behave in some way.

Now, we can ask: what does it mean to align an IT infrastructure to this business goal to provide $S$? First, for IT systems to have any impact on the business goal at all, the business must rely on the IT system in some way. This could either be directly, in the manner of an e-commerce web-site, or it might be indirectly, for instance by providing drawing and modelling software in an architect's office. In either case there is a workflow in which an IT system plays an intermediary role in the workflow process.

In fact, it does not matter whether this is an IT system, a human being or a steam-powered engine. What is key is that there is a technology playing an intermediate role in the performance of a service. We can display this as the workflow diagram shown by the dotted lines in the figure. The business $B$ would like to provide service $S$ to its customer $C$; in actuality this requires the help of intermediary $I$.

Inserting an intermediate agent into a business process. The dotted lines show a work flow path. The arc shows a promise the business would like to make to the end customer – but promise theory says that it cannot if it does not have direct contact.

Promise theory has several implications, and one of them is that an agent cannot promise something with confidence to an agent it is not directly in contact with. This is because agents can only vouch for their own behaviour. They cannot promise what an intermediate agent would do. This has implications for the business.

Suppose a business want to make a promise to its customer, but knows that it must rely on intermediaries (the IT department for example) to do so. Promise theory tells us that the business representative making the promise requires promises from every intermediate agent in the chain, and each of the agents in that chain require promises from down the chain too.

It is beyond the scope of this document to explain all of those promises. What cfengine allows a business to do is to automate many of those promises – or make them autonomic (self-managing).

Previous: Business processes and goals, Up: Business processes and goals

1.3.1 Teams and collaboration

Humans are poor at reliable, repetitive work but they are infinitely superior at creative work and decision-making. Modern theory on success in business rejects the classic views of management with militarized or bureaucratized chains of command and control in favour of more human-creative structures. Creative and adaptive workflow requires high level of decentralization and autonomy, while at the same time protecting the core values of the organization.

Team work is a key element in decentralized organization – both for humans and computers. IT departments are often organized in this way, for instance. Teams do not exist because they maximize production of every individual, nor do they make an organization more predictable or controllable. They exist because humans need continual motivation and emotional support – and indirectly this sustains workflow and adds creativity to a business. One often overlooks the team-aspect of coping when considering computer management, in favour of hierarchical design. CFEngine does not force us into hierarchical systems however, so we should not discard the smaller team idea too soon.

Hierarchy has long traditions but modern thinking favours teams.

CFEngine is complex enough for it to make sense to delegate responsibility for different issues. An organization will generally consist of many groups and teams already, each with their own special needs and each craving its own autonomy. CFEngine and promise theory were designed for precisely this kind of environment. CFEngine allows cooperation and sharing without allowing central managers to ride roughshod over local needs.

Teams thrive by discussion and interaction within the framework of a policy or vision, allowing variation and arriving at a consensus when necessary. Success in a team depends on a combination of abilities working together not undermining one another. Conflicts in the promises made by team members reveal design problems in the group. An analysis of promises (cfengine's model of collaboration) is a significant tool for understanding and enabling businesses.

M. Belbin a researcher in teamwork has identified nine abilities or roles (kinds of promise) to be played in a team collaboration:

Plant – a creative “ideas” person who solves problems.
Shaper – this is a dynamic member of the team who thrives on pressure and has the drive and courage to overcome obstacles.
Specialist – someone who brings specialist knowledge to the group.
Implementer – a practical thinker who is rooted in reality and can turn ideas into practice (who sometimes frustrates more imaginative high flying visionaries).
Resource Investigator – an enabler, or someone who knows where to find the help the team needs regardless of whether the help is physical, financial or human. This person is good at networking.
Chairman/Co-ordinator – an arbitrator who makes sure that everyone gets their say and can contribute.
Monitor-Evaluator – is a dispassionate, discerning member who can judge progress and achievement accurately during the process.
Team Worker – someone concerned with the team's inter-personal relationships and who is sensitive to the atmosphere of the group.
Completer/Finisher – someone critical and analytical who looks after the details of presentation and spots potential flaws and gaps. The completer is a quality control person.

His model has little room for technical workflow arguments. It is entirely concerned with the creative process. This is probably significant. We should ask ourselves: how can we use the freedom to organize into specialized teams to maximize human creativity, while passing hard work over to machines. Solving this problem is what cfengine is about.

Next: Is automation worthwhile?, Previous: Business processes and goals, Up: Enterprise integration

1.4 About Promises

Next: Basic definitions, Previous: About Promises, Up: About Promises

1.4.1 A theory for ITIL

ITIL has no theory to back it up, so we have to look elsewhere for a motivation of its practices. Promise theory is an attempt to do just this for a service oriented model in which peers make promises to one another. So it ought to work for ITIL also. The advantage of promise theory is that it helps us to see how cfengine can be used, because promises provide a simple picture of how cfengine works.

Think of cfengine as a general tool for automatically making sure that promises are kept.

The popular service concept fails to capture one thing very clearly, namely the distinction between making a promise and keeping a promise. A service implies that something will be provided but it does not specify when.

Suppose we ask a security company to protect our assets. The company might promise to deploy guards, or alarm technology, or it could simply promise that you will be safe without explaining how the promise will be kept. The promise does not necessarily imply any action required to maintain this state of safety, but we still pay the company for the service to keep this promise anyway. Trust plays an important role, of course.

Promise theory helps us to understand services in all forms by forcing us to think carefully about the concept of autonomy. Autonomy implies several things: for instance, privacy of information, independence of decision and responsibility for one's own behaviour. The concept of autonomy is like a filter that makes us think carefully about things that we often take for granted. It is a good discipline, forcing us to confront what we think we know about systems.

The agents of promises are humans, computers or any entity that can be associated with a promise even if by association with its owner or designer. They are said to be autonomous if they cannot be forced to make any promises about their behaviour by an outside agent. A useful principle for understanding systems is the maximal separation of concerns and promises help us to separate independent issues.

Separation of concerns is only half the story however. Promises are also about describing how the parts of a system work together, just as in team-work. Promises provide the glue that allows completely autonomous parts to form an organization. We are not allowed to think about “control” or “command”, only about voluntary cooperation. Keep these ideas in mind when reading this document.

Previous: A theory, Up: About Promises

1.4.2 Basic promise definitions

We can use the language of promises to make clearer definitions.

Service: a promise to act or provide a resource. The promise is made from a `server' agent $S$ to one or more external agents which we call the clients.
Agreement: a mutual acceptance of knowledge by two agents (“the agents agree”). The knowledge that is agreed to is called the body of the agreement. Note that the term “agreement” is sometimes used incorrectly to mean “contract”. Agreement is often signified by signing the body, or some equivalent declaration. In promise theory an agreement is a pair of use-promises between two parties to acknowledge acceptance of the agreement body.
Contract: a bilateral bundle of proposed promises between two agents, intended to serve as the body of an agreement.
Service Level Agreement An agreement between two parties whose body describes a contract for service delivery and consumption.

Service Level Agreements (SLA) are now a well-known part of the customer-business scenario. How are promises different from Service Level Agreements (SLA)? Promises are more primitive than agreements. Agreements bind two parties to a collection of bilateral decisions that have been made in advance. An agreement implies an existing infrastructure on which to agree. A promise on the other hand is an entirely autonomous statement about agents' behaviour (ad hoc). Showing only the promises in a system does not imply any agreement between the parties, only indications about their likely behaviours.

In other words, seeing the promises that have been made, an external observer could calculate effective service levels that have been promised without any agreement taking place. Promises are therefore more fundamental than agreements to the predictability of the system.

Previous: About Promises, Up: Enterprise integration

1.5 Is automation worthwhile?

Process automation is an investment which has its own cost. The benefits are not merely saved manpower but improved consistency or certainty of process. Automation provides an automatic quality assurance.

A simple argument against automation goes like this: if I can fix it in five minutes then it is not worth automating, unless the automation takes less time than that.

The argument is simplistic. Before dismissing automation, one should ask questions like this:

How many of these five minute periods occur in the long run?
How much time was needed to diagnose each of them?
Could the problems have been avoided altogether by proactive maintenance?

One of the benefits of automation is in prevention, another is in documenting institutional learning by codifying the processes required for the avoidance of incidents. A tool like cfengine which separates intention (promises) from action makes this kind of documentation highly readable and allows the learning to penetrate the workflow processes directly.

Next: ITIL past and present, Previous: Enterprise integration, Up: Top

2 CFEngine past and present

CFEngine is a free software package for automating the installation and maintenance of networked computers. The project began in 1993 and it has been in widespread use since 1995. CFEngine is available for all major Unix and Unix-like operating systems, and it will also run under NT-derived Windows operating systems via the Cygwin Unix-compatibility environment/libraries.

CFEngine scales easily from a single host to tens of thousands of hosts. As of this writing, the largest installations we know of regulate around 20,000 machines under a common administration. CFEngine can manage many aspects of system configuration and maintenance, including the following:

Performing post-installation tasks such as configuring the network interface.
Editing system configuration files and other files.
Creating symbolic links.
Checking and correcting file permissions and ownership.
Deleting unwanted files.
Compressing selected files.
Distributing files within a network.
Automatically mount NFS file systems.
Verifying the presence and integrity of important files and file systems.
Executing commands and scripts.
Applying security-related patches and similar system corrections.
Managing system server processes.

Cfengine's purpose is to implement policy-based configuration management. In practical terms, this means that cfengine greatly simplifies the tasks of system configuration and maintenance. For example, to customize a particular system, it is no longer necessary to write a program which performs each required action in a procedural language like Perl or your favorite shell. Instead, you write a much simpler policy description that documents how you want your hosts to be configured. The cfengine software determines what needs to be done in terms of implementation and/or remediation from this specification. Such policy descriptions are also used to ensure that the system remains configured as the system administrator wishes over time.

Here is a brief example of such a policy description which we've annotated:

     control:                     General directives: here, we define a list variable.
         tmpdirs = ( tmp:scratch:scratch2 )

     files:	                         File ownership and protection specifications.
         /usr/local/bin owner=root group=bin mode=755 action=fixall

     copy:                                   Copy files on/to the local system.
       solaris::                              Applies only to Solaris systems.
         /config/pam/solaris server=pammaster dest=/etc/pam.d
       linux::                                 Applies only to Linux systems.
         /config/pam/common-auth server=pammaster
           dest=/etc/pam.d/common-auth

     tidy:	                              Manage temporary scratch directories.
         ${tmpdirs} include=* age=7 recurse=inf

This simple configuration is divided into four stanzas, each introduced by a colon-terminated keyword, specifically control:, files:, copy: and tidy:. The control stanza defines a list of directories which we've named tmpdirs which we'll use later (in the tidy stanza).

The files stanza specifies that all of the files in the directory /usr/local/bin should be owned by user root and group bin and have the file mode 755. When cfengine runs with this configuration description it will correct any ownership and/or permissions which deviate from these specifications. Thus, this stanza serves to implement a policy about the proper ownerships and permissions for the executables in the local binaries directory.

The copy stanza prescribes different configurations for Linux and Solaris systems. On Solaris systems, files in /etc/pam.d will be updated with those in the directory /config/pam/solaris on a master server when the latter are newer. On Linux systems, only the file /etc/pam.d/common-auth is updated from the PAM master configuration because the Linux systems in question use the PAM include file mechanism to propagate this file's stacks to all of the PAM-enabled services. Note, however, that both of these specifications implement the same underlying system configuration maintenance policy: update the relevant PAM configuration files from the master server if necessary.

The final, tidy stanza illustrates the use of implicit looping. The single directive in the example applies to each of the directories in the tmpdirs list. For each directory, cfengine will delete all items in the directory or any of its subdirectories which have not been accessed in seven days (including ones where the filename begins with a period). Like the other directives in this sample configuration file, this stanza implements a policy: items in temporary directories which have not been used within a week will be deleted.

All cfengine configuration descriptions are variations on these an similar themes, albeit more elaborate ones. Before turning to more details about the technical aspects of using cfengine, a brief consideration of the most important underlying and guiding theoretical concepts is in order.

Next: CFEngine Components, Previous: CFEngine past and present, Up: CFEngine past and present

2.1 Fundamental CFEngine Concepts

As we've stated, cfengine operates on hosts in order to bring their configurations in line with the specified policies. We need to define some terms.

Host: A host is a single computer that runs an operating system like Unix, Linux or Windows. We will sometimes talk about machines too, and a host can also be a virtual machine supported by an environment VMWare or Xen/Linux.
Policy: This is a specification of what we want a host to be like, or how we want it to behave. A policy is essentially a piece of documentation that describes technical details and characteristics. CFEngine implements policies that are specified via directives of the sort we just considered.
Configuration: The configuration of a host is the actual state of its resources, e.g. the permissions and contents of files, the inventory of software installed, etc. It is the `state of affairs' on a particular host at a given time.

What are we aiming for with cfengine? The answer is: policy conformant configuration. We want to formulate a specification of not just one host, but usually many, including how they all interact, perhaps to solve a business problem; then we want to leave the details, implementation and maintenance to a robot agent: cfagent.

Humans are good at understanding input and thinking up solutions but they not very reliable at implementation: doing things reliably. Machines and software agents are good at carrying out tasks reliably, but are not good at understanding or finding actual solutions. With cfengine, you let the distinct parts of your human-computer organization concentrate on what they are each good at.

CFEngine can also produce reports about systems for monitoring the performance and compliance with policies. This is an important aspect of business integration as service providers want to know whether they are delivering what they have promised, and whether their money has been spent wisely.

Next: Convergence, Previous: Fundamental Concepts, Up: Fundamental Concepts

2.1.1 Promises, Actions and Operations

Cfengine's philosophy fits quite well with the service oriented approach to computing.

A cfengine policy can be thought of as a list of promises which the system makes to some auditor about its configuration. Most of the these promises involve the possibility of change to make a host fulfills its policy promises. We call such changes actions or operations. As you probably already guessed, the auditor in this scenario is part of cfengine itself. Cfagent is also the mechanic or surgeon that performs the operations on the system, if it does not meet its promises.

By describing its operation in this manner, we can think of configuration management as a service that is provided, a service that is intimately connected with monitoring and maintenance, and which can be “bought” on demand without necessarily subordinating a system to a central authority.

Operation: A unit of change is called an operation. CFEngine deals with changes to a system, and operations are embedded into the basic sentences of a cfengine policy. They tell us how policy constrains a host, in other words, how we will prevent a host from running away.

For example, here is a promise about the attributes of a file:

     files:
         /etc/passwd mode=a+r,go-w owner=root group=root action=fixall

There are implicit operations (actions) in this declaration: specifically, the operations that will change the attributes if/when they do not conform to this specification.

Next: Classes and Declarations From One to Many Hosts, Previous: Promises actions operations, Up: Fundamental Concepts

2.1.2 Convergence

A key property of cfengine is convergence. This is an important characteristic that distinguishes it from general computer languages. It is a property that helps to prevent systems from diverging: running away in an uncontrollable fashion.

Convergence: An operation is convergent if it always brings the configuration of a host closer to its ideal, policy-conformant state and has no effect if the host is already in that state. We shall sometimes call it a “correct state” or a “healthy state,” using the metaphor that a badly configured host is suffering from a kind of sickness.

Here is an example used during the editing of an ASCII file:

     editfiles:
         ...
         AppendIfNoSuchLine "Important configuration line"

This operation tells cfengine to append the given text to the end of a file, only if it is not already there. The policy-conformant configuration is therefore that the line is present, and once that is achieved nothing more will be done. We say that the operation AppendIfNoSuchLine is convergent.

Don't underestimate the value of convergence. It provides you with stability. Because cfengine's language interface strongly discourages you from doing anything non-convergent, it also help to prevent mistakes. The price is that you will have to learn to think in a convergent way—and that is new for most people who come to cfengine for the first time.

Next: Voluntary Cooperation, Previous: Convergence, Up: Fundamental Concepts

2.1.3 One or Many Hosts

One of the features that makes cfengine policies readable is the ability to hide away all of the complex decision-making that needs to be performed by the agent. To realize this ambition, cfengine uses a declarative language to express policy.

A declarative language is simply a structured list of sentences (in the case of cfengine, it is a list of policy promises). It is stated in no particular order; it describes a final goal that is to be achieved. The details of how one gets there are left implicit: to be evaluated and implemented by the engine that interprets the specification. This is in contrast to procedural or imperative languages, such as shell or Perl which micro-manage every step along the way.

In an imperative language, one focuses on the procedure. In a declarative language, one focuses on the intention, or the presumed result.

One example of this is the use of classes in cfengine. Classes are a way of making decisions, without writing many “if-then-else” clauses. A class is an identified which has the value “true” when a particular test is true. It is a Boolean variable; if you like it caches the result of an “if” test. The benefit of classes is that all of the testing can be hidden away in the bowels of cfengine, and only the results need be visible if or when they are needed.

Classes: A class is a way of slicing up and mapping out the complex environment of one or more hosts into regions that can then be referred to by a symbol or name. They describe scope: where something is to be constrained.

For example, the class debian is true if and only if cfagent is running on a host that has Debian Linux as its operating system.

Next: Scalability, Previous: Classes and Declarations From One to Many Hosts, Up: Fundamental Concepts

2.1.4 Voluntary Cooperation

It is a fundamental property of cfengine components that every host retains its individual autonomy. A host can always opt out of cfengine-based governance if its administrator wants to. This principle leads to a fundamental design and implementation decision:

Autonomy: No cfengine component is capable of receiving information that it has not explicitly asked for itself, nor can it be advised or commanded by an outside agent without requesting such advice.

It is important to understand what this means. It does not mean that centralized control of hosts cannot be achieved. Centralized control is the way that most users choose to use cfengine. Indeed, all you have to do to achieve centralized control is to make a policy decision for all your hosts to fetch policy specifications from a central authority.

Autonomy does mean that if your environment has some small groups or sub-cultures with special needs, it is possible for them to retain their special identity. No one claiming to be their self-appointed authority can ride rough shod over their local decisions.

Where does policy come from then? Each host works from a policy specification that cfengine expects to find in a local directory (usually /var/cfengine/inputs on a Unix-like host). If you want your host to be controlled from some central manager or authority, then your policy must contain bootstrapping specifications that say: “it is my decision that I should download and follow the policy specification located at the central manager.”

Each host can turn this policy decision off at any time. This is a key part of the cfengine security model.

Previous: Voluntary Cooperation, Up: Fundamental Concepts

2.1.5 Scalability

Cfengine's scalability is at least as good as any other system, because it allows for maximal distribution of workload.

Scalable distributed action: Each host is responsible for carrying out checks and maintenance on/for itself, based on its local copy of policy.

This does not mean that you are immune from making bad decisions. For example, network services can always be a bottleneck if you ask 10,000 hosts to fetch something from one place at the same time.

The fact that each cfengine agent keeps a local copy of policy (regardless of whether it was written locally or inherited from a central authority) means that cfengine will continue to function even if network communications are down.

Previous: Fundamental Concepts, Up: CFEngine past and present

2.2 CFEngine Components

The cfengine software consists of a number of components: separate programs that work together (see figure). The components differ between version 1 and version 2. We shall only discuss cfengine 2 here, as cfengine version 1 is no longer supported, and you are strongly advised to use version 2. In addition, CFEngine version 3 is being developed at the time of writing, but this will take a number of years before it can fully replace version 2. It will incorporate the state of the art in Network and System Administration research, building on all the lessons learned from versions 1 and 2.

The components of cfengine are:

cfagent: Interprets policy promises and implements them in a convergent manner. The agent can use data generated by the statistical monitoring engine cfenvd and it can fetch data from cfservd running on local or remote hosts.
cfexecd: Is a scheduler and wrapper which executes cfagent and logs its output (optionally sending a summary via email). It can be run in daemon (standalone) mode, or it can be run from cron on a Unix-like system.
cfservd: A server daemon that serves file data. It can also be configured to start cfagent immediately on receipt of a connection from cfrun. No actual data can be passed to this daemon.
cfrun: A helper application that polls hosts and asks them to run cfagent if they agree.
cfenvd: A statistical state monitor that collects statistics about resource usage on each host for anomaly detection purposes. The information is made available to the agent in the form of cfengine classes so that the agent can check for and respond to anomalies dynamically.
cfkey: Generates public-private key pairs on a host. You normally run this program only once, as part of the cfengine software installation process.
cfshow: Displays the cfagent database contents in ASCII format, should you ever become interested in its internal memory.
cfenvgraph: Dumps cfenvd's statistical database contents in a form that can be used to plot graphs showing the normal behavior of a host in its environment.

CFEngine Components and the Connections Between Them

This figure illustrates the relationships among cfengine components on different hosts. On a given system, cfagent may be started by the cfexecd daemon; the latter also handles logging during cfagent runs. In addition, operations such as file copying between hosts are initiated by cfagent on the local system, and they rely on the cfservd daemon on the remote system to obtain remote data.

Next: A meeting of mind-sets, Previous: CFEngine past and present, Up: Top

3 ITIL past and present

The IT Infrastructure Library (ITIL) is a collection of books, in which “best practices” for IT Service Management (ITSM) are described. Today, ITIL can be seen as a de-facto standard in the discipline of ITSM, for which it provides guidelines by its current core titles Service Strategy, Service Design, Service Transition, Service Operation and Continual Service Improvement. ITIL follows the principle of process-oriented management of IT services.

In effect, the responsibilities for specific IT management decisions can be shared between different organizational units as the management processes span the entire IT organization independent from its organizational partition. Whether this means a centralization or decentralization of IT management in the end, depends on the concrete instances of ITIL processes in the respective scenario.

Next: Foundations, Previous: ITIL past and present, Up: ITIL past and present

3.1 ITIL and its versions

ITIL has its roots in the early 1990s, and since then was subject to numerous improvements and enhancements. Today, the most popular release of ITIL is given by the books of ITIL version 2 (often referred to as ITILv2), while the British OGC (Office of Government Commerce), owner and publisher of ITIL, is currently promoting ITIL version 3 (ITILv3) under the device "`ITIL Reloaded"'.

It is important to understand that ITILv3 is not just an improved version of the ITILv2 books, but rather comes with a completely renewed structure, new sets of processes and a different scope with respect to the issue of IT strategies, IT-business-alignment and continual improvement. That is why, in the following, we run through the basics of both versions, highlighting commonalities and differences.

Next: ITILv2 Service Support and Service Delivery, Previous: ITIL and its versions, Up: ITIL and its versions

3.1.1 ITIL: Important Foundations

It is the paradigm of process-oriented IT Service Management that ITIL is based on. In addition, ITIL uses the Deming quality circle as a model for continual quality improvement, where quality both relates to the provided IT services as well as the management processes deployed to manage these services. Continual improvement as to ITIL means to follow the method of Plan-Do-Check-Act:

Plan: Plan the provision of high-quality IT services, set up the required management processes for the delivery and support of these services, define measurable goals and the course of action in order to fulfill them.
Do: Put the plans into action.
Check: Measure all relevant performance indicators, and quantify the achieved quality compared to the quality objectives. Check for potentials of improvement.
Act: In response to the measured quality, start activities for future improvements. This step leads into the Plan phase again.

Next: ITILv3 Management from the Service Life Cycle Perspective, Previous: ITIL Important Foundations, Up: ITIL and its versions

3.1.2 ITILv2 Service Support and Service Delivery

Although ITILv3 has been released during the summer of the year 2007, it is its predecessor that has achieved great acceptance amongst IT service providers all over the world. And due to the fact that the International ISO/IEC 20000 standard has emerged from the basic principles and processes coming from ITILv2, it is this version experiencing the biggest distribution and popularity.

The core modules of ITILv2 are the books entitled Service Support and Service Delivery. While the Service Support processes (e.g. Incident Management, Change Management) aim at supporting day-to-day IT service operation, the Service Delivery processes (e.g. Service Level Management, Capacity Management, Financial Management) are supposed to cover IT service planning like resource and quality planning, as well as strategies for customer relationships or dealing with unpredictable situations.

Previous: ITILv2 Service Support and Service Delivery, Up: ITIL and its versions

3.1.3 ITILv3 Management from the Service Life Cycle Perspective

In 2007, ITILv2 has been replaced by its successor ITILv3, aimed at covering the entire service life cycle from a management perspective and striving for a more substantiated idea of IT business alignment. Many of the ITILv2 processes and ideas have been recycled and extended by various additional processes and principles. The five service life cycle stages accordant to ITILv3 are:

Service Strategy: Common strategies and principles for customer-oriented, business-driven service delivery and management
Service Design: Principles and processes for the stage of designing new or changed IT services
Service Transition: Principles and processes to ensure quality-oriented implementation of new or changed services into the operational environment
Service Operation: Principles and processes for supporting service operation
Continual Service Improvement: Methods for planning and achieving service improvements at regular intervals

Next: Tool Support, Previous: ITIL and its versions, Up: ITIL past and present

3.2 Service orientation and ITIL

Why service and process orientation? What is ITIL trying to do? As we mentioned in the introduction, the `military' control view of human organization fell from favour in business research in the 1980s and service oriented autonomy was identified as a new paradigm for levelling organizations – getting rid of deep hierarchies that hinder communication and open up communication directly.

If one is cynical, one can interpret the signs of CEOs nervously trying to put back some of the military thinking into process management – with definitions of authority and chains of responsibility, but these chains are short and whenever ITIL says “committee”, promise theory would say that all we need is a single agent (a human or computer) and the internal details of it don't matter. We should probably not think too literally about ITIL's choice of words, which after all were born from a particular kind of corporate culture and will not appeal to everyone.

If we look at ITIL through the eyeglass of a hierarchical organization, some of its procedures could be seen as restrictive, throttling scalable freedoms. We do not believe that this is their intention. Rather ITIL's guidelines try to make a predictable and reliable face for business and IT operations so that customers feel confidence, without choking the creative process that lies behind the design of new services.

Next: ITIL processes, Previous: Foundations, Up: Foundations

3.2.1 CFEngine in ITIL clothes?

CFEngine users are interested in the ability to manage, i.e. cope with system configuration in a way that enables a business or other organization to do its work effectively. They don't want reams of human management because this is what cfengine is supposed to remove. To be able to use ITIL to help in this task, we have to first think of the process of setting up as a number of services. What services are these? We have to think a little sideways to see the relationship.

Service - providing a sensible configuration policy, responding to discovered problems or the needs of end-users.
Change - an edit of the configuration policy, with appropriate quality controls.
Release - a new configuration policy, consisting of many changes. A new version of cfengine? This could be a major and disruptive change so it should be planned carefully.
Capacity - having enough resources for cfservd to answer all queries in a network. Having enough people to support the processes of deploying and following cfengine's progress.

You should keep this kind of thinking in mind, and train yourself to see every part of a task in “ITIL clothes”.

Next: Service Strategy, Previous: CFEngine in ITIL clothes?, Up: Foundations

3.2.2 ITIL processes

The following management processes are in scope of ITILv3:

Service Level Management: Management of Service Level Agreements (Alas), i.e. service level and quality promises.
Service Catalogue Management: deciding on the services that will be provided and how they are advertised to users.
Capacity Management: Planning and provision of adequate business, service and resource capacities.
Availability Management: Resource provision and monitoring of service, from a customer viewpoint.
Continuity Management: Development of strategies for dealing with potential disasters.
Information Security Management: Ensuring a minimum level of information security throughout the IT organization.
Supplier Management: Maintaining supplier relationships.
Transition Planning and Support: Ensuring that new or changed services are deployed into the operational environment with the minimal impact on existing services
Asset and Configuration Management: Management of IT assets and Configuration Items.
Release Management: Planning, building, testing and rolling out hardware and software configurations.
Change Management: Assessment of current state, authorization and scheduling of improvements.
Service Validation and Testing: ensuring that services meet their specifications.
Knowledge Management: organizing and integrating experience and methodology for future reference.
Incident Management: responding to deviations from acceptable service.
Event Management: Efficient handling of service requests and complaints.
Problem Management: Problem identification by trend analysis of incidents.
Request Fulfillment: Fulfilling customer service requests.
Access Management: Management of access rights to information, services and resources.

Next: Service Design, Previous: ITIL processes, Up: Foundations

3.2.3 Service Strategy

Service strategy is about deciding what services you want to formalize. In other words, what parts of your system administration tasks can you wrap in procedural formalities to ensure that they are carried out most excellently?

Next: Service Operation, Previous: Service Strategy, Up: Foundations

3.2.4 Service Design

Service design is about deciding what will be delivered, when it will be delivered, how quickly the service will respond to the needs of its clients etc. This stage is probably something of a mental barrier to those who are not used to service-oriented thinking.

Next: Continual Service Improvement, Previous: Service Design, Up: Foundations

3.2.5 Service Operation

How shall we support service operation? What resources do we need to provide, both human and computer? Can we be certain of having these resources at all times, or is there resource sharing taking place? If services are chained into “supply chains”, remember that each link of the chain is a possible delay, and a possible misunderstanding. Successfully running services can be more complex at task than we expect, and this is why it is useful to formalize them in an ITIL fashion.

Previous: Service Operation, Up: Foundations

3.2.6 Continual Service Improvement

Continual improvement is quite self-explanatory. We are obviously interested in learning from our mistakes and improving the quality and efficiency by which we respond to service requests. But it is necessary to think carefully about when and where to introduce this aspect of management. How often should we revise out plans and change procedures? If this is too often, the overhead of managing the quality becomes one of the main barriers to quality itself! Continual has to mean regular on a time-scale that is representative for the service being provided, e.g. reviews once per week, once per month? No one can tell you about your needs. You have to decide this from local needs.

Previous: Foundations, Up: ITIL past and present

3.3 Tool Support

In the field of tool support for IT Service Management accordant to ITIL, various white papers and studies have been published. In addition, there are papers available from BMC, HP, IBM and other vendors that describe specific (commercial) solutions. Generally, the market for tools is growing rapidly, since ITIL increasingly gains attention especially in large and medium-size enterprises. Today, it is already hard to keep track of the variety of functionalities different tools provide. This makes it even more difficult to approach this topic in a way satisfactory to the entire researchers', vendors' and practitioners' community.

That is why this document follows a different approach: Instead of thinking of ever new tools and computer-aided solutions for ITIL-compliant IT Service Management, this book analyses how the existing and well-established technologies used for traditional systems administration can fit into an ITIL-driven IT management environment, and it guides potential practitioners in integrating a respective tool suite – namely cfengine – with ITIL and its processes.

To avoid any misunderstanding: We do not argue that cfengine – originally invented for configuring distributed hosts – may be deployed as a comprehensive solution for automating ITIL, but what we believe is cfengine and its more recent innovations can bridge the gap between the technology of distributed systems management and business-driven IT Service Management. To make the case we must show:

How ITIL terminology relates to the terminology of cfengine and hence to a traditional system administrator's language, and
Which parts (processes and activities) of ITIL can be (partially) supported by cfengine, and how.

These are the main goals of the subsequent chapters.

Next: Using cfengine to implement ITIL objectives, Previous: ITIL past and present, Up: Top

4 ITIL and cfengine comparison

To summarize the results of the previous chapters, it can be said that the goals of ITIL and the purpose of cfengine are quite different: ITIL gives recommendatory guidance in process- and service- oriented IT Service Management, while cfengine provides a powerful solution framework for a variety of common network and systems administration tasks. In other words:

The scope of ITIL is much broader than traditional systems administration, but: Portions of systems administration and configuration management tasks take place in the context of certain ITIL processes.
CFEngine was not designed to replace ITSM tools like trouble ticket systems (TTS), workflow management or CMDBs, but: in the more technical areas of IT Service Management, cfengine is able to support ITIL processes in their activities.

The goal of this document is to give an overview on how cfengine can be used to support selected IT Service Management tasks according to ITIL.

Next: ITIL terminology, Previous: A meeting of mind-sets, Up: A meeting of mind-sets

4.1 Which ITIL processes apply to cfengine?

In version 2, ITIL divides itself into service support and service delivery. For instance, service support might mean having a number of cfengine experts who can diagnose problems, or who have sufficient knowledge about cfengine to solve problems using the software. It could also mean having appropriate tools and mechanisms in place to carry out the tasks. Service delivery is about how these people make their knowledge available through formal processes, how available are they and how much work can they cope with? CFEngine enables a few persons to perform a lot of work very cheaply, but we should not forget to track our performance and quality for the process of continual improvement.

Service support is composed of a number of issues:

Incident management: collecting and dealing with incidents.
Problem management: root cause analysis and designing long term countermeasures.
Configuration management: maintaining information about hardware and software and their interrelationships.
Change management: implementing major sequenced changes in the infrastructure.
Release management: planning and implementing major “product” changes.

Although the difference between change management and release management is not completely clear in ITIL, we can think of a release as a change in the nature of the service, while change management deals with alterations possibly still within the scope of the same release. Thus is release is a more major change.

Service delivery, on the other hand, is dissected as follows:

Service Level Management
Problem management
Configuration management
Change management
Release management

These issues are somewhat clearer once we understand the usage of the terms “problem”, “service” and “configuration”. Once again, it is important that we don't mix up configuration management in ITIL with configuration management as used in a Unix parlance.

The notion of system administration in the sense of Unix does not exist in ITIL. In the world of business, reinvented through the eyes of ITIL's mentors, system administration and all its functions are wrapped in a model of service provision.

Next: Asset Management what is it used for?, Previous: Which ITIL processes apply to cfengine?, Up: Which ITIL processes apply to cfengine?

4.1.1 ITIL Configuration Management (CM)

Perhaps the most obvious example is the term configuration management.

Configuration Management: The process (and life-cycle) responsible for maintaining information about configuration items (CI) required to deliver an IT service, including their relationships.

As we see, this is comparable to our intuitive idea of “asset management”, but with “relationships” between the items included. ITIL also defines “Asset Management” as “a process responsible for tracking and reporting the value of financially valuable assets” and is a component of ITIL Configuration Management.

In the cfengine world, configuration management involves planning, deciding, implementing (“base-lining”) and verifying (“auditing”) the inventory. It also involves maintaining the security and privacy of the data, so that only authorized changes can be made and private assets are not made public.

In this document we shall try not to mix the ITIL concept with the more prosaic system administration notion of a configuration which includes the current state of software configuration on the individual computers and routers in a network.

Since cfengine is a completely distributed system that deals with individual devices on a one-by-one basis, we must interpret this asset management at two levels:

The local assets of an individual device at the level of virtual structures and containers within it: files, attributes, software packages, virtual machines, processes etc. This is the traditional domain of automation for cfengine's autonomic agent.
The collective assets of a network of such devices.

Since a single host can be thought of as a network of assets connected through virtual pathways, it really isn't such a huge leap to see the whole network in a similar light. This is especially true when many of the basic resources are already shared objects, such as shared storage.

Next: Change management, Previous: Configuration Management CM, Up: Which ITIL processes apply to cfengine?

4.1.2 CMDB Asset Management

Why bother to collect an inventory of this kind? Is it bureaucracy gone mad, or do we need it for insurance purposes? Both of these things are of course possibilities.

The data in an ITIL Configuration Management Database (CMDB) can be used for planning the future and for knowing how to respond to incidents, in other words for service level management (SLM) and for capacity planning. An organization needs to know what resources it has to know whether its can deliver on its promises. Moreover, for finance and insurance it is clearly a sound policy to have a database of assets.

For continuity management, risk analysis and redundancy assessment we need to know how much equipment is in use and how much can be brought in at a moment's notice to solve a business problem. These are a few of the reasons why we need to keep track of assets.

Next: Change management vs convergence, Previous: Asset Management what is it used for?, Up: Which ITIL processes apply to cfengine?

4.1.3 Change management in the enterprise

If we make changes to a technical installation, or even a business process, this can affect the service that customers experience. Major changes to service delivery are often written into service level agreements since they could result in major disruptions. Details of changes need to be known by a help-desk and service personnel.

The decision to make a change is more than a single person should usually. It requires consultation at different levels of process. An advisory board for changes takes on this role, whether it is an informal board that communicates electronically or a physical committee “with six or more legs and no brain”.

Next: Release management, Previous: Change management, Up: Which ITIL processes apply to cfengine?

4.1.4 Change management vs convergence

We should be especially careful here to decide what we mean by change. ITIL assumes a traditional model of change management that cfengine does not need. ITIL's ideas apply to the management of cfengine's configuration, not the way in which cfengine carries out its work.

In traditional idea of change management you start by “base-lining” a system, or establishing a known starting configuration. Then you assume that things only change when you actively implement a change, such as “rolling out a new version” or committing a release. This, of course, is very optimistic.

In most cases all kinds of things change beyond our control. Items are stolen, things get broken by accident and external circumstances conspire to confound the order we would like to preserve. The idea that only authorized people make changes is nonsense.

CFEngine takes a different view. It thinks that changes in circumstances are part of the picture, as well as changes in inventory and releases. It deals with the idea of “convergence”. In this way of thinking, the configuration details might be changing at random in a quite unpredictable way, and it is our job to continuously monitor and repair general dilapidation. Rather than assuming a constant state in between changes, cfengine assumes a constant “ideal state” or goal to be achieved between changes. An important thing to realize about including changes of external circumstances is that you cannot “roll back” circumstances to an earlier state – they are beyond our control.

Next: Incident and problem management, Previous: Change management vs convergence, Up: Which ITIL processes apply to cfengine?

4.1.5 Release management

A release is a collection of authorized changes to a system. One part of Change Management is therefore Release Management. A release is generally a larger umbrella under which many smaller changes are made. It is major change. Changes are assembled into releases and then they are rolled out.

In fact release management, as described by ITIL, has nothing to do with change management. It is rather about the management of designing, testing and scheduling the release, i.e. everything to do with the release process except the explicit implementation of it. Deployment or rollout describe the physical movement of configuration items as part of a release process.

Next: Service Level Management SLM, Previous: Release management, Up: Which ITIL processes apply to cfengine?

4.1.6 Incident and problem management

ITIL distinguishes between incidents and problems. An incident is an event that might be problematic, but in general would observe incidents over some length of time and then diagnose problems based on this experience.

Incident: An event or occurrence that demands a response.

One goal of cfengine is to plan pro-actively to handle incidents automatically, thus taking them off the list of things to worry about.

Problem: A pattern of consequence arising from certain incidents that is detrimental to the system. It is often a negative trend that needs to be addressed.

Changes can introduce new incidents. An integrated way to make the tracking of cause and effect easier is clearly helpful. If we are the cause of our own problems, we are in trouble!

Previous: Incident and problem management, Up: Which ITIL processes apply to cfengine?

4.1.7 Service Level Management (SLM)

Also loosely referred to as Quality of Service. This is the process of making sure that Service Level Promises are kept, or Service Level Agreements (SLA) are adhered to. We must assess the impact of changes on the ability to deliver on promises.

Previous: Which ITIL processes apply to cfengine?, Up: A meeting of mind-sets

4.2 ITIL terminology

Like many other areas of wishful standardization, ITIL elevates itself to a state of importance by using multitude of acronyms and specialized terms. Not all of these are as intuitive as one might hope for and many simply seem beyond necessity. However, to understand the writing, we need to know a few of them and also understand how they differ from similar terms in system administration and the world of cfengine. In the appendix, we list with comments about the most important of these terms. The figure shows a scatter-plot of these terms.

Next: Summary, Previous: A meeting of mind-sets, Up: Top

5 Using cfengine to implement ITIL objectives

How does cfengine fit into the management of a service organization? There are several ways:

It offers a rapid detection and repair of faults that help to avoid formal incidents.
It simplifies the deployment (release) of services.
Allows resources to be understood and planned better.

These properties allow for greater predictability of system services and therefore they contribute to customer confidence.

Next: How can cfengine or promises help?, Previous: Using cfengine to implement ITIL objectives, Up: Using cfengine to implement ITIL objectives

5.1 Infrastructure or management?

Any tool for assisting with change management lies somewhere between ITIL's notion of change management and the infrastructure itself. It must essentially be part of both (see figure). This applies to cfengine too.

CFEngine is both infrastructure and a part responsible for infrastructure.

CFEngine can manage itself as well as other resources: itself, its software, its policy and the resulting plans for the configuration of the system. In other words, cfengine is itself part of the infrastructure that we might change.

Next: What is maintenance?, Previous: Infrastructure or management?, Up: Using cfengine to implement ITIL objectives

5.2 How can cfengine or promises help an enterprise

Next: Modelling of policy, Previous: How can cfengine or promises help?, Up: How can cfengine or promises help?

5.2.1 Traditional IT Management

Traditional methods of managing IT infrastructure involve working from crisis to crisis – waiting for `incidents' to occur and then initiating fire suppression responses or, if there is time, proactive changes. With cfengine, these can be combined and made into a management service, with continuous service quality.

CFEngine can assist with:

Maintenance assurance.
Reporting for auditing.
Change management.
Security verification.

Promise theory comes with a couple of principles:

Separation of concerns.
Fundamental attention to autonomy of parts.

Next: Uniformity, Previous: Traditions, Up: How can cfengine or promises help?

5.2.2 Modelling policy

Other approaches to discussing organization talk about the separation of concerns, so why is promise theory special? Object Orientation (OO) is an obvious example. Promise theory is in fact quite different to object orientation (which is a misnomer).

Object orientation asks users to model abstract classes (roles) long before actual objects with these properties exist. It does not provide a way to model the instantiated objects that later belong to those classes. It is mainly a form of information structure modelling. Object orientation models only abstract patterns, not concrete organizations.

Promise theory on the other hand considers only actual existing objects (which it calls agents) and makes no presumptions that any two of these will be similar. Any patterns that might emerge can be exploited, but they are not imposed at the outset. Promise theory's insistence on autonomy of agents is an extreme viewpoint from which any other can be built (just as atoms are a basic building block from which any substance can be built) so there is no loss of generality by making this assumption.

In other words, OO is a design methodology with a philosophy, whereas promises are a model for an arbitrary existing system.

Previous: Modelling of policy, Up: How can cfengine or promises help?

5.2.3 Uniformity

The traditional production-line paradigm for management of IT systems involves reducing the number of variations – often simply making all systems identical for mass-production. However, as quoted at the beginning of chapter 2, the purpose of advanced technology is to enable us to cope with variation. CFEngine makes managing variations simple. Some organizations might simply want to have a uniform configuration on all their hardware, but what does this mean if the basic hardware is different?

In cfengine we understand that “similar” should be based on how systems behave not what their disk images look like. Two systems that make the same promises ought to behave in the same way, if the promises are at a high enough level. But what if two different operating systems promised to never have a file called /etc/passwd? A windows machine would not care too much, but a Unix system would be paralyzed.

Promises and system configuration are related: configuration affects behaviour and behaviour is what we promise. Clearly we cannot expect very high level promises to be simply translated into configurations however. The fact that we make promises about system configuration says nothing certain about the promise that results from changing it. That depends on many other factors. Thus we must be careful to think about what a promise means.

Fundamental assumption: The basic assumption of configuration management is that a specific configuration determines the resulting behaviour of a system. This assumption is completely unproven, and is sometimes obviously false. At best there is a correlation between configuration and behaviour. This is what makes IT management challenging. The things we can change do not necessarily give us the control we would like.

Next: Incident Management vs Maintenance, Previous: How can cfengine or promises help?, Up: Using cfengine to implement ITIL objectives

5.3 What is maintenance?

Maintenance is a process that ITIL does not formally spend any time on explicitly, but it is central to real-world quality control.

Imagine that you decide to paint your house. Release 1 is going to be white and it is going to last for 6 years. Then release 2 is going to be pink. We manage our painting service and produce release 1 with all of the care and quality we expect. Job done? No.

It would be wrong for us to assume that the house will stay this fine colour for 6 years. Wind, rain and sunshine will spoil the paint over time and we shall need to touch up and even repaint certain areas in white to maintain release 1 for the full six years. Then when it is time for release 2, the same kind of maintenance will be required for that too.

Unless we read between the lines, it would seem that ITIL's answer to this is to wait for a crisis to take place (an incident). We then mobilize some kind of response team. But how serious an incident do we require and what kind of incident response is required? A graffiti artist? A lightening strike? A bird anoints the paint-work? CFEngine is like the gardener who patrols the grounds constantly plucking weeds, before the flower beds are overrun. Call it continual improvement if you like: the important thing is that the process your be pro-active and not too expensive.

Maintenance is necessary because we do not control all of the changes that take place in a system. There is always some kind of “weather” that we have to work against. CFEngine is about this process of Maintenance. We call it “convergence” to the ideal state, where the ideal state is the specified version release. Keep this in mind as you read about ITIL change management.

Next: Rollout and installation, Previous: What is maintenance?, Up: Using cfengine to implement ITIL objectives

5.4 Incident Management vs Maintenance

CFEngine employs the idea of continual maintenance (we paint the fence on a regular basis to protect it). ITIL, on the other hand, moves from release to release (this year we paint the fence red, next year green) and does not recognize the effect of gradual entropic decay of state (the fence's colour fades gradually due to the harsh environment). Instead ITIL deals with events (graffiti and tagging of the fence) which must be corrected. While it is true that these incidents are maintenance, the repairs are more costly to initiate if they occur as exceptional events than if we are used to repainting the fence on a regular basis.

An exemplary Event Management process on the basis of ITIL V3

The figures above show ITIL processes for the handling of events and incidents. They show the aspects of dealing with events that are mainly human oriented, and those events in shaded boxes that can be automated using cfengine.

In the figure above we see that there must be a basic monitor at the top of the process chain which is responsible for observing events. This fits well with the view of promise theory in which a neutral observer is required to measure the state of different component agents in the system. Not all events are necessarily relevant or interesting so we can filter these based on a policy. CFEngine's event monitors come from two sources: cfagent (for monitoring the state of promises which are being managed - e.g. the proverbial colour of the fence) and cfenvd (for passively monitoring the environment - e.g. the brightness of the sunshine or the amount of rainfall impacting on the fence).

CFEngine filters events through its class interface. All events observed in cfengine are classified and made available to the environment.
CFEngine logs events by routing messages to email or to syslog (by asking inform=true or syslog=true or audit=true).
The daemon cfenvd auto-correlates events. The tool cfbrain will cross correlate events, further classifying the outcomes as part of the environment.
Events can be triggered by attaching promises to event-driven classes in the cfagent configuration. e.g.
```
          processes:

            www_in_high_anomaly::

              ``apache'' signal=term

          alerts:

            www_in_high_anomaly::

              ShowState(www.in)
```
For more devastating incidents, we can arrange for more information to be output. An incident is really only an event of some special significance. Diagnosing an incident requires either human intervention or pre-cached insight on the part of the promises we make. If we can make a specific promise then the diagnosis that this promise has not been kept can easily be turned into a specific repair. For example,
We might note a sudden burst of smtp traffic, or a sudden decrease in free disk space. These events can be anticipated if one knows a benign cause, such as email was shutdown for maintenance, or the host is a new mail-server that has never seen traffic before.

An exemplary Incident Management process on the basis of ITIL V3

Next: Change Management in ITIL, Previous: Incident Management vs Maintenance, Up: Using cfengine to implement ITIL objectives

5.5 Rollout and installation

When setting up hosts, ITIL actually makes a techical recommendation. This is unusual for ITIL as it generally does not get mixed up in the details of management, only the processes. ITIL recommends “base-lining” systems from a gold server, i.e. a system that is thought to be “perfect” enough to act as a model for all other systems. Once a server has been base-lined from the golden image, various customizations can be made relative to this known state. ITIL sees this as a way of achieving consistency.

We believe that ITIL exceeds its technical competence in making this a recommendation. True enough, this has traditionally been a way of performing a rollout, but the approach has been superceded by better technology. The gold server approach is not the recommended cfengine way. In fact a golden-image approach wastes a fundamental flexibility that cfengine offers, namely the possibility to allow variations (see the quote by Alvin Toffler at the start of chapter 2).

When we baseline a system from a gold-server, we are planning to make all hosts basically the same. However, this is neither necessary nor cost-saving if you use cfengine.

CFEngine places no restrictions on the approach used to roll out hosts. Rather than requiring you to start from a known state, it allows you to specify the final state for any initial state. This means you can migrate hosts gradually to a policy state without having to reinstall them. We can consider the end result of a cfengine policy process to be “the release”. In cfengine this is equivalent to a sufficiently comprehensive configuration policy.

The message here is that cfengine allows you to achieve predictable results without the need for a gold server. Nevertheless, it is helpful to begin a system based on a reliable substrate. It is like making a good sandwich: it helps to have a perfect piece of bread to build on, but it's what you put on top that is most important. You just need to know what you are starting with, and then most things can be fixed to satisfaction. We recommend:

Start with some kind of standard image to start (a predictable substrate).
It is does not necessarily matter what it is as long as it behaves predictably. e.g. install from known DVD, or install from net-boot or even from a gold server.
Customize the basic working system using cfengine. There are two possible approaches to this:
i) Copy constant “gold” overlays or patches into place from a trusted source to customize the system.
- Add more operating system packages.
- Insert special files (config, data etc).
- Run post-processing scripts.
ii) Edit system directly with cfengine
- Documented automatically by cfagent promises.
- Can always customize after that too (phase 3)!

Next: Overlay an expandable template, Previous: Rollout and installation, Up: Rollout and installation

5.5.1 Customize by constant/fixed “gold” overlay

The first alternative is to install a fixed patch to a system from a known gold-server. The basic pattern is this:

     copy:

       /source/file

           dest=/dest/file
           server=gold_server

In this example, we simply install a new file into a known location. This is the simplest way of customizing a host, but it lacks flexibility.

Next: Direct customization, Previous: Customize by constant fixed overlay, Up: Rollout and installation

5.5.2 Overlay an expandable template with cfengine

A more sophisticated approach is to download a parameterized template from a repository or gold server. This template contains context dependent variables that can be expanded in situ by cfengine. There are two stages to this: first we copy the template to a temporary location, then we edit the final file location, insert the template and expand its variables. By following this procedure, the result satisfies cfengine's principles of convergence.

     copy:

     /source/file
           dest=/tmp/file
           server=gold_server

     editfiles:

     { /dest/file

     EmptyEntireFilePlease
     InsertFile ``/tmp/file''
     ExpandVariables
     }

Previous: Overlay an expandable template, Up: Rollout and installation

5.5.3 Direct customization by cfengine

A final approach to customization is to apply direct editing operations to implement the required customization.

     editfiles:

     { /dest/file

     ReplaceAll ``X'' With ``Y''
     AppendIfNoSuchLine ``ABC''
     }

This approach is useful for small corrections, that require unsophisticated editing, but it becomes quickly cumbersome for more complex tasks.

Next: Release Management in ITIL, Previous: Rollout and installation, Up: Using cfengine to implement ITIL objectives

5.6 Change Management in ITIL

ITIL proposes that there should be an integrated approach to change and configuration management. Clearly changes to a system result in new configurations. However, changes can also be unplanned involuntary faults (ITIL discusses these as incidents).

ITIL does not want unplanned changes, however we know that they happen. CFEngine does not elevate deviations from policy to the level of an incident normally, it simply fixes problems immediately. However, we do not alway have enough information about changes to allow cfengine to make repairs, so we need a way of monitoring for unexpected change.

Change management in cfengine is a subtle topic, because cfengine does not fully subscribe to the model of change that ITIL does. In cfengine's view of the world, all changes are changes no matter how or why they occur. In ITIL's world view, there are planned changes, there are releases and there are “incidents”.

ITIL therefore distinguishes between planned and unplanned changes that affect service delivery. CFEngine on the other hand cares only about what promises have been made about the system and whether or not these have been kept.

CFEngine can detect changes because it effectively performs a constant audit of the system's promises. We should understand cfengine's change detection in two ways: changes that impact the performance or quality of services

with respect to the quality of the system configuration service (i.e. cfengine's service)
with respect to the quality of services supported by the system configurations (e.g. other services like web services)

To cfengine, changes only matter if they impact the promises that have been codified as policy. Even events that cfengine calls “anomalies”, detected and classified continuously, are only considered interesting if policy determines them to be restricted, thus every single state change can be considered either “within tolerances” (insignificant) or “out of tolerance” (significant). ITIL is only a heuristic set of guidelines and is not technically sophisticated enough to be able to make this kind of distinction.

Let's make an approximate mapping between ITIL concepts and cfengine change and the comment critically on it.

ITIL: CFEngine
Incident: Promise not kept
Change: Configuration version/content update
Release: Policy change

Next: Rollback or remediation, Previous: Change Management in ITIL, Up: Change Management in ITIL

5.6.1 Software packaging in ITIL

ITIL considers releases to be entire integrated systems that are versioned. Most operating systems work at a smaller level of granularity than this. Software version control using package managers to version individual software packages. Although such package managers resolve dependencies, they do not version entire conglomerates of software. Software comes in large packages for two main reasons:

Operating system installation (all or nothing).
Functional role adaptation (specialized workstation).

Different organizational roles require different ITIL services to support them, and hence different software to deliver them.

CFEngine deals with versioned data management in two ways:

File copying from master source (by date-stamp or checksum).
Package installation and verification (using local package managers).

Package managers handle the installation and update of packages easily, but they do not always add institutional adaptational control in a way that can be tied into a classification of hosts in an organization's network. CFEngine can use its classification of hosts to customize further. We simply attach relevant clusters of packages to different classes of host to ensure that specific workstations are properly adapted to service their tasks.

Not all software comes in operating system (vendor/provider) approved packages, but cfengine can also handle software that is zipped, tar-ed or bundled in any other manner.

The following example policies illustrates some of the copy rule type's capabilities, including some of the options we just considered:

     control:
         DefaultCopyType = ( mtime )
         SplayTime = ( 15 )
         sourcehost = ( source.cfengine.org )

     copy:
     #   Copy dat/doc files if not too big
         /usr/local/data dest=/archive/data
           include=*.dat include=*.doc exclude=test.*
           recurse=inf backup=false size<500m

     #   Retrieve configuration file from master
         /depot/hosts.deny server=$(sourcehost)
           dest=/etc/hosts.deny owner=root group=0 mode=644
           backup=off force=on timestamps=keep

     #   Transmit shadow password file encrypted
        /depot/shadow server=\$(sourcehost) dest=/etc/shadow
           owner=0 group=0 mode=600 encrypt=true

The first rule specifies that .dat and .doc files within the /usr/local/data directory tree be copied to /archive/data, provided that the source files have been modified more recently then their counterpart in the target directory and that they are smaller than 500 MB. In addition, files having the name test are also excluded. Existing files will be overwritten without being saved.

The second rule unconditionally replaces the local /etc/hosts.deny file with one from the system source.cfengine.org, retaining the timestamps from the source file. This rule also specifies the ownership and mode for the target file.

The third rule is similar to the second one, retrieving another file from the same remote system. In this case, however, the file will be copied only when the remote file is more recent than the local copy. When the file is copied, the previous version will be retained, and the file contents will be encrypted at it is transmitted across the network.

CFEngine can also automate software package management and installation. Policies for these items are specified in the packages stanza. Here are some examples:

     control:    # Define package manager \& install command
       linux::  DefaultPkgMgr = ( rpm )
       redhat:: RPMInstallCommand = ( "/usr/sbin/up2date %s" )
       suse::   RPMInstallCommand = ( "/usr/sbin/yast2 -i %s" )

     packages:
         nagios version=2.4 cmp=ge
         pstree action=install

The settings in the control section specify the package management software that is in use as well as the command used to install a software package. These directives illustrate the use of operating system-based classes within policies for defining a different installation command for different Linux distributions.

In the packages stanza, the first rule checks whether Nagios is installed. A warning will be generated if the package is not present at all or if the installed version is earlier than version 2.4. The second rule checks for the pstree package, and installs it if it is not present on the system.

The following parameterized method-promise installs its first argument in the prefixed location given by the second argument. It collects the tar file, unpacks it, configures and compiles it, then tidies its files.

     #
     # Build GNU sources and install
     #

     control:

        actionsequence = ( methods )

     methods:

        InstallTar(cfengine-2.1.0b7,/local/gnu)

           action=cf.install
           returnvars=null
           returnclasses=null
           server=localhost

We must install the method in the trusted modules directory (normally /var/cfengine/modules or WORKDIR/modules).

     #
     # cf.install
     #

     control:

      MethodName       = ( InstallTar )
      MethodParameters = ( filename prefix )
      path = ( /usr/local/gnu/bin )
      TrustedWorkDir = ( /tmp )
      TrustedSources = ( /depot )
      TrustedSourceServer = ( localhost )

      actionsequence = ( copy editfiles shellcommands tidy )

     copy:

      $(TrustedSources)/$(filename).tar.gz
         dest=$(TrustedWorkDir)/$(filename).tar.gz
         server=$(TrustedSourceServer)

     shellcommands:

      "$(path)/tar zxf $(filename).tar.gz"  chdir=$(TrustedWorkDir)

      "$(TrustedWorkDir)/$(filename)/configure --prefix=$(prefix)"
         chdir=$(TrustedWorkDir)/$(filename)
         define=okay

      okay::

      "$(path)/make"
          chdir=$(TrustedWorkDir)/$(filename)

     tidy:

       $(TrustedWorkDir) pattern=$(filename) r=inf rmdirs=true age=0

Next: Monitoring file changes, Previous: Software packaging, Up: Change Management in ITIL

5.6.2 Rollback or remediation

The ability to go back to an earlier “release” or state is often referred to as rollback. ITIL calls it remediation. The notion is closely connected with process management, and both ITIL and traditional management techniques value this by default. It is assumed practice.

CFEngine does not encourage rollback however. Why not? Because it required destructive intervention and cfengine's model is based on on-the-fly change. To go back to a previous state, a system must be stopped, reinitialized (perhaps from backup) and restarted. This requires service to stop and all run-time state is lost.

Cfengine's approach to this would be to revert policy to its previous state. The system would then roll into its desired state (as if going forwards). Nothing would be restored from separate backup media (see figure).

Move from one fixed point to another and back.

The difference here is the assumption about how and when changes occur. A sequence of step by step transitions sounds innocent, but it is unstable to unexpected changes. ITIL and many other change management models assumes that no unauthorized changes occur between releases. If they do, they are handled as incidents. By separating releases from incidental changes, we get led into thinking that we can in fact revert by destructive intervention.

In fact reversion has inevitable consequences. We must make a choice.

Revert entire state, except we lose runtime state.
In this case, we essential revert the entire system from a back up of its saved state. (Some virtual machines can save runtime state for resumption, but this can become stale, e.g. for network connections, as it is meaningless to rollback part of a dialogue.) This operation results in catastrophic change.
Revert managed state, on the fly.
This is cfengine's default behaviour. To go back to a previous state, simply change the policy back to a previous version. This will not necessarily revert the entire state of the system, but everything that is covered by policy will be reverted.

Some tools allow you to rollback without reverting from backup. CFEngine disallows this on principle as it requires human judgement to perform correctly. It cannot be automated without uncertain results. In fact cfengine retains the necessary information to allow managed changes to be reversed to some extent. The point however is that one can only guarantee the content of managed objects, so simply reversing a change will not necessarily take us back to the same state – so we consider this to be fundamentally too risky.

Next: Hashes and Message Digests, Previous: Rollback or remediation, Up: Change Management in ITIL

5.6.3 Monitoring file changes

CFEngine can monitor absolute and relative states of a system. A simple way to measure relative change is to use a database of checksums.

     control:

       ChecksumUpdates = ( true )
       ChecksumPurge   = ( true )

     files:

       /my/important/files

             recurse=inf
             checksum=md5
             owner=root,daemon
             group=0,1,4

Change monitoring is about detecting when stored data, or other measurable aspects of a computer system change. A change detection system is not normally concerned with the reason for a change, but if you are monitoring for change then we shall take it for granted somehow that you are expecting to find changes that you didn't plan for yourself.

Next: Computing hashes, Previous: Monitoring file changes, Up: Change Management in ITIL

5.6.4 Hashes and Message Digests

The most important bulk of information on a computer is its filesystem data. Change detection for filesystems uses a technique made famous in the program Tripwire, which collects a “snapshot" of the system in the form of a database of file checksums (cryptographic hashes) and permissions and rechecked the system against this database at regular intervals. Tripwire examines files, and looks for change in their contents or their attributes. This is a very simple (even simplistic) view of change. If a legitimate change is made to the system, such a system responds to this as a potential threat. Databases must then be altered, or rebuilt.

A cryptographic hash (also called a digest) is an algorithm that reads (digests) a file and computes a single number (the hash value) that is based on the contents. If so much as a single bit in the file changes then the value of the hash will change. You can compute hash values manually, for example:

     host$ openssl md5 cfengine-2.2.4a.tar.gz
     MD5(cfengine-2.2.4a.tar.gz)= 6d2b31c4814354c65cbf780522ba6661

There are several kinds of hash function. The most common ones are MD5 and SHA1. Recently both of the algorithms that create these hashes have been superceded by the newer SHA2. CFEngine supports MD5 and SHA1 and it will support SHA2 as soon as the OpenSSL library supports an interface to the new algorithm.

Next: Neighbourhood watch and tampering, Previous: Hashes and Message Digests, Up: Change Management in ITIL

5.6.5 Computing hashes or digests

CFEngine has adopted something like the Tripwire model, but with a few provisoes. Tripwire assumes that all change is unauthorized (it makes an incident out of any observed change). CFEngine cannot reasonably take this viewpoint. CFEngine expects systems to change dynamically, so it allows users to define a policy for what changes are considered to be okay.

Integrity checks on files whose contents are supposed to be static are a good way to detect tampering with the system, from whatever source. Running MD5 or SHA1 checksums of files regularly provides us with a way of determining even the smallest changes to file contents.

To use the checksum based change detection we first ask cfengine to collect MD5 hash data for specified files. Here is an excerpt from a cfengine configuration program that would check the /usr/local filesystem for file changes. Note that it excludes files such as log files that we therefore allow to change (log files are supposed to change):

     files:

       /usr/local owner=root,bin,man
                  mode=o-w          # check permissions separately
                  r=inf
                  checksum=best     # switch on change detection
                  action=warnall
                  ignore=logs
                  exclude=*.log

       # repeat for other files or directories

The first time we run this, cfengine collects data and treats all files as “unchanged”. It builds a database of the checksums. The next time the rule is checked, cfagent recomputes the checksums and compares the new values to the `reference' values stored in the database. If no change has occurred, the two should match. If they differ, then the file as changed and a warning is issued.

     cf:nexus: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
     cf:nexus: SECURITY ALERT: Checksum (md5) for /etc/passwd changed!
     cf:nexus: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

This message is designed to be visible. If you do not want the embracing rows of `!' characters, then this control directive turns them off:

     control:

      Exclamation  = ( off )

The next question to ask is: what happens if the change that was detected is actually okay (which is almost always the case in practice). If you activate this option:

     control:

      ChecksumUpdates = ( on )

Then, as soon as a change has been detected, the database is updated and the message will not be repeated. If this is set to off, which is the default, then warning messages will be printed each time the rule is checked.

New files are automatically detected, as they are not in the database. If you want to be notified when files are deleted, then set the option

     control:

      ChecksumPurge = ( on )

Previous: Computing hashes, Up: Change Management in ITIL

5.6.6 Neighbourhood watch and tampering

Message digests are supposed to be unbreakable, tamperproof technologies, but of course everything can be broken by a sufficiently determined attacker. Suppose someone wanted to edit a file and alter the cfengine checksum database to cover their tracks. If they had broken into your system, this is potentially easy to do. How can we detect whether this has happened or not?

A simple solution to this problem is to use another checksum-based operation to copy the database to a completely different host. By using a copy operation based on a checksum value, we can also remotely detect a change in the checksum database itself.

Consider the following code:

     # Neighbourhood watch

     control:

      allpeers = (
                 SelectPartitionNeighbours(/path/hostlist,\#,random,4)
                 )

     copy:

          /var/cfengine/checksum\_digests.db

                            dest=/safekeep/chkdb_$(this)
                            type=checksum
                            server=$(allpeers)
                            inform=true          # warn of copy
                            backup=timestamp
                            define=tampering

     alert:

      tampering::

           'Digest tampering detected on a peer'

It works by building a list of neighbours for each host. The function

     SelectPartitionNeighbours

can be used for this. Using a file which contains a list of all hosts running cfengine (e.g. the cfrun.hosts file), we create a list of hosts to copy databases it from. Each host in the network therefore takes on the responsibility to watch over its neighbours.

The copy rule attempts to copy the database to some file in a safekeeping directory. We label the destination file with $(this) which becomes the name of the server from which the file was collected. Finally, we backup any successful copies using a timestamp to retain a complete record of all changes on the remote host. Each time a change is detected, a copy will be kept of the old. The rule contains triggers to issue alerts and warnings also just to make sure the message will be heard.

In theory, all four neighbours should signal this change. If an attacker had detailed knowledge of the system, he or she might be able to subvert one or two of these before the change was detected, but it is unlikely that all four could be covered up. At any rate, this approach maximizes the chances of change detection.

Finally, in order to make this copy, you must, of course, grant access to the database in cfservd.conf.

     # cfservd.conf

     admit:

     any::

       /var/cfengine/checksum_digests.db  mydomain.tld

Let us now consider what happens if an attacker changes a file an edits the checksum database. Each of the four hosts that has been designated a neighbour will attempt to update their own copy of the database. If the database has been tampered with, they will detect a change in the md5 checksums of the remote copy versus the original. The file will therefore be copied.

It is not a big problem that others have a copy of your checksum database. They cannot see the contents of your files from this. A possibly greater problem is that this configuration will unleash an avalanche of messages if a change is detected. This makes messages visible at least.

Next: Configuration version control and rollback, Previous: Change Management in ITIL, Up: Using cfengine to implement ITIL objectives

5.7 Release Management in ITIL

Release management, as defined by ITIL (section 9 of BS15000-2), is a management function rather than a machine implementation operation. It includes all aspects of designing, planning and scheduling changes, but does not include the implementation.

CFEngine can help with the final stages of software release management, namely deployment of software components and configuration. However, the bulk of this item concerns the human process of decision-making.

Creating a schedule and policy for releases.
Acquiring of completing the components for release.
Assigning roles for responsibility.
Labelling release items uniquely for tracking.
Documentation updates.
Testing prior to release.

CFEngine is not a tool for assisting in this kind of process. Some kind of process planning tool and revision control system could work for this.

CFEngine has features that can be considered in the context of this work, however.

packages
files
copy

ITIL frequently works with the idea of a baseline state While cfengine has no problem working with the idea of a baseline configuration, it is designed to exceed this assumption of maintenance from release to release. ITIL does not adequately address the need for on-the-fly maintenance; it only models large-jump changes, not error corrections. CFEngine, on the other hand, makes no distinction between a large and a small change, thus users of cfengine must make a value judgement about the nature of such changes.

Next: Availability and Capacity Management, Previous: Release Management in ITIL, Up: Using cfengine to implement ITIL objectives

5.8 Version control and rollback

CFEngine does not provide specific tools for versioning configuration specifications. It is rather recommended to use a tool such as subversion for this.

Subversion maintains its own revision numbers that are not visible to cfengine however. It is useful to be able to refer to version numbers also in cfengine. From software release 2.2.2 a version string can be added to files as follows:

     control:

     cfinputs_version = ( 1.2.3 )
     Auditing   = ( on )

This defines the version number of a set of configuration files which is referred to in auditing and error messages.

When cfengine saves the current version of a file that it is modifying or replacing, by default such files are given a new extension and remain within the same directory which they were encountered. Alternatively, one can specify a repository directory to which such files can be moved instead. The repository location is specified in the control section:

     control:
         Repository = ( /var/spool/cfengine )

Files moved to the repository are given names reflecting their full path, with slashes replaced by underscore characters. For some, this creates a clearer overview of the changes that have occurred.

The repository is used by disable, editfiles, links, and copy rule types; copy and disable allow you to override repository use or to specify an alternate repository directory via their repository option.

You should never edit the production version of a policy directly, but rather edit a separate development area and publish the changes once tested. The ITIL change management process is applicable to this human change (much more relevant that the machine changes made by cfengine itself.).

Previous: Configuration version control and rollback, Up: Configuration version control and rollback

5.8.1 Delegating responsibility

CFEngine has no meta-access control mechanism which can decide who may write policy rules. To create such a mechanism, there would have to be a monitor which could identify users, and an authority mechanism that would disallow certain users to write rules of certain types about certain objects on certain hosts. Clearly it is possible to create such a system, but it would be both technically difficult, very cumbersome to use and would add a whole new level of complexity to policy and potential error to the configuration process.

To keep matters as simple as possible, cfengine avoids this and proposes a different approach. Promise theory allows us to model the security implications of this (see the figure of the bow-tie structure). A simple method of delegating is the following.

Delegate responsibility for different issues to admin teams 1,2,3, etc.
Make each of these teams responsible for version control of their own configuration rules.
Make an intermediate agent responsible for collating and vetting the rules, checking for irregularities and conflicts. This agent must promise to disallow rules by one team that are the responsibility of another team. The agent could be a layer of software, but a cheaper and more manageable solution is the make this another group of one or more humans.
Make the resulting collated configuration version controlled. Publish approved promises for all hosts to download from a trusted source.

A review procedure for policy promises is a good solution if you want to delegate responsibility for different parts of a policy to different sources. Human judgement is irreplaceable, and tools can be added to make conflicts easier to detect.

Promise theory underlines that, if a host of computing device accepts policy from any source, then it is alone and entirely responsible for this decision. The ultimate responsibility for the published version policy is the vetting agent. This creates a shallow hierarchy, but there is no reason why this formal body could not be comprised of representatives from the multiple teams.

Delegation of responsibility requires vetting access

Previous: Configuration version control and rollback, Up: Using cfengine to implement ITIL objectives

5.9 Availability and Capacity Management

CFEngine records all manner of information about the behaviour of computers during its efforts to keep promises. These data offer the potential of mining for building up a picture of the behaviour of an entire datacentre or organization, perhaps even multiple domains.

Cfengine's environment daemon further collects patterns of environmental influence of hosts in a resource non-intensive manner. These data contain much information to enable capacity planning.

We should add a warning however. Capacity planning requires a considerable amount of data and analysis, as well as a sound and critical judgement of the data. Resource and performance management are such complex issues that no simple recipe or checklist can replace the judgement of an experienced engineer. However, cfengine can supply data to such an engineer.

Performance measurements (cfshow -p) allow the average throughput of a server in terms of time to completion of service. If service times are too long, this is an indication (but not proof) that hardware should be upgraded.
Activity levels are graphed per service. These indicate the level of traffic coming into the different servers. Evidence of a ceiling limit on the throughput (clipping in the time-series) can show insufficient throughput.
Distribution graphs of fluctuations about the mean can also show evidence of ceiling limits. Asymmetric distributions show when the majority of service requests tend to bunch at a high level (probable stress on server) or at a low level (over dimensioned server).

The level of technical understanding to make sound judgements based on these data goes somewhat beyond the scope of this document. This motivates us to create better tools for cfengine that can make these analyses more accessible to users. However, this must be deferred for another occasion.

Next: ITIL glossary, Previous: Using cfengine to implement ITIL objectives, Up: Top

6 Summary

We have described the basics of cfengine and ITIL and shown a number of areas where the two can be integrated.

CFEngine users can benefit from the disciplines that ITIL brings.
ITIL can benefit from the predictability that cfengine brings.

Next: Road-map for adoption, Previous: Summary, Up: Summary

6.1 How we wrote this document, Promise concepts voluntary cooperation, Summary, Summary

So, if ITIL is so great, did we use it to manage the process of writing this document? Authoring a document and authoring a policy have much in common, so let us spend a moment to examine the process of checks and balances that we have used to produce this text.

The answer to our question is both yes and no, and while this might sound rather unhelpful, we suggest that it is in fact a significant answer; indeed it is the right answer in response to any question about best practices because such recipes must always be applied to a specific context.

There are sensible and ridiculous ways to implement a set of recommendations. ITIL users should expect to adapt its generalized ideas to each set of special circumstances. To do this here, we have used the parts of ITIL that make particular sense for authoring, and we have also used cfengine's model of promises or voluntary cooperation to understand how to implement them.

For example, ITIL suggests forming committees for discussing and deciding change. A committee is a cumbersome device when the total number of people involved in the entire process is two. Nevertheless, the role of the committee is relevant (i.e. the promises it makes to bring the process to completion), and this is where promise theory helps us to make sense of the “dumb rules”. We have multiple opinions and multiple pairs of eyes for quality control as well as for inspiration.

Next: Promise concepts voluntary cooperation, Previous: How we wrote this document, Up: How we wrote this document

6.1.1 ITIL concepts for authoring, Promise concepts voluntary cooperation, Summary, Summary

Several parts of ITIL are quite relevant to authoring.

Service management. A document provides an information service to its clients (the readers). It promises to be accurate to within reasonable limits.
Release management. Each version of our document can be considered a release which undergoes a continual improvement cycle, constantly being evaluated and changed in accordance with events and incidents that occur.
Incidents. An incident is something that impacts on the service. An incident could be the discovery of an error in the text. It could be a disagreement between the authors or a misunderstanding on the part of the reader. There have been many incidental changes based on discussions in our teams.
Impact. The impact of the incident is the potential damage caused by the incident, or the usefulness of the discovery. Incidents are not necessarily negative events. They can be events which point out improvements.
Request for change. One of the authors asks to make changes to the text.
Change management. Each identified change can be evaluated for its potential impact (benefit or confusion). If there are many changes to be made, priority can be assigned to them. When should the changes be implemented?

Previous: ITIL concepts for authoring, Up: How we wrote this document

6.1.2 Promising voluntary cooperation, Road-map for adoption, Summary, Summary

What does promise theory say about collaborative authoring?

First of all, it begins by saying that each individual in the process of authoring has independent knowledge and should be represented as a separate agent. It tells us that promises to cooperate will be needed to integrate the information.

However, more than that, promises tell us that each section of the text is an “agent” which can change or behave independently. In other words, we can manage the parts independently, but again we need to promise to coordinate those parts. So promise theory asks us first to identify the agents (the topics in the document) that will be interacting and then find out what promises they need to make to carry out their function.

Because of the individual nature of the parts, we can associate an individual author to each. To bring them together we need a further agent or individual to collate independent ideas and policy sources into a single coherent whole. Thus promises shows us a basic “bow-tie” structure for integrating and correlating independent sources and then making the results available to independent users (see figure). This is not the only solution to the problem of vetting that promises predicts, but it is the simplest one. Also it is the approach that ITIL approves – making a someone responsible for the job.

We emphasize that promise theory does not tell us the specifics of how to implement solutions, it only tells us what elements are needed and how they should interact. So we might implement agents as people, as different computers, or as different user accounts within the same computer. As long as the elements can keep the necessary promises, it does not matter.

So how did we write our document? In fact we did not use a very strict ITIL-like change management process when writing the first versions of our document. Such a process could have strangled our work in the creative stage and doubled the time it took to write. Rather, we worked in an ad hoc way by voluntary cooperation. Each of us promised to write about certain topics and work on the text independently. We worked as autonomous agents, and we used Subversion (a version control and sharing system) to keep the working document. Subversion is itself a third agent which promises to accept changes one at a time from either of the two authors and then make these changes available again to both authors. This agent performs no vetting or control other than ordering the changes.

The authors have to promise to one another to resolve any conflicts or disagreements, but promises do not suggest how this might take place. (ITIL, on the other hand, does offer suggestions for this resolution process).

ITIL seems to work best once a service is up and running, or once a basic version of a document exists. It does not say so much about the creative act, except to think of it as a release.

What ITIL is weak at is parallelization of effort. ITIL's processes are serialized processing models. In our first creative versions, we converged in parallel onto an approximate result, each working separately. This is very efficient but it can lead to duplication of work or inconsistency. Serialization is needed to resolve consistency issues precisely, but it leads to unnecessary waiting in some cases.

Previous: How we wrote this document, Up: Summary

6.2 Road-map for adoption

Below we indicate a checklist of ITIL compliant steps for using cfengine in a machine life-cycle.

Set up cfagent running at scheduled interval X. This is the Service Level Agreement.
Set up versioning of policy.
Set up delegation of authorship.
Run cfenvd for passive monitoring. Run cfagent for active monitoring.

Release:

Select installation medium e.g. DVD, net-boot with hooks to cfengine.
Start with essential promises, and formulate the configuration policy.
Use ITIL processes for deciding and refining configuration promises.
Evaluation and monitoring of promises using cfagent and cfenvd.
Use cfagent for monitor changes using cryptographic checksums.
Develop recovery plans. Use cfengine to automate backup of data and automate the duplication of servers for load balancing and redundancy.

Previous: Summary, Up: Top

7 ITIL glossary

This section lists some of the many terms from ITIL, especially the ISO/IEC 20000 version of the text, and offers some comments and translations into common cfengine terminology.

Next: Availability, Previous: ITIL glossary, Up: ITIL glossary

7.1 Active Monitoring

Monitoring of a configuration item or IT service that uses automated regular checks to discover the current status.

CFEngine performs programmed checks of all of its promises each time cfagent is started. Cfagent is, in a sense, an active monitor for a set of promises that are described in its configuration file.

Next: Alert, Previous: Active Monitoring, Up: ITIL glossary

7.2 Availability

The ability of a component or service to perform its required function.

Availability = Hours operational / Agreed service hours

Availability or intermittency in cfengine refers to the responsiveness of hosts in a network when remotely connecting to cfservd.
Intermittency = Successful~ attempts / Total Attempts
This is a measurement that cfagent automatically makes.

Next: Audit, Previous: Availability, Up: ITIL glossary

7.3 Alert

A warning that a threshold has been reached, something has changed or a failure has occurred.

A cfengine alert fits this description quite well. Most alerts are user-defined, but a few are side effects of certain configuration rules.

Next: Baseline, Previous: Alert, Up: ITIL glossary

7.4 Audit

A formal inspection and verification to check whether a standard or set of guidelines is being followed.

Cfengine's notion of an audit is more like the notion from system accounting. However, the data generated by this extra logging information could be collected and used in a more detailed examination of cfengine's operations, suitable for use in a formal inspection (e.g. for compliance).

Next: Benchmark, Previous: Audit, Up: ITIL glossary

7.5 Baseline

A snapshot of the state of a service or an individual configuration item at a point in time

In cfengine parlance, we refer to this as an initial state or configuration. In principle a cfengine initial state does not have to be a known-base line, since the changes we make will not generally be relative to an existing configuration. CFEngine encourages users to define the final state (regardless of initial state).

Next: Capability, Previous: Baseline, Up: ITIL glossary

7.6 Benchmark

The recorded state of something at a specific point in time.

CFEngine does not use this term in any of its documentation, though our general understanding of a “benchmark” is that of a standardized performance measurement under special conditions. CFEngine regularly records state and performance data in a variety of ways, for example when making file copies.

Next: Change record, Previous: Benchmark, Up: ITIL glossary

7.7 Capability

The ability of someone or something to carry out an activity.

CFEngine does not use this concept specifically. The notion of a capability is terminology used in role-based access control.

Next: Chronological Analysis, Previous: Capability, Up: ITIL glossary

7.8 Change record

A record containing details of which configuration items are affected and how they are affected by an authorized change.

Cfengine's default modus operandi is to not record changes made to a system unless requested by the user. Changes can be written as log entries or audit entries by switching on reporting.

Consider a typical cfengine promise (to ensure that a destination file is a copy of a source). Three levels of change recording can be added in cfengine 2:

     copy:

       /source/file dest=/destination/file

                    inform=true
                    syslog=true
                    audit=true

An “inform” promise means that cfagent promises to notify the changes to its standard output (which is usually sent by email or printed on a console output). A “syslog” promise implies that cfagent will log the message to the system log daemon. Both of the foregoing messages give only a simple message of actual changes. An “audit” promise is a promise to record extensive details about the process that cfagent undergoes in its checking of other promises.

Next: Configuration, Previous: Change record, Up: ITIL glossary

7.9 Chronological Analysis

An analysis based on the timeline of recorded events (used to help identify possible causes of problems).

A timeline analysis could easily be carried out based on audit information, system logs and cfenvd behavioural records.

Next: Configuration Item (CI), Previous: Chronological Analysis, Up: ITIL glossary

7.10 Configuration

A group of configuration items (CI) that work together to deliver an IT service.

A configuration is the current state of resources on a system. This is, in principle, different from the state we would like to achieve, or what has been promised.

Next: Configuration Management Database CMDB, Previous: Configuration, Up: ITIL glossary

7.11 Configuration Item (CI)

A component of an infrastructure which is or will be under the control of configuration management.

A configuration item is any object making a promise in cfengine. We often speak of the promise object, or “promiser”.

Next: Document, Previous: Configuration Item (CI), Up: ITIL glossary

7.12 Configuration Management Database (CMDB)

Database containing all the relevant details of each configuration item and details of the important relationships between them.

CFEngine has no asset database except for its own list of promises. The only relationships is cares about are those which are explicitly coded as promises. In the future, cfengine 3 is likely to extend the notion of promises to allow more general records of the CMDB kind, but only to the extent that they can be verified autonomically.

Next: Emergency Change, Previous: Configuration Management Database CMDB, Up: ITIL glossary

7.13 Document

Information and its supporting medium.

ITIL originally considered a document to be only a container for information. In version 3 it considers also the medium on which the data are recorded, i.e. both the file and the filesystem on which it resides.

Next: Error, Previous: Document, Up: ITIL glossary

7.14 Emergency Change

A change that must be introduced as soon as possible – for example to solve a major incident or to implement a critical security patch.

CFEngine has no specific concept for this.

Next: Event, Previous: Emergency Change, Up: ITIL glossary

7.15 Error

A design flaw or malfunction that causes a failure.

CFEngine often uses the term configuration error to mean a deviation of a configuration from its promised state. The ITIL meaning of the term would translated into “bug in the cfengine software” or “bug in the promised configuration”.

Next: Exception, Previous: Error, Up: ITIL glossary

7.16 Event

A change of state that has significance for the management of a configuration item or IT service.

The same basic definition applies to cfengine also, but cfengine makes all such events into classes, since its approach to observing the environment is to measure and then classify it into approximate expected states. CFEngine class attributes (usually from cfenvd) may be considered as event notifications as they change.

Next: Failure, Previous: Event, Up: ITIL glossary

7.17 Exception, Failure, Event, Summary

An event that is generated when a service or device is currently operating abnormally.

A state in which configuration policy is violated (could lead to a warning or an automated correction).

Next: Incident, Previous: Exception, Up: ITIL glossary

7.18 Failure

Loss of ability to operate to specification or to deliver the required output.

ITIL's idea of a failure is something that prevents a promise from being kept. Cfengine's autonomy model means that it is unlikely for such a failure to occur, since promises are only allowed to be made about resources for which we have all privileges. Occasionally, environmental issues might interfere and lead to failure.

Next: Monitoring, Previous: Failure, Up: ITIL glossary

7.19 Incident

Any event that is not expected in normal operations and which might cause a degradation of service quality.

Cfengine's philosophy of convergence gives us only one option for interpreting this term, namely as a temporary deviation from promised behaviour. A deviation must be temporary if cfengine is operating continually, since it will repair any problem on its next invocation round. Events which do not impact promises made by cfengine are of no interest to cfengine, since autonomy means it cannot be responsible for anything beyond its own promises.

Next: Passive Monitoring, Previous: Incident, Up: ITIL glossary

7.20 Monitoring

Repeated observation of a configuration item, IT service or process in order to detect events and ensure that the current status is known.

CFEngine incorporates a number of different kinds of monitoring, including monitoring of kept configuration-promises and passive monitoring of behaviour.

Next: Policy, Previous: Monitoring, Up: ITIL glossary

7.21 Passive Monitoring

Monitoring of a configuration item or IT service that relies on an alert or notification to discover the current status.

Cfenvd is cfengine's passive monitoring component. It observes system related behaviour and learns about it. It assumes that there is likely to be a weekly periodicity in the data in order to best handle its statistical inference.

Next: Proactive Monitoring, Previous: Passive Monitoring, Up: ITIL glossary

7.22 Policy

Formally documented management expectations and intentions. Policies are used to direct decisions, and to ensure consistent and appropriate development and implementation of processes, standards, roles, activities, IT infrastructures, etc.

Cfengine's configuration policy is an automatable set of promises about the static and runtime state of a computer. Roles are identified by the kinds of behaviour exhibited by resources in a network. We say that a number of resources (hosts or smaller configuration objects) play a specific promised role if they make identical promises. Any resource can play a number of roles. Decisions in cfengine are made entirely on the basis of the result of monitoring a host environment.

Next: Problem, Previous: Policy, Up: ITIL glossary

7.23 Proactive Monitoring, Problem, Policy, Summary

Monitoring that looks for patterns of events to predict possible future failures.

All cfengine monitoring is pro-active in the sense that it can lead to automated follow-up actions.

Next: Promise, Previous: Proactive Monitoring, Up: ITIL glossary

7.24 Problem

Unknown underlying cause of one or more incidents.

A repeated deviation from policy that suggests a change of policy or specific counter-measures. A promise needs to be reconsidered or new promises are required.

Next: Reactive Monitoring, Previous: Problem, Up: ITIL glossary

7.25 Promise, Reactive Monitoring, Problem, Summary

ITIL does not define this term, although promises are deployed in various ways – for instance in terms of cooperation, communication interfaces within or between processes or contractual relationships as defined by Service Level Agreements, Operational Level Agreements and Underpinning Contracts.

A promise in cfengine is a single rule in the cfengine language. The promiser is the resource whose properties are described, and the promisee is implicitly the cfengine monitor.

Next: Record, Previous: Promise, Up: ITIL glossary

7.26 Reactive Monitoring

Monitoring that takes action in response to an event – for example submitting a batch job when the previous job completes, or logging an incident when an error occurs.

The concept of reactive monitoring is unclear because the duration of an event and the speed of a response are undefined. In a sense, all cfengine monitoring is potentially reactive. It is possible to attach actions which keep promises to any observable condition discernable by cfengine's monitor. CFEngine is not usually considered event driven however, since it does not react “as soon as possible” but at programmed intervals.

Next: Recovery, Previous: Reactive Monitoring, Up: ITIL glossary

7.27 Record

Information in readable form that is maintained by the service provider about operations.

A log entry or database item.

Next: Remediation, Previous: Record, Up: ITIL glossary

7.28 Recovery

Returning a Configuration Item or an IT service to a working state. Recovering of an IT service often includes recovering data to a known consistent state.

Next: Repair, Previous: Recovery, Up: ITIL glossary

7.29 Remediation

Recovery to a known state after a failed change or release.

All cfengine promises refer to the state of a system that is desired. The promises are automatically enforced, hence cfengine recovers a system (in principle) on every invocation. CFEngine always returns to a known state, due to the property of “convergence”. There is no distinction between the concepts of repair, recovery or remediation.
However, this concept is like the notion of “rollback” which often involves a more significant restoration of a system from backup. This is discussed later.

Next: Release, Previous: Remediation, Up: ITIL glossary

7.30 Repair

The replacement or correction of a failed configuration item.

Next: Request for Change, Previous: Repair, Up: ITIL glossary

7.31 Release, Request for Change, Repair, Summary

A collection of new or changed configuration items that are introduced together.

An instantiation of the entire cfengine system under a specific version of a policy, i.e. a specific set of promises.

Next: Resilience, Previous: Release, Up: ITIL glossary

7.32 Request for Change

A form to be completed requesting the need for change. This is to be followed up.

This has no counterpart in cfengine. It is part of human communication which coordinates autonomous machines. Clearly autonomous computers do not listen to change requests from other computers, but when machines cooperate in clusters or groups they take suggestions from the collaborative process. An RFC in an ITIL sense is part of an organizational process that goes beyond cfengine's level of jurisdiction. This is an example of what ITIL adds to the autonomous cfengine model.

Previous: Request for Change, Up: Request for Change

7.32.1 Abandon Autonomy?

Why not simply abandon autonomy of machines if this seems to interfere with the need for organizational change? There are good reasons why autonomy is the correct model for resources. Autonomy reduces the risk to a resource of attack, mistake and error propagation.

ITIL's processes exist precisely to minimize the risk of negative impact of change, so the goals are entirely compatible. When an organization discusses a change it examines information from possible several autonomous systems and discusses how they will change their pattern of collaboration. There is no point in this process at which it is necessary for one of the systems to give up its autonomy.

Next: Restoration, Previous: Request for Change, Up: ITIL glossary

7.33 Resilience

The ability of a configuration item or IT service to resist failure or to recover quickly following a failure.

Cfengine's purpose is to make a system resilient to unpredictable change.

Next: Role, Previous: Resilience, Up: ITIL glossary

7.34 Restoration

Actions taken to return an IT service to the users after repair and recovery from an incident.

All cfengine promises refer to the state of a system that is desired. The promises are automatically enforced, hence cfengine recovers a system (in principle) on every invocation. CFEngine always returns to a known state, due to the property of “convergence”. There is no distinction between the concepts of repair, recovery or remediation.
However, this concept seems to suggest a more catastrophic failure which often involves a more significant restoration of a system from backup. This is discussed later.

Next: Service desk, Previous: Restoration, Up: ITIL glossary

7.35 Role

A set of responsibilities, activities and authorities granted to a person or a team. Roles are defined in processes.

A role in cfengine is a class of agents that make the same kind of promise. The type of role played by the class is determined by the nature of the promise they make. e.g. a promise to run a web server would naturally lead to the role “web server”.

Next: Service Level Agreement, Previous: Role, Up: ITIL glossary

7.36 Service desk

Interface between users and service provider.

A help desk. This is not formally part of cfengine's tool set.

Next: Service Management, Previous: Service desk, Up: ITIL glossary

7.37 Service Level Agreement

A written agreement between the service provider that documents agreed services, levels and penalties for non-compliance.

An agreement assumes a set of promises that propose behaviour and an acceptance of those promises by the client. If we assume that the users are satisfied with out policies, then an SLA can be interpreted as a combination of a configuration policy (configuration service promises), and the cfengine execution schedule.

Next: Warning, Previous: Service Level Agreement, Up: ITIL glossary

7.38 Service Management

The management of services.

Same.

Previous: Service Management, Up: ITIL glossary

7.39 Warning

An event that is generated when a service or device is approaching its threshold.

A message generated in place of a correction to system state when a deviation from policy is detected. Note that cfengine is not based on fixed thresholds. All “thresholds” for action or warning are defined as a matter of policy.

CFEngine-Modularization
1 Enterprise Integration
2 CFEngine past and present
- 2.1 Fundamental CFEngine Concepts
- 2.2 CFEngine Components
3 ITIL past and present
4 ITIL and cfengine comparison
- 4.1 Which ITIL processes apply to cfengine?
- 4.2 ITIL terminology
5 Using cfengine to implement ITIL objectives
6 Summary
- 6.1 How we wrote this document, Promise concepts voluntary cooperation, Summary, Summary
  - 6.1.1 ITIL concepts for authoring, Promise concepts voluntary cooperation, Summary, Summary
  - 6.1.2 Promising voluntary cooperation, Road-map for adoption, Summary, Summary
- 6.2 Road-map for adoption
7 ITIL glossary