The goal of most IT installations is to work as a support infrastructure for some other primary activity, such as the running of a business or other organization. Even if the primary activity is the design of computer systems, or the writing of software, the supporting infrastructure is a tool whose management is in principle separate from the main business goals. As organizations become larger, the management of the IT system and other ancillary activities frequently become isolated from “front line” activities.
IT infrastructure is an enabler, so it is important to ensure that it succeeds in this task. How do we do this? This document is about how to make cfengine-management best support primary business or organizational processes.
We write this document in the light to two trends: the demotion of system administration as a job description and the rise of service oriented thinking to replace it, along with monolithic design philosophy of systems. Service orientation is not so much a technological innovation as it is a different kind of social structure. It is a move away from hierarchy as the main model of organization, toward generalized network structure. In computer parlance, service orientation is essentially a peer to peer structure. There are no automatic kings or commanders in chief, only peers who need help from other peers. If such key positions arise, they emerge naturally by necessity, not by presumption.
For example, in the 1960s factory work in the United Kingdom was organized hierarchically with powerful unions attending to a dutiful “separation of concerns”, much like an idealized object oriented system. To build a ship, one would have to ask the management to ask panel producers for panels, then when they were finished they would send the message back up the hierarchy so that management would schedule the welders to arrive, then the painters and so on. Much delay and inefficiency was caused by this organizational bureaucratic structure.
Although this behaviour persists to a lesser extent, today we use more direct communication between the parts that need to connect and so save much time and overhead. This service oriented thinking can be applied to computing services, their organization and even the support of those computing services. The service model can be applied at all levels.
In the late 1980s it was realized that a service oriented view of management could profitably be formalized so as to be of benefit to all organizations. This began the Information Technology Infrastructure Library, building on the experience of leaders in government and industry, including organizations such as the British Broadcasting Corporation, the office of Government Commerce and others.
The IT Infrastructure Library (ITIL) has emerged as a de-facto set of ideas about service delivery. It is not based on any theoretical model or design criteria. It is rather a set of self-proclaimed best practices compiled by representatives from government and industry. As such its claims can be discussed, but we shall not do so here. We shall refer to ITIL because it has become a popular set of guidelines for all manner of IT organizations, and because it promotes the idea of IT-business alignment.
ITIL was an important source of concepts and processes documented in the following British and ISO standards:
ITIL now encompasses various books and courses and has its own qualification scheme allowing for a certification of Service Managers or IT staff.
The key concepts of ITIL include service and process orientation, and service orientation is an important model for system organization because it can encompass everything from the monolithic hierarchical systems of yesteryear to modern day peer to peer architectures which better mirror a free-market economic business interaction. It can be applied to computer-provided services (e.g. web services, or even configuration operations like cfengine) or it can be applied to human services and operations such as help desks and support. This makes it an important centre-piece in the discussion.
ITIL has its own particular terminology for discussing service related matters. To relate these to the use of a technology such as cfengine we need to understand the words and how they are used. ITIL uses many terms and phrases in a different way to system administrators.
The verb “to manage” originally meant “to cope”. Only more recently has strategic thinking changed it into a transitive verb: something that we do to systems, like driving a car, or flying a plane.
Today the term “management” signifies the introduction of a bureaucratic level of governance, to control and verify the workings of a system. The terminology this has come about mainly because the people who wrote ITIL live in that kind of world and understand things through these eyes. Ironically, computer engineers now speak of “self-management” and “autonomics” to recover the original idea of systems that can cope.
In this document we have two principal aims:
What do we need to make a business? Do we need a demand for a “product”, a workflow to implement it, a supply (chain) mechanism for selling it to a market? It turns out that the service abstraction is a paradigm that fits all enterprises without too much shoe-horning.
Businesses have probably many goals in their grand designs: they have high level visions, notions of secure and best practices, sometimes even ethical policies. All of these can be couched in the language of promises to behave in some way.
Now, we can ask: what does it mean to align an IT infrastructure to this business goal to provide $S$? First, for IT systems to have any impact on the business goal at all, the business must rely on the IT system in some way. This could either be directly, in the manner of an e-commerce web-site, or it might be indirectly, for instance by providing drawing and modelling software in an architect's office. In either case there is a workflow in which an IT system plays an intermediary role in the workflow process.
In fact, it does not matter whether this is an IT system, a human being or a steam-powered engine. What is key is that there is a technology playing an intermediate role in the performance of a service. We can display this as the workflow diagram shown by the dotted lines in the figure. The business $B$ would like to provide service $S$ to its customer $C$; in actuality this requires the help of intermediary $I$.
Inserting an intermediate agent into a business process. The dotted lines show a work flow path. The arc shows a promise the business would like to make to the end customer – but promise theory says that it cannot if it does not have direct contact.
Promise theory has several implications, and one of them is that an agent cannot promise something with confidence to an agent it is not directly in contact with. This is because agents can only vouch for their own behaviour. They cannot promise what an intermediate agent would do. This has implications for the business.
Suppose a business want to make a promise to its customer, but knows that it must rely on intermediaries (the IT department for example) to do so. Promise theory tells us that the business representative making the promise requires promises from every intermediate agent in the chain, and each of the agents in that chain require promises from down the chain too.
It is beyond the scope of this document to explain all of those promises. What cfengine allows a business to do is to automate many of those promises – or make them autonomic (self-managing).
Humans are poor at reliable, repetitive work but they are infinitely superior at creative work and decision-making. Modern theory on success in business rejects the classic views of management with militarized or bureaucratized chains of command and control in favour of more human-creative structures. Creative and adaptive workflow requires high level of decentralization and autonomy, while at the same time protecting the core values of the organization.
Team work is a key element in decentralized organization – both for humans and computers. IT departments are often organized in this way, for instance. Teams do not exist because they maximize production of every individual, nor do they make an organization more predictable or controllable. They exist because humans need continual motivation and emotional support – and indirectly this sustains workflow and adds creativity to a business. One often overlooks the team-aspect of coping when considering computer management, in favour of hierarchical design. CFEngine does not force us into hierarchical systems however, so we should not discard the smaller team idea too soon.
CFEngine is complex enough for it to make sense to delegate responsibility for different issues. An organization will generally consist of many groups and teams already, each with their own special needs and each craving its own autonomy. CFEngine and promise theory were designed for precisely this kind of environment. CFEngine allows cooperation and sharing without allowing central managers to ride roughshod over local needs.
Teams thrive by discussion and interaction within the framework of a policy or vision, allowing variation and arriving at a consensus when necessary. Success in a team depends on a combination of abilities working together not undermining one another. Conflicts in the promises made by team members reveal design problems in the group. An analysis of promises (cfengine's model of collaboration) is a significant tool for understanding and enabling businesses.
M. Belbin a researcher in teamwork has identified nine abilities or roles (kinds of promise) to be played in a team collaboration:
His model has little room for technical workflow arguments. It is entirely concerned with the creative process. This is probably significant. We should ask ourselves: how can we use the freedom to organize into specialized teams to maximize human creativity, while passing hard work over to machines. Solving this problem is what cfengine is about.
ITIL has no theory to back it up, so we have to look elsewhere for a motivation of its practices. Promise theory is an attempt to do just this for a service oriented model in which peers make promises to one another. So it ought to work for ITIL also. The advantage of promise theory is that it helps us to see how cfengine can be used, because promises provide a simple picture of how cfengine works.
Think of cfengine as a general tool for automatically making sure that promises are kept. |
The popular service concept fails to capture one thing very clearly, namely the distinction between making a promise and keeping a promise. A service implies that something will be provided but it does not specify when.
Suppose we ask a security company to protect our assets. The company might promise to deploy guards, or alarm technology, or it could simply promise that you will be safe without explaining how the promise will be kept. The promise does not necessarily imply any action required to maintain this state of safety, but we still pay the company for the service to keep this promise anyway. Trust plays an important role, of course.
Promise theory helps us to understand services in all forms by forcing us to think carefully about the concept of autonomy. Autonomy implies several things: for instance, privacy of information, independence of decision and responsibility for one's own behaviour. The concept of autonomy is like a filter that makes us think carefully about things that we often take for granted. It is a good discipline, forcing us to confront what we think we know about systems.
The agents of promises are humans, computers or any entity that can be associated with a promise even if by association with its owner or designer. They are said to be autonomous if they cannot be forced to make any promises about their behaviour by an outside agent. A useful principle for understanding systems is the maximal separation of concerns and promises help us to separate independent issues.
Separation of concerns is only half the story however. Promises are also about describing how the parts of a system work together, just as in team-work. Promises provide the glue that allows completely autonomous parts to form an organization. We are not allowed to think about “control” or “command”, only about voluntary cooperation. Keep these ideas in mind when reading this document.
We can use the language of promises to make clearer definitions.
Service Level Agreements (SLA) are now a well-known part of the customer-business scenario. How are promises different from Service Level Agreements (SLA)? Promises are more primitive than agreements. Agreements bind two parties to a collection of bilateral decisions that have been made in advance. An agreement implies an existing infrastructure on which to agree. A promise on the other hand is an entirely autonomous statement about agents' behaviour (ad hoc). Showing only the promises in a system does not imply any agreement between the parties, only indications about their likely behaviours.
In other words, seeing the promises that have been made, an external observer could calculate effective service levels that have been promised without any agreement taking place. Promises are therefore more fundamental than agreements to the predictability of the system.
Process automation is an investment which has its own cost. The benefits are not merely saved manpower but improved consistency or certainty of process. Automation provides an automatic quality assurance.
A simple argument against automation goes like this: if I can fix it in five minutes then it is not worth automating, unless the automation takes less time than that.
The argument is simplistic. Before dismissing automation, one should ask questions like this:
CFEngine is a free software package for automating the installation and maintenance of networked computers. The project began in 1993 and it has been in widespread use since 1995. CFEngine is available for all major Unix and Unix-like operating systems, and it will also run under NT-derived Windows operating systems via the Cygwin Unix-compatibility environment/libraries.
CFEngine scales easily from a single host to tens of thousands of hosts. As of this writing, the largest installations we know of regulate around 20,000 machines under a common administration. CFEngine can manage many aspects of system configuration and maintenance, including the following:
Cfengine's purpose is to implement policy-based configuration management. In practical terms, this means that cfengine greatly simplifies the tasks of system configuration and maintenance. For example, to customize a particular system, it is no longer necessary to write a program which performs each required action in a procedural language like Perl or your favorite shell. Instead, you write a much simpler policy description that documents how you want your hosts to be configured. The cfengine software determines what needs to be done in terms of implementation and/or remediation from this specification. Such policy descriptions are also used to ensure that the system remains configured as the system administrator wishes over time.
Here is a brief example of such a policy description which we've annotated:
control: General directives: here, we define a list variable. tmpdirs = ( tmp:scratch:scratch2 ) files: File ownership and protection specifications. /usr/local/bin owner=root group=bin mode=755 action=fixall copy: Copy files on/to the local system. solaris:: Applies only to Solaris systems. /config/pam/solaris server=pammaster dest=/etc/pam.d linux:: Applies only to Linux systems. /config/pam/common-auth server=pammaster dest=/etc/pam.d/common-auth tidy: Manage temporary scratch directories. ${tmpdirs} include=* age=7 recurse=inf
This simple configuration is divided into four stanzas, each
introduced by a colon-terminated keyword, specifically
control:
, files:
, copy:
and tidy:
. The
control
stanza defines a list of directories which we've named
tmpdirs which we'll use later (in the tidy
stanza).
The files
stanza specifies that all of the files in the
directory /usr/local/bin should be owned by user root and
group bin and have the file mode 755. When cfengine runs with this
configuration description it will correct any ownership and/or
permissions which deviate from these specifications. Thus, this
stanza serves to implement a policy about the proper ownerships and
permissions for the executables in the local binaries directory.
The copy
stanza prescribes different configurations for Linux
and Solaris systems. On Solaris systems, files in /etc/pam.d
will be updated with those in the directory /config/pam/solaris on a master server when the latter are newer. On
Linux systems, only the file /etc/pam.d/common-auth is updated
from the PAM master configuration because the Linux systems in
question use the PAM include file mechanism to propagate this file's
stacks to all of the PAM-enabled services. Note, however, that both of
these specifications implement the same underlying system
configuration maintenance policy: update the relevant PAM
configuration files from the master server if necessary.
The final, tidy
stanza illustrates the use of implicit
looping. The single directive in the example applies to each of the
directories in the tmpdirs list. For each directory, cfengine
will delete all items in the directory or any of its subdirectories
which have not been accessed in seven days (including ones where the
filename begins with a period). Like the other directives in this
sample configuration file, this stanza implements a policy: items in
temporary directories which have not been used within a week will be
deleted.
All cfengine configuration descriptions are variations on these an similar themes, albeit more elaborate ones. Before turning to more details about the technical aspects of using cfengine, a brief consideration of the most important underlying and guiding theoretical concepts is in order.
As we've stated, cfengine operates on hosts in order to bring their configurations in line with the specified policies. We need to define some terms.
What are we aiming for with cfengine? The answer is: policy
conformant configuration. We want to formulate a specification of not
just one host, but usually many, including how they all interact,
perhaps to solve a business problem; then we want to leave the
details, implementation and maintenance to a robot agent:
cfagent
.
Humans are good at understanding input and thinking up solutions but they not very reliable at implementation: doing things reliably. Machines and software agents are good at carrying out tasks reliably, but are not good at understanding or finding actual solutions. With cfengine, you let the distinct parts of your human-computer organization concentrate on what they are each good at.
CFEngine can also produce reports about systems for monitoring the performance and compliance with policies. This is an important aspect of business integration as service providers want to know whether they are delivering what they have promised, and whether their money has been spent wisely.
Cfengine's philosophy fits quite well with the service oriented approach to computing.
A cfengine policy can be thought of as a list of promises which the system makes to some auditor about its configuration. Most of the these promises involve the possibility of change to make a host fulfills its policy promises. We call such changes actions or operations. As you probably already guessed, the auditor in this scenario is part of cfengine itself. Cfagent is also the mechanic or surgeon that performs the operations on the system, if it does not meet its promises.
By describing its operation in this manner, we can think of configuration management as a service that is provided, a service that is intimately connected with monitoring and maintenance, and which can be “bought” on demand without necessarily subordinating a system to a central authority.
For example, here is a promise about the attributes of a file:
files: /etc/passwd mode=a+r,go-w owner=root group=root action=fixall
There are implicit operations (actions) in this declaration: specifically, the operations that will change the attributes if/when they do not conform to this specification.
A key property of cfengine is convergence. This is an important characteristic that distinguishes it from general computer languages. It is a property that helps to prevent systems from diverging: running away in an uncontrollable fashion.
Here is an example used during the editing of an ASCII file:
editfiles: ... AppendIfNoSuchLine "Important configuration line"
This operation tells cfengine to append the given text to the end of a
file, only if it is not already there. The policy-conformant configuration is
therefore that the line is present, and once that is achieved nothing
more will be done. We say that the operation AppendIfNoSuchLine
is convergent.
Don't underestimate the value of convergence. It provides you with stability. Because cfengine's language interface strongly discourages you from doing anything non-convergent, it also help to prevent mistakes. The price is that you will have to learn to think in a convergent way—and that is new for most people who come to cfengine for the first time.
One of the features that makes cfengine policies readable is the ability to hide away all of the complex decision-making that needs to be performed by the agent. To realize this ambition, cfengine uses a declarative language to express policy.
A declarative language is simply a structured list of sentences (in the case of cfengine, it is a list of policy promises). It is stated in no particular order; it describes a final goal that is to be achieved. The details of how one gets there are left implicit: to be evaluated and implemented by the engine that interprets the specification. This is in contrast to procedural or imperative languages, such as shell or Perl which micro-manage every step along the way.
In an imperative language, one focuses on the procedure. In a declarative language, one focuses on the intention, or the presumed result.
One example of this is the use of classes in cfengine. Classes are a way of making decisions, without writing many “if-then-else” clauses. A class is an identified which has the value “true” when a particular test is true. It is a Boolean variable; if you like it caches the result of an “if” test. The benefit of classes is that all of the testing can be hidden away in the bowels of cfengine, and only the results need be visible if or when they are needed.
For example, the class debian
is true if and only if cfagent is
running on a host that has Debian Linux as its operating system.
It is a fundamental property of cfengine components that every host retains its individual autonomy. A host can always opt out of cfengine-based governance if its administrator wants to. This principle leads to a fundamental design and implementation decision:
It is important to understand what this means. It does not mean that centralized control of hosts cannot be achieved. Centralized control is the way that most users choose to use cfengine. Indeed, all you have to do to achieve centralized control is to make a policy decision for all your hosts to fetch policy specifications from a central authority.
Autonomy does mean that if your environment has some small groups or sub-cultures with special needs, it is possible for them to retain their special identity. No one claiming to be their self-appointed authority can ride rough shod over their local decisions.
Where does policy come from then? Each host works from a policy specification that cfengine expects to find in a local directory (usually /var/cfengine/inputs on a Unix-like host). If you want your host to be controlled from some central manager or authority, then your policy must contain bootstrapping specifications that say: “it is my decision that I should download and follow the policy specification located at the central manager.”
Each host can turn this policy decision off at any time. This is a key part of the cfengine security model.
Cfengine's scalability is at least as good as any other system, because it allows for maximal distribution of workload.
This does not mean that you are immune from making bad decisions. For example, network services can always be a bottleneck if you ask 10,000 hosts to fetch something from one place at the same time.
The fact that each cfengine agent keeps a local copy of policy (regardless of whether it was written locally or inherited from a central authority) means that cfengine will continue to function even if network communications are down.
The cfengine software consists of a number of components: separate programs that work together (see figure). The components differ between version 1 and version 2. We shall only discuss cfengine 2 here, as cfengine version 1 is no longer supported, and you are strongly advised to use version 2. In addition, CFEngine version 3 is being developed at the time of writing, but this will take a number of years before it can fully replace version 2. It will incorporate the state of the art in Network and System Administration research, building on all the lessons learned from versions 1 and 2.
The components of cfengine are:
cfagent
: Interprets policy promises and implements them in a convergent manner.
The agent can use data generated by the statistical monitoring engine cfenvd
and
it can fetch data from cfservd
running on local or remote hosts.
cfexecd
: Is a scheduler and wrapper which executes cfagent
and logs
its output (optionally sending a summary via email). It can be run in
daemon (standalone) mode, or it can be run from cron
on a Unix-like system.
cfservd
: A server daemon that serves file data. It can also be configured
to start cfagent
immediately on receipt of a connection from cfrun
. No
actual data can be passed to this daemon.
cfrun
: A helper application that polls hosts and asks them to run cfagent
if they agree.
cfenvd
: A statistical state monitor that collects statistics
about resource usage on each host for anomaly detection purposes. The information is
made available to the agent in the form of cfengine classes so that the agent can check for and respond
to anomalies dynamically.
cfkey
: Generates public-private key pairs on a host. You normally run this
program only once, as part of the cfengine software installation process.
cfshow
: Displays the cfagent
database contents in ASCII format, should you ever
become interested in its internal memory.
cfenvgraph
: Dumps cfenvd
's statistical database contents in a form
that can be used to plot graphs showing the normal behavior of a host in its
environment.
This figure illustrates the relationships among cfengine
components on different hosts. On a given system, cfagent
may
be started by the cfexecd
daemon; the latter also handles
logging during cfagent
runs. In addition, operations such as
file copying between hosts are initiated by cfagent
on the
local system, and they rely on the cfservd
daemon on the remote
system to obtain remote data.
The IT Infrastructure Library (ITIL) is a collection of books, in which “best practices” for IT Service Management (ITSM) are described. Today, ITIL can be seen as a de-facto standard in the discipline of ITSM, for which it provides guidelines by its current core titles Service Strategy, Service Design, Service Transition, Service Operation and Continual Service Improvement. ITIL follows the principle of process-oriented management of IT services.
In effect, the responsibilities for specific IT management decisions can be shared between different organizational units as the management processes span the entire IT organization independent from its organizational partition. Whether this means a centralization or decentralization of IT management in the end, depends on the concrete instances of ITIL processes in the respective scenario.
ITIL has its roots in the early 1990s, and since then was subject to numerous improvements and enhancements. Today, the most popular release of ITIL is given by the books of ITIL version 2 (often referred to as ITILv2), while the British OGC (Office of Government Commerce), owner and publisher of ITIL, is currently promoting ITIL version 3 (ITILv3) under the device "`ITIL Reloaded"'.
It is important to understand that ITILv3 is not just an improved version of the ITILv2 books, but rather comes with a completely renewed structure, new sets of processes and a different scope with respect to the issue of IT strategies, IT-business-alignment and continual improvement. That is why, in the following, we run through the basics of both versions, highlighting commonalities and differences.
It is the paradigm of process-oriented IT Service Management that ITIL is based on. In addition, ITIL uses the Deming quality circle as a model for continual quality improvement, where quality both relates to the provided IT services as well as the management processes deployed to manage these services. Continual improvement as to ITIL means to follow the method of Plan-Do-Check-Act:
Although ITILv3 has been released during the summer of the year 2007, it is its predecessor that has achieved great acceptance amongst IT service providers all over the world. And due to the fact that the International ISO/IEC 20000 standard has emerged from the basic principles and processes coming from ITILv2, it is this version experiencing the biggest distribution and popularity.
The core modules of ITILv2 are the books entitled Service Support and Service Delivery. While the Service Support processes (e.g. Incident Management, Change Management) aim at supporting day-to-day IT service operation, the Service Delivery processes (e.g. Service Level Management, Capacity Management, Financial Management) are supposed to cover IT service planning like resource and quality planning, as well as strategies for customer relationships or dealing with unpredictable situations.
In 2007, ITILv2 has been replaced by its successor ITILv3, aimed at covering the entire service life cycle from a management perspective and striving for a more substantiated idea of IT business alignment. Many of the ITILv2 processes and ideas have been recycled and extended by various additional processes and principles. The five service life cycle stages accordant to ITILv3 are:
Why service and process orientation? What is ITIL trying to do? As we mentioned in the introduction, the `military' control view of human organization fell from favour in business research in the 1980s and service oriented autonomy was identified as a new paradigm for levelling organizations – getting rid of deep hierarchies that hinder communication and open up communication directly.
If one is cynical, one can interpret the signs of CEOs nervously trying to put back some of the military thinking into process management – with definitions of authority and chains of responsibility, but these chains are short and whenever ITIL says “committee”, promise theory would say that all we need is a single agent (a human or computer) and the internal details of it don't matter. We should probably not think too literally about ITIL's choice of words, which after all were born from a particular kind of corporate culture and will not appeal to everyone.
If we look at ITIL through the eyeglass of a hierarchical organization, some of its procedures could be seen as restrictive, throttling scalable freedoms. We do not believe that this is their intention. Rather ITIL's guidelines try to make a predictable and reliable face for business and IT operations so that customers feel confidence, without choking the creative process that lies behind the design of new services.
CFEngine users are interested in the ability to manage, i.e. cope with system configuration in a way that enables a business or other organization to do its work effectively. They don't want reams of human management because this is what cfengine is supposed to remove. To be able to use ITIL to help in this task, we have to first think of the process of setting up as a number of services. What services are these? We have to think a little sideways to see the relationship.
You should keep this kind of thinking in mind, and train yourself to see every part of a task in “ITIL clothes”.
The following management processes are in scope of ITILv3:
Service strategy is about deciding what services you want to formalize. In other words, what parts of your system administration tasks can you wrap in procedural formalities to ensure that they are carried out most excellently?
Service design is about deciding what will be delivered, when it will be delivered, how quickly the service will respond to the needs of its clients etc. This stage is probably something of a mental barrier to those who are not used to service-oriented thinking.
How shall we support service operation? What resources do we need to provide, both human and computer? Can we be certain of having these resources at all times, or is there resource sharing taking place? If services are chained into “supply chains”, remember that each link of the chain is a possible delay, and a possible misunderstanding. Successfully running services can be more complex at task than we expect, and this is why it is useful to formalize them in an ITIL fashion.
Continual improvement is quite self-explanatory. We are obviously interested in learning from our mistakes and improving the quality and efficiency by which we respond to service requests. But it is necessary to think carefully about when and where to introduce this aspect of management. How often should we revise out plans and change procedures? If this is too often, the overhead of managing the quality becomes one of the main barriers to quality itself! Continual has to mean regular on a time-scale that is representative for the service being provided, e.g. reviews once per week, once per month? No one can tell you about your needs. You have to decide this from local needs.
In the field of tool support for IT Service Management accordant to ITIL, various white papers and studies have been published. In addition, there are papers available from BMC, HP, IBM and other vendors that describe specific (commercial) solutions. Generally, the market for tools is growing rapidly, since ITIL increasingly gains attention especially in large and medium-size enterprises. Today, it is already hard to keep track of the variety of functionalities different tools provide. This makes it even more difficult to approach this topic in a way satisfactory to the entire researchers', vendors' and practitioners' community.
That is why this document follows a different approach: Instead of thinking of ever new tools and computer-aided solutions for ITIL-compliant IT Service Management, this book analyses how the existing and well-established technologies used for traditional systems administration can fit into an ITIL-driven IT management environment, and it guides potential practitioners in integrating a respective tool suite – namely cfengine – with ITIL and its processes.
To avoid any misunderstanding: We do not argue that cfengine – originally invented for configuring distributed hosts – may be deployed as a comprehensive solution for automating ITIL, but what we believe is cfengine and its more recent innovations can bridge the gap between the technology of distributed systems management and business-driven IT Service Management. To make the case we must show:
These are the main goals of the subsequent chapters.
To summarize the results of the previous chapters, it can be said that the goals of ITIL and the purpose of cfengine are quite different: ITIL gives recommendatory guidance in process- and service- oriented IT Service Management, while cfengine provides a powerful solution framework for a variety of common network and systems administration tasks. In other words:
The goal of this document is to give an overview on how cfengine can be used to support selected IT Service Management tasks according to ITIL.
In version 2, ITIL divides itself into service support and service delivery. For instance, service support might mean having a number of cfengine experts who can diagnose problems, or who have sufficient knowledge about cfengine to solve problems using the software. It could also mean having appropriate tools and mechanisms in place to carry out the tasks. Service delivery is about how these people make their knowledge available through formal processes, how available are they and how much work can they cope with? CFEngine enables a few persons to perform a lot of work very cheaply, but we should not forget to track our performance and quality for the process of continual improvement.
Service support is composed of a number of issues:
Service delivery, on the other hand, is dissected as follows:
The notion of system administration in the sense of Unix does not exist in ITIL. In the world of business, reinvented through the eyes of ITIL's mentors, system administration and all its functions are wrapped in a model of service provision.
Perhaps the most obvious example is the term configuration management.
As we see, this is comparable to our intuitive idea of “asset management”, but with “relationships” between the items included. ITIL also defines “Asset Management” as “a process responsible for tracking and reporting the value of financially valuable assets” and is a component of ITIL Configuration Management.
In the cfengine world, configuration management involves planning, deciding, implementing (“base-lining”) and verifying (“auditing”) the inventory. It also involves maintaining the security and privacy of the data, so that only authorized changes can be made and private assets are not made public.
In this document we shall try not to mix the ITIL concept with the more prosaic system administration notion of a configuration which includes the current state of software configuration on the individual computers and routers in a network.
Since cfengine is a completely distributed system that deals with individual devices on a one-by-one basis, we must interpret this asset management at two levels:
Why bother to collect an inventory of this kind? Is it bureaucracy gone mad, or do we need it for insurance purposes? Both of these things are of course possibilities.
The data in an ITIL Configuration Management Database (CMDB) can be used for planning the future and for knowing how to respond to incidents, in other words for service level management (SLM) and for capacity planning. An organization needs to know what resources it has to know whether its can deliver on its promises. Moreover, for finance and insurance it is clearly a sound policy to have a database of assets.
For continuity management, risk analysis and redundancy assessment we need to know how much equipment is in use and how much can be brought in at a moment's notice to solve a business problem. These are a few of the reasons why we need to keep track of assets.
If we make changes to a technical installation, or even a business process, this can affect the service that customers experience. Major changes to service delivery are often written into service level agreements since they could result in major disruptions. Details of changes need to be known by a help-desk and service personnel.
The decision to make a change is more than a single person should usually. It requires consultation at different levels of process. An advisory board for changes takes on this role, whether it is an informal board that communicates electronically or a physical committee “with six or more legs and no brain”.
We should be especially careful here to decide what we mean by change. ITIL assumes a traditional model of change management that cfengine does not need. ITIL's ideas apply to the management of cfengine's configuration, not the way in which cfengine carries out its work.
In traditional idea of change management you start by “base-lining” a system, or establishing a known starting configuration. Then you assume that things only change when you actively implement a change, such as “rolling out a new version” or committing a release. This, of course, is very optimistic.
In most cases all kinds of things change beyond our control. Items are stolen, things get broken by accident and external circumstances conspire to confound the order we would like to preserve. The idea that only authorized people make changes is nonsense.
CFEngine takes a different view. It thinks that changes in circumstances are part of the picture, as well as changes in inventory and releases. It deals with the idea of “convergence”. In this way of thinking, the configuration details might be changing at random in a quite unpredictable way, and it is our job to continuously monitor and repair general dilapidation. Rather than assuming a constant state in between changes, cfengine assumes a constant “ideal state” or goal to be achieved between changes. An important thing to realize about including changes of external circumstances is that you cannot “roll back” circumstances to an earlier state – they are beyond our control.
A release is a collection of authorized changes to a system. One part of Change Management is therefore Release Management. A release is generally a larger umbrella under which many smaller changes are made. It is major change. Changes are assembled into releases and then they are rolled out.
In fact release management, as described by ITIL, has nothing to do with change management. It is rather about the management of designing, testing and scheduling the release, i.e. everything to do with the release process except the explicit implementation of it. Deployment or rollout describe the physical movement of configuration items as part of a release process.
ITIL distinguishes between incidents and problems. An incident is an event that might be problematic, but in general would observe incidents over some length of time and then diagnose problems based on this experience.
One goal of cfengine is to plan pro-actively to handle incidents automatically, thus taking them off the list of things to worry about.
Changes can introduce new incidents. An integrated way to make the tracking of cause and effect easier is clearly helpful. If we are the cause of our own problems, we are in trouble!
Also loosely referred to as Quality of Service. This is the process of making sure that Service Level Promises are kept, or Service Level Agreements (SLA) are adhered to. We must assess the impact of changes on the ability to deliver on promises.
Like many other areas of wishful standardization, ITIL elevates itself to a state of importance by using multitude of acronyms and specialized terms. Not all of these are as intuitive as one might hope for and many simply seem beyond necessity. However, to understand the writing, we need to know a few of them and also understand how they differ from similar terms in system administration and the world of cfengine. In the appendix, we list with comments about the most important of these terms. The figure shows a scatter-plot of these terms.
How does cfengine fit into the management of a service organization? There are several ways:
Any tool for assisting with change management lies somewhere between ITIL's notion of change management and the infrastructure itself. It must essentially be part of both (see figure). This applies to cfengine too.
CFEngine can manage itself as well as other resources: itself, its software, its policy and the resulting plans for the configuration of the system. In other words, cfengine is itself part of the infrastructure that we might change.
Traditional methods of managing IT infrastructure involve working from crisis to crisis – waiting for `incidents' to occur and then initiating fire suppression responses or, if there is time, proactive changes. With cfengine, these can be combined and made into a management service, with continuous service quality.
CFEngine can assist with:
Promise theory comes with a couple of principles:
Other approaches to discussing organization talk about the separation of concerns, so why is promise theory special? Object Orientation (OO) is an obvious example. Promise theory is in fact quite different to object orientation (which is a misnomer).
Object orientation asks users to model abstract classes (roles) long before actual objects with these properties exist. It does not provide a way to model the instantiated objects that later belong to those classes. It is mainly a form of information structure modelling. Object orientation models only abstract patterns, not concrete organizations.
Promise theory on the other hand considers only actual existing objects (which it calls agents) and makes no presumptions that any two of these will be similar. Any patterns that might emerge can be exploited, but they are not imposed at the outset. Promise theory's insistence on autonomy of agents is an extreme viewpoint from which any other can be built (just as atoms are a basic building block from which any substance can be built) so there is no loss of generality by making this assumption.
In other words, OO is a design methodology with a philosophy, whereas promises are a model for an arbitrary existing system.
The traditional production-line paradigm for management of IT systems involves reducing the number of variations – often simply making all systems identical for mass-production. However, as quoted at the beginning of chapter 2, the purpose of advanced technology is to enable us to cope with variation. CFEngine makes managing variations simple. Some organizations might simply want to have a uniform configuration on all their hardware, but what does this mean if the basic hardware is different?
In cfengine we understand that “similar” should be based on how
systems behave not what their disk images look like. Two systems that
make the same promises ought to behave in the same way, if the promises
are at a high enough level. But what if two different operating systems
promised to never have a file called /etc/passwd
? A windows machine
would not care too much, but a Unix system would be paralyzed.
Promises and system configuration are related: configuration affects behaviour and behaviour is what we promise. Clearly we cannot expect very high level promises to be simply translated into configurations however. The fact that we make promises about system configuration says nothing certain about the promise that results from changing it. That depends on many other factors. Thus we must be careful to think about what a promise means.
Maintenance is a process that ITIL does not formally spend any time on explicitly, but it is central to real-world quality control.
Imagine that you decide to paint your house. Release 1 is going to be white and it is going to last for 6 years. Then release 2 is going to be pink. We manage our painting service and produce release 1 with all of the care and quality we expect. Job done? No.
It would be wrong for us to assume that the house will stay this fine colour for 6 years. Wind, rain and sunshine will spoil the paint over time and we shall need to touch up and even repaint certain areas in white to maintain release 1 for the full six years. Then when it is time for release 2, the same kind of maintenance will be required for that too.
Unless we read between the lines, it would seem that ITIL's answer to this is to wait for a crisis to take place (an incident). We then mobilize some kind of response team. But how serious an incident do we require and what kind of incident response is required? A graffiti artist? A lightening strike? A bird anoints the paint-work? CFEngine is like the gardener who patrols the grounds constantly plucking weeds, before the flower beds are overrun. Call it continual improvement if you like: the important thing is that the process your be pro-active and not too expensive.
Maintenance is necessary because we do not control all of the changes that take place in a system. There is always some kind of “weather” that we have to work against. CFEngine is about this process of Maintenance. We call it “convergence” to the ideal state, where the ideal state is the specified version release. Keep this in mind as you read about ITIL change management.
CFEngine employs the idea of continual maintenance (we paint the fence on a regular basis to protect it). ITIL, on the other hand, moves from release to release (this year we paint the fence red, next year green) and does not recognize the effect of gradual entropic decay of state (the fence's colour fades gradually due to the harsh environment). Instead ITIL deals with events (graffiti and tagging of the fence) which must be corrected. While it is true that these incidents are maintenance, the repairs are more costly to initiate if they occur as exceptional events than if we are used to repainting the fence on a regular basis.
The figures above show ITIL processes for the handling of events and incidents. They show the aspects of dealing with events that are mainly human oriented, and those events in shaded boxes that can be automated using cfengine.
In the figure above we see that there must be a basic monitor at the top of the process chain which is responsible for observing events. This fits well with the view of promise theory in which a neutral observer is required to measure the state of different component agents in the system. Not all events are necessarily relevant or interesting so we can filter these based on a policy. CFEngine's event monitors come from two sources: cfagent (for monitoring the state of promises which are being managed - e.g. the proverbial colour of the fence) and cfenvd (for passively monitoring the environment - e.g. the brightness of the sunshine or the amount of rainfall impacting on the fence).
inform=true
or syslog=true
or audit=true
).
cfenvd
auto-correlates events. The tool cfbrain
will cross correlate events, further classifying the outcomes as part of the
environment.
processes: www_in_high_anomaly:: ``apache'' signal=term alerts: www_in_high_anomaly:: ShowState(www.in)
For more devastating incidents, we can arrange for more information to be output. An incident is really only an event of some special significance. Diagnosing an incident requires either human intervention or pre-cached insight on the part of the promises we make. If we can make a specific promise then the diagnosis that this promise has not been kept can easily be turned into a specific repair. For example,
We might note a sudden burst of smtp traffic, or a sudden decrease in free disk space. These events can be anticipated if one knows a benign cause, such as email was shutdown for maintenance, or the host is a new mail-server that has never seen traffic before.
When setting up hosts, ITIL actually makes a techical recommendation. This is unusual for ITIL as it generally does not get mixed up in the details of management, only the processes. ITIL recommends “base-lining” systems from a gold server, i.e. a system that is thought to be “perfect” enough to act as a model for all other systems. Once a server has been base-lined from the golden image, various customizations can be made relative to this known state. ITIL sees this as a way of achieving consistency.
We believe that ITIL exceeds its technical competence in making this a recommendation. True enough, this has traditionally been a way of performing a rollout, but the approach has been superceded by better technology. The gold server approach is not the recommended cfengine way. In fact a golden-image approach wastes a fundamental flexibility that cfengine offers, namely the possibility to allow variations (see the quote by Alvin Toffler at the start of chapter 2).
When we baseline a system from a gold-server, we are planning to make all hosts basically the same. However, this is neither necessary nor cost-saving if you use cfengine.
CFEngine places no restrictions on the approach used to roll out hosts. Rather than requiring you to start from a known state, it allows you to specify the final state for any initial state. This means you can migrate hosts gradually to a policy state without having to reinstall them. We can consider the end result of a cfengine policy process to be “the release”. In cfengine this is equivalent to a sufficiently comprehensive configuration policy.
The message here is that cfengine allows you to achieve predictable results without the need for a gold server. Nevertheless, it is helpful to begin a system based on a reliable substrate. It is like making a good sandwich: it helps to have a perfect piece of bread to build on, but it's what you put on top that is most important. You just need to know what you are starting with, and then most things can be fixed to satisfaction. We recommend:
It is does not necessarily matter what it is as long as it behaves predictably. e.g. install from known DVD, or install from net-boot or even from a gold server.
i) Copy constant “gold” overlays or patches into place from a trusted source to customize the system.
ii) Edit system directly with cfengine
The first alternative is to install a fixed patch to a system from a known gold-server. The basic pattern is this:
copy: /source/file dest=/dest/file server=gold_server
In this example, we simply install a new file into a known location. This is the simplest way of customizing a host, but it lacks flexibility.
A more sophisticated approach is to download a parameterized template from a repository or gold server. This template contains context dependent variables that can be expanded in situ by cfengine. There are two stages to this: first we copy the template to a temporary location, then we edit the final file location, insert the template and expand its variables. By following this procedure, the result satisfies cfengine's principles of convergence.
copy: /source/file dest=/tmp/file server=gold_server editfiles: { /dest/file EmptyEntireFilePlease InsertFile ``/tmp/file'' ExpandVariables }
A final approach to customization is to apply direct editing operations to implement the required customization.
editfiles: { /dest/file ReplaceAll ``X'' With ``Y'' AppendIfNoSuchLine ``ABC'' }
This approach is useful for small corrections, that require unsophisticated editing, but it becomes quickly cumbersome for more complex tasks.
ITIL proposes that there should be an integrated approach to change and configuration management. Clearly changes to a system result in new configurations. However, changes can also be unplanned involuntary faults (ITIL discusses these as incidents).
ITIL does not want unplanned changes, however we know that they happen. CFEngine does not elevate deviations from policy to the level of an incident normally, it simply fixes problems immediately. However, we do not alway have enough information about changes to allow cfengine to make repairs, so we need a way of monitoring for unexpected change.
Change management in cfengine is a subtle topic, because cfengine does not fully subscribe to the model of change that ITIL does. In cfengine's view of the world, all changes are changes no matter how or why they occur. In ITIL's world view, there are planned changes, there are releases and there are “incidents”.
ITIL therefore distinguishes between planned and unplanned changes that affect service delivery. CFEngine on the other hand cares only about what promises have been made about the system and whether or not these have been kept.
CFEngine can detect changes because it effectively performs a constant audit of the system's promises. We should understand cfengine's change detection in two ways: changes that impact the performance or quality of services
Let's make an approximate mapping between ITIL concepts and cfengine change and the comment critically on it.
ITIL considers releases to be entire integrated systems that are versioned. Most operating systems work at a smaller level of granularity than this. Software version control using package managers to version individual software packages. Although such package managers resolve dependencies, they do not version entire conglomerates of software. Software comes in large packages for two main reasons:
CFEngine deals with versioned data management in two ways:
Not all software comes in operating system (vendor/provider) approved packages, but cfengine can also handle software that is zipped, tar-ed or bundled in any other manner.
The following example policies illustrates some of the copy
rule type's capabilities,
including some of the options we just considered:
control: DefaultCopyType = ( mtime ) SplayTime = ( 15 ) sourcehost = ( source.cfengine.org ) copy: # Copy dat/doc files if not too big /usr/local/data dest=/archive/data include=*.dat include=*.doc exclude=test.* recurse=inf backup=false size<500m # Retrieve configuration file from master /depot/hosts.deny server=$(sourcehost) dest=/etc/hosts.deny owner=root group=0 mode=644 backup=off force=on timestamps=keep # Transmit shadow password file encrypted /depot/shadow server=\$(sourcehost) dest=/etc/shadow owner=0 group=0 mode=600 encrypt=true
The first rule specifies that .dat and .doc files within the /usr/local/data directory tree be copied to /archive/data, provided that the source files have been modified more recently then their counterpart in the target directory and that they are smaller than 500 MB. In addition, files having the name test are also excluded. Existing files will be overwritten without being saved.
The second rule unconditionally replaces the local
/etc/hosts.deny file with one from the system
source.cfengine.org
, retaining the timestamps from the source
file. This rule also specifies the ownership and mode for the target
file.
The third rule is similar to the second one, retrieving another file from the same remote system. In this case, however, the file will be copied only when the remote file is more recent than the local copy. When the file is copied, the previous version will be retained, and the file contents will be encrypted at it is transmitted across the network.
CFEngine can also automate software package management and
installation. Policies for these items are specified in the packages
stanza. Here are some examples:
control: # Define package manager \& install command linux:: DefaultPkgMgr = ( rpm ) redhat:: RPMInstallCommand = ( "/usr/sbin/up2date %s" ) suse:: RPMInstallCommand = ( "/usr/sbin/yast2 -i %s" ) packages: nagios version=2.4 cmp=ge pstree action=install
The settings in the control
section specify the package
management software that is in use as well as the command used to
install a software package. These directives illustrate the use of
operating system-based classes within policies for defining a
different installation command for different Linux distributions.
In the packages
stanza, the first rule checks whether Nagios is
installed. A warning will be generated if the package is not present
at all or if the installed version is earlier than version 2.4. The
second rule checks for the pstree
package, and installs it if
it is not present on the system.
The following parameterized method-promise installs its first argument in the prefixed location given by the second argument. It collects the tar file, unpacks it, configures and compiles it, then tidies its files.
# # Build GNU sources and install # control: actionsequence = ( methods ) methods: InstallTar(cfengine-2.1.0b7,/local/gnu) action=cf.install returnvars=null returnclasses=null server=localhost
We must install the method in the trusted modules directory (normally
/var/cfengine/modules
or WORKDIR/modules).
# # cf.install # control: MethodName = ( InstallTar ) MethodParameters = ( filename prefix ) path = ( /usr/local/gnu/bin ) TrustedWorkDir = ( /tmp ) TrustedSources = ( /depot ) TrustedSourceServer = ( localhost ) actionsequence = ( copy editfiles shellcommands tidy ) copy: $(TrustedSources)/$(filename).tar.gz dest=$(TrustedWorkDir)/$(filename).tar.gz server=$(TrustedSourceServer) shellcommands: "$(path)/tar zxf $(filename).tar.gz" chdir=$(TrustedWorkDir) "$(TrustedWorkDir)/$(filename)/configure --prefix=$(prefix)" chdir=$(TrustedWorkDir)/$(filename) define=okay okay:: "$(path)/make" chdir=$(TrustedWorkDir)/$(filename) tidy: $(TrustedWorkDir) pattern=$(filename) r=inf rmdirs=true age=0
The ability to go back to an earlier “release” or state is often referred to as rollback. ITIL calls it remediation. The notion is closely connected with process management, and both ITIL and traditional management techniques value this by default. It is assumed practice.
CFEngine does not encourage rollback however. Why not? Because it required destructive intervention and cfengine's model is based on on-the-fly change. To go back to a previous state, a system must be stopped, reinitialized (perhaps from backup) and restarted. This requires service to stop and all run-time state is lost.
Cfengine's approach to this would be to revert policy to its previous state. The system would then roll into its desired state (as if going forwards). Nothing would be restored from separate backup media (see figure).
The difference here is the assumption about how and when changes occur. A sequence of step by step transitions sounds innocent, but it is unstable to unexpected changes. ITIL and many other change management models assumes that no unauthorized changes occur between releases. If they do, they are handled as incidents. By separating releases from incidental changes, we get led into thinking that we can in fact revert by destructive intervention.
In fact reversion has inevitable consequences. We must make a choice.
In this case, we essential revert the entire system from a back up of its saved state. (Some virtual machines can save runtime state for resumption, but this can become stale, e.g. for network connections, as it is meaningless to rollback part of a dialogue.) This operation results in catastrophic change.
This is cfengine's default behaviour. To go back to a previous state, simply change the policy back to a previous version. This will not necessarily revert the entire state of the system, but everything that is covered by policy will be reverted.
Some tools allow you to rollback without reverting from backup. CFEngine disallows this on principle as it requires human judgement to perform correctly. It cannot be automated without uncertain results. In fact cfengine retains the necessary information to allow managed changes to be reversed to some extent. The point however is that one can only guarantee the content of managed objects, so simply reversing a change will not necessarily take us back to the same state – so we consider this to be fundamentally too risky.
CFEngine can monitor absolute and relative states of a system. A simple way to measure relative change is to use a database of checksums.
control: ChecksumUpdates = ( true ) ChecksumPurge = ( true ) files: /my/important/files recurse=inf checksum=md5 owner=root,daemon group=0,1,4
Change monitoring is about detecting when stored data, or other measurable aspects of a computer system change. A change detection system is not normally concerned with the reason for a change, but if you are monitoring for change then we shall take it for granted somehow that you are expecting to find changes that you didn't plan for yourself.
The most important bulk of information on a computer is its filesystem data. Change detection for filesystems uses a technique made famous in the program Tripwire, which collects a “snapshot" of the system in the form of a database of file checksums (cryptographic hashes) and permissions and rechecked the system against this database at regular intervals. Tripwire examines files, and looks for change in their contents or their attributes. This is a very simple (even simplistic) view of change. If a legitimate change is made to the system, such a system responds to this as a potential threat. Databases must then be altered, or rebuilt.
A cryptographic hash (also called a digest) is an algorithm that reads (digests) a file and computes a single number (the hash value) that is based on the contents. If so much as a single bit in the file changes then the value of the hash will change. You can compute hash values manually, for example:
host$ openssl md5 cfengine-2.2.4a.tar.gz MD5(cfengine-2.2.4a.tar.gz)= 6d2b31c4814354c65cbf780522ba6661
There are several kinds of hash function. The most common ones are MD5 and SHA1. Recently both of the algorithms that create these hashes have been superceded by the newer SHA2. CFEngine supports MD5 and SHA1 and it will support SHA2 as soon as the OpenSSL library supports an interface to the new algorithm.
CFEngine has adopted something like the Tripwire model, but with a few provisoes. Tripwire assumes that all change is unauthorized (it makes an incident out of any observed change). CFEngine cannot reasonably take this viewpoint. CFEngine expects systems to change dynamically, so it allows users to define a policy for what changes are considered to be okay.
Integrity checks on files whose contents are supposed to be static are a good way to detect tampering with the system, from whatever source. Running MD5 or SHA1 checksums of files regularly provides us with a way of determining even the smallest changes to file contents.
To use the checksum based change detection we first ask cfengine to collect MD5 hash data for specified files. Here is an excerpt from a cfengine configuration program that would check the /usr/local filesystem for file changes. Note that it excludes files such as log files that we therefore allow to change (log files are supposed to change):
files: /usr/local owner=root,bin,man mode=o-w # check permissions separately r=inf checksum=best # switch on change detection action=warnall ignore=logs exclude=*.log # repeat for other files or directories
The first time we run this, cfengine collects data and treats all files as “unchanged”. It builds a database of the checksums. The next time the rule is checked, cfagent recomputes the checksums and compares the new values to the `reference' values stored in the database. If no change has occurred, the two should match. If they differ, then the file as changed and a warning is issued.
cf:nexus: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! cf:nexus: SECURITY ALERT: Checksum (md5) for /etc/passwd changed! cf:nexus: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
This message is designed to be visible. If you do not want the embracing rows of `!' characters, then this control directive turns them off:
control: Exclamation = ( off )
The next question to ask is: what happens if the change that was detected is actually okay (which is almost always the case in practice). If you activate this option:
control: ChecksumUpdates = ( on )
Then, as soon as a change has been detected, the database is updated
and the message will not be repeated. If this is set to off
,
which is the default, then warning messages will be printed each time
the rule is checked.
New files are automatically detected, as they are not in the database. If you want to be notified when files are deleted, then set the option
control: ChecksumPurge = ( on )
Message digests are supposed to be unbreakable, tamperproof technologies, but of course everything can be broken by a sufficiently determined attacker. Suppose someone wanted to edit a file and alter the cfengine checksum database to cover their tracks. If they had broken into your system, this is potentially easy to do. How can we detect whether this has happened or not?
A simple solution to this problem is to use another checksum-based operation to copy the database to a completely different host. By using a copy operation based on a checksum value, we can also remotely detect a change in the checksum database itself.
Consider the following code:
# Neighbourhood watch control: allpeers = ( SelectPartitionNeighbours(/path/hostlist,\#,random,4) ) copy: /var/cfengine/checksum\_digests.db dest=/safekeep/chkdb_$(this) type=checksum server=$(allpeers) inform=true # warn of copy backup=timestamp define=tampering alert: tampering:: 'Digest tampering detected on a peer'
It works by building a list of neighbours for each host. The function
SelectPartitionNeighbours
can be used for this. Using a file which
contains a list of all hosts running cfengine (e.g. the cfrun.hosts
file),
we create a list of hosts to copy databases it from. Each host in the
network therefore takes on the responsibility to watch over its neighbours.
The copy rule attempts to copy the database to some file in a safekeeping
directory. We label the destination file with $(this)
which becomes
the name of the server from which the file was collected. Finally, we backup
any successful copies using a timestamp to retain a complete record of all changes
on the remote host. Each time a change is detected, a copy will be kept of the
old. The rule contains triggers to issue alerts and warnings also just to make
sure the message will be heard.
In theory, all four neighbours should signal this change. If an attacker had detailed knowledge of the system, he or she might be able to subvert one or two of these before the change was detected, but it is unlikely that all four could be covered up. At any rate, this approach maximizes the chances of change detection.
Finally, in order to make this copy, you must, of course, grant access to the
database in cfservd.conf
.
# cfservd.conf admit: any:: /var/cfengine/checksum_digests.db mydomain.tld
Let us now consider what happens if an attacker changes a file an edits the checksum database. Each of the four hosts that has been designated a neighbour will attempt to update their own copy of the database. If the database has been tampered with, they will detect a change in the md5 checksums of the remote copy versus the original. The file will therefore be copied.
It is not a big problem that others have a copy of your checksum database. They cannot see the contents of your files from this. A possibly greater problem is that this configuration will unleash an avalanche of messages if a change is detected. This makes messages visible at least.
Release management, as defined by ITIL (section 9 of BS15000-2), is a management function rather than a machine implementation operation. It includes all aspects of designing, planning and scheduling changes, but does not include the implementation.
CFEngine can help with the final stages of software release management, namely deployment of software components and configuration. However, the bulk of this item concerns the human process of decision-making.
CFEngine has features that can be considered in the context of this work, however.
ITIL frequently works with the idea of a baseline state While cfengine has no problem working with the idea of a baseline configuration, it is designed to exceed this assumption of maintenance from release to release. ITIL does not adequately address the need for on-the-fly maintenance; it only models large-jump changes, not error corrections. CFEngine, on the other hand, makes no distinction between a large and a small change, thus users of cfengine must make a value judgement about the nature of such changes.
CFEngine does not provide specific tools for versioning configuration specifications. It is rather recommended to use a tool such as subversion for this.
Subversion maintains its own revision numbers that are not visible to cfengine however. It is useful to be able to refer to version numbers also in cfengine. From software release 2.2.2 a version string can be added to files as follows:
control: cfinputs_version = ( 1.2.3 ) Auditing = ( on )
This defines the version number of a set of configuration files which is referred to in auditing and error messages.
When cfengine saves the current
version of a file that it is modifying or replacing, by default such
files are given a new extension and remain within the same directory
which they were encountered. Alternatively, one can specify a
repository directory to which such files can be moved instead. The
repository location is specified in the control
section:
control: Repository = ( /var/spool/cfengine )
Files moved to the repository are given names reflecting their full path, with slashes replaced by underscore characters. For some, this creates a clearer overview of the changes that have occurred.
The repository is used by disable
, editfiles
, links
, and copy
rule types;
copy
and disable
allow you to override repository use or to specify an alternate
repository directory via their repository
option.
You should never edit the production version of a policy directly, but rather edit a separate development area and publish the changes once tested. The ITIL change management process is applicable to this human change (much more relevant that the machine changes made by cfengine itself.).
CFEngine has no meta-access control mechanism which can decide who may write policy rules. To create such a mechanism, there would have to be a monitor which could identify users, and an authority mechanism that would disallow certain users to write rules of certain types about certain objects on certain hosts. Clearly it is possible to create such a system, but it would be both technically difficult, very cumbersome to use and would add a whole new level of complexity to policy and potential error to the configuration process.
To keep matters as simple as possible, cfengine avoids this and proposes a different approach. Promise theory allows us to model the security implications of this (see the figure of the bow-tie structure). A simple method of delegating is the following.
A review procedure for policy promises is a good solution if you want to delegate responsibility for different parts of a policy to different sources. Human judgement is irreplaceable, and tools can be added to make conflicts easier to detect.
Promise theory underlines that, if a host of computing device accepts policy from any source, then it is alone and entirely responsible for this decision. The ultimate responsibility for the published version policy is the vetting agent. This creates a shallow hierarchy, but there is no reason why this formal body could not be comprised of representatives from the multiple teams.
CFEngine records all manner of information about the behaviour of computers during its efforts to keep promises. These data offer the potential of mining for building up a picture of the behaviour of an entire datacentre or organization, perhaps even multiple domains.
Cfengine's environment daemon further collects patterns of environmental influence of hosts in a resource non-intensive manner. These data contain much information to enable capacity planning.
We should add a warning however. Capacity planning requires a considerable amount of data and analysis, as well as a sound and critical judgement of the data. Resource and performance management are such complex issues that no simple recipe or checklist can replace the judgement of an experienced engineer. However, cfengine can supply data to such an engineer.
cfshow -p
) allow the average throughput of a server in terms
of time to completion of service. If service times are too long, this is an indication
(but not proof) that hardware should be upgraded.
The level of technical understanding to make sound judgements based on these data goes somewhat beyond the scope of this document. This motivates us to create better tools for cfengine that can make these analyses more accessible to users. However, this must be deferred for another occasion.
We have described the basics of cfengine and ITIL and shown a number of areas where the two can be integrated.
So, if ITIL is so great, did we use it to manage the process of writing this document? Authoring a document and authoring a policy have much in common, so let us spend a moment to examine the process of checks and balances that we have used to produce this text.
The answer to our question is both yes and no, and while this might sound rather unhelpful, we suggest that it is in fact a significant answer; indeed it is the right answer in response to any question about best practices because such recipes must always be applied to a specific context.
There are sensible and ridiculous ways to implement a set of recommendations. ITIL users should expect to adapt its generalized ideas to each set of special circumstances. To do this here, we have used the parts of ITIL that make particular sense for authoring, and we have also used cfengine's model of promises or voluntary cooperation to understand how to implement them.
For example, ITIL suggests forming committees for discussing and deciding change. A committee is a cumbersome device when the total number of people involved in the entire process is two. Nevertheless, the role of the committee is relevant (i.e. the promises it makes to bring the process to completion), and this is where promise theory helps us to make sense of the “dumb rules”. We have multiple opinions and multiple pairs of eyes for quality control as well as for inspiration.
Several parts of ITIL are quite relevant to authoring.
What does promise theory say about collaborative authoring?
First of all, it begins by saying that each individual in the process of authoring has independent knowledge and should be represented as a separate agent. It tells us that promises to cooperate will be needed to integrate the information.
However, more than that, promises tell us that each section of the text is an “agent” which can change or behave independently. In other words, we can manage the parts independently, but again we need to promise to coordinate those parts. So promise theory asks us first to identify the agents (the topics in the document) that will be interacting and then find out what promises they need to make to carry out their function.
Because of the individual nature of the parts, we can associate an individual author to each. To bring them together we need a further agent or individual to collate independent ideas and policy sources into a single coherent whole. Thus promises shows us a basic “bow-tie” structure for integrating and correlating independent sources and then making the results available to independent users (see figure). This is not the only solution to the problem of vetting that promises predicts, but it is the simplest one. Also it is the approach that ITIL approves – making a someone responsible for the job.
We emphasize that promise theory does not tell us the specifics of how to implement solutions, it only tells us what elements are needed and how they should interact. So we might implement agents as people, as different computers, or as different user accounts within the same computer. As long as the elements can keep the necessary promises, it does not matter.
So how did we write our document? In fact we did not use a very strict ITIL-like change management process when writing the first versions of our document. Such a process could have strangled our work in the creative stage and doubled the time it took to write. Rather, we worked in an ad hoc way by voluntary cooperation. Each of us promised to write about certain topics and work on the text independently. We worked as autonomous agents, and we used Subversion (a version control and sharing system) to keep the working document. Subversion is itself a third agent which promises to accept changes one at a time from either of the two authors and then make these changes available again to both authors. This agent performs no vetting or control other than ordering the changes.
The authors have to promise to one another to resolve any conflicts or disagreements, but promises do not suggest how this might take place. (ITIL, on the other hand, does offer suggestions for this resolution process).
ITIL seems to work best once a service is up and running, or once a basic version of a document exists. It does not say so much about the creative act, except to think of it as a release.
What ITIL is weak at is parallelization of effort. ITIL's processes are serialized processing models. In our first creative versions, we converged in parallel onto an approximate result, each working separately. This is very efficient but it can lead to duplication of work or inconsistency. Serialization is needed to resolve consistency issues precisely, but it leads to unnecessary waiting in some cases.
Below we indicate a checklist of ITIL compliant steps for using cfengine in a machine life-cycle.
Release:
This section lists some of the many terms from ITIL, especially the ISO/IEC 20000 version of the text, and offers some comments and translations into common cfengine terminology.
Monitoring of a configuration item or IT service that uses automated regular checks to discover the current status.
CFEngine performs programmed checks of all of its promises each time cfagent is started. Cfagent is, in a sense, an active monitor for a set of promises that are described in its configuration file. |
The ability of a component or service to perform its required function.
Availability = Hours operational / Agreed service hours
Availability or intermittency in cfengine refers to the responsiveness of
hosts in a network when remotely connecting to cfservd.
Intermittency = Successful~ attempts / Total Attempts |
A warning that a threshold has been reached, something has changed or a failure has occurred.
A cfengine alert fits this description quite well. Most alerts are user-defined, but a few are side effects of certain configuration rules. |
A formal inspection and verification to check whether a standard or set of guidelines is being followed.
Cfengine's notion of an audit is more like the notion from system accounting. However, the data generated by this extra logging information could be collected and used in a more detailed examination of cfengine's operations, suitable for use in a formal inspection (e.g. for compliance). |
A snapshot of the state of a service or an individual configuration item at a point in time
In cfengine parlance, we refer to this as an initial state or configuration. In principle a cfengine initial state does not have to be a known-base line, since the changes we make will not generally be relative to an existing configuration. CFEngine encourages users to define the final state (regardless of initial state). |
The recorded state of something at a specific point in time.
CFEngine does not use this term in any of its documentation, though our general understanding of a “benchmark” is that of a standardized performance measurement under special conditions. CFEngine regularly records state and performance data in a variety of ways, for example when making file copies. |
The ability of someone or something to carry out an activity.
CFEngine does not use this concept specifically. The notion of a capability is terminology used in role-based access control. |
A record containing details of which configuration items are affected and how they are affected by an authorized change.
Cfengine's default modus operandi is to not record changes made to a system unless requested by the user. Changes can be written as log entries or audit entries by switching on reporting. |
Consider a typical cfengine promise (to ensure that a destination file is a copy of a source). Three levels of change recording can be added in cfengine 2:
copy: /source/file dest=/destination/file inform=true syslog=true audit=true
An “inform” promise means that cfagent promises to notify the changes to its standard output (which is usually sent by email or printed on a console output). A “syslog” promise implies that cfagent will log the message to the system log daemon. Both of the foregoing messages give only a simple message of actual changes. An “audit” promise is a promise to record extensive details about the process that cfagent undergoes in its checking of other promises.
An analysis based on the timeline of recorded events (used to help identify possible causes of problems).
A timeline analysis could easily be carried out based on audit information, system logs and cfenvd behavioural records. |
A group of configuration items (CI) that work together to deliver an IT service.
A configuration is the current state of resources on a system. This is, in principle, different from the state we would like to achieve, or what has been promised. |
A component of an infrastructure which is or will be under the control of configuration management.
A configuration item is any object making a promise in cfengine. We often speak of the promise object, or “promiser”. |
Database containing all the relevant details of each configuration item and details of the important relationships between them.
CFEngine has no asset database except for its own list of promises. The only relationships is cares about are those which are explicitly coded as promises. In the future, cfengine 3 is likely to extend the notion of promises to allow more general records of the CMDB kind, but only to the extent that they can be verified autonomically. |
Information and its supporting medium.
ITIL originally considered a document to be only a container for information. In version 3 it considers also the medium on which the data are recorded, i.e. both the file and the filesystem on which it resides. |
A change that must be introduced as soon as possible – for example to solve a major incident or to implement a critical security patch.
CFEngine has no specific concept for this. |
A design flaw or malfunction that causes a failure.
CFEngine often uses the term configuration error to mean a deviation of a configuration from its promised state. The ITIL meaning of the term would translated into “bug in the cfengine software” or “bug in the promised configuration”. |
A change of state that has significance for the management of a configuration item or IT service.
The same basic definition applies to cfengine also, but cfengine makes all such events into classes, since its approach to observing the environment is to measure and then classify it into approximate expected states. CFEngine class attributes (usually from cfenvd) may be considered as event notifications as they change. |
An event that is generated when a service or device is currently operating abnormally.
A state in which configuration policy is violated (could lead to a warning or an automated correction). |
Loss of ability to operate to specification or to deliver the required output.
ITIL's idea of a failure is something that prevents a promise from being kept. Cfengine's autonomy model means that it is unlikely for such a failure to occur, since promises are only allowed to be made about resources for which we have all privileges. Occasionally, environmental issues might interfere and lead to failure. |
Any event that is not expected in normal operations and which might cause a degradation of service quality.
Cfengine's philosophy of convergence gives us only one option for interpreting this term, namely as a temporary deviation from promised behaviour. A deviation must be temporary if cfengine is operating continually, since it will repair any problem on its next invocation round. Events which do not impact promises made by cfengine are of no interest to cfengine, since autonomy means it cannot be responsible for anything beyond its own promises. |
Repeated observation of a configuration item, IT service or process in order to detect events and ensure that the current status is known.
CFEngine incorporates a number of different kinds of monitoring, including monitoring of kept configuration-promises and passive monitoring of behaviour. |
Monitoring of a configuration item or IT service that relies on an alert or notification to discover the current status.
Cfenvd is cfengine's passive monitoring component. It observes system related behaviour and learns about it. It assumes that there is likely to be a weekly periodicity in the data in order to best handle its statistical inference. |
Formally documented management expectations and intentions. Policies are used to direct decisions, and to ensure consistent and appropriate development and implementation of processes, standards, roles, activities, IT infrastructures, etc.
Cfengine's configuration policy is an automatable set of promises about the static and runtime state of a computer. Roles are identified by the kinds of behaviour exhibited by resources in a network. We say that a number of resources (hosts or smaller configuration objects) play a specific promised role if they make identical promises. Any resource can play a number of roles. Decisions in cfengine are made entirely on the basis of the result of monitoring a host environment. |
Monitoring that looks for patterns of events to predict possible future failures.
All cfengine monitoring is pro-active in the sense that it can lead to automated follow-up actions. |
Unknown underlying cause of one or more incidents.
A repeated deviation from policy that suggests a change of policy or specific counter-measures. A promise needs to be reconsidered or new promises are required. |
ITIL does not define this term, although promises are deployed in various ways – for instance in terms of cooperation, communication interfaces within or between processes or contractual relationships as defined by Service Level Agreements, Operational Level Agreements and Underpinning Contracts.
A promise in cfengine is a single rule in the cfengine language. The promiser is the resource whose properties are described, and the promisee is implicitly the cfengine monitor. |
Monitoring that takes action in response to an event – for example submitting a batch job when the previous job completes, or logging an incident when an error occurs.
The concept of reactive monitoring is unclear because the duration of an event and the speed of a response are undefined. In a sense, all cfengine monitoring is potentially reactive. It is possible to attach actions which keep promises to any observable condition discernable by cfengine's monitor. CFEngine is not usually considered event driven however, since it does not react “as soon as possible” but at programmed intervals. |
Information in readable form that is maintained by the service provider about operations.
A log entry or database item. |
Returning a Configuration Item or an IT service to a working state. Recovering of an IT service often includes recovering data to a known consistent state.
All cfengine promises refer to the state of a system that is desired. The promises are automatically enforced, hence cfengine recovers a system (in principle) on every invocation. CFEngine always returns to a known state, due to the property of “convergence”. There is no distinction between the concepts of repair, recovery or remediation. |
Recovery to a known state after a failed change or release.
All cfengine promises refer to the state of a system that is desired. The promises are
automatically enforced, hence cfengine recovers a system (in principle) on every invocation.
CFEngine always returns to a known state, due to the property of “convergence”. There is no
distinction between the concepts of repair, recovery or remediation.
However, this concept is like the notion of “rollback” which often involves a more significant restoration of a system from backup. This is discussed later. |
The replacement or correction of a failed configuration item.
All cfengine promises refer to the state of a system that is desired. The promises are automatically enforced, hence cfengine recovers a system (in principle) on every invocation. CFEngine always returns to a known state, due to the property of “convergence”. There is no distinction between the concepts of repair, recovery or remediation. |
A collection of new or changed configuration items that are introduced together.
An instantiation of the entire cfengine system under a specific version of a policy, i.e. a specific set of promises. |
A form to be completed requesting the need for change. This is to be followed up.
This has no counterpart in cfengine. It is part of human communication which coordinates autonomous machines. Clearly autonomous computers do not listen to change requests from other computers, but when machines cooperate in clusters or groups they take suggestions from the collaborative process. An RFC in an ITIL sense is part of an organizational process that goes beyond cfengine's level of jurisdiction. This is an example of what ITIL adds to the autonomous cfengine model. |
Why not simply abandon autonomy of machines if this seems to interfere with the need for organizational change? There are good reasons why autonomy is the correct model for resources. Autonomy reduces the risk to a resource of attack, mistake and error propagation.
ITIL's processes exist precisely to minimize the risk of negative impact of change, so the goals are entirely compatible. When an organization discusses a change it examines information from possible several autonomous systems and discusses how they will change their pattern of collaboration. There is no point in this process at which it is necessary for one of the systems to give up its autonomy.
The ability of a configuration item or IT service to resist failure or to recover quickly following a failure.
Cfengine's purpose is to make a system resilient to unpredictable change. |
Actions taken to return an IT service to the users after repair and recovery from an incident.
All cfengine promises refer to the state of a system that is desired. The promises are
automatically enforced, hence cfengine recovers a system (in principle) on every invocation.
CFEngine always returns to a known state, due to the property of “convergence”. There is no
distinction between the concepts of repair, recovery or remediation.
However, this concept seems to suggest a more catastrophic failure which often involves a more significant restoration of a system from backup. This is discussed later. |
A set of responsibilities, activities and authorities granted to a person or a team. Roles are defined in processes.
A role in cfengine is a class of agents that make the same kind of promise. The type of role played by the class is determined by the nature of the promise they make. e.g. a promise to run a web server would naturally lead to the role “web server”. |
Interface between users and service provider.
A help desk. This is not formally part of cfengine's tool set. |
A written agreement between the service provider that documents agreed services, levels and penalties for non-compliance.
An agreement assumes a set of promises that propose behaviour and an acceptance of those promises by the client. If we assume that the users are satisfied with out policies, then an SLA can be interpreted as a combination of a configuration policy (configuration service promises), and the cfengine execution schedule. |
The management of services.
Same. |
An event that is generated when a service or device is approaching its threshold.
A message generated in place of a correction to system state when a deviation from policy is detected. Note that cfengine is not based on fixed thresholds. All “thresholds” for action or warning are defined as a matter of policy. |
Table of Contents