Pattern Matching and Referencing
One of the strengths of CFEngine 3 is the ability to recognize and exploit patterns. All string patterns in CFEngine 3 are matched using PCRE regular expressions.
CFEngine has the ability to extract back-references from pattern matches. This makes sense in two cases. Back references are fragments of a string that match parenthetic expressions. For instance, suppose we have the string:
Mary had a little lamb ...
and apply the regular expression
"Mary ([^l]+)little (.*)"
The pattern matches the entire string, and it contains two parenthesized
subexpressions, which respectively match the fragments had a
and lamb
...
. The regular expression libraries assign three matches to this result,
labelled 0, 1 and 2.
The zeroth value is the entire string matched by the total expression. The first value is the fragment matched by the first parenthesis, and so on.
Each time CFEngine matches a string, these values are assigned to a special
variable context $(match.n)
. The fragments can be referred to in the remainder
of the promise. There are two places where this makes sense. One is in pattern
replacement during file editing, and the other is in searching for files.
Consider the examples below:
bundle agent testbundle
{
files:
# This might be a dangerous pattern - see explanation in the next section
# on "Runaway change warning"
"/home/mark/tmp/cf([23])?_(.*)"
edit_line => myedit("second backref: $(match.2)");
}
There are other filenames that could match this pattern, but if, for example,
there were to exist a file /home/mark/tmp/cf3_test
, then we would have:
‘$(match.0)’
equal to `/home/mark/tmp/cf3_test'
‘$(match.1)’
equal to `3'
‘$(match.2)’
equal to `test'
Note that because the pattern allows for an optional '2' or '3' to follow the
letters cf
, it is possible that $(match.1)
would contain the empty string.
For example, if there was a file named /home/mark/tmp/cf_widgets
, then we
would have
‘$(match.0)’
equal to `/home/mark/tmp/cf_widgets'
‘$(match.1)’
equal to `'
‘$(match.2)’
equal to `widgets'
Now look at the edit bundle. This takes a parameter (which is the
back-reference from the filename match), but it also uses back references to
replace shell comment lines with C comment lines (the same approach is used to
hash-comment lines in files). The back-reference variables $(match.n)
refer
to the most recent pattern match, and so in the C_comment
body, they do not
refer to the filename components, but instead to the hash-commented line in
the replace_patterns
promise.
bundle edit_line myedit(parameter)
{
vars:
"edit_variable" string => "private edit variable is $(parameter)";
insert_lines:
"$(edit_variable)";
replace_patterns:
# replace shell comments with C comments
"#(.*)"
replace_with => C_comment,
select_region => MySection("New section");
}
########################################
# Bodies
########################################
body replace_with C_comment
{
replace_value => "/* $(match.1) */"; # backreference from replace_patterns
occurrences => "all"; # first, last, or all
}
########################################################
body select_region MySection(x)
{
select_start => "\[$(x)\]";
select_end => "\[.*\]";
}
Try this example on the file
[First section]
one
two
three
[New section]
four
#five
six
[final]
seven
eleven
The resulting file is edited like this:
[First section]
one
two
three
[New section]
four
/* five */
six
[final]
seven
eleven
private edit variable is second backref: test
Runaway change warning
Be careful when using patterns to search for files that are altered by
CFEngine if you are not using a file repository. Each time CFEngine makes a
change it saves an old file into a copy like cf3_test.cf-before-edit
. These
new files then get matched by the same expression above – because it ends in
the generic.*
), or does not specify a tail for the expression. Thus CFEngine
will happily edit backups of the edit file too, and generate a recursive
process, resulting in something like the following:
cf3_test cf3_test.cf-before-edit
cf3_test~ cf3_test~.cf-before-edit.cf-before-edit
cf3_test~.cf-before-edit cf3_test~.cf-before-edit.cf-before-edit.cf-before-edit
Always try to be as specific as possible when specifying patterns. A lazy approach will often come back to haunt you.
Commenting lines
The following example shows how you would hash-comment lines in a file using CFEngine.
######################################################################
#
# HashCommentLines implemented in CFEngine 3
#
######################################################################
body common control
{
version => "1.2.3";
bundlesequence => { "testbundle" };
}
########################################################
bundle agent testbundle
{
files:
"/home/mark/tmp/comment_test"
create => "true",
edit_line => comment_lines_matching;
}
########################################################
bundle edit_line comment_lines_matching
{
vars:
"regexes" slist => { "one.*", "two.*", "four.*" };
replace_patterns:
"^($(regexes))$"
replace_with => comment("# ");
}
########################################
# Bodies
########################################
body replace_with comment(c)
{
replace_value => "$(c) $(match.1)";
occurrences => "all";
}
Regular expressions in paths
When applying regular expressions in paths, the path will first be split at
the path separators, and each element matched independently. For example, this
makes it possible to write expressions like /home/.*/file
to match a single
file inside a lot of directories — the .*
does not eat the whole string.
Note that whenever regular expressions are used in paths, the /
is always
used as the path separator, even on Windows. However, on Windows, if the
pathname is interpreted literally (no regular expressions), then the backslash
is also recognized as the path separator. This is because the backslash has a
special (and potentially ambiguous) meaning in regular expressions (a \d
means the same as [0-9]
, but on Windows it could also be a path separator
and a directory named d
).
The pathtype
attribute allows you to force a specific behavior when
interpreting pathnames. By default, CFEngine looks at your pathname and makes
an educated guess as to whether your pathname contains a regular expression.
The values literal
and regex
explicitly force CFEngine to interpret the
pathname either one way or another. (see the pathtype
attribute).
body common control
{
bundlesequence => { "wintest" };
}
########################################
bundle agent wintest
{
files:
"c:/tmp/file/f.*" # "best guess" interpretation
delete => nodir;
"c:\tmp\file"
delete => nodir,
pathtype => "literal"; # force literal string interpretation
"C:/windows/tmp/f\d"
delete => nodir,
pathtype => "regex"; # force regular expression interpretation
}
########################################
body delete nodir
{
rmdirs => "false";
}
Note that the path /tmp/gar.*
will only match filenames like /tmp/gar
,
/tmp/garbage
and /tmp/garden
. It will not match filename like
/tmp/gar/baz
(because even though the .*
in a regular expression means
"zero or more of any character", CFEngine restricts that to mean "zero or more
of any character in a path component"). Correspondingly, CFEngine also
restricts where you can use the /
character (you can't use it in a character
class like [^/]
or in a parenthesized or repeated regular expression
component.
This means that regular expressions which include "optional directory
components" won't work. You can't have a files promise to tidy the directory
(/usr)?/tmp
. Instead, you need to be more verbose and specify
/usr/tmp|/tmp
, or even better, think declaratively and create an slist
that contains both the strings /tmp
and /usr/tmp
, and then allow CFEngine
to iterate over the list!
This also means that the path /tmp/.*/something
will match files like
/tmp/abc/something
or /tmp/xyzzy/something
. However, even though the
pattern .*
means "zero or more of any character (except /
)", CFEngine
matches files bounded by directory separators. So even though the pathname
/tmp//something
is technically the same as the pathname /tmp/something
,
the regular expression /tmp/.*/something
will not match on the degenerate
case of /tmp//something
(or /tmp/something
).
Anchored vs. unanchored regular expressions
CFEngine uses the full power of regular expressions, but there are two “flavors” of regex. Because they behave somewhat differently (while still utilizing the same syntax), it is important to know which one is used for a particular component of CFEngine:
An “anchored” regular expression will only successfully match an entire
string, from start to end. An anchored regular expression behaves as if it
starts with ^
and ends with $
, whether you specify them yourself or not.
Furthermore, an anchored regular expression cannot have these automatic
anchors removed.
An “unanchored” regular expression may successfully match anywhere in a
string. An unanchored regex may use anchors (such as ^
, $
, \A
, \Z
,
\b
, etc.) to restrict where in the string it may match. That is, an
unanchored regular expression may be easily converted into a partially- or
fully-anchored regex.
For example, the comment parameter in readstringarray()
is an unanchored regex. If you specify the regular expression as #.*
, then
on any line which contains a pound sign, everything from there until the end
of the line will be removed as a comment. However, if you specify the regular
expression as ^#.*
(note the ^
anchor at the start of the regex), then
only lines which start with a #
will be removed as a comment! If you want to
ignore C-style comment in a multi-line string, then you have to a bit more
clever, and use this regex: (?s)/\*.*?\*/
Conversely, delete_lines
promises use anchored regular expressions to delete
lines. If our promise uses bob:\d*
as a line-matching regex, then only the
second line of this file will be deleted (because only the second line starts
with bob:
and is then followed exclusively by digits, all the way to the end
of the string).
bobs:your:uncle
bob:111770
thingamabob:1234
robert:bob:xyz
i:am:not:bob
If CFEngine expects an unanchored regular expression, then finding every line
that contains the letters bob
is easy. You just use the regex bob
. But if
CFEngine expects an anchored regular expression, then you must use .*bob.*
.
If you want to find every line that has a field which is exactly bob
with no
characters before or after, then it is only a little more complicated if
CFEngine expects an unanchored regex: (^|:)bob(:|$)
. But if CFEngine expects
an anchored regular expression, then it starts getting ugly, and you'd need to
use bob:.*|.*:bob:.*|.*:bob
.
Special topics on Regular Expressions
Regular expressions are a complicated subject, and really are beyond the scope of this document. However, it is worth mentioning a couple of special topics that you might want to know of when using regular expressions.
The first is how to not get a back reference. If you want to have a
parenthesized expression that does not generate a back reference, there is a
special PCRE syntax to use. Instead of using ()
to bracket the piece of a
regular expression, use (?:)
instead. For example, this will match the
filenames foolish, foolishly, bearish, bearishly, garish, and garishly in the
/tmp
directory. The variable $match.0
will contain the full filename, and
$match.1
will either contain the string ly
or the empty string. But the
(?:expression)
which matches foo, bear, or gar does not create a
back-reference:
files:
"/tmp/(?:foo|bear|gar)ish(ly)?"
Note that sometimes multi-line strings are subject to be matched by regular
expressions. CFEngine internally matches all regular expressions using
PCRE_DOTALL option, so .
matches newlines. If you want to match any
character except newline you could use \N
escape sequence.
Another thing you might want to do is ignore capitalization. CFEngine is
case-sensitive (in all things), so the files promise /tmp/foolish
will not
match the files /tmp/Foolish
or /tmp/fOoLish
, etc. There are two ways to
achieve case-insensitivity. The first is to use character classes:
files:
"/tmp/[Ff][Oo][Oo][Ll][Ii][Ss][Hh]"
While this is certainly correct, it can also lead to unreadability. The PCRE patterns in CFEngine have another way of introducing case-insensitivity into a pattern:
files:
"/tmp/(?i:foolish)"
The (?i:)
brackets impose case-insensitive matching on the text that it
surrounds, without creating a sub-expression. You could also write the regular
expression like this (but be aware that the two expressions are different, and
work slightly differently, so check the documentation for the specifics):
files:
"/tmp/(?i)foolish"
The /s
, /m
, and /x
switches from PCRE are also available, but use them
with great care!