Jump to content

C preprocessor

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 82.236.187.110 (talk) at 00:23, 7 July 2011 (→‎Token concatenation: identifiers with leading underscores are reserved by the implementation). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The C preprocessor (cpp) is the preprocessor for the C programming language. In many C implementations, it is a separate program invoked by the compiler as the first part of translation. The preprocessor handles directives for source file inclusion (#include), macro definitions (#define), and conditional inclusion (#if). The language of preprocessor directives is agnostic to the grammar of C, so the C preprocessor can also be used independently to process other types of files.

The transformations it makes on its input form the first four of C's so-called Phases of Translation. Though an implementation may choose to perform some or all phases simultaneously, it must behave as if it performed them one-by-one in order.

Phases

The following are the first four (of eight) phases of translation specified in the C Standard:

  1. Trigraph replacement — The preprocessor replaces trigraph sequences with the characters they represent.
  2. Line splicing — Physical source lines that are continued with escaped newline sequences are spliced to form logical lines.
  3. Tokenization — The preprocessor breaks the result into preprocessing tokens and whitespace. It replaces comments with whitespace.
  4. Macro expansion and directive handlingPreprocessing directive lines, including file inclusion and conditional compilation, are executed. The preprocessor simultaneously expands macros and, in the 1999 version of the C standard, handles _Pragma operators.

Including files

The most common use of the preprocessor is to include another file:

#include <stdio.h>

int main (void)
{
    printf("Hello, world!\n");
    return 0;
}

The preprocessor replaces the line #include <stdio.h> with the system header file of that name, which declares the printf() function among other things. More precisely, the entire text of the file 'stdio.h' replaces the #include directive.

This can also be written using double quotes, e.g. #include "stdio.h". If the filename is enclosed within angle brackets, the file is searched for in the standard compiler include paths. If the filename is enclosed within double quotes, the search path is expanded to include the current source directory. C compilers and programming environments all have a facility which allows the programmer to define where include files can be found. This can be introduced through a command line flag, which can be parameterized using a makefile, so that a different set of include files can be swapped in for different operating systems, for instance.

By convention, include files are given a .h extension, and files not included by others are given a .c extension. However, there is no requirement that this be observed. Occasionally you will see files with other extensions included: files with a .def extension may denote files designed to be included multiple times, each time expanding the same repetitive content; #include "icon.xbm" is likely to refer to an XBM image file (which is at the same time a C source file).

#include often compels the use of #include guards or #pragma once to prevent double inclusion.

Conditional compilation

The #if, #ifdef, #ifndef, #else, #elif and #endif directives can be used for conditional compilation.

#ifdef __unix__
# include <unistd.h>
#elif defined _WIN32 // _WIN32 is defined by most compilers available for the Windows operating system (but not by all).
# include <windows.h>
#endif
#if VERBOSE >= 2
  print("trace message");
#endif

Note that comparison operations will only work with integers

#if VERBOSE == "on" // NOT ALLOWED
  print("trace message");
#endif
#if VERBOSE >= 2.0 // NOT ALLOWED
  print("trace message");
#endif

Most compilers targeting Windows implicitly define _WIN32[1]. This allows code, including preprocessor commands, to compile only when targeting Windows systems. Note that a few compilers define WIN32 instead. For such compilers that do not implicitly define the _WIN32 macro, it can be specified on the compiler's command line, using -D_WIN32.

The example code tests if a macro __unix__ is defined. If it is, the file <unistd.h> is then included. Otherwise, it tests if a macro _WIN32 is defined instead. If it is, the file <windows.h> is then included.

A more complex #ifdef example can use operators, for example something like

#if !(defined __LP64__ || defined __LLP64__) || defined _WIN32 && !defined _WIN64
	// we are compiling for a 32-bit system
#else
	// we are compiling for a 64-bit system
#endif

You can also cause compilation to halt by using the #error directive:

#if RUBY_VERSION == 190
# error 1.9.0 not supported
#endif

Macro definition and expansion

There are two types of macros, object-like and function-like. Object-like macros do not take parameters; function-like macros do. The generic syntax for declaring an identifier as a macro of each type is, respectively,

#define <identifier> <replacement token list>
#define <identifier>(<parameter list>) <replacement token list>

Note that the function-like macro declaration must not have any whitespace between the identifier and the first, opening, parenthesis. If whitespace is present, the macro will be interpreted as object-like with everything starting from the first parenthesis added to the token list.

Whenever the identifier appears in the source code it is replaced with the replacement token list, which can be empty. For an identifier declared to be a function-like macro, it is only replaced when the following token is also a left parenthesis that begins the argument list of the macro invocation. The exact procedure followed for expansion of function-like macros with arguments is subtle.

Object-like macros were conventionally used as part of good programming practice to create symbolic names for constants, e.g.

#define PI 3.14159

... instead of hard-coding those numbers throughout one's code. An alternative in both C and C++ is to apply the const qualifier to a global variable.

An example of a function-like macro is:

#define RADTODEG(x) ((x) * 57.29578)

This defines a radians to degrees conversion which can be written subsequently, e.g. RADTODEG(34) or RADTODEG (34). This is expanded in-place, so the caller does not need to litter copies of the multiplication constant all over the code. The macro here is written as all uppercase to emphasize that it is a macro, not a compiled function. Note the second x is enclosed in its own pair of parentheses. This avoids calculations in an undesired order of operations if an expression instead of a single value is passed.

Standard predefined positioning macros

Certain symbols are required to be defined by an implementation during preprocessing. These include __FILE__ and __LINE__, predefined by the preprocessor itself, which expand into the current file and line number. For instance the following:

// debugging macros so we can pin down message origin at a glance
#define WHERESTR  "[file %s, line %d]: "
#define WHEREARG  __FILE__, __LINE__
#define DEBUGPRINT2(...)       fprintf(stderr, __VA_ARGS__)
#define DEBUGPRINT(_fmt, ...)  DEBUGPRINT2(WHERESTR _fmt, WHEREARG, __VA_ARGS__)
//...

  DEBUGPRINT("hey, x=%d\n", x);

prints the value of x, preceded by the file and line number to the error stream, allowing quick access to which line the message was produced on. Note that the WHERESTR argument is concatenated with the string following it.

The first C Standard specified that the macro __STDC__ be defined to 1 if the implementation conforms to the ISO Standard and 0 otherwise, and the macro __STDC_VERSION__ defined as a numeric literal specifying the version of the Standard supported by the implementation. Standard C++ compilers support the __cplusplus macro. Compilers running in non-standard mode, with advanced or reduced language features that may be conflicting with the essential standard, might not set these macros or should define others to exhibit the differences.

Other Standard macros include __DATE__ and __TIME__, which expand to the date and time of translation respectively

The second edition of the C Standard, C99, added support for __func__, which contains the name of the function definition within which it is contained, but because the preprocessor is agnostic to the grammar of C, this must be done in the compiler itself using a variable local to the function.

Precedence

Note that the example macro RADTODEG(x) given above uses seemingly superfluous parentheses both around the argument and around the entire expression. Omitting either of these can lead to unexpected results. For example:

  • Macro defined as
#define RADTODEG(x) (x * 57.29578)

will expand

RADTODEG(a + b)

to

(a + b * 57.29578)
  • Macro defined as
#define RADTODEG(x) (x) * 57.29578

will expand

1 / RADTODEG(a)

to

1 / (a) * 57.29578

neither of which give the intended result.

However

#define RADTODEG(x) ((x) * 57.29578)

should expand to

1 / ((a) * 57.29578)

which should give the correct result.

Multiple lines

A macro can be extended over as many lines as required using a backslash escape character at the end of each line. The macro ends after the first line which does not end in a backslash.

The extent to which multi-line macros enhance or reduce the size and complexity of the source of a C program, or its readability and maintainability is open to debate (there is no experimental evidence on this issue). Techniques such as X-Macros are occasionally used to address these potential issues.

Multiple evaluation of side effects

Another example of a function-like macro is:

#define MIN(a,b) ((a)>(b)?(b):(a))

Notice the use of the ternary conditional ?: operator. This illustrates one of the dangers of using function-like macros. One of the arguments, a or b, will be evaluated twice when this "function" is called. So, if the expression MIN(++firstnum,secondnum) is evaluated, then firstnum may be incremented twice, not once as would be expected.

A safer way to achieve the same would be to use a typeof-construct:

#define max(a,b) \
       ({ typeof (a) _a = (a); \
           typeof (b) _b = (b); \
         _a > _b ? _a : _b; })

This will cause the arguments to be evaluated only once, and it will not be type-specific anymore. This construct is not legal ANSI C; both the typeof keyword, and the construct of placing a compound statement within parentheses, are non-standard extensions implemented in the popular GNU C compiler (GCC).

Sometimes, the problem can be avoided with additional parameters like so:

#define max(a, b, type, max) \
  do { \
    type _a = (a); \
    type _b = (b); \
    max = _a > _b ? _a : _b; \
  } while (0)

But this raises the complexity of both the macro and the calling code significantly.

If using GCC, the general problem can also be solved using a static inline function, which is as efficient as a #define. The inline function allows the compiler to check/coerce parameter types—in this particular example this appears to be a disadvantage, since the 'max' function as shown works equally well with different parameter types, but in general having the type coercion is an advantage.

Within ANSI C, there is no reliable general solution to the issue of side-effects in macro arguments.

Token concatenation

Token concatenation, also called token pasting, is one of the most subtle — and easy to abuse — features of the C macro preprocessor. Two arguments can be 'glued' together using ## preprocessor operator; this allows two tokens to be concatenated in the preprocessed code. This can be used to construct elaborate macros which act like a crude version of C++ templates.

For instance:

#define MYCASE(item,id) \
case id: \
  item##_##id = id;\
break

switch(x) {
    MYCASE(widget,23);
}

The line MYCASE(widget,23); gets expanded here into

case 23:
  widget_23 = 23;
break;

(The semicolon following the invocation of MYCASE becomes the semicolon that completes the break statement.)

Only function-like parameters can be pasted in a macro, and the parameters are not parsed for macro replacement first, so the following somewhat non-intuitive behavior occurs:

enum {
    OlderSmall = 0,
    NewerLarge = 1
};

#define Older Newer
#define Small Large

#define replace_1(Older, Small) Older##Small
#define replace_2(Older, Small) replace_1(Older, Small)

void printout()
{
        // replace_1(Older, Small) becomes OlderSmall (not NewerLarge),
        // despite the #define calls above.
    printf("Check 1: %d\n", replace_1(Older, Small));

        // The parameters to replace_2 are substituted before the call
        // to replace_1, so we get NewerLarge.
    printf("Check 2: %d\n", replace_2(Older, Small));
}

results in

Check 1: 0
Check 2: 1

Semicolons

One stylistic note about the above macro is that the semicolon on the last line of the macro definition is omitted so that the macro looks 'natural' when written. It could be included in the macro definition, but then there would be lines in the code without semicolons at the end which would throw off the casual reader. Worse, the user could be tempted to include semicolons anyway; in most cases this would be harmless (an extra semicolon denotes an empty statement) but it would cause errors in control flow blocks:

#define PRETTY_PRINT(msg) printf(msg);

  if (n < 10)
    PRETTY_PRINT("n is less than 10");
  else
    PRETTY_PRINT("n is at least 10");

This expands to give two statements – the intended printf and an empty statement – in each branch of the if/else construct, which will cause the compiler to give an error message similar to:

error: expected expression before ‘else’

— gcc 4.1.1

Multiple statements

Inconsistent use of multiple-statement macros can result in unintended behaviour. The code

#define CMDS \
   a = b; \
   c = d;

  if (var == 13)
    CMDS;
  else
    return;

will expand to

  if (var == 13)
    a = b;
  c = d;
  ;
  else
    return;

which is a syntax error (the else is lacking a matching if).

The macro can be made safe by replacing the internal semicolon with the comma operator, since two operands connected by a comma form a single statement. The comma operator is the lowest precedence operator. In particular, its precedence is lower than the assignment operator's, so that a = b, c = d does not parse as a = (b,c) = d. Therefore,

#define CMDS a = b, c = d

  if (var == 13)
    CMDS;
  else
    return;

will expand to

  if (var == 13)
    a = b, c = d;
  else
    return;

The problem can also be fixed without using the comma operator:

#define CMDS \
  do { \
    a = b; \
    c = d; \
  } while (0)

expands to

  if (var == 13)
    do {
      a = b;
      c = d;
    } while (0);
  else
    return;

The do and while (0) are needed to allow the macro invocation to be followed by a semicolon; if they were omitted the resulting expansion would be

  if (var == 13) {
      a = b;
      c = d;
  }
  ;
  else
    return;

The semicolon in the macro's invocation above becomes an empty statement, causing a syntax error at the else by preventing it matching up with the preceding if.

A cleaner way, using the non-standard GNU C compiler (GCC) compound statement within parentheses compiler extension[1]:

#define CMDS \
  ({ \
    a = b; \
    c = d; \
  })

expands to

  if (var == 13)
    ({
      a = b;
      c = d;
    });
  else
    return;

Quoting macro arguments

Although macro expansion does not occur within a quoted string, the text of the macro arguments can be quoted and treated as a string literal by using the "#" directive (also known as the "Stringizing Operator"). For example, with the macro

#define QUOTEME(x) #x

the code

printf("%s\n", QUOTEME(1+2));

will expand to

printf("%s\n", "1+2");

This capability can be used with automatic string literal concatenation to make debugging macros. For example, the macro in

#define dumpme(x, fmt) printf("%s:%u: %s=" fmt, __FILE__, __LINE__, #x, x)

int some_function() {
    int foo;
    /* [a lot of complicated code goes here] */
    dumpme(foo, "%d");
    /* [more complicated code goes here] */
}

would print the name of an expression and its value, along with the file name and the line number.

Indirectly quoting macro arguments

The "#" directive can also be used indirectly, in order to quote the "value" of a macro instead of the name of that macro. For example, with or without the macro:

#define FUNC(arg) #arg

the code

printf("FOO=%s\n", QUOTEME(FOO));

will expand to

printf("FOO=%s\n", "FOO");

One common use for this technique is to convert the __LINE__ macro to a string. E.g.:

#define QUOTEME_(x) #x
#define QUOTEME(x) QUOTEME_(x)

Now

QUOTEME(__LINE__);

is converted to:

"34"

if __LINE__ happens to have the value 34 when QUOTEME() is called. On the other hand QUOTEME_(__LINE__) will expand to "__LINE__"

Brainteaser

The "#" directive is also used to solve the following preprocessor brainteaser (involving characters, as opposed to strings): Define a macro, CHAR(), which takes a single input character X in the source program text and converts it into the C-language character value of X; that is, such that

printf("%c\n", CHAR(a))
printf("%c\n", CHAR(b))

yields

a
b

Solution:

#define CHAR(X)  #X[0]

Variadic macros

Macros that can take a varying number of arguments (variadic macros) are not allowed in C89, but were introduced by a number of compilers and standardised in C99. Variadic macros are particularly useful when writing wrappers to variable parameter number functions, such as printf, for example when logging warnings and errors.

X-Macros

One little-known usage pattern of the C preprocessor is known as "X-Macros".[2][3][4] An X-Macro is a header file. Commonly these use the extension ".def" instead of the traditional ".h". This file contains a list of similar macro calls, which can be referred to as "component macros". The include file is then referenced repeatedly.

Wikibooks section on X-Macros

Compiler-specific predefined macros

Compiler-specific predefined macros are usually listed in the compiler documentation, although this is often incomplete. The Pre-defined C/C++ Compiler Macros project lists "various pre-defined compiler macros that can be used to identify standards, compilers, operating systems, hardware architectures, and even basic run-time libraries at compile-time".

Some compilers can be made to dump at least some of their useful predefined macros, for example:

GNU C Compiler
gcc -dM -E - < /dev/null
HP-UX ansi C compiler
cc -v fred.c (where fred.c is a simple test file)
SCO OpenServer C compiler
cc -## fred.c (where fred.c is a simple test file)
Sun Studio C/C++ compiler
cc -## fred.c (where fred.c is a simple test file)
IBM AIX XL C/C++ compiler
cc -qshowmacros -E fred.c (where fred.c is a simple test file)

User-defined compilation errors and warnings

The #error directive outputs a message through the error stream.

#error "Gaah!"

This prints "Gaah!" in the preprocessor output and halts the computation at that point. This is extremely useful for determining whether a given line is being compiled or not. It is also useful if you have a heavily parameterized body of code and want to make sure a particular #define has been introduced from the makefile, e.g.:

#ifdef _WIN32
    ... /* Windows specific code */
#elif defined __unix__
    ... /* Unix specific code */
#else
    #error "What's your operating system?"
#endif

Although the text following the #error directive does not have to be quoted, it is good practice to do so. Otherwise, there may be problems with apostrophes and other characters that the preprocessor tries to interpret.

Compiler-specific preprocessor features

The #pragma directive is a compiler specific directive which compiler vendors may use for their own purposes. For instance, a #pragma is often used to allow suppression of specific error messages, manage heap and stack debugging, etc.

C99 introduced a few standard #pragma directives, taking the form #pragma STDC …, which are used to control the floating-point implementation.

Nonstandard extensions

  • Many implementations do not support trigraphs or do not replace them by default.
  • Many implementations (including, e.g., the C-compilers by GNU, Intel, and IBM) provide a non-standard #warning directive to print out a warning message in the output, but not stop the compilation process. A typical use is to warn about the usage of some old code, which is now deprecated and only included for compatibility reasons, e.g.:
#warning "Do not use ABC, which is deprecated. Use XYZ instead."
  • Some Unix preprocessors traditionally provided "assertions", which have little similarity to assertions used in programming. The syntax is described at GCC Obsolete features.
  • GCC provides [2] #include_next for chaining headers of the same name.
  • Objective-C preprocessors have #import which is like #include, but only includes the file once.

As a general-purpose preprocessor (GPP)

Since the C preprocessor can be invoked independently to process files other than those containing to-be-compiled source code, it can also be used as a "general purpose preprocessor" (GPP) for other types of text processing. One particularly notable example is the now-deprecated imake system; more examples are listed at General purpose preprocessor.

GPP does work acceptably with most assembly languages. GNU mentions assembly as one of the target languages among C, C++ and Objective-C in the documentation of its implementation of the preprocessor. This requires that the assembler syntax not conflict with GPP syntax, which means no lines starting with # and that double quotes, which cpp interprets as string literals and thus ignores, don't have syntactical meaning other than that.

However, since the C preprocessor does not have features of other preprocessors, such as recursive macros, selective expansion according to quoting, string evaluation in conditionals, and Turing completeness, it is very limited in comparison to a more modern, true GPP such as m4. For instance, the inability to define macros using other macros requires code to be broken into more sections than would be required.

Consider the following constructs, which are popularly known as "define-guards":

#ifndef HAVE_STRF
#define HAVE_STRF 1
#endif

#ifndef HAVE_PUTS
#define HAVE_PUTS 1
#endif

#ifndef HAVE_STDS
#define HAVE_STDS 1
#endif

Seeing that surrounding the #define lines in conditionals is repetitive, one could attempt to overcome that with a macro, yet the C preprocessor is limited in that it cannot define macros that define other macros, which invalidates the code below:

/* Invalid macro definition */
#define def x \
#ifndef x \
#define x \
#endif

The m4 macro processor has fewer restrictions:

define(`def',
        `ifdef(`$1',
                `dnl',
                `define($@)')')

With the new macro, the long list of define-guards is significantly shortened:

def(`HAVE_STRF', 1)
def(`HAVE_PUTS', 1)
def(`HAVE_STDS', 1)

See also

References

  1. ^ List of predefined ANSI C and Microsoft C++ implementation macros.
  2. ^ Wirzenius, Lars. C Preprocessor Trick For Implementing Similar Data Types Retrieved January 9, 2011.
  3. ^ Meyers, Randy (2001). "The New C: X Macros". Dr. Dobb's Journal. Retrieved 1 May 2008. {{cite journal}}: Unknown parameter |month= ignored (help)
  4. ^ Beal, Stephan (2004). "Supermacros". Retrieved 27 October 2008. {{cite journal}}: Cite journal requires |journal= (help); Unknown parameter |month= ignored (help)