Jump to content

Obfuscation (software): Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
copyedit, cap
No edit summary
Line 2: Line 2:
'''Obfuscated code''' is [[source code]] that is very hard to read and understand, often intentionally. Some languages are more prone to obfuscation than others. [[C (programming language)|C]], [[C++]] and [[Perl]] are most often cited as easily obfuscatable languages. Macro preprocessors are often used to create hard-to-read code by masking the standard language syntax and grammar from the main body of code. The term '''''shrouded code''''' has also been used.
'''Obfuscated code''' is [[source code]] that is very hard to read and understand, often intentionally. Some languages are more prone to obfuscation than others. [[C (programming language)|C]], [[C++]] and [[Perl]] are most often cited as easily obfuscatable languages. Macro preprocessors are often used to create hard-to-read code by masking the standard language syntax and grammar from the main body of code. The term '''''shrouded code''''' has also been used.


There are also programs known as obfuscators that may operate on source code, [[object code]], or both, mainly for the purpose of deterring [[reverse engineering]], [[reassembling]], or [[recompiling]].
There are also programs known as obfuscators that may operate on source code, [[object code]], or both, mainly for the purpose of deterring [[reverse engineering]], [[disassembler|disassembly]], or [[decompiler|decompilation]].


Obfuscating code to prevent reverse engineering is typically done to manage risks that stem from unauthorized access to source code. These risks include loss of intellectual property, ease of probing for application vulnerabilities and loss of revenue that can result when applications are reverse engineered, modified to circumvent metering or usage control and then recompiled. Obfuscating code is, therefore, also a compensating control to manage these risks. The risk is greater in computing environments such as Java and Microsoft's .NET which take advantage of [[just-in-time compilation]] technology that allow developers to deploy an application as [[intermediate code]] rather than code which has been compiled into machine language before being deployed.
Obfuscating code to prevent reverse engineering is typically done to manage risks that stem from unauthorized access to source code. These risks include loss of intellectual property, ease of probing for application vulnerabilities and loss of revenue that can result when applications are reverse engineered, modified to circumvent metering or usage control and then recompiled. Obfuscating code is, therefore, also a compensating control to manage these risks. The risk is greater in computing environments such as Java and Microsoft's .NET which take advantage of [[just-in-time compilation]] technology that allow developers to deploy an application as [[intermediate code]] rather than code which has been compiled into machine language before being deployed.

Revision as of 13:35, 19 March 2008

Obfuscated code is source code that is very hard to read and understand, often intentionally. Some languages are more prone to obfuscation than others. C, C++ and Perl are most often cited as easily obfuscatable languages. Macro preprocessors are often used to create hard-to-read code by masking the standard language syntax and grammar from the main body of code. The term shrouded code has also been used.

There are also programs known as obfuscators that may operate on source code, object code, or both, mainly for the purpose of deterring reverse engineering, disassembly, or decompilation.

Obfuscating code to prevent reverse engineering is typically done to manage risks that stem from unauthorized access to source code. These risks include loss of intellectual property, ease of probing for application vulnerabilities and loss of revenue that can result when applications are reverse engineered, modified to circumvent metering or usage control and then recompiled. Obfuscating code is, therefore, also a compensating control to manage these risks. The risk is greater in computing environments such as Java and Microsoft's .NET which take advantage of just-in-time compilation technology that allow developers to deploy an application as intermediate code rather than code which has been compiled into machine language before being deployed.

Obfuscators may be used to compact object code or interpreted code without affecting its behaviour when size is important. Common cases include MIDlets – in which the meaningful identifiers embedded in the Java class files are be replaced with shorter ones – and Javascript code on the web.

Recreational obfuscation

Code is sometimes obfuscated deliberately for recreational purposes. There are programming contests which reward the most creatively obfuscated code: the International Obfuscated C Code Contest, Obfuscated Perl Contest, International Obfuscated Ruby Code Contest, and Obfuscated PostScript Contest.

There are many varieties of interesting obfuscations ranging from simple keyword substitution, use/non-use of whitespace to create artistic effects, clever self-generating or heavily compressed programs, or programs that are valid and operate similarly in multiple programming languages.

Short obfuscated Perl programs printing "Just another Perl hacker" or something similar are often found in signatures of Perl programmers. These are informally known as JAPHs, and the origin of this practice is generally credited to Randal Schwartz.

Examples

This is an infamous example from Internet lore:

#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1)
  :0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);}

Although unintelligible at first glance, it is a legal C program that when compiled and run will generate the 12 verses of The 12 Days of Christmas. It actually contains all the strings required for the poem in an encoded form inlined in the code. The code iterates through the 12 days displaying what it needs to.

Another example is a program's source listing that was formatted to resemble an empty tic-tac-toe board. Each pass through the program modified the sourcecode to show a turn in the game, to be executed for the next move.

Another example is this short program that generates mazes of arbitrary length:

char*M,A,Z,E=40,J[40],T[40];main(C){for(*J=A=scanf(M="%d",&C);
--            E;             J[              E]             =T
[E   ]=  E)   printf("._");  for(;(A-=Z=!Z)  ||  (printf("\n|"
)    ,   A    =              39              ,C             --
)    ;   Z    ||    printf   (M   ))M[Z]=Z[A-(E   =A[J-Z])&&!C
&    A   ==             T[                                  A]
|6<<27<rand()||!C&!Z?J[T[E]=T[A]]=E,J[T[A]=A-Z]=A,"_.":" |"];}

Note the shape of the corridors in the program. Modern C compilers don't allow constant strings to be overwritten, which can be avoided by changing the first line to

char M[2],A,Z,E=40,J[40],T[40];main(C){for(*J=A=scanf("%d",&C);

Versions of gcc (the GNU Compiler for C) prior to 4.0 can compile the program in its original form if the -fwritable-strings flag is used.

A famous example in Perl is:

@P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{
@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord
($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&&
close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print

This slowly displays the text "Just another Perl / Unix hacker", multiple characters at a time, with delays. An explanation can be found here.

Some Python examples can be found in the official Python programming FAQ.

Another famous C example is:

_(__,___,____){___/__<=1?_(__,___+1,____):!(___%__)?_(__,___+1,0):___%__==___/
__&&!____?(printf("%d\t",___/__),_(__,___+1,0)):___%__>1&&___%__<___/__?_(__,1+
___,____+!(___/__%(___%__))):___<__*__?_(__,___+1,____):0;}main(){_(100,0,0);}

This program prints out the prime numbers less than 100. While this program is extremely difficult to follow, it is the result of several simple transformations from the following elementary program:

void primes(int cap) {
  int i, j, composite;
  for(i = 2; i < cap; i++) {
    composite = 0;
    for(j = 2; j < i; j++) 
      composite += !(i % j);
    if(!composite)
      printf("%d\t", i);
  }
}

int main() { 
  primes(100);
}

The first set of transformations aim to reduce the primes function into a single statement: (1) the meat of the function is rewritten as a single while loop with a sequence of cascading if-else structures inside of it; (2) the while loop is then rewritten as recursion; (3) the cascading if-else statements is rewritten as a single conditional statement. These transformations serve to obfuscate the code considerably. The remaining transformations purposefully hide the details of the function by: (4) removing all intermediate variables; (5) renaming the few variables left to _, ___, etc.; and, finally, (6) removing all whitespace and unnecessary parentheses.

The first transformation takes advantage of the well-known property that any simple function can be transformed into a single while loop followed by a series of cascading if-else statements. The program above is first transformed by rewriting the primes function in this form:

void primes(int cap) { 
  int i, j, composite, t = 0;
  while(t < cap * cap) {
    i = t / cap;
    j = t++ % cap;
    if(i <= 1);
    else if(j == 0)
      composite = 0;
    else if(j == i && !composite)
      printf("%d\t",i);
    else if(j > 1 && j < i)
      composite += !(i % j);  
  }
}

int main() {
  primes(100);
}

Another well-known property is that any while loop can be replaced by a series of recursive calls. The program is further transformed by replacing the while loop with recursion. Note that this requires adding two parameters to the primes function's header. Also note the consolidation of two statements into a list in the third if block. Finally note that one additional if-else block must be added to capture the situation in the non-recursive version where program flow fails all of the if-conditions and the while loop would simply continue with t incremented by one:

void primes(int cap, int t, int composite) {
  int i,j;
  i = t / cap;
  j = t % cap;
  if(i <= 1)
    primes(cap,t+1,composite);
  else if(j == 0)
    primes(cap,t+1,0);
  else if(j == i && !composite)
    (printf("%d\t",i), primes(cap,t+1,0));
  else if(j > 1 && j < i)
    primes(cap,t+1, composite + !(i % j));
  else if(t < cap * cap)
    primes(cap,t+1,composite);
}

int main() {
  primes(100,0,0);
}

The program is next transformed by changing the variable names to single letters and replacing the if-else structure with conditional statements (e.g. if(A) B else if(C) D else E becomes A ? B : C ? D : E):

void primes(int m, int t, int c) {
  int i,j;
  i = t / m;
  j = t % m;
  (i <= 1) ? primes(m,t+1,c) : (j == 0) ? primes(m,t+1,0) : (j == i && !c) ? 
  (printf("%d\t",i), primes(m,t+1,0)) : (j > 1 && j < i) ? 
  primes(m,t+1,c + !(i % j)) : (t < m * m) ? primes(m,t+1,c) : 0;
}

int main() {
  primes(100,0,0);
}

Next the program is transformed by removing the intermediate variables i and j and replacing them with (t / m) and (t % m), respectively:

void primes(int m, int t, int c) {
  ((t / m) <= 1) ? primes(m,t+1,c) : !(t % m) ? primes(m,t+1,0) : 
  ((t % m)==(t / m) && !c) ? (printf("%d\t",(t / m)), primes(m,t+1,0)) : 
  ((t % m)> 1 && (t % m) < (t / m)) ? primes(m,t+1,c + !((t / m) % (t % m))) : 
  (t < m * m) ? primes(m,t+1,c) : 0;
}

int main() {
  primes(100,0,0); 
}

Next the program is transformed by renaming the function primes and the variables m, t, and c to _, __, ___, and ____, respectively:

void _(int __, int ___, int ____) {
  ((___ / __) <= 1) ? _(__,___+1,____) : !(___ % __) ? _(__,___+1,0) : 
  ((___ % __)==(___ / __) && !____) ? (printf("%d\t",(___ / __)), 
  _(__,___+1,0)) : ((___ % __) > 1 && (___ % __) < (___ / __)) ? 
  _(__,___+1,____ + !((___ / __) % (___ % __))) : (___ < __ * __) ? 
  _(__,___+1,____) : 0;
} 

int main() {
  _(100,0,0); 
}

Finally, the program is transformed by removing white space, type declarations, and unambiguous parentheses to come up with the fully obfuscated version:

_(__,___,____){___/__<=1?_(__,___+1,____):!(___%__)?_(__,___+1,0):___%__==___/
__&&!____?(printf("%d\t",___/__),_(__,___+1,0)):___%__>1&&___%__<___/__?_(__,1+
___,____+!(___/__%(___%__))):___<__*__?_(__,___+1,____):0;}main(){_(100,0,0);}

This program runs on most systems. The limiting factor generally will be stack overflow because setting the equivalent of the cap parameter to 100 generally results in 100^2 recursive calls. This sort of obfuscation by program transformation is relatively easy to apply and can be performed on many simple programs.

Obfuscation by code morphing

Obfuscation by code morphing refers to obfuscating machine language or object code rather than obfuscating the source code.

This is achieved by completely replacing a section of the compiled code with an entirely new block that expects the same machine state when it begins execution as the previous section, and will leave with the same machine state after execution as the original. However, a number of additional operations will be completed as well as some operations with an equivalent effect.

Code morphing makes disassembly of a distributed program more difficult. However, by adding unnecessarily complicated operations and hindering compiler-made optimizations, the execution time of the program is increased. For that reason, code morphing should be limited to critical portions of a program and not be used on an entire application.

Code morphing is often used in obfuscating the copy protection or other checks that a program makes to determine whether it is a valid, authentic installation, or a pirated copy.

Code morphing is multilevel technology containing hundreds of unique code transformation patterns, as well as a special layer that transforms some commands into virtual machine commands. Code morphing turns binary code into an undecipherable mess that is not similar to normal compiled code, and completely hides execution logic of the protected code.

Obfuscation in malicious software

Spammers frequently use obfuscated JavaScript or HTML code in spam messages. The obfuscated message, when displayed by an HTML-capable e-mail client, appears as a reasonably normal message—albeit with obnoxious JavaScript behaviors such as spawning pop-up windows. However, when the source is viewed, the obfuscations make it far more difficult for investigators to discern where the links go, or what the JavaScript code does. For this same reason JavaScript obfuscation is also often used by malware authors to conceal parts of code that run browser exploits, or that redirect to pages containing exploits.

Automated JavaScript obfuscators are currently available on the market. Although the effectiveness of these tools may vary, and the channels through which they are exchanged range from the legal to the manifestly malicious, they are all created for the purpose of confounding casual viewers or inexperienced investigators.

The techniques use JavaScript's dynamic nature—a piece of code is stored as an encrypted string, which is decrypted and evaluated. This may be done several times. Other techniques include insertion of dummy code, as well as dummy HTML links to legitimate pages.

Obfuscation for VM migration

Third-party obfuscators are used to recompile old classes to meet new Java Development Kit definitions avoiding ClassFormatError while migrating from Microsoft Virtual Machine to Java Virtual Machine.

Disadvantages of obfuscation

When used alone

No obfuscator known today provides any guarantees on the difficulty of reverse engineering, and this seems to be an inherent issue (see for example, "Can We Obfuscate Programs?" by Boaz Barak). Thus, obfuscators do not provide security of a level similar to modern encryption schemes, and should be used with other measures in tandem, in cases where security is of high importance.

Additionally, the most common software reverse engineering attacks target copy protection schemes. These schemes generally rely heavily on existing operating system procedure calls, making basic code obfuscation easily bypassed using the same tools used with unobfuscated code.

Debugging

Obfuscated code is extremely difficult to debug. Variable names will no longer make sense, and the structure of the code itself will likely be modified beyond recognition. This fact generally forces developers to maintain two builds: One with the original, unobfuscated source code that can be easily debugged, and another for release. While both builds should be tested to make sure they perform identically, the second build is generally reliably constructed from the first by an obfuscator.

This limitation does not apply to intermediate language (e.g., Java, C#) obfuscators, which generally work on compiled assemblies rather than on source code.

Portability

Obfuscated code often depends on the particular characteristics of the platform and compiler, making it difficult to manage if either change.

Conflicts with Reflection APIs

Reflection is a set of APIs in various languages that allow an object to be examined or created just by knowing its classname at run-time. Many obfuscators allow specified classes to be exempt from renaming; and it is also possible to let a class be renamed and call it by its new name. However, the former option places limits on the dynamism of code, while the latter adds a great deal of complexity and inconvenience to the system.

Obfuscating software

A vast variety of tools exists to perform or assist with code obfuscation. These include experimental research tools created by academics, hobbyist tools, commercial products written by professionals, and Open-source software. There are also deobfuscators, which do the reverse.

Software obfuscation tools include specialized obfuscators to demonstrate a relatively limited technique, more general obfuscators which attempt a more thorough obfuscation, and combined-function tools which obfuscate code as part of a larger goal such as software licensing enforcement.

Explanation

Programs written in languages such as C++ or Pascal are compiled into the machine language of a given computer before they become a program. Programmers write "source code" while computers run "machine code" so this conversion is necessary. There is (generally) a one way transformation from source code to machine code. Machine code is not encrypted and is easy for anyone to see, but the format is so tedious for humans that reverse-engineering efforts are slow and painful.

Java and .NET languages (e.g., C#, Visual Basic) take a different approach to compilation. They are far easier to reverse engineer because they do not compile to machine code, they compile into intermediate code.

Microsoft recommends using the Script Encoder to obfuscate the ASP files, so in case the web server is compromised, the cracker would be unable to find out how your ASP applications work. The Script Encoder works also on JScript and VBScript files. Note that the encoded JScript is only functional in Internet Explorer. However, in the documentation, it states that "Note that this encoding only prevents casual viewing of your code; it will not prevent the determined hacker from seeing what you've done and how." It is indeed possible, as shown by the Windows Script Decoder, created on August 1, 2000 by Mr. Brownstone.

See also