JDATATRANS for Array Obfuscation in Java Source Code to Defeat Reverse Engineering from Decompiled Codes

Software obfuscation or obscuring a software is an approach to defeat the practice of reverse engineering a software for using its functionality illegally in the development of another software. Java applications are more amenable to reverse engineer…

Authors: ** - Praveen Sivadasan – School of Computer Sciences, Mahatma G, hi University

JDATATRANS for Array Obfuscation in Java Source Code to Defeat Reverse   Engineering from Decompiled Codes
JDATATRANS for Array Obfuscation in Java Source Code to Defe at Reverse Engineering f rom Decompiled Codes Praveen Sivadasan School of Computer Sciences Mahatma Gandhi University Keral a, India +91-9 94606377 4 prave en _siv adas@ y ahoo. com P Soj an Lal School of Computer Sciences Mahatma Gandhi Univ ersity Kerala , India sojanlal@gmail.com India Nave en Siv adasa n TCS Inno vati on L abs Hy deraba d Tata Con sult an cy S erv ices Madha pu r, Hy der abad, India s.na ve en@a tc.t cs.co m ABSTRACT Software obfuscation or obscuring a software is an approach to defeat the practice of reverse engineering a software for using its functionality illegally in the development of anot her software. Java application s are more amenable to reverse engineering and re-engineering attacks through methods such as decompilat ion because Java class files store the program in a semi complied form called 'byte' codes. The existing obfuscation systems obfuscate the Java class files. Obfuscated source code produ ce obfuscated byte codes and hence two level obfu scation (source code and byte code level) of the progra m makes it more resilient to reverse engineering attacks. But source code obfuscation is much more difficult due to richer set of programming constructs and the scope of the different variables used in the program and only very little progress has been made on this front. Hence programmers resort to adho c manual ways of obscuring their program which makes it difficult for its maintenance and usability. To address this issue partially, we developed a user friendly tool JDATATRANS to obfu scate Java source code by obscuring the array usages. Using various array restructuring techniques such as 'array splitting ', 'arra y folding' and 'array flattening', in addition to constan t hidin g , our system obfuscate the inpu t Java source code and produ ce an obfuscated Java source code that is functionally equivalent to the inpu t progra m. We also perform a n umber of experiments to measure the po tency, resilience and cost incurred b y our to ol. Categories and Subject Descriptor s [ Inform ation Security ]: Java Virtual Machine, Platfo rm Independence, Network Mobility, Ja va Class File General Term s Software Security Keywords Reverse Engineering, Restructured Array s, Source Code Obfuscation. 1.Intr oduction The java based web application s gained popu larity because of its Architecture Neutral Distribution Format (ANDF) [12 ]. During compilation, the Java source code is translated to java class files that contain Java Virtual Machine (JVM) code called the ‘byte code’, retaining most or all information present in the original source code [5] . This is because the translation to real machine instruction happens in the browser of the user’s machine by JIT (Just-In-Time Compiler). Also, Java programs are small in size because of the vast functionalities provided by the Java standard libraries. Decompilation is the process of generating source codes from machine codes or intermediate byte codes. JAD, Mocha, Decaf are some of the well-known decompilers [22 ]. Though decompilation is in general hard for most programming languages, the semi compiled nature of Java class files make it more am enable to reverse engineering and re-engineering attacks throu gh decompilation [10 , 14, 15, 16]. This makes it easier for the competitors to extract the proprietary algorithms and data structures from Java application s in o rder t o incorpo rate them into their own programs in order to cut down their development time and cost. Such cases of intellectual property thefts [6, 18, 19] are difficult to detect and pursue legally. Recent statistics [7] show that four out of every ten software programs is pirated worldwide and over the years, global softw are piracy has increased by over 40% and has caused a loss of more than 11 billion USD [7] . Over the ye ars, a number of software protection methods have been propo sed. The remote service based methods provide the ma ximum protection against piracy because the application resides in a remote server and only the results of the computation is returned to the client application without exposing the algorithmic details of the server application . But su ch methods suffer from limited network bandwidth and latency. Even the approach of runni ng only the crucial software components in a remote server suffer from similar drawbacks [5]. The approach of encrypting the executable is effective only if the entire decryption/execution process takes place in the hardware [5] . Furthermore, there is dramatic difference in the cost of encryption and decryption in any public key encryption system [5]. Software obfuscation [6, 9,13 ,17,2 1] is a popu lar approach where the program is transformed into an obfuscated program using an ‘obfuscator’ in such a way that the functionality and the input/outp ut behavior is preserved in the obfuscated program whereas it is much more difficult to reverse engineer the obfuscated program. Though, obfuscation is a more economical method for preventing reverse engineering[5], there are ‘deobfucators’ [23 , 24] available to defeat some of the less sophisticated obfuscation strategies. The popu lar transformation techniques employed for obfuscation are (i) layout transformation which makes the structure of the transformed program difficult to comprehend (ii) data transfo rmation that ob scures the crucial data and data structures (iii) control transforma tion to obscure the flow of execution [ 5, 6, 25] . The obfuscation can b e preformed on the source code [4, 5], the intermediate code or the machine executable code. The eff ectiveness of obfuscation is usually measured in terms of a) th e potency th at is the degree to which the reader is confused, b) the resilience that is the degree to which th e obfuscation attacks are resisted and finally c) the cost which measures the am ount of execution time/space penalty suff ered by the program due to obfuscation [ 5, 26 ]. The existing Java byte code obfuscators are primarily based on lexical transform ations, where the class names , variable names and function names are replaced by less comprehensible strings. In source code obfuscators, th e commonly applied transformations are (i) replacing symbol names with non-meaningful ones, (ii) substitution of constant values with arithmetic expressions, (iii) removing source code formatting, and (iv) exploiting the preprocessor. We refer the reader to [4, 7, 27, 28, 29] for a survey of diffe rent java obfuscation tools that are available. Little progress has been made so far in successfully applying more sophisticated obfuscation strategies for either byte code or source code obfuscation. This is primarily because of the difficulty in handling issues such as the scope of the variable names, dynam ic binding of variable names to objects, polymorphism etc. Data transform ation and constant hidin g are the two well studied obfuscation techniqu es. In data transform ation, Array transformation in particular is popular. Array splittin g, array folding and array flattening are the three well known array transformation methods [20 ,30,3 1]. As shown in Figure 1, in array splitting, a one dimensional array 'A ' for example is split into say 'k' array s A1 ... Ak such that array Ai hold s the elements of 'A ' with indices (i mod k). In array folding, a one dimensional array 'D' is transformed into a multidimensional array say for exam ple a two dimensional array 'D1' using a transformation operation, which is a bijection between the indices of D and D1. Array flattening on the other hand does the reverse where a multidimensional array is transformed into a single dimensional array using the bijection ma ppin g. In [7] , Ertaul et. al proposed a novel constant hiding techniqu es using y-factors . The y-factors are essentially a predefined increasing sequence of 'm' prime numbers y[0], y[1],y[2] ...,y[m]. The y_factors can be used to transform a non negative number 'x' which is less than y[0] as follows. Let the function 'F(A, k)' be defined as F(A, k) = ((....((A mod y[k]) mod y[k-1]) mod y[k-2]) .... mod y[0]). Now replace 'x' by the expression F(A, k) such that F(A, k) evaluates to 'x'. Now to hide any large positive constant say 'c' in the program, first 'c' is replaced with a simple expression of the form 2*d + r where 'r' is 0 if 'x' is even and 'r' is 1 if 'x' is odd. Now, the constants 2 and ‘r’ in the resulting expression can be hidden by replacing it with the correspondin g F( ) function . Our Contribution We developed a source code obfuscation tool JDATATRANS that obfuscate arrays in Java source code. Our tool has the following two major components. JDATATRANS-CoBS (Classes for oBfuscating So urce codes) An extensible repository o f array generic classes w hich we refer to as CoBS (Classes for oBfuscting Source codes). The internal implem entation of the array class is highly obfuscated. At present the repository has three separate array implem entations using the well known obfuscations methods - array folding, array flattening and array splitting. The programmer has the choice of using any of these array implementations for each of the crucial arrays in the program, by instantiating the array object to the respective CoBS array class. JDATATRANS-Obfuscator (for obfu scating the CoBS arrays in the Java prog ram) We have developed an obfuscator that identifies the usage of the CoBS array s in the Java program and obfuscates the corresponding sent ences hiding th e constant and array indices in it using the F( ) functions. This provides an additio nal level of obfuscation to the program in addition to the obfuscation provided by the CoBS implementation. Our source code obfuscator produces a functionally equivalent Java source program and additio nal levels of obfuscation can be obtained by applying any of the existing byte cod e obfuscators on the target class files. We also perform a number of experiments to evaluate the eff ectiveness of our obfuscation system . Our experiments reveal that the system is able to ma ke the program sufficiently incomprehensible even for decomplier assisted reverse engineering without much overhead in terms of the increase in code size or execution time. To the best of our knowledge, there are no existing array obfuscators for either Java source code or byte code. Furthermore, we believe that our approach of developing a library of highly obfuscated data structures as CoBS classes together with an obfuscator that reads the program and obfuscates the statements where CoBS objects are accesse d is novel and is a significant step towards build ing high quality obfuscators for Java applications. 2.Implem entation In this section we give an overview of the various i mplementation aspects of JDAT ATRANS. As me ntion ed in the introduction, the two major components of the JDATAT RANS tool are a) JDATATRANS-CoBS (Classes for oBfuscating Source codes) repository that contain generic classes for various obfuscated array implementations and b) JDATATRANS-obfuscato r that obfuscates the Java programs that uses the CoBS array s. Before we discuss the implementation details of these two components, we give an outlin e of the ConstHide module that hides the constants in the source code. The ConstHide module is used by both CoBS and the Obfuscator. The tool is built using Java 5.0 with user friendly G UI suppo rt based on SWING. 2.1 The ConstHide Module To compute F( ), we use an array Y[m] of 'm ' pairs where Y[i] = (Pi, Qi) denote the pair at th e i-th index of Y. These pairs have the following property a) for any pair Y[i] = (Pi, Qi), Pi + Qi is a pri me number and b) if i < j then Pi + Qi < Pj + Qj. That is, sum of the numbers in any pair is a prime number and the pairs are stored in Y array in the increasing order of their sum value. The following sequence of pairs for example can be the contents of the Y-array - (2,3),(5,6),(11, 12),(23,24),(47, 48),(95,96),(191,1 92),….. (12287, 12288) . Following is the algorithm to compute F( ) functio n. int F(A, k) { //k is a num ber between 1 an d m which //denotes th e depth of the o bfuscation. Y[m]={(P1,Q 1),(P2,Q2)...... ..(Pm,Qm)} r = A; for (i :k . ....1) {r = r mo d (Pi + Qi);} return r ; } The input constant i n transformed to a correspon ding F( ) expression using the following hide( ) functio n. String Hide (c) { //This funct ion returns an e xpression //of the for m F(A, k) which evaluates to // c Let c = 2d + r; //Note that r = 0 if c is ev en, else // r = 1 //Now we will hide the first in teger // 2 in the a bove expression Choose k ran domly from {1 .. . m} Let A be suc h that 2 = F (A, k) Choose two n umber B and C su ch that A=B mod C return the e xpression F(B mo d C, k)*d + r; } The ConstHide would for example hide the constant '2' by replacing 2 with any of the following expressions: F(41%23,2), F(374%191,5), F(757%383,6). Though most compilers simplify the expressions of the form 374%19 1, we still use these expressions to ensure that the source code itself is difficult to comprehend. 2.2JDATATRANS-CoBS The repository at present holds the generic class implementations for split array s, folded arrays and flattened arrays. The following shows the public methods for split arrays. public class Sp litArray { public SplitA rray (int size ) ; public void se tArray(int pos, E elem): public E getA rray(int pos); public int len gthArray(); } The methods 'setA rray( )', 'getArray ( )' and 'lengthArray ( )' are used to store an element at a given index, retrieve the element stored at the given index and to get the length of the array respectively . Implementation for these three methods is mandatory for all the array classes in CoBS repository. If the programmer wishes to use an obfuscated array for one of the crucial arrays say 'X', then he/she first needs to decide which obfuscation technique to use (say the splitarray mechanism is chosen) and simply need to include the following array declaration statement in the program for 'X' (Assum e that X is an array of type Integer and size 1000). SplitArray X = new Sp litArray (1000); When the programm er imports the CoBS package , initially the CoBS array s have only dummy (stub) implementation for all the public methods. This is done for the following reason. First, it ensures that the program that uses the CoBS array s compile. Now if there are multiple implementations of the SplitArray (with differing internal implem entations) itself in the CoBS repository, The CoBS handler in the next phase, replaces the dummy implem entation with one from the available implem entations in a random fashion. This ensures an addition al level of obfuscation. The following is a sam ple ob fuscated implementation of getArray ( ), setArray ( ) and l engthArray( ) for SplitArray s. public class SplitArray extends obfuscate Array Splitt ing: One dimensional array A is split into A1 and A2. (1) int A[1 0]; (1) int A1[5], A2[5 ] ; (2) A[i]=…; (2) if ((i%2)==0) A1 [i/2] = …….; else A2 [i/2] = …….; Array Folding: One dimensional array D is folded into a two dimensional array D1 . (1) int D[1 0]; (1) int D1[2][5]; (2) D[0]=…; (2) D1[0][0] =……; (3) D[5]=…. ; (3) D1[1][0]=……; (4) D[i]=…. ; (4) D1[(i-(i%5))/5] [i%5]=……; Array flattening: Two dimensional array E is flattened it into a one dimensional array E1. (1) int E[3 ][3]; (1) int E1[9 ]; (2) E[i][j]= …; (2) E1[3*i+j ]=……; Figure 1. The array restructuring techniques – array splitting, array folding and array flattening. { E[] iObj1;E[] iObj2; public SplitArray (int size ) { if((size% F(41%23,2))==0 ) { iObj1 =(E[])n ewObject[(int)(size/F(1524%767,7))]; iObj2 =(E[] )newObject[(int)(size/F(88% 47,3))];} else { in t temp=(in t)(size/2)+1; iObj 1= (E[ ])new Object[temp]; iObj 2= (E[ ])new Object[size-temp];} } public void setArray(int pos,E elem) { if((pos% F(183%95, 4))==0) iObj1[(int)pos/ F( 374 %191,5)]=elem; else iObj2[(int)pos/ F( 757 %383,6)]=elem; } public E getArray(int pos) { if((pos% F(1524%76 7,7))==0) return(iObj1[(int)pos/ F( 305 9%1535,8 )]); else return(iObj2[(int)pos/ F( 613 0%3071,9 )]); } public int lengthArray() { return(iObj1 .length+iObj2. length); } } Note that all the constants are hidden using the F( ) functions returned by the ConstHide module. The internal implementation can be obfuscated to any level of sophisti cation. And as mentioned earlier, the sy stem suppo rts multiple implem entations of Split array s with varying levels of obfuscation and the CoBS handler make the choice when the programm er uses the CoBS classes. 2.3JDA T A TRANS-Obfuscator The JDATAT RANS-Obfuscator scans the program and identifies those statements, called candidate statements, in the program where either a CoBS based array is declared or an instance of the array is accessed using any of its public me thod s. To do this, the obfuscator first preprocesses the input program. In the preprocessing phase, the sentence boun daries are detected, the comm ents are stripped off and the tokens in the sentence are identified. Now in each candidate statement, the obfuscator identifies the constants used, including the array indices. If there are no constants used then the array index variable is muliplied by '1' (which clearly does not alter its value) and the constant '1' is hidden using ConstHide. If only the lengthArray( ) function is invoked, then it is similarly replaced by lengthArray( ) * 1, where the '1' is later hi dden by ConstHid e. For non candidate statements, the first integer constant is hidd en, avoiding alphanumeric strin gs. We remark that the resulting program can be rebofuscated again by the obfuscator to obtain further levels of obfuscation. To illustrate the me thod , consider the following snippet from the program 'myprog.java ' which the programmer wishes to obfuscate. The programmer decides to obfuscate the array 'ar' using SplitArray. SplitArrayar=newSplitAr ray (100000); ar.setArray(i,(3 *i + 1000) % n); y = ar.getArray( i); Af ter obfuscation, it is transformed into th e following code. SplitArrayar=newSplitAr ray (50000*F(49135%2 4575,12)); ar.setArray(i*(4 *F(3059%1535,8)( F(49135%24575,12)* F(35%27,2)+F(33% 21,2))),(3*i + 1 000)% n); y=ar.getArray(i* (F(35%27,2)-F(12 273%6143,10))); Af ter one more iteration of obfuscation, it is further transformed into the following code. SplitArrayar=newSplitAr ray (50000*F((F(4913 5%24575,12)*2456 7+F(33%21,2))%2457 5,12)); ar.setArray(i*(4 *F((F(49135%2457 5,12)*1529+F(33%21 ,2))%1535,8)(F(4 9135%24575,12)*F (35%27,2)+F(33%21, 2))),(3*i + 1000 ) % n); y=ar.getArray(i* (F((F(49135%2457 5,12)*17+F(33%21,2 ))%27,2)-F(12273 %6143,10))); But this creates an additio nal overhead in terms of the execution time as the number of F( ) expressions that needs to be computed in runtime increases with each iteration of the obfuscation. Figure 2 shows how the various JDATAT RANS components that we discussed so far interact in order to obfuscate the input Java program. Figure 2. High-level view of the JDAT ATRANS com ponen ts 3.Experimental Results The following plot shows the tool performance where th e analysis is performed on a sample code 'myprog.java' denoted by A and its obfscated version using SplitArray , FoldedArray and FlattenedArray denoted by B, C and D respectively . The algorithm section of ‘my prog.java’ is as follows Yes Java Source File CoBS Repository CoBS Parser CoBS Handler Class Generator Obfuscator for hiding constants in the Candidate/NonCandidate statement s Obfuscated Source File Obfuscate n times Set ‘n’ elements to an a rray of size 10 0000 Print n array elements The execution time analysis is performed on a system with Intel Core Duo processor, 1.6 6GHz, with 1GB of RAM. Executio n tim e ana ly sis 0 2 4 6 8 10 12 14 1000 0 20 000 3000 0 50 000 700 00 10 0000 No. of a rray e le m e nts Exe cutio n T ime (in Sec on ds) A B C D The graph shows no major variations in execution time for A, B, C, D for different number of array elements. For 100 000 elements, the execution time analysis of A, B, C, D and its o bfuscated codes are performed. Let P2, P3, P4 , P5 correspond to different obfuscated versions of codes A, B, C, D. For code say, B (my prog_Sp litArray.java), the obfuscated versions are B2 (my prog_Sp litArray_mod123.java),B3(my prog_Sp litArray_mod 123123.java),B4(myprog_SplitArray_mod1231231 23.java),B5(m yprog_SplitArray _mod123 1231 2312 3.java).The tool outp ut B1 (my prog_Sp litArray_mod.java) is a formatted nonobfuscated version of B. Let P,P1,P 2,P 3,P 4,P 5 correspond ing to codes A,B,C and D be represented on X-axis and the Execution time(Sec) on Y axis. Execution T im e A na ly sis 12 12 .5 13 13 .5 14 14 .5 15 15 .5 P P1 P2 P3 P4 P5 Sou rc e cod e s and O bfu s c ated codes Exe cutio n T ime (Se cond s ) A and B C an d D The above graph shows that there is not a considerable variation between execution times of the original code and obfuscated codes. The storage cost of the obfuscated files is measured in terms of file size. Sto r ag e Co st A nal y si s 0 0.5 1 1.5 2 2.5 3 P P1 P2 P3 P4 P5 Sou r ce cod e s and Ob fuscate d cod e s Fi le Si ze (in Kil o Byte s ) A B C D The graph shows that for code D, the file size grows more for obfuscated versions. The execution time is analyzed for Case1 program with random indices. The code for the following algorithm is denoted by ‘E’ and the obfuscated versions using SplitArray, FoldedArray and FlattenedArray are denoted by F, G and H respectively. For n>0, Set n diff erent elements to array ‘A’ of size 10 0000 for (i : 0 .... n-1){ Generate a random n umber say ‘nu m < n Access A[num] } Rand om index access Exec ution Time A nalysis- Case 1 0 5 10 15 20 25 100 00 2000 0 300 00 5000 0 7000 0 100 000 No. of a rray e l e m e nts Exec u tion Tim e(Seconds) E F G H The graph shows no major variation for the execution times. For Case1, the execution times for 1000 00 elements are analyzed for codes and obfuscated versions of E, F, G and H. Execu tion Tim e A nalysis-For rando m access of 100000 array elements- Case 1 20 21 22 23 24 25 26 P P2 P3 P4 P5 Source code s a nd Obfus ca te d code s Execut i on Ti m e(Sec onds) E and F G and H In the graph, considerable variation does not appear for execution times of E,F and G,H. Again, the execution time is examined for the following code of Case 2 with random indices, denoted by ‘I’. Initialize array ‘A’ of size 100000 , to 0 Read n for (i : 0 .... n-1){ Generate a random n umber say ‘nu m’

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment