A simple and efficient explicit parallelization of logic programs using low-level threading primitives

This is an unpub lished work . Please contact the authors fo r further details and results. A simple and efficient explicit parallelization of logic programs using low-level threading primitives Diptikalyan Saha IBM Research Labs, Delh i, India diptsaha@in.ibm.co m Paul Fodor State University of New York at Stony Brook, USA p fodor@cs.sunysb.edu Abstract In this work, we present an automatic way to parallelize logic programs for finding all the answers to queries using a transfor- mation to low level threading primitives. Alth ough much work has been done in parallelization o f logic programming more than a decade ago (e.g., Aurora, Mus e, YapOR), the current state of pa- rallelizing logic programs is still very poor. This work presents a way for parallelism of tabled logic programs in XSB Prolog under the well founded semantics. An important co ntribution of this work relies in merging answer-tables from multiple children threads without incurring copying or full-sharing and synchroni- zation of data-structures. The implementatio n of the parent- children shared answer-tabl es surpasses in efficienc y all the other data-structures currently implem ented fo r completion of answers in parallelization using m ulti-threading. The transformation and its lower-level answ er merging predicates wer e implemented as an extension to the XSB system. Categories and Subject Descriptors D.3.3 [ Programming Languages ]: Language Constructs and Features General Terms Logic Programming . Keywords parallelization, threads, tabled logic programming 1. Introduction In the last few years, many applicat ions are turning towards log ic programming as a better way to re present knowledge and a logical semantics for huge amounts of data. Unfortunately, although logic programming works well for medium size programs, it is still lacking the capability to handle huge program s from areas like semantic web, data mining, prog ram analysis , and security policy frameworks, so most researchers agree that we need to better un- derstand parallelization of logic p rograms. We believe that the area of logi c programming was better studied more than a decade ago, when special WAM engines [3] were developed to run on multip le processors (see Aurora [4], Muse [1], and YapOR [5]). All these approaches ex tended the WAM in some way for shared memor y architectures: Aurora kept a tra il tree ”derivation t ree” and used ”binding arrays ” in each process to point to nodes in the trail tree, while Muse copied process bind- ings for each spawn thread. Unfortunat ely, during those times multiple core processors were extremely exp ensive and those approaches were never adopted by large communities or evo lve into full-fledged s ystems. Lately, there is a high er interest in pa- rallelization using comm on multi-core processors for answer-set programming. However, the work fo r well-founded models is still weak and we are aware of only one d eductive database which allows parallelization of computation: OntoBroker. Howev er, the improvements in efficiency are not very high (only about ~10% improvement for the parallel com putation of the transitive closur e of a very sparse random graph on a dual-core processor compared to a single process computation) due to its bottom-up computation and lower possibility for parallelization. Multi-Core is one of the current buzz-words arou nd along with related areas (e.g., Cloud Computing). It seems that the era of Moore’s law is coming to an end and processor speeds are not going to increase as they way it was increasing in the pre-2002 era. Instead, the future perform ance gain is going to be exploiting the multiple CPUs available in the single processor b y thread level parallelism. The main question here is the effect of this change in computing paradigm to the logic programming model. The WAM (Warren Abstract Machine) emulator and its tabled extension s such as SLG-WAM were design ed primarily for a single thre aded execution. The main problem is whether a tabled Prolog program can be compiled to a set of s et of instructions where each set is going to run in a single CPU, and still preserve the soundness and completeness of tabled execution . An important question to the programming language commu- nity is how to cope and exploit the power of parallel executions. In theory, we don’t really need to write our programs along with parallel-programming constructs th at the compiler can g enerate efficient code which exploits pa rallelism as the user knows what is the intension of the program, or the compiler take the burden of taking any program and generate e fficien t parallel code from it , after all the compiler knows more about the system and architec- ture. We believe that it’s a new opportunity for declarative com- munity where programmer, by defi nition, writes down its intent through a declarative language and then the compiler can take the burden of generating efficient code which exploits parallelism in the executing machine. Many LP systems support the use of POSIX threads to perform separable computations with a si mple and clear API (e.g., Yap, SWI, XSB, and Mercury). Our work describes parallelization of computing tabled predicates using a transformation into a p arallel version which distributes the computation on multiple thr eads. This transformation for multi-core processors can be ext ended to a multi-computer MapReduce schem e [2] (i.e.,”map” for distribut - ing the data and ”reduce” to com bine the results). Due to its sim- ple join for the ”reduce” op eration, such a tr ansformation has a very good performance (tested on m ultiple computers in our net- work). 2. Parallelization through explicit threading and answer-tries merging In this section we describe the parallelization of computing mod- els for tabled predicates. Our work was inspired from the map- reduce algorithm, although not a new paradigm, it is a simple and efficient way to parallelize computations . Intuitively, our m ethod can be described as fol lows: we split the query da ta and distribute the data to multi ple computers/threads (in map-reduce terms, this can be viewed as the “map” phase), keep the res ults sets disjoined and join and combine the results (i.e., the "reduce" phase in map- reduce terms). The motivating example for this work is the right recursiv e transitive closure and we consider the simplest query to this pro- gram: reach(bound,Y) with only the first argument bound. In most cases this is the most used by programm ers. This program is easi- ly transformed into a parallel m ulti-threaded version as seen in Program 2 bellow. :- table(reachr/2). reachr(X,Y):- edg e(X,Y). reachr(X,Y):- edg e(X,Z), reachr (Z,Y). Program 1: Right recursive transitive closure :- table(reachr/2). reachr(X,Y):- thread_create(( edge1 (X,Z1), reachr1 (Z1,_), merge_ans_tables(reac hr(X,_), reachr1( Z1,_))),Id1), thread_create(( edge2 (X,Z2), reachr2 (Z2,_), merge_ans_tables(reachr1(X,_), reachr2(Z2,_))),I d2), … thread_join(Id1), thread_join(Id2), … reachr(X,Y). % tabling suspension to collect the results :- table reachr1/2. reachr1(X,Y):- edge(X,Y). reachr1(X,Y):- edge(X,Z) , check_if_not_t ried(reachr(X,Z)), reachr1 (Z,Y). … :- thread_shared_constant edge/2. :- import check_if_not_tried/1, merg e_ans_tables/3 from increval. Program 2: Right recursive tc parallel transformation The first step of our algor ithm is the initial d istribution of tasks to the children threads. Considering the program 1, the transfor- mation looks for a distribution of th e computation in the first step, so it looks for two or more edges starting from the initial bound node in the graph. If such a case is not found, the algori thm will continue to compute the tr ansitive closure and keep look ing for 2 or more edges from the same n ode before putting those options into the children threads. The second step is to run the chil dren threads in parallel on the complete data set (i.e., all edge s). However, one essential condi- tion for efficiency is that comput ations do not overlap (i.e., with- out duplication their tasks), and their answer-tables have no re- sults in common and making sure that if one thread stops th en it takes up other thread's work.. This is done building the Predicate Dependency Graph (PDL) and addi ng in front of each literal L that participates in a cycle in PDL a safe version of a semaphore “ check_if_not_tried ( L ) ” which checks if the quer y was not al- ready started b y another child t hread. If it was not st arted, it is marked as started and the program proceeds, ot herwise the com- putation of the thread fails and the computation backtracks. This incurs that the computation for each th read is sound , but the re- sult set is not complete. However, the union of al l the answer sets of the children threads is complete. The third step is the merging of t he answers from all the chi ld- ren threads (for completion). After a ll the children threads hav e completed their res pective answer tables, we m erge the answer tables to return the answers in th e parent thread s table. This is a very common operation for programs that find all answers to que- ries or compute various aggregate functions. Using full-shared data-structures (i.e., shared trie s in XSB) is impractical beca use it induces a really significant overhead for copying results from all threads. This overhead is over 30% for copying a single tabled answer-t abl e for the transitive closure example in XSB. We implemented a method that me rges answer tables in con- stant time. On completion of the quer y to the parent thread (i.e., after the join with all its children and end of quer ying), we remove the trie based return of the answers. In a builtin-call ea ch thread’s answer table ponter_next points to the other threads table answer tables head_pointer. Finally, we return answers through the an- swer table pointer when asked for answ ers in the parent tab le. 3. Evaluation Results We evaluated the performance of the p arallelism method for tran- sitive closure programs and for r eal examples from program anal- ysis with the Anders en's Points-to Analysis which uses two flavors of transitive closure: left and righ t recursive. First, we compared the pe rformance of transitive closur e for single process computation (i.e., un-threaded) for XSB, paralle- lized multi-threading computati on without lower-level answer- tables merging for XSB (i.e. , using shared predicates and , basical- ly, copying of answers), parallelized computation with m ulti- threading with lower-level answer-tab les merging for XSB. On a dual-core Dell Optiplex 755 with 3GB main memory running Ubuntu 9.10, we computed transitiv e closures of complete graphs for sizes between 100 and 10,000 vertexes, obtaining an im- provement of 40.51% in averag e for the wall time with two threads. For instance, for 10K vertex es, we got 4.89sec. for the computation on two threads vs. 8.22 for the single-threaded com- putation. The points-to analysis experiment shows an even better im- provement (see table 1 for the data set Vasy59 data set). Total time Single-threaded computation 2.28 sec 2 threads with shared predicat es for merging results 1.64 sec 2 threads Thread1 0.83 1.19 sec Thread2 1.05 Merging 0.14 Table 1: Execution times for points-to analysis 4. Related Work The work by Marques and Swift in [8] is the closes t research to our own work and it’s the basis for our implementation in XSB Prolog. However, their work doesn 't actually address paralleliza - tion of tabled computations, but instead, cor rectness and com- pleteness of SLG resolution for shared tabled predicate calls in multiple threads. Their algorithms deal with isolating th e compu- tation space for repetitive computations and the y show how repe- titive calls to the sam e tabled calls are "usurped" (i.e ., stopped) until the first thread to call the quer y completes the computat ion. However, our implement ation also deals with merging of answer tables, which is a wa y to “promo te” answers computed b y all children threads and adjusting the spac e reclamation routin es to account for space that is "promoted", done concurrently by all children threads. An alternative to prom oting answers is to use private message queues, but we are confident that our method has better efficiency due to its constant cost. [6] and [7] share the same basic ide as with [8], that is multiple similar tabled calls are com puted in a single thread. This is again different from our approach becaus e we can still distribute d iffer- ent parts of the computation for the specific case of transitive closure. Other approaches to par allelis m of logic programs (such as: [1,2,4,5] and the works cit ed in [3]) are just distantly r elated to our approach, although the main goal was the same with ours. The main difference is that all thes e approaches modify the Warren Abstract Machine (copying en vir onments, binding arrays, back tracing, etc.) while we use explicit transform ation with multi- threading. 5. Conclusions and Future Work We presented a novel approach that transforms logic prog rams under the well founded semantics into parallelized v ersions and has been implemented in th e increval package of the XSB Pr olog system (available for download from [10]). The primary advan- tages of this approach are its g ood efficiency and clear transfor- mation while preserving the s oundness and completeness. References [1] K. A. M. Ali and R. Karlsson. T he muse aproach to or-paralel prolog. In International Journal of Parallel Program ming, 1994. [2] J. Dean and S. Ghemawat. Mapre duce: Simplified data processing on large clusters. In Sixth Symposium on Operating System Design and Implementation, 2004. [3] G. Gupta, K. A. M. Ali, M. Ca rlsson, and M. V. Hermenegildo. Par al- lel execution of prolog program s: A survey. In ACM Transactions on Programming La nguages and Systems, 1995. [4] E. Lusk, R. Butler, T. Di sz, R. Olson, R. Overbeek, R. Stevens, D. H.Warren, A. Calderwood, P. Szeredi, S. Har idi, P. Brand, M. Carlsson, A. Ciepielewski, and B. Hausman. The auro ra or-parallel prolog system. I n New Generation Computing, 7(2-3), 243 - 271, 1990. [5] R. Rocha, F. Silva, V. S. Costa, and R. R. Fern. Yapor: an or-parallel prolog system based on environment copying. In Pr ogress in Artifi- cial Intelligence, 1999. [6] J. Freire, R. Hu, T. Swift, and D. S. Warren. Parallelizing tabled evalu- ation. 7th International PLILP Symposium, pages 115–132. Spr in- ger-Verlag, 1995. [7] R. Hu. Efficient Tabled Evaluation of Norm al Logic Programs in a Distributed Environment. PhD thesis, Depar tment of Computer Science, State University of New York, Stony Brook, USA, 1997. [8] Rui Marques, Te rrance Swift, Concurrent and Local Evaluation of Normal Programs, Theoretical Com puter Science, 2008. [9] R. Rocha, F. Silva, and V. S. Costa. On applying or-parallelism and tabling to logic programs. Theory and Practice of Logic Program- ming, 4(6) , 2004. [10] mXSB multi-threaded with an swer merging and parallelism predi- cates (increval package): http://www.cs.s unysb.edu/~pfodor/mXSB.tgz

A simple and efficient explicit parallelization of logic programs using low-level threading primitives

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment