Hadoop is a popular MapReduce framework for developing parallel applications in distributed environments. Several advantages of MapReduce such as programming ease and ability to use commodity hardware make the applicability of soft computing methods for parallel and distributed systems easier than before. In this paper, we present the results of an experimental study on running soft computing algorithms using Hadoop. This study shows how a simple genetic algorithm running on Hadoop can be used to produce solutions for high dimensional optimization problems. In addition, a simple but effective technique, which did not need MapReduce chains, has been proposed.
Hadoop gives a new impulse to parallel and distributed systems, which are one of the most interesting study areas of information technologies [1,2]. It provides an easy to use programming API, which can be used to create simple solutions for many complex problems. Hadoop uses MapReduce (MR) programming model, which is the popular Google approach for big data analysis [2,3]. Hadoop can also be used to apply soft computing methods on big data. The socalled Hadoop Ecosystem also provides a comprehensive machine-learning library called Mahout [4].
In this paper we describe our approach to solve a highdimensional optimization problem by employing a soft computing method using Hadoop. In short, a simple genetic algorithm (GA) which used very big populations was programmed using Hadoop MapReduce API, and applied to a high-dimensional optimization problem. We first surveyed the literature for different MapReduce-Genetic Algorithm (MRGA) models and proposed a simple but effective approach. In the experiments, the sphere function, which is a well-known benchmark function in the literature was selected (Fig. I).
To date, many different approaches regarding soft computing methods have been proposed for various highdimensional problems [5][6][7][8]. Those are either novel or hybrid methods, and are generally performed on a single PC. In this context, parallel and distributed systems can be utilized as an alternative for high-dimensional problems. Among them, two types of approaches generally stand out. The first type performs the calculations in the Map phase, the second uses Map-Map chains or Map-Reduce chains. The first approach is quick because it produces the population and run GA completely on the memory, but the population size is limited with the size assigned to a map. The second can produce a big HDFS (Hadoop Distributed File System) [5][6] output, which causes a performance loss, but is more flexible and enable usage of different components like combiner.
The model used in this study (Fig. 2) can be considered as a hybrid version of the two aforementioned models. But here, first populations were produced on HDFS in order to evaluate the Read-Write performance and so that the populations could be reused in different experiments. The GA process firstly performed in Map phases. In the first experiments, we tried a standard MapReduce process for each population. But, as can be seen in the next section, successful values were not obtained. Then, we suggested that more elite individuals selected from each final population in the Maps by a very low selection rate could be passed on a new population in the reduce phase. Thus a small mixed-eligible population was created. Since the size of the elite individuals passed on was small, performance loss was not too high in comparison with the chain model. In this study we used the open source Apache Hadoop framework, which is developed in accordance with the information given in the [3] paper where the MapReduce model was initially explained. MapReduce is a programming model developed by Google for analyzing big data on distributed computer clusters. In this model a two-phased action called map and reduce is performed on data. Mappers list the data as key-value pairs, which is distributed in blocks on the distributed files system. Let k1 be input key, v1 be input value, k2 be output key, v2 be intermediate value and v3 be output value:
Map (k1,v1) → list (k2,v2)
After intermediate key/value lists are produced, Reducers perform grouping operation on key/value pairs. Reduce (k2, list (v2)) → list (v3). In order to run a program written in MapReduce programming model, there should be an execution environment containing the classes, which are written, based on this model. In study we used Hadoop version 2.6.0 and it should be considered that the information regarding the Hadoop framework given here might not apply to other versions.
Another important component for running a MapReduce program is a distributed file system. Computers, which can communicate with each other via a network connection, are needed to establish a distributed file system, which is necessary for distributing the data on to the machines, processing and writing the results. The file system of Hadoop, which is called Hadoop Distributed File System (HDFS) [9,10], was developed based on Google File System (GFS). HDFS allows us to work with big data on distributed clusters.
In Hadoop system data is split into blocks. The size of each block is set to 128 MB (64 MB for older version than 2.2.0) by default [16]. Different block sizes can be configured using dfs.block.size parameter. These blocks are replicated and then distributed on the computers in the Hadoop cluster. Replication factor can be specified by changing the parameters in the hdfssite.xml file. Thus, each data has a backup in block level and when there is failure on a machine where some part of the data is stored; the program execution is not
This content is AI-processed based on open access ArXiv data.