Dont Wait to be Breached! Creating Asymmetric Uncertainty of Cloud Applications via Moving Target Defenses
Cloud applications expose - besides service endpoints - also potential or actual vulnerabilities. Therefore, cloud security engineering efforts focus on hardening the fortress walls but seldom assume that attacks may be successful. At least against z…
Authors: Kennedy A. Torkura, Christoph Meinel, Nane Kratzke
Don’t W ait to be Br eached! Cr eating Asymmetric Uncertainty of Cloud A pplications via Moving T arget Defenses (In vited Paper) K ennedy A. T orkura Hasso Plattner Institute Univ ersity of Potsdam, Germany Email: kennedy.torkura@hpi.de Christoph Meinel Hasso Plattner Institute Univ ersity of Potsdam, Germany christoph.meinel@hpi.de Nane Kratzke L ¨ ubeck Univ ersity of Applied Sciences L ¨ ubeck, Germany nane.kratzke@th-luebeck.de Abstract —Cloud applications expose besides service endpoints also potential or actual vulnerabilities. Therefor e, cloud security engineering efforts f ocus on hardening the fortr ess walls but seldom assume that attacks may be successful. At least against zero-day exploits, this approach is often toothless. Other than most security approaches and comparable to biological systems we accept that defensive “walls” can be breached at several layers. Instead of hardening the “fortr ess” walls we propose to make use of an (additional) active and adaptive defense system to attack potential intruders - an immune system that is inspired by the concept of a moving target defense. This “immune system” works on two layers. On the infrastructure layer , virtual machines are continuously regenerated (cell regeneration) to wipe out even undetected intruders. On the application level, the vertical and horizontal attack surface is continuously modified to circumv ent successful replays of f ormerly scripted attacks. Our evaluations with two common cloud-native reference applications in popular cloud service infrastructur es (Amazon W eb Services, Google Compute Engine, Azur e and OpenStack) show that it is technically possible to limit the time of attackers acting undetected down to minutes. Further , more than 98% of an attack surface can be changed automatically and minimized which makes it hard for intruders to replay formerly successful scripted attacks. So, even if intruders get a foothold in the system, it is hard for them to maintain it. Keyw ords – zero-day; exploit; moving target defense; microser- vice; cloud-nativ e; application; security; asymmetric I . I N T RO D U C T I O N This paper extends ideas presented in [1] to impro ve cloud application security in the context of unkno wn zero- day exploits and reports on ongoing research in this field. Cloud computing enables a v ariety of innov ati ve IT -enabled business and service models, and many research studies and programs focus on responsibly developing systems to ensure the security and priv acy of users. But compliance with stan- dards, audits, and checklists, does not automatically equal T ABLE I. Some popular open source elastic platforms Platform Contributors URL Kubernetes Cloud Nati ve F ound. http://kubernetes.io Swarm Docker https://docker .io Mesos Apache http://mesos.apache.org/ Nomad Hashicorp https://nomadproject.io/ security [2] and there is a fundamental issue remaining. Zero- day vulnerabilities are computer-software vulnerabilities that are unknown to those who would be interested in mitigating the vulnerability (including the entity responsible for operating a cloud application). Until a vulnerability is mitigated, hackers can exploit it to adversely affect computer programs, data, additional computers or a network. For zero-day exploits, the probability that vulnerabilities are patched is zero, so the exploit should always succeed. Therefore, zero-day attacks are a severe threat, and we hav e to draw a scary conclusion: In principle, attackers can establish footholds in our systems whenever they want. This contribution deals with the question how to build “un- fair” cloud systems that permanently jangle attackers nerves. W e present the latest results from our ongoing research that applies Moving T arget Defense (MTD) principles on cloud runtime en vironment and cloud application layer . Recent research [3], [4] made successfully use of elastic container platforms (see T able I) and their “designed for fail- ure” capabilities to realize transferability of cloud-native appli- cations at runtime. By transferability , the conducted research means that a cloud-nati ve application can be moved from one IaaS provider infrastructure to another without any do wntime. These platforms are more and more used as distributed and elastic runtime en vironments for cloud-native applications [5] and can be understood as a kind of cloud infrastructure unifying middleware [6]. It should be possible to make use of the same features to immunize cloud applications simply by moving an application within the same provider infrastructure. T o move anything from A to A makes no sense at first glance. Howe ver , let us be paranoid and aware that with some probability and at a gi ven time, an attacker will be successful and compromise at least one virtual machine [7]. A transfer from A to A would be an ef fective countermeasure – because the intruder immediately loses any hijacked machine that is mov ed. T o understand that, the reader must know that our approach does not effecti vely mo ve a machine, it regenerates it. T o move a machine means to launch a compensating machine unknown to the intruder and to terminate the former (hi- jacked) machine. Wheneve r an application is moved its virtual machines are regenerated. Moreover , this would effecti vely eliminate undetected hi-jacked machines. Howe ver , attackers can run automated attacks against re- generated machines that will incorporate the same set of Figure 1. The cyber attack life cycle model. Adapted from the cyber attack lifecycle used by the M-T rends reports, see T able II . vulnerabilities. Therefore, this e xtended paper shows how we further can improve the regenerating security measure by employing MTD at the application layer to change the attack surface of the application itself to let even automated and formerly successful attack scripts fail (at least partly). Primarily , this is achie ved by diversifying the application in a way that its containerized components are dynamically trans- formed at runtime . The two abstraction layers that compose microservice applications (application layer and the container image layers) are dynamically changed by changing the pro- gramming languages of the applications, and consequently , the container images are built to conform to the requirements of the corresponding applications. This combined approach is enforced at runtime to transform the attack surface of cloud-nativ e applications, thereby reducing the possibility of successful attacks. The remaining of this paper is outlined as follows: Section II presents a cyber-attack lifecycle model to sho w where our approach intends to break the continuous workflo w of security breaches. Section III presents an approach on ho w MTD can be applied on cloud runtime environment (infrastructure) le vel to regenerate the ”infrastructur e cells” of a system continuously , lev eraging the inherent ”designed-for-failure” capabilities of modern container platforms like Kubernetes, Swarm, or Mesos. This continuing regeneration will wipe out even undetected attackers in a system. Howe ver , attackers might recognize that they periodically loose foothold in a hi-jacked system and might try to automatize their breaches. T o ov ercome this, Section IV will present how even the attack surface of an application can be continuously changed and therefore extends our ideas shown in [1]. W e have to consider that our approach has some limitations. W e will discuss these limitations in Section V and present corresponding related work in Section VI. W e conclude our findings in Section VII. I I . C Y B E R A T TAC K R E F E R E N C E M O D E L Figure 1 shows the cyber attack life cycle model which is used by the M-T rends reports 1 to report dev elopments in cyber attacks over the years. According to this model, an attacker passes through different stages to complete a cyber attack mission. It starts with initial reconnaissance and compromising of access means. Social engineering methodologies [8] and phishing attacks [9] very often supports these steps. Intruders aim to establish a foothold near the target. All these steps are not covered by this paper , because technical solutions are not able to harden the weakest point in security – the human being. The follo wing steps of this model are more important for this paper . According to the life cycle model, the attacker’ s goal is to escalate privileges to get access to the target system. Because this leaves trails on the system which could rev eal a security breach, the attacker is motiv ated to compromise this forensic trail. According to security reports, attackers make more and more use of counter -forensic measures to hide their presence and impair in vestigations. These reports refer to batch scripts used to clear ev ent logs and securely delete arbitrary files. The technique is simple, b ut the intruders’ knowledge of forensic artifacts demonstrate increased sophistication, as well as their intent to persist in the en vironment. With a barely detectable foothold, the internal reconnaissance of the victim’ s network is carried out to allo w the lateral mov ement to the target system. This process is a complex and lengthy process and may ev en take weeks. So, infiltrated machines and application components hav e worth for attackers and tend to be used for as long as possible. T able II shows ho w astonishingly many days on av erage an intruder has access to a victim system. So, basically there is the requirement, that (1) an undetected attacker should lose access to compromised nodes of a system as fast as possible. Furthermore there is the requirement, that it (2) must be hard f or an attacker 1 http://bit.ly/2m7U A Yb (visited 9th Nov . 2017) T ABLE II. Undetected days on victim systems reported by M-T r ends. External and internal discovery data is r eported since 2015. No data could be found for 2011. Y ear External notification Internal discovery Median 2010 - - 416 2011 - - ? 2012 - - 243 2013 - - 229 2014 - - 205 2015 320 56 146 2016 107 80 99 to regain foothold in a system by automating successful attacks . Howe ver , how? Section III will deal with the (1) requir ement sho wing that it is possible to re generate possibly compromised infrastructure continuously even to get rid of undetected attackers. Section IV will deal with the (2) requirement and demonstrate that it is possible to change attack surfaces of applications in a way that successful attacks cannot be repeated 1:1. I I I . M OV I N G T A R G E T D E F E N S E M E C H A N I S M S O N T H E C O N T A I N E R R U N T I ME E N V I R O N M E N T L E V E L Our recent research dealt [10] mainly with vendor lock-in and the question ho w to design cloud-nativ e applications that are transferable between different cloud service providers. One aspect that can be learned from this is that there is no common understanding of what a cloud-nativ e application is. A kind of software that is “intentionally designed for the cloud” is an often heard but empty phrase. Howe ver , note worthy similarities exist between various viewpoints on cloud-native applications (CN A) [5]. A common approach is to define maturity lev els in order to categorize different kinds of cloud applications (see T able III). [11] proposed the IDEAL model for CN As. A CN A should strive for an isolated state , is distributed , provides elasticity in a horizontal scaling way , and should be operated on automated deployment machinery . Finally , its components should be loosely coupled . [12] stress that these properties are addressed by cloud- specific architecture and infrastructure approaches like Mi- croser vices [13], API-based collaboration , adaption of cloud-focused patterns [11], and self-service elastic plat- forms that are used to deploy and operate these microservices via self-contained deployment units (containers). T able I lists some of these platforms that provide additional operational capabilities on top of IaaS infrastructures like automated and on-demand scaling of application instances, application health management, dynamic routing and load balancing as well as aggregation of logs and metrics [5]. A. Re generating cloud application runtime envir onments con- tinuously If the reader understands and accepts the commonality that cloud-nativ e applications are operated (more and more often) on elastic – often container-based – platforms, it is an obvious idea to delegate the responsibility to immunize cloud appli- cations to these platforms. Recent research showed that the operation of these elastic container platforms and the design of applications running on top of them should be handled as two different engineering problems. This point of view often solves sev eral issues in modern cloud-native application engineering [4]. Also, that is not just true for the transferability problem but might be an option to tackle zero-day exploits. These kinds of platforms could be an essential part of the immune system of modern cloud-nativ e applications. Furthermore, self-service elastic platforms are really “bul- letproofed” [15]. Apache Mesos [16] has been successfully operated for years by companies like T witter or Netflix to consolidate hundreds of thousands of compute nodes. Elastic container platforms are designed for failure and provide self-healing capabilities via auto-placement, auto-restart, auto- replication, and auto-scaling features. They will identify lost containers (for whate ver reasons, e.g., process failure or node unav ailability) and will restart containers and place them on remaining nodes. These features are necessary to operate large-scale distributed systems resiliently . Howe ver , the same features can be used intentionally to purge “compromised nodes” . [3] demonstrated a software prototype that provides the control process shown in Figure 2 and Figure 3. This process relies on an intended state ρ and a curr ent state σ of a container cluster . If the intended state differs from the current state ( ρ 6 = σ ), necessary adaption actions are deduced (creation and attachment/detachment of nodes, creation and termination of security groups) and processed by an e xecution pipeline fully automatically (see Figure 3) to reach the intended state ρ . With this kind of control process, a cluster can be simply resized by changing the intended amount of nodes in the cluster . If the cluster is shrinking and nodes have to be terminated, affected containers of running applications will be rescheduled to other av ailable nodes. The downside of this approach is, that this will only work for Level 2 (cloud resilient) or Le vel 3 (cloud-nati ve) applications (see T able III) which by design, can tolerate dependent service failures (due to node failures and container rescheduling). Howe ver , for that kind of Level 2 or Le vel 3 application, we can use the same control process to regenerate nodes of the container cluster . The reader shall consider a cluster with σ = N nodes. If we want to regenerate one node, we change the intended state to ρ = N + 1 nodes which will add one new node to the cluster ( σ 0 = N + 1 ). Moreov er, in a second step, we will decrease the predetermined size of the cluster to ρ 0 = N again, which affects that one node of the cluster is terminated ( σ 00 = N ). So, a node is regenerated simply by adding one node and deleting one node. W e could T ABLE III. Cloud Application Maturity Model , adapted from OPEN D A TA CENTER ALLIANCE Best Practices [14] Level Maturity Criteria 3 Cloud - T ransferable across infrastructure providers at nativ e runtime and without interruption of service. - Automatically scale out/in based on stimuli. 2 Cloud - State is isolated in a minimum of services. resilient - Unaf fected by dependent service failures. - Infrastructure agnostic. 1 Cloud - Composed of loosely coupled services. friendly - Services are discov erable by name. - Components are designed to cloud patterns. - Compute and storage are separated. 0 Cloud - Operated on virtualized infrastructure. ready - Instantiateable from image or script. Figure 2. The control theory inspired execution control loop compares the intended state ρ of an elastic container platform with the current state σ and derives necessary scaling actions. These actions are pr ocessed by the execution pipeline explained in F igure 3. So, platforms can be operated elastically in a set of synchr onized IaaS infrastructures. Explained in details by [3]. Figure 3. The execution pipeline pr ocesses necessary actions to transfer the current state σ into the intended state ρ . See [4] for more details. ev en regenerate the complete cluster by changing the cluster size in the following way: σ = N 7→ σ 0 = 2 N 7→ σ 00 = N . Howe ver , this would consume much more resources because the cluster would double its size for a limited amount of time. A more resource efficient way would be to regenerate the cluster in N steps: σ = N 7→ σ 0 = N + 1 7→ σ 00 = N 7→ ... 7→ σ 2 N − 1 = N + 1 7→ σ 2 N = N . The reader is referred to [4] for more details, especially if the reader is interested in the multi-cloud capabilities, that are not covered by this paper due to page limitations. Whenev er such regeneration is triggered, all – even unde- tected – hijacked machines would be terminated and replaced by other machines, but the applications would be unaffected. For an attacker , this means losing their foothold in the system entirely . Imagine this would be done once a day or even more frequently? B. Evaluation The execution pipeline presented in Figure 3 was ev aluated by operating and transferring two elastic platforms ( Swarm Mode of Docker 17.06 and K ubernetes 1.7 ). The platforms operated a reference “sock-shop” application being one of the most complete reference applications for microservices architecture research [17]. T able IV lists the machine types that sho w a high similarity across different providers [18]. The ev aluation of [4] demonstrated that most time is spent on the IaaS le vel (creation and termination of nodes and security groups) and not on the elastic platform level (joining, draining nodes). The measured differences on infrastructures provided by different providers are sho wn in Figure 4. For the current use case, the reader can ignore the times to create and delete a security group (because that is a one time action). Howe ver , there will be many node creations and terminations. According to our execution pipeline shown in Figure 3, a node creation ( σ = N 7→ σ 0 = N + 1 ) inv olves the durations to create a node (request of the virtual machine including all installation and configuration steps), to adjust security gr oups the cluster is operated in and to join the new node into the cluster . The shutdown of a node ( σ = N 7→ σ 0 = N − 1 ) in volves the termination of the node (this includes the plat- form draining and deregistering of the node and the request to terminate the virtual machine) and the necessary adjustment of the security group . So, for a complete regeneration of a node ( σ = N 7→ σ 0 = N + 1 7→ σ 00 = N ) we have to add these runtimes. T able V lists these values per infrastructure. T ABLE IV. Used machine types and regions for evaluation Provider Region Master type W orker type A WS eu-west-1 m4.xlarge m4.large GCE europe-west1 n1-standard-4 n1-standard-2 Azure europewest Standard A3 Standard A2 OS own datacenter m1.large m1.medium T ABLE V. Durations to regenerate a node (median values) Provider Creation Secgroup Joining T erm. T otal A WS 70 s 1 s 7 s 2 s 81 s GCE 100 s 8 s 9 s 50 s 175 s Azure 380 s 17 s 7 s 180 s 600 s OS 110 s 2 s 7 s 5 s 126 s 0 20 40 60 80 100 120 140 Time to create a security group [seconds] 0 10 20 30 40 50 60 Time to adjust a security group [seconds] 0 50 100 150 200 250 300 350 400 Time to delete a security group [seconds] AWS OS GCE Azure 0 100 200 300 400 500 600 Time to create a node [seconds] AWS OS GCE Azure 0 20 40 60 80 100 120 Time to join a node into a cluster [seconds] AWS OS GCE Azure 0 50 100 150 200 250 Time to terminate a node [seconds] Figure 4. Infrastructure specific runtimes of IaaS operations see [4]. Even on the “slo west” infrastructure, a node can be re gen- erated in about 10 minutes. In other words, one can regenerate six nodes ev ery hour or up to 144 nodes a day or a cluster of 432 nodes every 72h (which is the reporting time requested by the EU General Data Protection Regulation). If the reader compares a 72h regeneration time of a more than 400 node cluster (most systems are not so large) with the median value of 99 days that attackers were present on a victim system in 2016 (see T able II) the benefit of the proposed approach should become apparent. I V . M OV I N G T A RG E T D E F E N S E M E C H A N I S M S O N T H E M I C RO S E RV I C E A R C H I T E C T U R E L E V E L MTD techniques introduce methods for improving the security of protected assets by applying security-by-diversity tactics and security diversification concepts. While most MTD techniques do not have formal requirements for div ersifying, i.e. when, how and why to di versify , we employ a cyber risk- based technique as the primary div ersification decision making factor on the application le vel. Our moti vation for this is to ov ercome the high number of vulnerability infection among container images as sho wn by sev eral recent researchers[19], [20]. Therefore, our MTD techniques are designed to improve this state of insecurity by reducing the window of vulnerability exposur e via di versification and commensurate attack surface randomization. A. Cyber Risk Analysis for Micr oservice Diversification Larsen et al. [21] assert that a common challenge when employing di versification strategies is deciding on when , how and wher e to div ersify . W e present a cyber risk procedure to support decision making or satisfy the above afore-mentioned requirements. W e le verage security metrics to design a cyber risk-based mechanism, and security metrics are useful tools for risk assessment. These metrics are computed by deriv- ing security risks per microservice and after that employing vulnerability prioritization such that diversification is a func- tion of microservice risk assessment, i.e. microservices are div ersified in order risk sev erity . W e introduce the notion of Diversification Index - D i as an expression of the depth of div ersification to be implemented. D i defines if microservices are to be globally or selectively div ersified. Diversifying 2 out of 4 microservices can be expressed as 2:4 . D i is formally defined as: D i = m d m (1) where, m d = number of microservices to be div ersified, m = total number of microservices in the application. For this, we adopt two approaches: 1) Risk Analysis Using CVSS: The Common V ulnerability Scoring System CVSS [22] is a widely adopted vulnerability metrics standard. It provides vulnerability base scores which express the sev erity of damage the referred vulnerability might impact upon a system if exploited. In order to deriv e the microservice security state (Security Risk - S R ), base scores of all the vulnerabilities detected can be summed and av eraged as expressed below: S R = 1 N N X i =1 V i (2) where S R is the Security Risk, V i is the CVSS base score of vulnerability i , and N is the total number of vulnerabilities detected in microservice m . Ho wev er, a veraging vulnerabilities to obtain a single metric to signify a system’ s security state is not optimal. Deriv ed values are not sufficiently representative of other factors such as the public availability of exploits. Therefore, we employ another scoring technique called shrink- age estimator , an approach which has been popularly used for online rating systems, e.g. IMDB. The shrinkage estimator considers the average rating and the number of votes. Hence, it provides a more precise value for SR, than mere av eraging (Equation 2). Therefore, lev eraging the shrinkage estimator, we can deriv e a more precise S R as follo ws: S R = v v + a R + a v + a C (3) where, v = the total number of vulnerabilities detected in a microservices, a = minimum number of vulnerabilities to be detected in a microservice assessment before it added in the risk analysis, C = the mean sev erity score of vulnerabilities detected in a microservice R = the average severity score of all vulnerabilities infect- ing a microservice-based application The Pearson’ s correlation coefficient is deri ved to deter - mine the dependence relationship between the microservices. 2) Risk Analysis Using O W ASP Risk Rating Methodology: The risk assessment method described in the previous sub- section is limited to vulnerabilities contained in the Common V ulnerability Enumeration (CVE) dictionary . CVE is a public dictionary for publishing known vulnerabilities. These vul- nerabilities are analyzed and assigned vulnerability security metrics using the CVSS. Howe ver , the CVE contains only a Figure 5. T ypical Microservice Attack Surfaces illustrated with the PetClinic Application [24] handful of web application vulnerabilities. Thus we need to deriv e another risk assessment methodology for application layer vulnerabilities. This additional step is necessary since microservices are essentially web/REST -based applications. W e opt for the OW ASP Risk Rating Methodology (ORRM), which is specifically designed for web applications [23]. This methodology is based on two core risk components: Lik elihood and Impact formally expressed as: Risk = Lik elihood ∗ I mpact (4) In order to deriv e these metrics, risk assessors are required to consider the threat vector , attacks to be used and the impacts of successful attacks. B. Dissecting Micr oservice Attack Surfaces An important aspect of our security-by-diversity tactics is to manipulate microservice attack surfaces against possible attackers through random architectural transformations. There- fore, the attack surfaces are altered by randomizing the entry and exit points, which are commonly used for identifying attack surfaces [25], [26]. A detailed understanding of these attack surfaces is imperative. Therefore, we categorize mi- croservice attack surface into: horizontal and vertical attack surfaces and thereafter employ vulnerability corr elation to identify vulnerability similarities. 1) Horizontal V ulnerability Corr elation: The objectiv e of correlating vulnerabilities horizontally is to analyze the rela- tionship of vulnerabilities along the horizontal attack surface, i.e. the parts of the applications users directly interact with. Figure 5 illustrates the multi-layered attack surface of the PetClinic application [24]. The application layer horizontal attack surface consists of the interactions and exit/entry points from the API gate way to the V ets, V isits and Customer services application layers. Requests and responses are transversed along this layer , pro viding attack opportunities for attackers. The vulnerability correlation process is similar to security event corr elation techniques [27], though rather than clustering similar attributes e.g., malicious IP addresses, we focus on Common W eakness Enumeration (CWE) Ids. The CWE is a standardized classification system for application weaknesses 2 . For example, CWE 89 categorizes all vulnerabilities related to Impr oper Neutralization of Special Elements used in an SQL Command (SQL Injection) 3 and can be mapped to sev eral CVEs e.g. CVE-2016-6652 4 , a SQL injection vulnerability in Spring Data JP A. If this vulnerability exists in all PetClinic’ s microservices, an attacker could easily conduct a correlated attack ( Attack P aths 2, 4, 5, and 6 of Figure 5) resulting to correlated failures and eventual application failure since each microservice works ultimately to the successful functioning of the PetClinic application. 2) V ertical V ulnerability Correlation: The vertical correla- tion technique is similar to the horizontal correlation. Howe ver , the interactions across application-image layers are analyzed. This analysis, therefore, employs security-by-design tactics across the vertical attack surface. Attack Path 1 illustrates the exploitation of vulnerability across the vertical attack surface, and the attacker initiated an attack against the API Gate way of the PetClinic application, from the application layer to the image layer . From there, another attack is launched to the Customers service application layer, across the image layer and finally , the database is compromised. The same attack can be repeated against the other microservices if affected by the vulnerabilities. Hence we need to express such casual relationships in vulnerability correlation matrices. V 1 V 2 . . . V n M 1 1 1 · · · . . . M 2 1 0 · · · . . . . . . . . . . . . . . . M n 1 1 · · · . . . Figure 6. Microservice V ulnerability Correlation Matrix Correlated vulnerabilities can be represented with correla- tion matrices, more specifically referred to as micr oservices vulnerability corr elation matrix . Therefore, we are influenced by [28] to define the microservices vulnerability correlation matrix as a mapping of vulnerabilities to micr oservice in- stances in a micr oservice-based application . The micr oservices vulnerability correlation matrix presents a view of vulner- abilities that concurrently affect multiple microservices. An example of the microservice correlation matrix is Figure 6, where the microservices M 1 and M 2 will have a correlated failure under an attack that exploits vulnerability V 1 since they share the same vulnerability . Howe ver , an attack that e xploits V 2 can only affect M 1 , while M 2 remains unaffected. C. Evaluation The PetClinic application was used for our ev aluation. PetClinic is part of the Spring Cloud demo applications and an established cloud-nativ e reference application used for demonstration purposes in plenty of industrial and academic microservice-related use cases [17]. It is, therefore, an excel- lent reference. Howev er, we were forced to modify the original PetClinic by adding OpenAPI support. T wo experiments have 2 https://cwe.mitre.org/inde x.html 3 https://cwe.mitre.org/data/definitions/89.html 4 https://n vd.nist.gov/vuln/detail/CVE-2016-6652 T ABLE VI. V ulnerabilities Detected in PetClinic App-Layer CWE-ID API-GA TEW A Y CUSTOMERS-SER VICE VETS-SER VICE VISITS-SER VICE CWE-16 31 4 2 2 CWE-524 48 17 6 11 CWE-79 0 3 0 1 CWE-425 0 0 20 0 CWE-200 14 6 0 0 CWE-22 0 1 0 0 CWE-933 1 0 0 0 TO TAL 94 31 28 14 Figure 7. V ulnerability scanning results of the Homogeneous PetClinic application been conducted: (1) Security risk comparison to verify the efficienc y of our security-by-di versity tactics (2) Attack surface analysis to ev aluate the improvement in the horizontal and vertical attack surfaces. In order to perform Security Risk analysis, we lev eraged the Cloud A ware V ulnerability Assessment System (CA V AS) [29]. The vulnerability scanners integrated into CA V AS (An- chore and OW ASP ZAP), are used for launching vulnerability scans against PetClinic images and microservice instances respectiv ely . The detected vulnerabilities were persisted in the Security Reports and CMDB. First, the diversification index is deriv ed by computing risks per PetClinic microservices to obtain the Security Risk - S R . Hence, we inspect the results for the image vulnerability scan and notice that the vulnerabilities are too similar (Figure 7). Therefore, S R will be too similar for meaningful vulnerability prioritization. Since the prioritization step is imperati ve for ordering microservices in order of risk sev erity , we compute S R using the ORRM (Section IV -A2). The application layer scan results are retriev ed from the database and analyzed. Scores are assigned to the detected vulnerabilities based on the risk scores for OW ASP T op-10 2017 web vulnerabilities [30]. This is a reasonable approach giv en O W ASP uses ORRM for deriving the T op-10 web application vulnerability scores. Also, this affords objective assignment of scores 5 , which are publicly v erifiable. T able VI is the distribution of detected vulnerabilities, while a subset of the mapping between CWE-Ids and O W ASP T op-10 is on T able VII. From T able VII, it is obvious that the API-Gatew ay has the most sev ere risks followed by the Customer , V ets, and V isits microservices. Therefore, we apply di versification based on this result using a diversification index of 3:4 , i.e. three out of four microservices. The di versified PetClinic is r etested and the results are shown in Figure 8. W e observe that the div ersified PetClinic application layer vulnerabilities are reduced with about 53.3 %. Ho we ver , the image vulnerabilities 5 https://www .owasp.or g/index.php/T op 10-2017 Details About Risk Factors T ABLE VII. Risk Scores By CWE CWE-ID O W ASP T10 Risk Category Risk Score CWE-16 A6 - Security Misconfiguration 6.0 CWE-524 Not Listed 3.0 CWE-79 A6 - Security Misconfiguration 6.0 CWE-425 Not Listed 3.0 CWE-200 A3 - Sensitive Data Exposure 7.0 CWE-22 A5 - Broken Access Control 6.0 CWE-933 Not Listed 3.0 Figure 8. V ulnerabilities detected in the Diversified PetClinic Application increased especially for the Customer and V ets service which are transformed to NodeJS and Ruby respectiv ely . Impor- tantly , the microservices are no longer homogeneous, and the possibilities for correlated attacks hav e been eliminated. Also, the vulnerabilities in the API Gatew ay’ s image are drastically reduced from 696 to 6, while the application layer vulnerabilities reduced from 94 to 24. The reduction is due to reduced code base size, a distinct characteristic of Python programming model. The API Gate way is the most important microservice since it presents the most vulnerable and sensi- tiv e attack surface of the application, therefore consider the security of PetClinic improved, our results mean that out of 94 opportunities for attacking the API Gateway , only 24 were left. D. Attack Surface Analysis Here we analyze the attack surfaces of the homogeneous and diversified PetClinic versions. W e consider direct and indirect attack surfaces, i.e. vulnerabilities that directly/ in- directly lead to attacks respectiv ely . From the vulnerability scan reports, each detected vulnerability is counted as an Figure 9. Horizontal Attack Surface Analysis attack surface unit ( attack opportunities concept [31], [32]). Figure 9 compares the horizontal app layer attack surface for both PetClinic apps. Notice a reduced attack surface in the div ersified version, showing better security . Essentially , the attackability of PetClinic has been reduced. Ho wev er , the results for the vertical attack surface are different. This attack surface portrays attacks transv ersing the app-image layer (Figure 5). While there are fewer correlated vulnerabilities in the div ersified API-Gate way , correlated vulnerabilities in the Customers and V ets Services have increased. This increment is due to the corresponding increase of image vulnerabilities. Howe ver , the attackability due to homogeneity is reduced. W e want to emphasize that intruders would observe this approach as permanently changing attack surfaces increasing dramatically the effort to breach the system. V . C R I T I C A L D I S C U S S I O N The idea presented in Section III of an immune system like approach to remove undetected intruders in virtual machines seems to a lot of experts intriguing. Nevertheless, according to the state of the art, this is currently not done. There might be reasons for that and open questions the reader should consider . It is often remarked that the proposal can be compared with the approach to restart periodically virtual machines that have memory leak issues and has apparently nothing to do with security concerns, and could be applied to traditional (non- cloud) systems as well. So, the approach may hav e ev en a broader focus than presented (which is not a bad thing). Another question is how to detect “infected” nodes? The presented approach selects nodes simply at random and will hit ev ery node at some time. The same could be done using a round-robin approach, but a round-robin strategy would be better predictable for an attacker . Howe ver , both strategies will create a lot of additional regenerations, and that leaves room for improvements. It seems obvious to search for solutions like presented by [33], [34] to provide some “intelligence” for the identification of “suspicious” nodes. Such a kind of intelli- gence would limit regenerations to likely “infected” nodes. In all cases, it is essential for anomaly detection approaches to secure the forensic trail [35], [36]. Furthermore, to regenerate nodes periodically or e ven ran- domly is likely nontri vial in practice and depends on the state management requirements for the af fected nodes. Therefore, this paper proposes the approach only as a promising solution for Lev el 2 or 3 cloud applications (see T able III) that are operated on elastic container platforms. These kinds of applications hav e desirable state management characteristics. Howe ver , this is a limitation to applications following the microservice architecture approach. One could be further concerned about exploits that are adaptable to bio-inspired systems. Stealthy resident worms dat- ing back to the old PC era would be an example. This concern might be especially valid for the often encountered case of not entirely stateless services when data-as-code dependencies or code-injection vulnerabilities exist. Furthermore, attackers could shift their focus to the platform itself in order to disable the regeneration mechanism as a first step. On the other hand, this could be easily detected – but there could exist more sophisticated attacks. These “immunization” results on the infrastructure level (see Section III) are impressiv e but should be combined with secure coding practices in development pipelines, i.e. employ- ing with continuous security assessments. W e presented how to automate security in CN A de velopment en vironments [29]. In these cases, detected web vulnerabilities, e.g. X-Content- T ype-Options Header Missing , can be resolved by appending appropriate headers , as described and advised in CA V AS reports. Furthermore, image vulnerabilities can be reduced by using more secure container images. For example, Alpine Linux images can replace Ubuntu images as base images due to smaller footprint which equals smaller attack surfaces [37]. Our MTD approach presented in Section IV lev erages automatic code generation techniques on the application le vel via Swagger CodeGen library . W e discovered that over 150 companies/projects use Swagger CodeGen in production 6 , hence the library is mature and capable of transforming large microservice applications. Nev ertheless, in this work a basic application has been used to introduce the concepts, more complex applications will be tested in the future. Howe ver , our approach also has some limitations. Our techniques can be applied only to OpenAPI compatible microservices. Also, Swagger Codegen currently supports about 30 programming languages/framew orks and this might be a limitation in terms for possible combinations (entropy), although more languages can be added via customizations. There might be a need for manual efforts to check if the transformation output is functionally compatible especially for complex applications. An e vent-based technique might interestingly enhance our MTD technique by detecting attacks and triggering commensu- rate diversification. Con ventionally , W eb Application Firew alls (W AF) are deployed in front of web applications to detect and stop malicious traffic (which might also indicate an ongoing attack). Hence W AF can be deployed at the API Gateway and configured with attack thr esholds . Once a threshold is breached, the W AF would trigger the div ersification of the entire microservice application or the endangered microser- vice. A scheduled diversification routine might support this methodology . These techniques can comfortably be applied across cloud platforms using orchestration technologies, e.g. Kubernetes. V I . R E L A T E D W O R K T o the best of the authors’ knowledge, there are currently no approaches making intentional use of virtual machine regeneration for security purposes neither on the infrastructure nor on the application le vel. Ho wev er , the proposed approach is deriv ed from multi-cloud scenarios and their increased require- ments on security . Moreov er, sev eral promising approaches are dealing with multi-cloud scenarios. So, all of them could show equal opportunities. Howe ver , often, these approaches come along with much inherent complexity . A container-based approach seems to handle this kind of complexity better . There are some good surve y papers on this [38], [39], [40], [41]. MTD via software diversity was first introduced by Forest et al. [42], since then the concept has been applied at different abstraction lev els. Baudry et al. [43] introduced sosiefication , a diversification method which transforms software programs 6 https://github .com/swagger-api/swagger-code gen by generating corresponding replicas through statement dele- tion, addition or replacement operators. These variants still exhibit the same functionality but are computationally div erse. W illiams et. al [44] presented Genesis , a VM-based dynamic div ersification system. Genesis employed the Strata VM to distribute software components such that e very version became unique, hence difficult to attack. A detailed comparison of automated div ersification techniques was presented in [21]. The authors have not found a prior work that applied MTD concepts to microservices. V I I . C O N C L U S I O N There is still no such thing as an impenetrable system. Once attackers successfully breach a system, there is little to prev ent them from doing arbitrary harm but we can reduce the available time for the intruder to do this. Moreover , we can make it harder to replay a successful attack. The presented approach e v olved mainly from transferability research ques- tions for cloud-nativ e applications. Therefore, it is limited to microservice-based application architectures b ut pro vides some unusual characteristics for thinking about security in general. Basically we proposed an “immune system” inspired ap- proach to tackle zero-day exploits. The founding cells are continuously regenerated. The primary intent is to reduce the time for an attacker acting undetected massi vely . Therefore, this paper proposed to regenerate virtual machines (the cells of an IT -system) with a much higher frequency than usual to purge e ven undetected intruders. Evaluations on infrastructures provided by A WS, GCE, Azure, and OpenStack showed that a virtual machine could be regenerated between two minutes (A WS) and 10 minutes (Azure). The reader should compare these times with recent c ybersecurity reports. In 2016 an attacker was undetected on a victim system for about 100 days. The presented approach means for intruders that their undetected time on victim systems is not measured in months or days any-more, it would be measured in minutes. Howe ver , regenerated virtual machines will incorporate the same set of application vulnerabilities. So, a reasonable approach for intruders would be to script their attacks and rerun it merely . Although they might lose their foothold within minutes in a system, they can regain it automatically within seconds. Therefore, we propose to alter the attack surface of applications by randomizing the entry and exit points, which are commonly used for identifying attack surfaces [25], [26]. Based on horizontal and vertical microservice attack surfaces we demonstrated how to employ a vulnerability corr elation to identify vulnerability similarities on the application layer and how to adapt the attack surface accordingly . This attack surface modification would let even automated and formerly successful attack scripts fail (at least partly). W e propose and demonstrate the feasibility to div ersify the application via dynamic transformations of its containerized components at runtime . In our presented use cases, we could show , that it is possible to change the attack surface of a reference application incorporating over 600 container image vulnerabilities and approximately 80 application vulnerabilities to a surface with no image vulnerabilities and only 24 application vulnerabilities anymore. That is a reduction of almost 98%. What is more, the surface of the application can be changed continuously resulting that scripted attacks fail with each surface change. That is a nightmare from an intruders point of view . The critical discussion in Section V sho wed that there is a need for additional ev aluation and room for more in-depth re- search on both lev els: continuously infrastructure regeneration and application surface modifying. Howe ver , sev eral re vie wers remarked independently that the basic idea is so “intriguing”, that it should be considered more consequently . A C K N O W L E D G M E N T This research is partly funded by the Cloud TRANSIT project (13FH021PX4, German Federal Ministry of Education and Research). The authors would like to thank Bob Duncan from the University of Aberdeen for his inspiring thoughts on cloud security challenges. R E F E R E N C E S [1] N. Kratzke, “About an Immune System Understanding for Cloud- nativ e Applications - Biology Inspired Thoughts to Immunize the Cloud Forensic T rail, ” in Proc. of the 9th Int. Conf. on Cloud Computing, GRIDS, and V irtualization (CLOUD COMPUTING 2018, Barcelona, Spain), 2018. [2] B. Duncan and M. Whittington, “Compliance with standards, assurance and audit: does this equal security?” in Proc. 7th Int. Conf. Secur. Inf. Networks - SIN ’14. Glasgow: A CM, 2014, pp. 77–84. [Online]. A vailable: http://dl.acm.org/citation.cfm?doid=2659651.2659711 [3] N. Kratzke, “Smuggling Multi-Cloud Support into Cloud-native Appli- cations using Elastic Container Platforms, ” in Proc. of the 7th Int. Conf. on Cloud Computing and Services Science (CLOSER 2017), 2017. [4] ——, “ About the complexity to transfer cloud applications at runtime and ho w container platforms can contribute?” in Cloud Computing and Service Sciences: 7th International Conference, CLOSER 2017, Re vised Selected Papers, Communications in Computer and Information Science (CCIS). Springer International Publishing, 2018, to be published. [5] N. Kratzke and P .-C. Quint, “Understanding Cloud-nativ e Applications after 10 Y ears of Cloud Computing - A Systematic Mapping Study, ” Journal of Systems and Software, vol. 126, no. April, 2017. [6] N. Kratzke and R. Peinl, “ClouNS - a Cloud-Nativ e Application Reference Model for Enterprise Architects, ” in 2016 IEEE 20th Int. Enterprise Distributed Object Computing W orkshop (EDOCW), Sep. 2016. [7] L. Bilge and T . Dumitras, “Before we knew it: an empirical study of zero-day attacks in the real world, ” in ACM Conference on Computer and Communications Security , 2012. [8] K. Krombholz, H. Hobel, M. Huber, and E. W eippl, “ Advanced social engineering attacks, ” Journal of Information Security and Applications, vol. 22, 2015. [9] S. Gupta, A. Singhal, and A. Kapoor, “ A literature surve y on social engineering attacks: Phishing attack, ” 2016 International Conference on Computing, Communication and Automation (ICCCA), 2016, pp. 537–540. [10] N. Kratzke and P .-C. Quint, “T echnical Report of the Project Cloud- TRANSIT - Transfer Cloud-nati ve Applications at Runtime, ” Oct. 2018, technical report. [11] C. Fehling, F . Leymann, R. Retter , W . Schupeck, and P . Arbitter, Cloud Computing Patterns: Fundamentals to Design, Build, and Manage Cloud Applications. Springer Publishing Company , Incorporated, 2014. [12] A. Balalaie, A. Heydarnoori, and P . Jamshidi, “Migrating to Cloud- Nativ e Architectures Using Microservices: An Experience Report, ” in 1st Int. W orkshop on Cloud Adoption and Migration (CloudW ay), T aormina, Italy , 2015. [13] S. Newman, Building Microservices. O’Reilly Media, Incorporated, 2015. [14] S. Ashtikar , C. Barker, B. Clem, P . Fichadia, V . Krupin, K. Louie, G. Malhotra, D. Nielsen, N. Simpson, and C. Spence, “OPEN D A T A CENTER ALLIANCE Best Practices: Architecting Cloud-A ware Applications Re v . 1.0, ” 2014. [Online]. A vailable: https://www .opendatacenteralliance.org/docs/architecting cloud aware applications.pdf [15] M. Stine, Migrating to Cloud-Nativ e Application Architectures. O’Reilly , 2015. [16] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica, “Mesos: A Platform for Fine- Grained Resource Sharing in the Data Center .” in 8th USENIX Conf. on Networked systems design and implementation (NSDI’11), vol. 11, 2011. [17] C. M. Aderaldo, N. C. Mendonc ¸ a, C. P ahl, and P . Jamshidi, “Benchmark requirements for microservices architecture research, ” in Proc. of the 1st Int. W orkshop on Establishing the Community-Wide Infrastructure for Architecture-Based Software Engineering, ser . ECASE ’17. Piscat- away , NJ, USA: IEEE Press, 2017. [18] N. Kratzke and P .-C. Quint, “About Automatic Benchmarking of IaaS Cloud Service Providers for a W orld of Container Clusters, ” Journal of Cloud Computing Research, vol. 1, no. 1, 2015. [19] J. Gummaraju, T . Desikan, and Y . Turner , “Over 30% of official images in docker hub contain high priority security vulnerabilities, ” BanyanOps, T ech. Rep., 2015. [20] R. Shu, X. Gu, and W . Enck, “ A study of security vulnerabilities on docker hub, ” in Proceedings of the Se venth A CM on Conference on Data and Application Security and Priv acy , 2017. [21] P . Larsen, S. Brunthaler, L. Da vi, A.-R. Sadeghi, and M. Franz, “ Auto- mated software div ersity , ” Synthesis Lectures on Information Security , Priv acy , & T rust, vol. 10, no. 2, 2015, pp. 1–88. [22] P . Mell, K. Scarfone, and S. Romanosky , “Common vulnerability scoring system, ” IEEE Security & Priv acy , 2006. [23] OW ASP , “Owasp risk rating methodology , ” online. [24] Pivotal, “Distributed version of spring petclinic built with spring cloud, ” https://github .com/spring- petclinic/spring- petclinic- microservices, 2019. [25] A. Y ounis, Y . K. Malaiya, and I. Ray , “ Assessing vulnerability ex- ploitability risk using software properties, ” Software Quality Journal, 2016. [26] P . K. Manadhata, Y . Karabulut, and J. M. Wing, “Report: Measuring the attack surfaces of enterprise software. ” ESSoS, vol. 9, 2009, pp. 91–100. [27] M. Ficco, “Security e vent correlation approach for cloud computing, ” International Journal of High Performance Computing and Networking 1, 2013. [28] P .-Y . Chen, G. Kataria, and R. Krishnan, “Correlated failures, di ver - sification, and information security risk management, ” MIS quarterly , 2011, pp. 397–422. [29] K. A. T orkura, M. I. Sukmana, and C. Meinel, “Cavas: Neutralizing application and container security vulnerabilities in the cloud native era (to appear), ” in 14th EAI International Conference on Security and Priv acy in Communication Networks. Springer, 2018. [30] OW ASP , “ Application security risks-2017. open web application secu- rity project (owasp), ” 2017. [31] M. Howard, J. Pincus, and J. M. W ing, “Measuring relative attack surfaces, ” in Computer security in the 21st century . Springer , 2005, pp. 109–137. [32] OW ASP , “ Attack surface analysis cheat sheet, ” https://www .ow asp.org/ index.php/Attack Surface Analysis Cheat Sheet. [33] Q. Fu, J.-G. Lou, Y . W ang, and J. Li, “Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis, ” in 2009 Ninth IEEE Int. Conf. on Data Mining, 2009. [34] M. W urzenberger , F . Skopik, R. Fiedler , and W . Kastner , “Applying High-Performance Bioinformatics T ools for Outlier Detection in Log Data, ” in CYBCONF , 2017. [35] B. Duncan and M. Whittington, “Cloud cyber-security: Empo wering the audit trail, ” Int. J. Adv . Secur ., vol. 9, no. 3 & 4, 2016, pp. 169–183. [36] ——, “Creating an Immutable Database for Secure Cloud Audit Trail and System Logging, ” in Cloud Comput. 2017 8th Int. Conf. Cloud Comput. GRIDs, V irtualization. Athens, Greece: IARIA, ISBN: 978- 1-61208-529-6, 2016, pp. 54–59. [37] H. Gantikow , C. Reich, M. Knahl, and N. Clarke, “Providing security in container-based hpc runtime environments, ” in International Conference on High Performance Computing. Springer, 2016. [38] A. Barker, B. V arghese, and L. Thai, “Cloud Services Brokerage: A Survey and Research Roadmap, ” in 2015 IEEE 8th International Conference on Cloud Computing. IEEE, jun 2015. [39] D. Petcu and A. V . V asilakos, “Portability in clouds: approaches and research opportunities, ” Scalable Computing: Practice and Experience, vol. 15, no. 3, oct 2014. [40] A. N. T oosi, R. N. Calheiros, and R. Buyya, “Interconnected Cloud Computing Environments, ” ACM Computing Surv eys, vol. 47, no. 1, may 2014. [41] N. Grozev and R. Buyya, “Inter-Cloud architectures and application brokering: taxonomy and survey, ” Software: Practice and Experience, vol. 44, no. 3, mar 2014. [42] S. Forrest, A. Somayaji, and D. H. Ackley , “Building div erse computer systems, ” in Operating Systems, 1997., The Sixth W orkshop on Hot T opics in. IEEE, 1997, pp. 67–72. [43] B. Baudry , S. Allier, and M. Monperrus, “T ailored source code trans- formations to synthesize computationally div erse program variants, ” in Proceedings of the 2014 International Symposium on Software T esting and Analysis. ACM. [44] D. Williams, W . Hu, J. W . Davidson, J. D. Hiser , J. C. Knight, and A. Nguyen-T uong, “Security through div ersity: Leveraging virtual machine technology , ” IEEE Security & Privac y , 2009.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment