ARMS: Automated rules management system for fraud detection

Fraud detection is essential in financial services, with the potential of greatly reducing criminal activities and saving considerable resources for businesses and customers. We address online fraud detection, which consists of classifying incoming t…

Authors: David Aparicio, Ricardo Barata, Jo~ao Bravo

ARMS: Automated rules management system for fraud detection
ARMS: A utomated rules management system for fraud detection David Aparício david.aparicio@feedzai.com Feedzai Ricardo Barata ricardo.barata@feedzai.com Feedzai João Bravo joao.bravo@feedzai.com Feedzai João Tiago Ascensão joao.ascensao@feedzai.com Feedzai Pedro Bizarro pedro.bizarro@feedzai.com Feedzai ABSTRA CT Fraud dete ction is essential in nancial ser vices, with the poten- tial of greatly reducing criminal activities and saving considerable resources for businesses and customers. W e address online fraud detection, which consists of classifying incoming transactions as either legitimate or fraudulent in real-time . Modern fraud detection systems consist of a machine learning model and rules dened by human experts. Often, the rules performance degrades ov er time due to concept drift, especially of adversarial nature. Furthermore , they can be costly to maintain, either because they are computation- ally e xpensive or because they send transactions for manual review . W e propose ARMS, an automated rules management system that evaluates the contribution of individual rules and optimizes the set of active rules using heuristic search and a user-dened loss- function. It complies with critical domain-specic requirements, such as handling dierent actions ( e.g., accept, alert, and decline), priorities, blacklists, and large datasets (i.e., hundreds of rules and millions of transactions). W e use ARMS to optimize the rule-based systems of tw o real-world clients. Results show that it can maintain the original systems’ performance (e .g., recall, or false-positive rate) using only a fraction of the original rules ( ≈ 50% in one case, and ≈ 20% in the other). CCS CONCEPTS • Theory of computation → Optimization with randomized search heuristics ; • Software and its engineering → Genetic programming ; • Applied computing → Online banking ; Online shopping ; Secure online transactions ; KEY W ORDS fraud detection; genetic programming; evolutionary algorithms; greedy algorithms; randomized search A CM Reference Format: David Aparício, Ricardo Barata, João Brav o, João Tiago Ascensão, and Pe- dro Bizarro. 2020. ARMS: Automated rules management system for fraud detection. In Proceedings of KDD ’20 (submitted) . ACM, New Y ork, NY , USA, 11 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner /author(s). KDD ’20 (submitted), August 22–27, San Diego, CA, USA © 2020 Copyright held by the owner/author(s). ACM ISBN 978-x-xxxx-xxxx-x/Y Y/MM. https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 IN TRODUCTION Financial institutions, merchants, customers, and government agen- cies alike suer fraud-related losses daily , including credit card theft and other scams. Financial fraud consists of someone inappro- priately obtaining the details of a payment card (e.g., a credit/debit card) and using it to make unauthorized transactions. Frequently , the cardholder detects such illicit usage and initiates a dispute with the bank to be reimbursed (a chargeback ), at the expense of the mer- chant or bank that accepted the transaction. An ov er-conservative decision-maker might block all suspicious activity . Ho wever , this is far from optimal, as fraud patterns ar e not trivial, and it prev ents legitimate economic activity . Therefore, it is essential to adjust automated fraud detection systems to the risk prole of the client. Modern automated fraud detection systems consist of a machine learning (ML) model followed by a rule-based system. The model scores the transaction. The rule-based system uses the score and triggers of manually dened rules to decide an action (i.e., accept, alert, or decline the transaction). Rule-based systems with many rules are complex, hard to maintain, and frequently computationally expensive. An ideal system has only a minimum set of rules that ensure performance while preserving requirements and alerts low . Our main contributions are the following: (1) Identifying a new problem: how to properly evaluate a complex rules system (taking into account overlapping rule triggers with dierent rule priorities and blacklists)? (Section 2). (2) Proposing ARMS (Figure 1), a framework which handles all bookkeeping necessary to correctly evaluate such rules sys- tems (Sections 3.1–3.4). (3) Exploring optimization methods (namely random search, greedy expansion, and genetic programming) to improve the original system according to user-dened criteria (Sections 3.5-3.8). (4) Evaluating our proposed solutions on both synthetic and real data, demonstrating improvements to e xisting rules systems deployed at Feedzai (Section 4). Evaluating the performance of the whole fraud detection system is simple: given the fraud labels (i.e., the chargebacks) and the historical decisions, we compute performance metrics ( e.g., recall at a given false positive rate or FPR). How ever , it is not enough to analyze the performance of each rule by itself. W e need to consider how it contributes to the entire system as its triggers may overlap with other rules with dierent de cisions and priorities. Blacklists are another source of dependencies. Blacklisting rules, when triggered due to fraudulent behavior , blacklist the user (or email, or card) so that their future transactions are promptly declined. Deactivating blacklisting rules has side eects on the blacklists themselves and, therefore, in triggering or not rules that v erify them. 1 KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA D . Aparício et al. lab els ℓ Blacklist dep endencies BD A ugment rules p o ol ( w/ priority shu ffl ing) λ loss function ( a) Evaluate original rules sy stem. ( b ) A ugment rules p o ol + handle blacklists. Augmente d rule trigg ers R p μ Optimization metho d Generate v e ctors of priorities/de activations ( c) Optimize the system according to the loss function. … p λ b est system p erformance ( loss, r e call, F P R, …) ne w rule priorities Best priorities Ω b est p loss function Ω 1 loss Ω < b est loss Original system p erformance Ω 1 lab els ℓ Blacklist dep endencies BD p Rules Subset R ⊂ R + Augmente d rule trigg ers R + Rules R b est b est a p b p c b est p priorities Compute Rule trigg ers R lab els ℓ + ( d) Results. Figure 1: ARMS comp onents: handling blacklists, priority shu ling, and optimizing a user-dened loss function. In this work, we study the use case of a system with a pre-existing set of rules and priorities to optimize according to a user-dened objective function. As far as we know , we are the rst to address the proper evaluation and optimization of such complex rules systems. A suitable goal is to minimize the numb er of rules and alerts while keeping the original system’s performance ( e.g., recall). W e e xplore three dierent methods (random, greedy , and genetic algorithms), using synthetic data and data sets from real-world online merchants. Our results show that ARMS can signicantly reduce the numb er of rules while maintaining the system’s p erformance. W e stress that rules can depend on expensive aggregations (e.g., the average amount of the user’s transactions in the last month). Thus, ARMS brings meaningful gains in practical fraud detection settings. W e organize the remainder of the paper as follows. Section 2 gives an overview of fraud detection systems and discusses related work. Section 3 presents ARMS main components: handling black- lists, rules system evaluation, priority shuing, and rules system optimization. Section 4 presents our r esults in synthetic data and real-world clients. Finally , we discuss our conclusions in Section 5. 2 BA CKGROUND 2.1 Fraud detection W e fo cus on fraud detection in online payments, where a fraud- ster makes unauthorized transactions online. Fraud detection can be formulated as a binary classication task: each transaction is represented as a feature vector , z , and labeled as either fraudulent (positive class, y = 1 ) or legitimate (negative class, y = 0 ). Other approaches frame it as an outlier detection problem [ 9 ] that treats fraudulent transactions as anomalies. T ypical outlier detection is unsupervise d, and often results in much lower performance. W e consider fraud detection as a two-step process. First, when a transaction occurs, a feature engineering step , д ( z ) , is applied to the raw features z , resulting in processed features, x . An example of a processed feature (a prole ) is the number of transactions for a card in the last hour . Secondly , the automated fraud detection system evaluates the transactions and decides between three actions: to accept the transaction, to decline it, or to alert it to be manually re- viewed (so that spe cialized fraud analysts investigate it and produce a nal decision). Reviews are complicated (i.e., subje ct to human error) and expensive, as they require spe cialized knowledge and introduce unnecessary friction for legitimate transactions. 2.2 A utomate d fraud detection system W e consider an automated fraud detection system consisting of a machine learning model followed by a rule-based system. 2.2.1 Machine learning model. The sup ervised machine learn- ing model trains oine using historical data. When evaluating a transaction, the model then produces a score, ˆ y ∈ [ 0 , 1 ] , that is typically the probability of fraud given the features, P ( y = 1 | x ) . 2.2.2 Rule-based system. Rules consist of conditions and corre- sponding actions. Depending on the action, the rules can b e accept , alert , or decline rules. Rules may depend on the model score (e.g., if ˆ y < 0 . 5 then accept the transaction), and the features (e .g., if the transaction is above a risky amount , then alert/decline it). Since a transaction might trigger multiple rules with contradictory actions, priorities are necessary . Finally , rules can be switched on and o at any time. The rules system encapsulates all rules, their state (ac- tive or not), and priorities. Generally , the rule system is a function, f ( x , ˆ y ) , that evaluates a list of rules and returns an action. 2.3 System evaluation T o assess system performance, we compare the system’s decisions with the labels coming from chargebacks or the fraud analysts’ decisions. Then, we compute the relevant performance metrics. 2.4 Rule evaluation The rules system, f ( x , ˆ y ) , receives the pr ocessed features and the model score and returns a decision to accept, alert, or decline. It comprises a set of rules, R = ( R 1 , R 2 , . . . , R k ) , applied individually on incoming transactions. Hence , transactions may trigger none , one, some, or all of these rules. W e aim to measure the contribution of individual rules to the system. T ypically , at the time of the deployment of the system (i.e., after training with the latest data), rules and priorities perform well. Howev er , as time go es by , fraud patterns change, and performance degrades. This degradation is acute in fraud detection, given the ad- versarial context (fraudsters often change their strategies). Wher eas some rules remain benecial, others may become redundant or even degrade the performance of the system. Figure 2 illustrates how an initially goo d rule can degrade over time. As rule-base d systems remain in production for a long time, it is essential to monitor how individual rules are impacting the system, namely their fraud detection and computational performances (rules can be heav y to compute, e.g., if they depend on pr oles). One naive approach is to evaluate each rule independently by measuring how well its decisions match the labels (intuitively , ac- cept rules should nd legitimate transactions, while alert and de- cline rules should nd fraudulent transactions). Then, if the rule ’s performance is inadequate, it is discarded. Notwithstanding, this 2 ARMS: A utomated rules management system for fraud detection KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA Alert ru le A ccept ru le D e cli ne r ule Time ( a) ( b ) ( c) ( d) ( e ) fraudulent legiti mate Alert ru le Figure 2: Rule degradation: (a) transactions in the feature space, ( b-d) an alert, an accept, and a de cline rule are progres- sively added, and trigger for some transactions. The alert rule has a good ratio of correct alerts when added, but by the end it is alerting only legitimate transactions (e). approach is problematic and insucient because it disregards in- teractions between rules. Consider the following examples: • Low-priority rules can perform outstandingly when no high- priority rules are triggered ( e.g., in specic corner cases), but perform very po orly if used individually . • T urning o high-priority rules allows lower-priorities rules to act; this can lead to dierent decisions by the system. Instead of this naive approach, we build a rules management system that takes into account the interactions b etween rules with dierent actions and dierent priorities when evaluating them. 2.5 State-of-the-art W e revie w current work on the optimization of rule-base d systems using search heuristics. T able 1 shows an ov erview of the metho ds. Ishibuchi et al. propose a method to maximize correctly classi- ed instances, while reducing the number of rules, using genetic programming [ 7 ]. This approach is not suciently exible for the fraud detection use-case , as the client (e.g., a merchant or bank) may want to optimize for other metrics (e.g., recall, or a combination of metrics). Moreover , their method neglects priorities, and it is not clear if it scales up well for fraud detection data sets with millions of transactions (the y used the Iris data set [ 5 ], which consists of 150 records). In a later study , they compare multiple heuristics, namely greedy search and genetic programming, in four small data sets [ 8 ]. Some approaches target specic use-cases, namely nancial trad- ing [ 1 ] or opinion mining [ 10 ]. Besides the domain, another crucial dierence between our r esearch and the work by Allen et al. is that, instead of learning new rules, we optimize existing rule systems. Rosset et al. describe a method that learns and selects rules for telecommunication fraud detection [ 11 ]. Like us, the authors stress the importance of cho osing a go od set of rules, instead of a set of good rules. However , we target online transaction fraud and optimize a more complex system with priorities and blacklists. T able 1: Comparison of rules management systems. [8] [1] [10] [11] [3] [6] ARMS various rule actions ✗ ✗ ✗ ✗ ✗ ✗ ✓ rule priorities ✗ ✗ ✗ ✗ * ✗ ✓ > million instances ✗ ✗ ✗ ✗ ✗ ✗ ✓ user-dened loss ✗ ✗ ✗ ✗ ✗ ✗ ✓ blacklists ✗ ✗ ✗ ✗ ✗ ✗ ✓ rule learning ✗ ✓ ✗ ✓ ✗ ✗ ✗ * optimizes rule weights instead T able 2: Notation. Features X = ( X 1 , X 2 , . . ., X m ) Rules R = ( R 1 , R 2 , . . ., R k ) Priority space P = { p ∈ Z | p ≥ − 1 } Rule priority p i ∈ P Rule active condition p i > − 1 Rules priority vector p = ( p 1 , p 2 , . . ., p k ) Priority-action map a : p i → { acc ep t , al e r t , de c l ine } Transaction feature v ector x = ( x 1 , x 2 , . . ., x m ) Transactions X = ( x 1 , x 2 , . . ., x n ) Transaction rules vector r = ({ p 1 , − 1 } , { p 2 , − 1 } , . . . , { p k , − 1 } ) Rules triggers matrix R = [ r x 1 , r x 2 , . . ., r x n ] T Labels vector ℓ = [ ℓ x 1 , ℓ x 2 , . . ., ℓ x n ] T Blacklist updater rules B u ⊂ R Blacklist checker rules B c ⊂ R Loss function λ Performance Ω (contains Ω l o s s , Ω r e c a l l , etc.) Duman et al. propose a system combining genetic programming and scatter search to optimize rule weights and other parameters [ 3 ]. Similar to our case, the rules are based on expert knowledge and suer from concept drift. Each rule has a weight corresponding to its contribution to a fraud score, unlike our work, which considers priorities to activate a single rule. Furthermore , Duman et al. do not consider blacklists, dierent rule actions ( e.g., accept, alert, and decline), and uses a predened tness function that minimizes the money loss. Additionally , the most substantial data set considered contains only ≈ 250 thousand transactions and 43 rules and pa- rameters. They report money savings of 212% at the cost of a 67% increase in false positiv es, and, after manual tuning, the y settled for a system with savings of 189% and a 35% increase in false positives. Gianini et al. optimize a system of 51 rules using a game theory approach [ 6 ]. They measure rule importance using Shapley val- ues [ 13 ] as a measure of contribution to the system. They propose two strategies: (1) select the n rules with highest Shapley values (and deactivate the others) and (2) greedy expansion of the set of rules using the Shapley values of the rules. Both strategies p er- formed identically and were able to reduce the numb er of rules down to 30 while maintaining the original system’s F-score. Like Duman et al. [ 3 ], this approach disregar ds essential constraints of the fraud detection system we are considering: rule priorities, rule actions, blacklists, and support for a user-dened loss function. 3 ARMS W e start this section with an ov erview of ARMS. Then, we describe in detail each of its main components: handling blacklisting rules, the evaluation of the rule-based system, rule priority shue, and, nally , the optimization strategies to select rules. 3.1 System overview Algorithm 1 gives a general view of ARMS. W e refer the r eader to T able 2 for the notation used throughout this work. ARMS receives the following information as inputs : • Features . A vector of features, X ( e.g., username, email). • Transactions . A matrix X n × m containing the values of the m features for each of the n transactions. It is needed to compute 3 KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA D . Aparício et al. blacklists (i.e., to know blacklisted values for each featur e, e.g., username = fraudster91 ). • Triggers or activations . A matrix R n × k containing the rule triggers of the k rules for each of the n transactions. Each cell R i j = − 1 if rule R j did not trigger for transaction x i or R i j = p j (i.e., the rule ’s priority) if it did. • Labels . A vector with the label for each transaction, ℓ . • Priorities . A vector with the priority of each rule, p . • Actions . A map, a , mapping rule priorities to actions (i.e., accept, alert, or decline). • Blacklisting rules . A set of blacklisting rules containing rules that update the blacklist, B u , and rules that check it, B c . • Method . Optimization strategy , µ (i.e., random search, greedy expansion, or genetic programming). • Loss function . A loss function, λ , dened by the user . • Priority shu le . A bo olean, ar p , sp ecifying whether to a ugment the rules pool R by cloning rules with dierent priorities. • Optimization parameters . Set of parameters, θ , which are specic to the optimization strategy ( e.g., population size or mutation probability for the genetic algorithm or the number of evaluations for the random search). ARMS starts by addressing the blacklist dependencies (line 1 of Algorithm 1; details in Section 3.2). Then, ARMS evaluates the orig- inal system’s performance, Ω 1 (line 2 of Algorithm 1; Section 3.3). This evaluation runs before optimization because the loss function often depends on the original performance (e.g., optimize the FPR, while maintaining recall). Afterwards, ARMS augments the rules pool, if the user so desires (lines 3-4 of Algorithm 1; Section 3.4). This adds new rules with the same triggers as existing rules, but with dierent priorities. The rationale is that changing priorities might improve the system. Finally , ARMS optimizes the rules sys- tem (line 5 of Algorithm 1; Section 3.5). In essence , ARMS turns o rules and changes their priorities, obtaining a new priority vector , p best , to reduce the loss of the system, Ω b e s t . Algorithm 1 ARMS: A utomated Rules Management System. Input: V ector X , Matrix X , Matrix R , vector ℓ , vector p , map a , set B , loss function λ , method µ , parameter ar p ∈ { 0 , 1 } , parameters θ Output: V ector p best , performance Ω b e s t 1: BD ← computeBlacklistDependencies ( R , X , X , B ) 2: Ω 1 ← ev alu ate ( X , R , ℓ , p , a , B , BD , λ ) 3: if ar p = 1 then 4: R ← a ugmentRulesP ool ( R , p , a ) 5: ( p best , Ω b e s t ) ← µ .optimize ( X , R , ℓ , p , a, BD , λ , Ω 1 , θ ) 3.2 Handling blacklists Both analysts and rules can blacklist entities. If an analyst nds transaction x to be fraudulent, they can blacklist some of its entities (e .g., in the future, always decline transactions from the email use d in transaction x ). Similarly , blacklist updater rules add entities to the blacklist when they trigger . Other rules, calle d blacklist checker rules , trigger when a transaction contains a blacklisted entity . Therefore, blacklist rules hav e side eects. Deactivating blacklist updater rules can lead to blacklist checker rules not triggering, and aect the system’s performance. Thus, we need to take this into account when evaluating the system. For this purpose, ARMS keeps a state of the blacklists and manages them according to the interaction between blacklist updater and blacklist checker rules (for a detailed description, we refer to Supplementary Algorithm S1). 3.3 Rules system evaluation ARMS evaluates (Algorithm 2) the original system and the con- gurations produced by the optimization strategies (Section 3.5). It creates an empty confusion matrix, V (line 2), to b e update d by traversing each transaction, x ∈ X , alongside its rule triggers, r i ∈ R , and its label, ℓ i ∈ ℓ ( lines 3-9). For each transaction: (1) ARMS computes the activations r ′ i (i.e., what rules are activ e and with what priority ), using priority vector p (line 4). When ARMS is evaluating the original system, p contains the original rules’ priorities; however , rule priority shuing and optimiza- tion strategies generate variations of p . (2) ARMS checks whether to turn o any blacklist checker rules as a side-eect and stores that in r ′′ i (line 5). (3) ARMS obtains the nal de cision, o i , from r ′′ i , i.e., to accept, alert, or decline (line 6). It is the action of the highest priority rule triggered for the transaction that is active. (4) ARMS evaluates the decision, o i , against the label ℓ i , storing it in v i (line 7). Accepting a legitimate transaction is a true negative. Declining/alerting a legitimate transaction is a false positive. Declining/alerting a fraudulent transaction is a true positive. A ccepting a fraudulent transaction is a false negativ e. The confusion matrix, V , is updated with v i (line 8). Finally , ARMS uses the confusion matrix V to compute the rule conguration’s performance, Ω , based on a user-dened loss func- tion, λ (line 9). The loss function allows optimizing (e.g., minimize the number of active rules, maximize recall) and satisfying metrics or constraints (e.g., keep the original system’s FPR). W e discuss loss functions used in synthetic data and real-world clients in Section 4. Algorithm 2 Rules system evaluation. 1: function ev alua te ( X , R , ℓ , p , a , B , BD , λ ) 2: V ← initConfusionMa trix () 3: for all x ∈ X , r i ∈ R , ℓ i ∈ ℓ do 4: r ′ i ← mask ( r i , p ) 5: r ′′ i ← handleBD ( r ′ i , B , BD [ x ] ) 6: o i ← a ( max ( r ′′ i )) 7: v i ← getTruthV alue ( o i , ℓ i ) 8: V ← upda teConfusionMa trix ( V , v i ) 9: Ω ← λ ( V ) 10: return Ω 3.4 Priority shu le Initial rule priorities require expert knowledge and ar e dened by clients or fraud analysts. O ver time, however , the system re- quires adjusted priorities to deal with concept drift and incorporate emerging knowledge (e .g., new rules). In this section we discuss how ARMS addresses priority shuing for optimization (Section 3.5). First, we discuss how ARMS changes the priority of individual rules. Then, we discuss how ARMS can augment the initial rules pool by cloning existing rules and assigning them alternative priorities. 4 ARMS: A utomated rules management system for fraud detection KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA 3.4.1 Random priority shu le. Since the system might have many rules and many possible priorities, the search space of all possible rule priorities can b e gigantic. A more ecient alternative for such cases is to use random priority shue. For a given rule r i with priority p i , ARMS changes its priority to p j , p i with the same action, i.e., a i = a j . The new rule priority is sampled considering uniform probabilities. Consider the illustrative example with three types of accept rules: weak accept with priority 1, strong accept with priority 3, and whitelist accept with priority 5. Random priority shue can, for example, change the priority of a strong accept to either 1 (weak accept) or 5 ( whitelist accept). 3.4.2 A ugment rules po ol. Another option is to augment the initial pool by cloning existing rules (i.e., same triggers), but assign- ing them dierent priorities. Starting fr om the existing priorities, p , we create variants for each p i ∈ p , with all possible alternative priorities with the same action, E . Then, for each p j ∈ E , ARMS adds a new vector ( a new "rule ") with the same triggers as the original rule and the new priority , p j , to the rules triggers matrix, R . 3.5 Optimization strategies ARMS uses two fundamental mechanisms to optimize a rule-based system: deactivate underperforming rules and change priorities. It is unfeasible to test all possible combinations. Instead, we employ three heuristics (methods): random search (Se ction 3.6), greedy expansion (Section 3.7), and genetic programming (Section 3.8). First, we give an overview of ARMS optimization (Algorithm 3), as methods share a similar structure . The original system (i.e., with rule priorities p and performance Ω 1 ) is the one to beat (line 1). Until meeting a predened stopping criteria ( line 3), ARMS generates new priority v ectors, p ′ , which are variations of the original p (line 4). The criteria can b e to stop after k hours, after computing n variations, or when the loss between consecutive iterations does not improve abov e a threshold ϵ . ARMS saves the variation with lowest loss it nds, p best , alongside its performance Ω b e s t , and returns them to the user (lines 5-8). The fundamental dierence between methods is how they generate the variations, p ′ . Algorithm 3 ARMS optimization. θ : parameters of the method 1: function µ .optimize ( X , R , ℓ , p , a , BD , λ , Ω 1 , θ ) 2: ( p best , Ω b e s t ) ← ( p , Ω 1 ) 3: while st oppingCriteriaNotMet () do 4: generate a new p ′ from p 5: if p ′ is the best so far then 6: save it as p best 7: save its performance as Ω b e s t 8: return ( p best , Ω b e s t ) 3.6 Random search A straightforward approach is to generate random rules priority vectors, p ′ , and evaluate them against the original p , saving the best rule conguration p ′ that it found. While this approach seems naive, it is a natural baseline that can be b etter and less expensive than grid or manual searches [2]. Random search has two parameters: • Rule shuto probability , ρ . Percentage of rules to deactivate, e.g., if ρ = 50% , then ARMS turns o ≈ 50% of the rules. • Rule priority shue probability , γ . Percentage of rules with priorities changed, e.g., if γ = 50% , then ARMS generates new priority vectors for ≈ 50% of the rules. For more detail, we refer to Supplementary Algorithm S2. 3.7 Greedy expansion ARMS contains a greedy expansion module, that starts from a set of inactive rules and greedily turns on rules, one at the time. Greedy solutions are not guaranteed to nd the global optimum. Consider the following example, where w e want to optimize recall and rules R 1 , R 2 , and R 3 have r ecall 70%, 69%, and 20%, respectively . A greedy solution would pick R 1 rst. Now , imagine that rules R 2 and R 3 are detrimental to R 1 , i.e., the system becomes worse if we combine R 1 with either R 2 or R 3 . Hence, the nal solution is a system with only R 1 . Imagine, however , that R 2 and R 3 are somewhat complementary , and that, when combined, the system’s recall is > 70% . Then, the global optimum is > 70% , and the greedy solution is not optimal. Nev ertheless, greedy heuristics can nd useful solutions in a reasonable time. For more detail, we refer to Supplementary Algorithm S3. 3.8 Genetic programming Genetic programming is standar d in classication tasks [ 4 ], such as fraud detection. It continuously improves a population of solutions by combining them using crossovers and random mutations, while keeping a fraction of the best solutions for the next iteration. In our case , we build a population of random rule congurations and improve them with genetic programming. The algorithm has three parameters: • Population size , ψ . Number of congurations per iteration, e .g., if ψ = 100 , ARMS evaluates 100 dierent rule congurations per iteration. • Survivors fraction , α . Fraction of the top congurations that survive for the next iteration, e.g., if ψ = 100 and α = 20% , only the 20 best solutions survive for the next iteration. If α is high, then we might achie ve higher variability but get stuck trying to improve bad solutions. If α is low , then the lack of variability might prevent the system from reaching a good solution. • Mutation probability , ρ . The p ercentage of rules subject to random mutation, e .g., if ρ = 20% , then 20% of the rules are randomly mutated (i.e., the child rule conguration mutates the parents rules conguration). If ρ is high, we leave little room for genetic optimization and are essentially doing a ran- dom search. If ρ is low , we are more dependent on nding good parent congurations. For more detail, we refer to Supplementary Algorithm S4. 4 EXPERIMEN TS AND RESULTS W e test the follo wing hypotheses: (h1) ARMS turns o rules and, at least, maintains system performance, (h2) ARMS changes the priority of rules and improves system performance, (h3) results ar e stable (i.e., similar across folds). 5 KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA D . Aparício et al. 4.1 Synthetic data Since we can not nd public data sets similar to our own, w e use synthetic data to test hypotheses (h1–h2) . Later , we also test (h1– h3) in r eal datasets. W e generate 225k labels with a fraud rate of 5% (i.e., 11250 positive labels) and simulate accept, alert, and decline rules from the labels. The support of a rule corresponds to how many times it triggers. An accept rule with negative predictive value (NPV) of k % is correct k % of the times that it triggers (i.e., out of all triggers, k % will be true negatives). The same goes for the precision (PPV) of an alert or decline rule (i.e., out of all triggers, k % will be true positives). W e sample the support, NPV and precision from Gaussian distributions and use 10 dierent priority lev els (for details see Supplementar y Section A.1) and divide the data set into three splits: train, validation, and test, with 75k "transactions" each. 4.2 Methodology W e run ARMS on the train set and do parameter tuning in the validation set. W e detail the parameter space in Supplementar y Section A.2. W e ensure that results are comparable between random search and genetic programming by keeping the numb er of rule conguration evaluations xed (i.e., n = 300 k). W e optimize the loss function from Equation 1 with α = 0 . 1 , β = 0 . 5 , and γ = 0 . 4 . Note that Ω 1 and Ω ′ are the performance of the original system and of a conguration found by ARMS, respectively; Ω r u l e s % is the percentage of active rules, Ω r e c a l l % is the recall, and Ω a l e r t % is the alert rate. λ ( R ′ ) = α ∗ Ω ′ r u l e s % − β ∗ Ω ′ r e c a l l + γ ∗ Ω ′ a l e r t s % (1) Finally , we evaluate the four nal methods in the test set: the original system, and the best rule system conguration found by random search, greedy expansion, and genetic programming. 4.3 Results on synthetic data After running parameter tuning (note that the greedy expansion method does not have any parameter), we nd that the following parameters were the best: • Random search: ρ = 40% . • Genetic programming: ρ = 10% , ψ = 30 , α = 5% . For brevity , we omit results for other parameters; we do a more thorough analysis of the parameters in real data sets (Section 4.6). W e observe that all methods improved upon the original system, and that genetic programming was the one with highest perfor- mance (T able 3). When we ran augmente d rules pool (ARP) before optimization, results consistently improved. Thus, we verify hy- pothesis (h1) and ( h2) . T able 3: Performance of ARMS in synthetic data. recall alerts % rules o loss original 13.11% 0.779% none 0.0376 random 79.53% 1.013% 38 (38.8%) -0.1837 greedy 54.42% 1.746% 34 (34.7%) -0.1998 genetic 52.82% 1.067% 45 (45.9%) -0.2058 greedy w/ arp 53.30% 1.107% 43 (43.9%) -0.2060 genetic w/ arp 53.09% 0.97% 45 (45.9%) -0.2075 May June July A ug. Sept. Oct. 0 − 1 − 2 − 3 − 4 ∆ fraud rate (pp) D1 D2 Figure 3: Datasets fraud rate evolution (concept drift). 4.4 Real-world data sets W e evaluate ARMS on representative samples of real-world data sets of two online merchants. In both cases, an automated fraud detection system actively scores transactions in production. W e collected the rule triggers, model decisions, and blacklists. The data sets comprise dozens of rules, with dierent actions (i.e., accept, alert, and decline) and multiple priorities (more details in Supple- mentary Section A.3). For privacy compliance, we r efer to the data sets simply as D1 and D2 . The data covers six months of transactions. W e divide each data set in four sequential and overlapping folds of thr ee months each (for temporal cross-validation, detailed in Section 4.5.3) and split each fold into three sequential sets (train, validation, and test) of one month each. Unless explicitly stated, when we mention fraud , we are referring to validated fraud (i.e., chargebacks or fraud conrmed by analysts, not transactions declined by the automated fraud detection system). Due to the adversarial setting and other factors, we observe concept drift in both data sets. Figure 3 shows the evolution of the fraud rate in D1 and D2 (with May 2018 as reference), highlighting the system’s ability to reduce fraud over time. While both clients are online merchants, they have three impor- tant dierences: (1) D1 has more non-veried declined transactions. It has ≈ 14x more auto-declined transactions than conrme d frauds, due to the specic requirements of the client. Using automati- cally declined transactions for training is dangerous as it creates a fee dback loop. Thus, we disregard them in train- ing and validation but use them in testing so that results are comparable to a production setting. Moreover , for this dataset, ARMS does not optimize decline rules. (2) Only D2 uses blacklists. (3) The active rules in D2 changed multiple times during the period under study , while the rules in D1 never changed. 4.5 Methodology 4.5.1 Optimization metrics (loss functions). Online merchants are required to keep the fraud-to-gross rate (FTG) under a certain threshold, or else they face nes. Thus, a sensible approach is to minimize the FPR and ensure that recall is within the legal re- quirements. The system should be able to pick up all the necessary fraud (ideally , all of it) without declining legitimate transactions. Additionally , reducing the number of rules and alerts decreases the overall cost of the system. W e use dierent loss functions for each data set, showing ARMS’ ability to t diverse use-cases: 6 ARMS: A utomated rules management system for fraud detection KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA • In D1 , the FPR is articially high due to the many trans- actions declined by the automated fraud dete ction system. Therefore, our focus is to remov e rules, Ω ′ r u l e s % , and r educe alerts, Ω ′ a l e r t % , while maintaining approximately the same recall, Ω ′ r e c a l l , as the original rule-based system, Ω 1 r e c a l l , (Equation 2). W e use α = β = 1 2 , thus giving e qual impor- tance to both objectives. • In D2 , the objective is to remove rules, Ω ′ r u l e s % , but also to improve recall, Ω ′ r e c a l l , while maintaining approximately the same FPR, Ω ′ f p r , as the original system, Ω 1 f p r , (Equa- tion 3). W e use α = 0 . 05 and β = 0 . 95 , thus attributing more importance to impr oving r ecall than to reducing the number of rules. λ ( R ′ ) = ( α ∗ Ω ′ r u l e s % + β ∗ Ω ′ a l e r t % if Ω ′ r e c a l l ≥ 0 . 95 ∗ Ω 1 r e c a l l α + β + ( Ω 1 r e c a l l − Ω ′ r e c a l l ) otherwise (2) λ ( R ′ ) = ( α ∗ Ω ′ r u l e s % − β ∗ Ω ′ r e c a l l if Ω ′ f p r ≤ Ω 1 f p r α + ( Ω 1 f p r − Ω ′ f p r ) otherwise (3) 4.5.2 Baselines. W e compare ARMS optimized rule systems against three baselines: (1) Original system ( All on ): system with all rules and origi- nal priorities. (2) Mandatory system ( All o ): system with no rules except for the ones that cannot be deactivated due to business rea- sons, with the original priorities. (3) Random search: generate r independent rule congura- tions, using dierent values of ρ (Section 3.6). If ARMS nds rule systems b etter than baselines 1 and 2 by turning o rules, we successfully address (h1) . If it further improves its performance by also tuning rule priorities, we address (h2) . 4.5.3 T emporal cross validation (TCV). W e use TCV to verify (h3) . For each data set, we create four folds composed of three sets (i.e., train, validation, and test) of one month each. W e train ARMS with dierent search heuristics and parameters on each train data set, evaluate the resulting congurations on the validation data set, and identify the b est one for each heuristic. Finally , we evaluate the winners and the baselines on the test set. 4.5.4 Optimization strategies. W e run ARMS with two dier- ent optimization strategies: gree dy expansion (Section 3.7) and genetic programming (Section 3.8). Results for both are shown in Sections 4.6.2 and 4.6.3, respectively . 4.6 Results on real data Unless state d otherwise, the results refer to rule congurations obtained in the train data of each fold and evaluated in the respective validation set. Results shown are always relative to the original system baseline and show the gains relative to the curr ent system in production, i.e., ∆ l os s is the dierence between the loss of the system being evaluated and the original one. A B C D validation split − 2 . 5 0 . 0 2 . 5 5 . 0 ∆ l oss · 10 − 1 all on all o rand ρ = 4% rand ρ = 16% rand ρ = 28% rand ρ = 46% rand ρ = 70% rand ρ = 88% Figure 4: Baselines comparison in D1. A B C D validation split − 2 . 5 0 . 0 2 . 5 5 . 0 ∆ l oss · 10 − 1 all on all o rand ρ = 46% gr e e dy Figure 5: Gree dy expansion against baselines in D1. 4.6.1 Baselines comparison. W e compare the original system ( all on ) against the mandator y system ( all o ) and against random search, with n = 10000 and ρ as a tunable parameter with values spaced out in 4% intervals (Figure 4 for D1 ). W e observe that the mandatory system has a higher loss than the other systems, as it fails to meet the recall constraint from Equation 2. W e also obser ve that random sear ch is almost always sup erior to the original system, regardless of ρ . In a few cases, the random search is worse than the original system because it does not meet the recall constraint, namely with aggressive congurations (e.g., ρ = 88% ). On the other hand, aggressive random search (higher ρ ) can decrease the loss signicantly , so there is a trade-o b etween b eing able to meet the recall constraints and lowering the loss. W e observe similar behavior for D2 , and thus omit results for br evity . Nev ertheless, we show metrics besides the loss for D2 in Supplementary Figure S1. From these results, w e decide to use random search with ρ = 46% for D1 , and ρ = 58% for D2 , for the baselines, alongside the original system and the mandatory system for both data sets. 4.6.2 Greedy expansion r esults. W e test greedy expansion with and without ARP . W e nd that ARP did not improve the system in D1 or D2 . One possible explanation is that greedy expansion yields simple systems with few rules, so it did not benet from ARP. Another possibility is that the original priorities ar e already well-tuned for both data sets as they correspond to mature systems. When compared against the baselines, the outcomes vary . For D1 , the greedy expansion was superior to the baselines except for the second fold, where it failed to met the constraints (Figure 5). In the other three folds, greedy expansion was able to r emove ≈ 75% of the rules and reduce alerts. For D2 , however , the greedy expansion 7 KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA D . Aparício et al. 10 30 50 70 90 110 top- n rules 0 . 85 0 . 90 0 . 95 1 . 00 NDCG fold B fold C fold D Figure 6: Gree dy expansion rule order consistency in D1. was worse than the baselines in two of the four folds, as it did not respect the constraints (Supplementary Figure S2). W e also evaluate the consistency between rules across folds. Recall that the greedy expansion obtains an ordered list of rules sequentially by importance. W e compare the order ed lists across folds and compute their normalized discounted cumulative gain (NDCG) in Figure 6. W e show results of the rst fold of D1 compared with the other folds. W e observe that rules are consistent across folds (NDCG values are consistently > 0 . 7 ), but the NDCG line drops (e .g., important rules in fold A are more similar to important rules in fold B than in fold C). W e obser ve similar behavior in D2 (omitted for brevity ). 4.6.3 Genetic programming results. W e evaluate how the genetic programming method (Section 3.8) improves fraud detection. Since our datasets are very big, we can not perform a grid sear ch on all parameters. Thus, we have a three phase process. First, we nd a good set of default parameters. For this purpose, we set ψ = 100 and do grid search on α and ρ . W e do n = 10000 eval- uations by default, i.e., for ψ = 100 , then r = 100 runs. W e perform a grid search on α ∈ [ 2% , 5% , 10% , 20% ] and ρ ∈ [ 0% , 2% , 5% , 10% ] . For D1 , we nd that ρ = 10% outp erforms the baselines across datasplits and that random search takes longer to achieve similar losses (e .g., for fold A; Supplementary Figure S3). The overall best parameters were found to be ρ = 10% , α = 5% , and ρ = 5% . Secondly , we study how each parameter inuences the loss. For this purpose, we vary only one parameter at a time and keep the others at the default values. Since parameters r , ψ , ρ , α are ordinal, we try 10 dierent values for each and see how increasing each parameter individually inuences the loss. Figure 7 shows results for fold A of D1 . W e observe that, in general, increasing ρ and α makes the p erformance worse; however the best α is 10%, thus, keeping some of the best individual congurations is important. W e also observe that r inuences the loss much more than ψ (e.g., both ( r = 100 , ψ = 10 ) and ( r = 10 , ψ = 100 ) perform 1000 rule evaluations, but the rst one leads to lower losses). T ypically , the loss improves as you increase r and ψ , but it plateaus relatively quickly for both (i.e., r at 300, ψ at 400). Similar conclusions hold for D2 (omitted for space concerns). W e did not obser ve gains in changing rule priorities during genetic optimization. Finally , we measure ARMS performance on the test sets. W e compare ARMS using genetic programming against the baselines and ARMS using greedy search. T o do this, we evaluate the rules deactivations suggested by ARMS (trained on the train sets and evaluated in the validation sets) on the respective test sets of each 0 18 36 54 72 90 ρ : mutation pr obability (%) − 3 . 5 − 3 . 0 − 2 . 5 ∆ l oss · 10 − 1 0 18 36 54 72 90 α : sur viv ors rate (%) − 3 . 50 − 3 . 25 − 3 . 00 − 2 . 75 − 2 . 50 ∆ l oss · 10 − 1 0 200 400 600 800 1000 r : numb er of runs − 3 . 5 − 3 . 0 − 2 . 5 ∆ l oss · 10 − 1 0 200 400 600 800 1000 ψ : p opulation size − 3 . 5 − 3 . 4 ∆ l oss · 10 − 1 Figure 7: Genetic: inuence of r , ψ , ρ , α on fold A of D1. fold. For D1 , we evaluate the b est rule conguration found by ARMS using r = 1000 , ψ = 250 , α = 5% , ρ = 5% , and no priority shuing. For D2 , we evaluate the best rule conguration found by ARMS using r = 1000 , ψ = 150 , α = 20% , ρ = 5% , and no priority shuing. For D1 , we observe that greedy and genetic optimization per- formed similarly and b etter than random search with ρ = 46% (Figure 8). For D2 , we observe that random search and the genetic programming approaches perform similarly; the greedy method fails to comply to the constraints in two of the four folds (Figure 9). In order to check the consistency of ARMS across data folds, we measure the Jaccard similarity [ 12 ] of the deactivated rules suggested by ARMS in dierent splits. W e see that the Jaccard is A B C D test split − 2 . 5 0 . 0 2 . 5 5 . 0 ∆ l oss · 10 − 1 all on all o rand ρ = 46% gr e e dy genetic Figure 8: Performance of ARMS on the test sets of D1. A B C D test split 0 2 4 6 ∆ l oss · 10 − 1 all on all o rand ρ = 58% gr e e dy genetic Figure 9: Performance of ARMS on the test sets of D2. 8 ARMS: A utomated rules management system for fraud detection KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA T able 4: ARMS consistency results (i.e., across folds). W e highlight in bold the lowest loss for each fold. A B C D A 1 0.930 0.902 0.826 B – 1 0.950 0.820 C – – 1 0.829 D – – – 1 (a) Jaccard of remov ed rules ( D1 ). A B C D A 1 0.789 0.696 0.636 B – 1 0.773 0.565 C – – 1 0.708 D – – – 1 (b) Jaccard of removed rules ( D2 ). A B C D A 0.275 0.344 0.274 0.273 B – 0.348 0.277 0.275 C – – 0.268 0.267 D – – – 0.264 (c) Loss on future folds ( D1 ). A B C D A -0.626 -0.613 -0.651 -0.662 B – -0.612 -0.651 -0.662 C – – -0.678 -0.696 D – – – -0.704 (d) Loss on future folds ( D2 ). higher for D1 than D2 (T able 4 (a)-(b)). The fact that D2 rule set changes across folds obviously leads to intrinsically lo wer values (i.e., regar dless of what ARMS deactivates). W e also evaluate sys- tems trained on a given fold in more recent folds (e.g., we train ARMS on fold A and evaluate it the test set of A, B, C and D). W e observe that systems trained on older folds have good performance on more recent test sets (T able 4 (c)-(d)). 4.6.4 Summar y . W e evaluated ARMS on two big online mer- chants. For D1 , ARMS using genetic programming ( or greedy ex- pansion) was able to remove ≈ 50% of the original 193 rules, while maintaining the original system performance (i.e., keeping 95% of the original recall). Thus, ARMS was able to impro ve the original system ( h1 ). W e also saw that results are stable acr oss data-splits ( h3 ). W e did not see gains of using priority shuing ( h2 ). For D2, we observed that ARMS was able to r emove ≈ 80% of the system rules while maintaining the original system performance (i.e., ke ep- ing a low FPR). Thus, ARMS improved the original system ( h1 ). Similar to D1 , we found evidence supporting ( h3 ) but not ( h2 ). 4.6.5 Discussion. Real-world transaction data sets for fraud de- tection pose several challenges. A uto-declines lead to unreliable labels, and thus we cannot verify if a system positive is a true p os- itive, meaning that decline rules cannot b e evaluated unless an analyst veries auto-declines. In practice this is dicult because fraud analysts’ time is a very limited resource. The two systems that we chose are also particularly hard to optimize since they have been in production for years and have been manually tuned by data scientists. Finally , we evaluated ARMS’ performance on past transactions and did not measure its performance in produc- tion. W e think that putting ARMS in production and continuously optimizing the rules system could lead to better results. 5 CONCLUSION W e have proposed ARMS, a framework that optimizes rules systems using search heuristics, namely random search, greedy expansion, and genetic programming. T o the best of our knowledge, ARMS is the rst to (1) handle dierent rule priorities and actions, (2) address blacklists side eects, and (3) optimize user-dened func- tions. These components are essential in real-world fraud detection systems. Our results in real-world clients demonstrate that ARMS is capable of maintaining the original system’s performance while greatly reducing the number of rules (b etween 50% and 80%, in our experiments) and minimizing other metrics (e.g., alert rate). Currently we ar e adding a rules suggestions module to ARMS, which is beyond the scope of this pap er . In the future we also plan to incorporate a module to simultaneously tune the rules and the machine learning model threshold. A CKNO WLEDGMEN TS W e want to thank the other members of Feedzai’s research team, who always gave insightful suggestions. In particular , we want to give special thanks to Marco Sampaio , for reviewing the paper internally , and Patrícia Rodrigues, for starting ARMS. Note on reproducibility W e make available a binary of ARMS, the synthetic data describe d in Se ction 4.1 (as well as the script used to generate it), and all the necessary steps to reproduce our results from Section 4.3 at https://github.com/feedzai/research- arms. For privacy compliance, we can not share our clients data sets. REFERENCES [1] Franklin Allen and Risto Karjalainen. 1999. Using genetic algorithms to nd technical trading rules. Journal of nancial Economics 51, 2 (1999), 245–271. [2] James Bergstra and Y oshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, Feb (2012), 281–305. [3] Ekrem Duman and M Hamdi Ozcelik. 2011. Detecting credit card fraud by genetic algorithm and scatter search. Expert Systems with Applications 38, 10 (2011), 13057–13063. [4] Pedro G Espejo, Sebastián V entura, and Francisco Herrera. 2009. A survey on the application of genetic programming to classication. IEEE Transactions on Systems, Man, and Cybernetics, Part C (A pplications and Reviews) 40, 2 (2009), 121–144. [5] Ronald A Fisher . 1936. The use of multiple measur ements in taxonomic problems. A nnals of eugenics 7, 2 (1936), 179–188. [6] Gabriele Gianini, Leopold Ghemmogne Fossi, Corrado Mio , Olivier Caelen, Lionel Brunie, and Ernesto Damiani. 2020. Managing a pool of rules for credit card fraud detection by a Game Theor y based approach. Future Generation Computer Systems 102 (2020), 549–561. [7] Hisao Ishibuchi, Ken Nozaki, Naohisa Y amamoto, and Hideo T anaka. 1995. Se- lecting fuzzy if-then rules for classication problems using genetic algorithms. IEEE Transactions on fuzzy systems 3, 3 (1995), 260–270. [8] Hisao Ishibuchi and Takashi Y amamoto. 2004. Comparison of heuristic criteria for fuzzy rule selection in classication problems. Fuzzy Optimization and Decision Making 3, 2 (2004), 119–139. [9] Y ufeng Kou, Chang- Tien Lu, Sirirat Sirwongwattana, and Y o-Ping Huang. 2004. Survey of fraud detection techniques. In IEEE International Conference on Net- working, Sensing and Control, 2004 , V ol. 2. IEEE, IEEE, 749–754. [10] Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. 2015. Automated rule selection for aspect extraction in opinion mining. In Twenty-Fourth International Joint Conference on A rticial Intelligence . [11] Saharon Rosset, Uzi Murad, Einat Neumann, Yizhak Idan, and Gadi Pinkas. 1999. Discovery of fraud rules for telecommunicationsâĂŤ challenges and solutions. In Proceedings of the fth ACM SIGKDD international conference on Knowledge discovery and data mining . ACM, 409–413. [12] Cesare Baroni Urbani. 1980. A statistical table for the degree of coexistence between two species. Oecologia (1980), 287–289. [13] Eyal Winter . 2002. The shapley value. Handbo ok of game theory with economic applications 3 (2002), 2025–2054. 9 KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA D . Aparício et al. A SUPPLEMEN T ARY MA TERIALS A.1 Synthetic data The set of rules comprises 8 accept rules, 30 review rules, and 60 decline rules. The support of the accept rules was sampled from a Gaussian distribution N ( 45000 , 22500 2 ) , while the support of the re- view and decline rules was sample d fr om N ( 22 . 5 , 225 . 0 2 ) . The NP V of accept rules was sampled from N ( 0 . 75 , 0 . 20 2 ) , while the precision of the alert and decline rules was sampled from N ( 0 . 17 , 0 . 05 2 ) . Rules have ten p ossible priorities. Accept rules have priority p a ∈ { 0 , 1 , 5 , 6 , 10 } , alert rules have priority p l ∈ { 2 , 4 , 7 , 9 } , and decline rules have priority p d ∈ { 3 , 8 } . A.2 Synthetic data parameter tuning A.2.1 Random search. W e use 16 mutation probabilities, i.e., ρ ∈ [ 4% , 94% ] , in intervals of 4%, i.e., ρ = 4% , ρ = 8% , ... A.2.2 Genetic programming. W e use three mutation probabili- ties, i.e., ρ = 10% , ρ = 20% , and ρ = 30% . W e use two population sizes, i.e., ψ = 20 and ψ = 30 . Finally , we use two survivors fractions, i.e., α = 2% and α = 5% . A.3 Real-world datasets A.3.1 D1. The client has 198 rules, with one of three possible actions: accept, alert, and decline. Out of the 198 rules, 30 of them are accept rules, 89 are alert rules, and 79 are decline rules. Accept rules have four dierent priority levels p a ∈ { 1 , 8 , 10 , 15 } , alert rules have two p a ∈ { 5 , 11 } , and decline rules have three p d ∈ { 6 , 9 , 12 } . If no rules are triggered, the default action is to accept the transaction. The dataset contains few validated fraud, i.e ., of the declined (by the model/rules) and fraudulent population of transactions, only a small portion was validated by analysts or via chargeback. W e note that decline rules and auto-declined transactions are ignored in the train and validation datasets. W e make this choice because decline rules can not be validated. However , when we measure performance in the test set, decline rules are included in order to make r esults directly comparable to the results obtained in production. W e do temporal cross validation (TCV) with four folds and each set has one month of data. A.3.2 D2. Unlike D1 , which has the same activated rules for the whole period, in D2 the rules changed. During the seven months period a total of 13 rules were added, while some were removed, increasing the number of rules in the set from the original 77 to 90. Rules have one of three outcomes: accept, alert, and alert&decline (this means that most auto-declined are veried, unlike in D1 ). From those, 6 ar e accept rules, 48 are alert rules, and 36 ar e alert&decline rules. Accept rules have priority p a ∈ { 0 , 5 , 10 } , alert rules have priority p l = 1 , and alert&decline rules have priority p d ∈ { 2 , 4 , 8 } . Three of the decline rules are blacklist checker rules, and all 36 alert&decline rules are blacklist updater rules. Since D2 has a high ratio of validated fraud, all rules are opti- mized by ARMS, however the auto-de cline transactions are not used during the training process, but are present in the test set in order to make r esults directly comparable to the results obtained in production. W e do temporal cross validation (TCV) with four folds and each set has one month of data. A.4 Supplementary Algorithms and Figures Algorithm S1 Blacklist propagation. 1: function computeBlacklistDependencies ( R , X , X , B ) 2: BL ← { } 3: BD ← { } 4: for all x ∈ X do 5: for all R j ∈ B u do 6: if r j , − 1 then 7: for all X l ∈ X that R j blacklists do 8: BL [( R j , X l : x l )] . append ([ x . t im e , + ∞]) 9: for all R q ∈ B c that checks X l do 10: BD [ x ] . add ( R j ≺ R q ) 11: if r j = − 1 then 12: for all X l ∈ X that R j can blacklist do 13: if x . t i m e is in any BL [( R j , X l : x l )] then 14: r j ← p j 15: for all { x l ∈ x | x l is in any active blacklist } do 16: if ( ∄ R q ∈ B c | p q , − 1 ) then 17: for all { R j ∈ B u | ( R j , X l : x l ) ∈ BL } do 18: BL [( R j , X l : x l )] . last () ← [ _ , x . t i m e ] 19: for all R q ∈ B c do 20: if r q , − 1 and | { ( R i ≺ R q ) ∈ BD [ x ] | R i ∈ B u } | = 0 then 21: BD [ x ] . add ( R q ≺ R q )) 22: return BD Algorithm S2 Random search optimization. θ : { rule shuto probability ρ , rule priority shue probability γ } 1: function Random.optimize ( X , R , ℓ , p , a , BD , λ , Ω 1 , θ ) 2: p best ← p 3: Ω b e s t ← Ω 1 4: while stoppingCriteriaNo tMet () do 5: p rand ← p 6: for all p i ∈ p rand do 7: with γ % probability, do : 8: p i ← randomPriorityShuffle ( p i , a ) 9: with ρ % probability , do : 10: p i ← − 1 11: Ω r a n d ← ev alua te ( X , R , ℓ , p rand , a , B , BD , λ ) 12: if Ω r a n d l o s s < Ω b e s t l o s s then 13: Ω b e s t ← Ω r a n d 14: p best ← p rand 15: return ( p best , Ω b e s t ) 10 ARMS: A utomated rules management system for fraud detection KDD ’20 (submied), A ugust 22–27, San Diego, CA, USA Algorithm S3 Greedy expansion optimization. θ : { backtracking b t ∈ { t r u e , f a l s e } } 1: function Greedy.optimize ( X , R , ℓ , p , a, BD , λ , Ω 1 , θ ) 2: p best ← p 3: Ω b e s t ← Ω 1 4: p keep ← (− 1 , . . ., − 1 ) 5: p greedy ← (− 1 , . . ., − 1 ) 6: Q ← ∅ 7: while | Q | < | R | and stoppingCriteriaNotMet () do 8: R k e e p ← None 9: Ω k e e p ← + ∞ 10: for all { R j ∈ R | R j < Q } do 11: p д r e e d y j ← p j 12: Ω д r e e d y ← ev alua te ( X , R , ℓ , p greedy , a , B , BD , λ ) 13: if Ω д r e e d y l o s s < Ω k e e p l o s s then 14: R k e e p ← R j 15: Ω k e e p ← Ω д r e e d y 16: p keep ← p greedy 17: p д r e e d y j ← − 1 18: Q . add ( R k e e p ) 19: if Ω k e e p l o s s < Ω b e s t l o s s then 20: Ω b e s t ← Ω k e e p 21: p best ← p keep 22: if b t is t r u e and isBacktrackingTime () then 23: run greedy contraction to remove l rules, l < | Q | 24: return ( p best , Ω b e s t ) A B C D validation set − 0 . 10 − 0 . 05 0 . 00 ∆ FPR (pp) A B C D validation set − 2 − 1 0 ∆ r e call (pp) A B C D validation set 20 40 60 80 100 activ e rules rate (%) A B C D validation set − 0 . 2 − 0 . 1 0 . 0 0 . 1 ∆ sent to r e vie w (pp) all on rand ρ = 16% rand ρ = 28% rand ρ = 46% rand ρ = 58% rand ρ = 70% rand ρ = 88% Figure S1: Baseline metrics comparison in D2. A B C D validation split 0 2 4 6 ∆ l oss · 10 − 1 all on rand ρ = 58% gr e e dy gr e e dy w/ arp Figure S2: Gree dy expansion results in D2. Algorithm S4 Genetic programming optimization. θ : { Population size ψ , sur vivors fraction α , mutation probability ρ } 1: function Genetic.optimize ( X , R , ℓ , p , a , BD , λ , Ω 1 , θ ) 2: p best ← p 3: Ω b e s t ← Ω 1 4: P ← generateInitialP opula tion ( R , p , ψ , ρ ) 5: while stoppingCriteriaNo tMet () do 6: ( P = , P − ) ← ev alua teP opula tion ( P , α ) 7: P + ← mut ateAndCrossover ( P = , α , ψ , ρ ) 8: P ← { P = , P + } 9: ( P = , P − ) ← ev alua teP opula tion ( P ) 10: p best ← P = 1 11: Ω b e s t ← ev alua te ( X , R , ℓ , p best , a , B , BD , λ ) 12: return ( p best , Ω b e s t ) 13: function genera teInitialPopula tion ( R , p , ψ , ρ ) 14: P ← ∅ 15: for i ∈ [ 0 , ψ [ do 16: p ′ ← p 17: for all p ′ j ∈ p ′ do 18: with ρ % probability , do : 19: p ′ j ← − 1 20: P [ i ] ← p ′ 21: return P 22: function mut ateAndCrosso ver ( P = , α , ψ , ρ ) 23: P + ← ∅ 24: for i ∈ [ 0 , ( 1 − α ) ∗ ψ [ do 25: p mother ← getRandomVector ( P = ) 26: p father ← getRandomVector ( P = ) 27: p ild ← p mother 28: for all p c h i l d j ∈ p ild do 29: with 50% probability , do : 30: p c h i l d j ← p f a t h e r j 31: for all p c h i l d j ∈ p ild do 32: with ρ % probability , do : 33: p c h i l d j ← randomPriorityShuffle ( p i , a ) 34: P + . add ( p ild ) 35: return P + Figure S3: Genetic programming loss versus random search by number of evaluations in fold A of D1 (zoomed in the rst 10000 rule evaluations; the metho ds nearly converge eventu- ally) 11

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment