Spam filtering by quantitative profiles

Spam ﬁltering b y quan titativ e proﬁles M. Grend´ ar a , J. ˇ Skuto v´ a a , V. ˇ Spitalsk´ y a a Slovanet a.s., Z´ ahr adn ´ ıcka 151, 821 08 Br atislava, Slovakia Abstract Instead of the “bag-of-w ords” represen tation, in the quan titativ e proﬁle approach to spam ﬁltering and email categorization, an email is represented b y an m -dimensional v ector of num bers, with m ﬁxed in adv ance. Inspired by Sroufe et al. [Sroufe, P ., Phithakkitnuk o on, S., Dantu, R., and Cangussu, J. (2010). Email shap e analysis. In LNCS , 5935, pp. 18-29] tw o instances of quantitativ e proﬁles are considered: line proﬁle and character proﬁle. Performance of these proﬁles is studied on the TREC 2007, CEAS 2008 and a priv ate corpuses. At low computational costs, the tw o quantitativ e proﬁles achiev e p erformance that is at least comparable to that of heuristic rules and naiv e Bay es. Keywor ds: email categorization, spam ﬁltering, quan titative proﬁle, character proﬁle, line proﬁle, Random F orest 1. In tro duction Spam is an unsolicited email message. F rom the receiver’s p ersp ective, spam is an anno y ance and th us it is necessary to blo c k its delivery , for instance, by ﬁltering it out. T raditional approach to spam ﬁltering and email categorization that is based on heuristic rules, naive Ba yes ﬁltering and/or text-mining suﬀers from sev eral deﬁciencies. Among shortcomings of the traditional approac h there are high computational costs, language dep endence, necessity to up date the heuristic rules, high n umber of rules, and vulnerability . Instead of the “bag-of-words” representation, employ ed in text-mining and naive Bay es ﬁltering, in the quan titative proﬁle ( QP ) approach that we prop ose, an email is represen ted by an m -dimensional vector of n umbers with m ﬁxed in adv ance. Inspired by Sroufe, Phithakkitnuk oon, Dantu, and Cangussu [12], tw o instances of QP s are considered: line proﬁle ( LP ) and c haracter proﬁle ( CP ). Informally put, the line proﬁle of an email is a v ector of lengths of the ﬁrst m lines. The character proﬁle is a histogram of characters. Of course, man y other QP s are conceiv able. The main adv an tages of the t wo considered quantitativ e proﬁles are i ) sound p erformance, ii ) simple computabilit y , iii ) language-indep endence, iv ) robustness to outlying emails, v ) high scalabilit y and v i ) low vulnerabilit y . The considered tw o instances of the quan titativ e proﬁle approac h p erform comparably to the naive Ba yes ﬁltering and heuristics-based approaches and p erform very well also in a multi-language, non-English comm unication. F urthermore, the satisfactory p erformance is attained b y means of a small set of easily computable quantitativ e features. A p erformance study was done on the TREC 2007, CEAS 2008 and a priv ate corpuses. T o demonstrate the pow er of the considered QP s, the proﬁles are obtained from ra w emails, without any prepro cessing. Consequen tly , w e inten tionally ignore the structure of emails and character enco ding. On the tw o QP s, the Random F orest algorithm substantially outp erforms other classiﬁcation algorithms (SVM, LD A/QD A, logistic regression). Thanks to the Random F orest, the QP approach gains robustness to emails with extreme-v alued proﬁles as well as high scalability . Email addr esses: marian.grendar@slovanet.net (M. Grend´ ar), jana.skutova@slovanet.net (J. ˇ Skutov´ a), vladimir.spitalsky@slovanet.net (V. ˇ Spitalsk´ y) Pr eprint submitte d to Elsevier June 4, 2021 In our view, classiﬁcation and email categorization b y LP (or CP ) should hav e lo w vulnerabilit y . F or, the lines the lengths of which diﬀerentiates b etw een spam and ham, c hange from corpus to corpus. F or instance, in the CEAS 2008 corpus, the most imp ortant for deciding b etw een spam and ham are the lengths of the (10, 17, 15, 14, 16)-th lines, whilst in the TREC 2007 corpus they are the (5, 13, 6, 15, 7)-th lines. As the training corpus is usually not av ailable to a spammer, it should b e not easy to ev ade the LP (or CP ) ﬁlter. As a by-product of a p erformance study of CP and LP , we note that in the TREC 2007 and CEAS 2008 corpuses the n um b er of header lines is capable of discriminating betw een spam and ham, at a rate that is, in our view, to o high. Consequently , the corpuses lead to ov erly optimistic p erformance of spam ﬁltering metho ds. The pap er is organized as follows. In the next section we formally introduce the QP s mentioned ab ov e. Then, in Section 3, w e describ e the three email corpuses used for assessmen t of the QP s’ performance. In Section 4 we describ e measures used for performance ev aluation. The results are summarized in Section 5. In the concluding section some directions for future research are brieﬂy discussed. All the computations w ere p erformed with R [7]. T o make the results repro ducible, a supplementary material including the source co de was prepared, cf. [4]. 2. Quan titativ e proﬁles The quantitative pr oﬁle ( QP ) of an email is an m -dimensional vector of real n umbers that represents the email. The dimension m of the proﬁle is set in adv ance, and it is the same for all emails. In this paper we consider t wo particular QP s – the line proﬁle and the character proﬁle. These proﬁles can b e introduced by means of a simple probabilistic mo del. An email is represented as a realization of a vector random v ariable, that is generated by a hierarchical data generating pro cess. The length n of an email is an integer-v alued random v ariable, with the probabilit y distribution F n . Given the length, the email is represented b y a random v ector X n 1 = ( X 1 , . . . , X n ) from the probabilit y distribution F X n 1 | n with the supp ort in A n , where A = { a 1 , . . . , a m } is a ﬁnite set (alphab et) of size m = |A| . Then, the char acter pr oﬁle ( CP ) of an em ail is an m -dimensional random v ector CP = ( CP 1 , . . . , CP m ), where CP j = n X i =1 I { X i = a j } , j = 1 , . . . , m, and I is the indicator function. http://www.theinsider.org/news/emails/unsubscribe/ To be removed from this mailing list please use the form provided: −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− http://www.theinsider.org/news/article.asp?id=2476 American gunman massacres students and staff at American university *** BREAKING NEWS *** Lines: 10 Content−Length: 336 Status: O X−OriginalArrivalTime: 17 Apr 2007 10:44:10.0734 (UTC) Message−ID: X−MimeOLE: Produced By Microsoft MimeOLE V6.00.3790.3959 Date: Tue, 17 Apr 2007 11:44:10 +0100 Subject: "The Insider" − News Bulletin To: "Subscriber" From: "The Insider"  Tue, 17 Apr 2007 11:44:10 +0100 Received: from mail pickup service by cosmic200 with Microsoft SMTPSVC; for ; Tue, 17 Apr 2007 06:44:26 −0400 by speedy.uwaterloo.ca (8.12.8/8.12.5) with ESMTP id l3HAiP0I026448 Received: from cosmic200 (windows.globalgold.co.uk [194.1.150.45]) Return−Path: From the_insider@postmaster.co.uk Tue Apr 17 06:44:26 2007 (a) Email (b) Line proﬁle (c) Character proﬁle Figure 1: Graphical representation of the line and character proﬁles of an email In order to introduce the other QP , it is necessary to select a sp ecial character (or a subset of c haracters) from the alphab et. Let k b e the n umber of o ccurrences of the sp ecial character in an email and let T j 2 ( j = 1 , . . . , k ) b e the index of the j -th o ccurrence of the sp ecial character; put T 0 = 0. Then the binary pr oﬁle ( BP ) of an email is deﬁned as a ˜ k -dimensional random v ector BP = ( BP 1 , . . . , BP ˜ k ), where BP j = T j − T j − 1 − 1 , j = 1 , . . . , ˜ k . There, ˜ k is the maximum allo wable n umber of the o ccurrences of the sp ecial c haracter and it is set in adv ance. Hence, if k > ˜ k , the rest of the email is ignored. And if an email has k < ˜ k , BP j = 0 for j > k . In this work, an email is understo o d as a stream of bytes without any prepro cessing, so that A is taken to b e the ASCI I c haracter set. The end-of-line is taken for the sp ecial c haracter. Consequen tly , the binary proﬁle b ecomes the line pr oﬁle ( LP ) since B P j is the length of the j -th line of an email in b ytes. W e ﬁx the maximal n umber ˜ k of considered lines to 100. 3. Data sets T o assess the p erformance of the tw o quantitativ e proﬁles, we consider three email corpuses: the publicly a v ailable TREC 2007 and CEAS 2008 corpuses, and a priv ate corpus. The TREC 2007 corpus [3] comprises ov er 75 000 emails (25 220 hams and 50 199 spams), of whic h appro ximately 67% is spam, and the rest are ham emails. F or the training phase w e used the ﬁrst 50 000 emails. The rest forms the test set. The ratio of spam to ham in the training and test sets is approximately 2:1. The other publicly av ailable corpus, used for the p erformance analysis of QP s, is CEAS 2008 [2]. It consists of 137 705 emails (27 126 hams and 110 579 spams). The corpus w as hand lab eled [9]. T o form the training set, we used the ﬁrst 90 000 emails. The test set comprises the remaining emails. The ratio of spam to ham in the training and test sets is approximately 4:1. P erformance of spam ﬁltering algorithms is typically assessed on English language corpuses, such as the ab o ve mentioned TREC and CEAS. When applied to non-English emails, their p erformance may b e diﬀeren t. Due to the supp ort from a Slov ak internet services provider, w e enjoy ed the opp ortunit y of access to a priv ate corpus created in 2010, that comprises mainly non-English emails. Structure of the corpus is summarized in T able 1. The training set w e used consists of 11 050 emails and the test set consists of 12 200 emails. T able 1: Comp osition of the priv ate corpus corpus s.ham adv ert notify spam total train 6837 1611 1225 1377 11 050 test 3650 1409 5758 1383 12 200 The priv ate corpus w as hand lab eled. Emails w ere placed into one of the tw o groups: ham and spam. Moreo ver, ham w as divided into advert, solicited ham (denoted s.ham) and notify . Based on the language of the ma jor part of an email, the emails w ere placed into one of the four language groups: Slo v ak/Czech, English, German, and other. The corpus comprises 64% of the Slo v ak/Czech ham in the training set and 38% in the test set. 4. Classiﬁcation algorithm and p erformance measures Quan titative proﬁles serve as an input to a classiﬁcation algorithm. In this w ork w e use the Random F orest classiﬁer, in tro duced b y Breiman [1] and p orted to R by Liaw and Wiener [6], with the default settings. W e hav e employ ed also LDA, logistic regression, LASSO and SVM [5], but these metho ds p erformed muc h w orse. 3 T o ev aluate QPs’ p erformance, we calculate the false p ositive rate fpr = F P / ( T N + F P ) and the false negativ e rate fnr = F N/ ( T P + F N ), where TP ( TN ) stands for the num b er of true p ositive (true negativ e) emails, i.e. the correctly recognized spam (ham) emails; and FP ( FN ) stands for the n umber of false positive (false negative) emails, i.e. the incorrectly recognized ham (spam) emails, resp ectiv ely . W e also presen t the receiv er operating c haracteristic ( ROC ) curv e, i.e. the graph of the true p ositive rate vs. the false p ositive rate, obtained as functions of the decision threshold. The area A UC under the ROC curv e is also rep orted. 5. Results In this section w e summarize p erformance of the basic QP s on the three corpuses mentioned ab o ve. 5.1. Comp arison of quantitative pr oﬁles with Sp amAssassin and Bo goﬁlter F or the sake of comparison, we rep ort also results for SpamAssassin ( SA ) [11], version 3.3.1, oﬀ-line and without the Ba yes ﬁlter, and Bogoﬁlter ( BF ) [8], v ersion 1.2.2 with the default conﬁguration. On the priv ate corpus, the output from SA was processed by the Random F orest classiﬁcation algorithm, as it attains muc h b etter p erformance than SA with the default weigh ts. In addition to muc h b etter p erformance, it allows for email categorization, which is imp ossible with the default SA . On the public corpuses, how ever, the default SA classiﬁcation p erforms b etter. Bogoﬁlter allows a binary classiﬁcation only . BF was learnt in the batch mo de. T able 2: fnr (%) at ﬁxed fpr = 0 . 5% or fpr = 1% priv ate TREC07 CEAS08 ﬁlter at 0.5% at 1% at 0.5% at 1% at 0.5% at 1% CP 14.39 11.64 2.53 0.49 4.38 4.25 LP 21.33 20.10 0.30 0.13 0.39 0.27 SA 12.68 10.10 35.87 30.51 76.14 69.92 BF 13.05 7.38 0.40 0.06 0.47 0.36 0.00 0.02 0.04 0.06 0.08 0.10 0.80 0.85 0.90 0.95 F alse positive rate T rue positive rate A UC LP = 0.96395 A UC CP = 0.95982 A UC SA = 0.95682 A UC BF = 0.96131 (a) priv ate corpus 0.00 0.01 0.02 0.03 0.04 0.05 0.90 0.92 0.94 0.96 0.98 1.00 F alse positive rate T rue positive rate A UC LP = 0.99986 A UC CP = 0.9993 A UC SA = 0.94216 A UC BF = 0.99959 (b) TREC 2007 corpus 0.00 0.01 0.02 0.03 0.04 0.05 0.90 0.92 0.94 0.96 0.98 1.00 F alse positive rate T rue positive rate A UC LP = 0.99974 A UC CP = 0.99889 A UC SA = 0.79569 A UC BF = 0.99842 (c) CEAS 2008 corpus Figure 2: ROC curves 4 On the priv ate corpus, for fpr = 0 . 5% the b est p erformance is attained by SA and BF , and CP p erforms only slightly worse, cf. T able 2 and Figure 2. LP is slightly less eﬀectiv e, and it attains fnr around 21% at fpr = 0 . 5%. With the exception of SA , on the public corpuses all the studied ﬁlters attain muc h b etter p erformance than on the priv ate corpus, see also Section 5.5. The tw o QP s p erform muc h b etter than SA as T able 2 as w ell as Figures 2b and 2c indicate. The line proﬁle attains b etter p erformance (smaller fnr ) than BF at the 0 . 5% lev el of fpr . 5.2. Email c ate gorization for the private c orpus On the priv ate corpus, p erformance of CP and especially LP in spam ﬁltering (i.e. binary categorization) is w orse than that of SA and BF . How ever, in categorization of emails into one of the four categories, b oth CP and LP p erform muc h b etter in the most interesting category of solicited ham. Misclassiﬁcation table for CP and SA is in T able 3. T able 3: CP and SA confusion tables for categorization, priv ate corpus CP SA coun t advert s.ham notify spam adv ert s.ham notify spam adv ert 530 837 10 32 515 830 17 47 ham 26 3597 27 0 95 3290 248 17 notify 20 237 5499 2 24 235 5498 1 spam 37 221 12 1113 44 161 14 1164 5.3. Comp arison with email shap e analysis Sroufe et al. [12] suggest to ﬁlter spam b y means of its shap e, which the authors deﬁne (using our terminology) as a smoothed line proﬁle of email bo dy , where smo othing is p erformed b y the kernel smo other. Sroufe et al. also rep ort the total error of 30%, based on a preliminary study on the TREC corpus. F urther, the authors ﬁnd the p erformance v ery go o d, ’considering that no conten t or con text w as even referenced’, cf. [12], p. 26. The line proﬁle, that is inspired b y the email shap e analysis, attains on the TREC corpus fpr = 4 . 23% a fnr = 17 . 00%, when the threshold is not optimized and email headers are inten tionally not tak en into account, cf. Section 5.5. This gives the total error around 12 . 29% and indicates that smo othing is unnecessary . 5.4. R e duction of the fe atur e sp ac e It is also imp ortant to know to what extent the dimension of the QP feature space can b e reduced without substantiv e reduction of the classiﬁer’s p erformance. T o this end the top 20 and the top 50 features w ere considered, where the ranking of features was provided by the Random F orest’s measure of the mean decrease of accuracy . The study was done on the priv ate corpus and solely the email b o dy was considered. In the binary classiﬁcation the top 20 features of the line proﬁle attain essentially the same p erformance as the en tire line proﬁle of the length 100. In the case of CP , to attain the full-set p erformance, the top 50 features out of 256 are needed. The same holds for SA . Ho wev er, in the case of the email categorization it is not p ossible to reduce the dimension of SA features without substantiv e decrease in accuracy . 5 5.5. Why ar e TREC and CEAS misle ading? All the considered email ﬁlters except of SA attain muc h better p erformance on the public corpuses than on the priv ate one; cf. T able 2. In search for explanation we hav e noted that in the public corpuses, unlik e to the priv ate corpus, the n umber of header lines con tains information that is substan tive for spam ﬁltering. Figure 3 depicts the distribution of the num b er of lines in header, for spam and ham, in the TREC corpus. Once the email header is not taken in to accoun t, and solely the email b ody is pro cessed, p erformance of LP and CP worsens and b ecomes comparable to that on the priv ate corpus; cf. T able 4. In T able 4, LPH ( CPH ) denotes the line (character) proﬁle of email header and LPB ( CPB ) denotes the line (character) proﬁle of email b o dy , resp ectiv ely . Number of header lines Density 0.00 0.05 0.10 20 40 60 ham 0.00 0.05 0.10 spam Figure 3: Distribution of emails with resp ect to the num b er of header lines, for spam and ham, in the TREC 2007 corpus T able 4: fnr (%) at ﬁxed fpr = 0 . 5% or fpr = 1% priv ate TREC07 CEAS08 ﬁlter at 0.5% at 1% at 0.5% at 1% at 0.5% at 1% CPH 16.51 13.71 0.19 0.05 0.78 0.30 LPH 18.06 15.11 1.78 0.12 0.68 0.36 CPB 14.24 11.92 15.05 5.47 4.70 4.51 LPB 21.26 18.58 45.01 43.67 7.07 6.35 The decline of p erformance of CP and LP caused by exclusion of email headers supp orts the hypothesis that in the TREC 2007 and CEAS 2008 corpuses the proﬁles of headers carry a substantiv e information for discriminating b et ween spam and ham. 5.6. Summary of the p erformanc e study The empirical study implies that the simple and easily obtainable line and character proﬁles attain at least comparable p erformance as the optimally tuned SpamAssassin, which is based on hundreds of ﬁxed rules, and the p erformance of character proﬁles is close to that of Bogoﬁlter. P articularly , on the public corpuses LP is b etter than BF and SA . On the priv ate corpus CP attains comparable p erformance as BF and SA , and LP is slightly worse. 6 6. Conclusions Motiv ated b y Sroufe et al. [12], w e ha ve prop osed the quantitativ e proﬁle approac h to email classiﬁcation. In this rep ort w e explored tw o quantitativ e proﬁles, the line proﬁle and the c haracter proﬁle. The proﬁles are obtained from ra w emails, without an y prepro cessing. The computational costs of the tw o proﬁles are minimal. Performance of the proﬁles was studied on the TREC 2007, CEAS 2008 corpuses and a priv ate, multi-lingual corpus. The tw o quan titative proﬁles attained at least comparable p erformance as the optimally tuned SpamAssassin and the batc h-mo de learnt Bogoﬁlter. Besides the go o d performance, the tw o quan titativ e proﬁles are language indep endent and the res ulting ﬁlter is robust to outlying emails, highly scalable and has lo w vulnerability . As a by-product, w e hav e noted that the num b er of header lines in the TREC 2007 and CEAS 2008 corpuses contain rather strong information on the email class. The corpuses thus lead to ov erly optimistic p erformance of spam ﬁlters. In the near future w e plan to explore quan titative proﬁles based on size and structure of emails, on the sym b olic dynamics, and another instances of the binary proﬁle. Also, the proﬁles are w orth employing in a semi-sup ervised email categorization. 7. Ac kno wledgement Stim ulating feedback from J´ an Gallo and Stanisla v Z´ ari ˇ s is gratefully ackno wledged. This pap er was prepared as a part of the pro ject “SP AMIA”, M ˇ S SR 3709/2010-11, supported b y the Ministry of Education, Science, Researc h and Sp ort of the Slo v ak Republic, under the heading of the state budget support for researc h and developmen t. References [1] Breiman, L. (2001). Random forests. Machine L earning , 45(1), 5-32. [2] Cormac k, G. V., and Lynam, T. R. (2008). CEAS 2008 corpus. http://plg.uwaterloo.ca/ ~ gvcormac/ceascorpus [3] Cormac k, G. V., and Lynam, T. R. (2007). TREC 2007 corpus. http://plg.uwaterloo.ca/ ~ gvcormac/treccorpus07/about.html [4] Grend´ ar, M., ˇ Skutov´ a, J., and ˇ Spitalsk´ y, V. (2011). Supplement to “Spam ﬁltering by quantitativ e proﬁles”. http://www.savbb.sk/ ~ grendar/spam/SupplementToQuantitativeProfiles.pdf [5] Hastie, T., Tibshirani, R., and F riedman, J. (2009). The Elements of Statistic al L e arning , 2-nd ed., Springer, New Y ork. [6] Lia w, A., and Wiener, M. (2002). Classiﬁcation and regression by randomF orest. R News , 2(3), 18-22. [7] R Dev elopment Core T eam (2010). R: A language and en vironment for statistical computing. R F oundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R- project.org [8] Ra ymond, E. S., Relson, D., Andree, M., and Louis, G. 2004. Bogoﬁlter. http://Bogofilter.sourceforge.net [9] Segal, R., Bratko, A., and Cormack, G. (2008). CEAS 2008 Spam Filter Challenge, conference talk, http://www.ceas.cc/2008/challenge/results.pdf [10] Sing, T., Sander, O., Beeren winkel, N., and Lengauer, T. (2009). R OCR: Visualizing the performance of scoring classiﬁers. R pack age version 1.0-4. http://CRAN.R- project.org/package=ROCR [11] SpamAssassin. http://spamassassin.apache.org [12] Sroufe, P ., Phithakkitnuk o on, S., Dantu, R., and Cangussu, J. (2010). Email shap e analysis. In Distribute d Computing and Networking , Lecture Notes in Computer Science, K. Kant et al. (eds), 5935/2010, pp. 18-29. 7

Spam filtering by quantitative profiles

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment