Combined Classifier for Website Messages Filtration

The paper describes a new approach to website messages filtration using combined classifier. Information security standards for the internet resources require user data protection however the increasing volume of spam messages in interactive sections of websites poses a special problem. Spam messages vary significantly in content, however the common feature of these messages is that they are usually of little interest to the majority of the recipients. Many filtering approaches are based on the Naive Bayesian classifier an effective method to construct automatically anti-spam filters with high performance. Unlike many email filtering solutions the proposed approach is based on the effective combination of Bayes and Fisher methods, which allows us to build accurate and stable spam filter. In this paper we consider the organization of combined classifier according to determined optimization criteria based on statistical methods, probability calculations and decision rules. We consider the optimization criteria for grading messages basing on statistical methods. The classifiers normally admit the compromise between the acceptable level of false-positive and false-negative errors, and use the threshold values for decision-making, which may vary. In order to receive more valid results of spam detection we need to analyze multitudes of results of various filters and a subset of their overlaps. The approach we suggest is to construct classifier organization, which presumes the combined use of Bayes and Fischer methods for improved the filtration quality based on the analysis of subsets and set overlaps identified by both methods (spam, non-spam, false triggering and spam leaks).


Introduction
The constantly growing volumes of data, number of uses as well as groups devoted to various subjects significantly decrease the effectiveness and the authenticity of communicated information. In this regard the task of increasing the efficiency of statistical data filtration and authentication algorithms becomes undoubtedly topical. The history of this subject in computer science accounts for more than 20-30 years and the trend is becoming more urgent. We can say that right now the antispam features of interactive sections of websites rest in the very initial stage of development. The subject of message filtration in emails is widely developing, manual antispam methods are being used, and the issue of automated antispam protection of corporate websites becomes a priority on the agenda (including comments, forums and other interactive sections). In practice there are no universal software solutions to protect all types of interactive website sections from spam. There are only small number of specialized tools which prevent automatic messages posting. Some of them are designed for a particular content management system, such as WordPress in form of plugins: Akismet, Quiz, Spam Karma etc. These modules have some disadvantages: the distribution model "as is" do not include the statistical base, most of online services do not provide multilingual filtration and are limited only by the support of the English language. The other blog comment hosting services such as IntenseDebate, Disqus, Livefyre do not provide self-hosted option, except Discourse. Thereby the spam filtering software solution should have the following properties: the use of multiple filtering methods, both formal and linguistic, united by a common intellectual decision making core; high speed and precision of the method; easy installation and use. This work describes a new approach to spam filtration involving the combined use of Bayes and Fischer methods, allowing to significantly reduce the number of false triggering and increase spam detection.

Calculation of combined probabilities of conditions
The main idea of message classification is based on selection of all conditions, calculation of probabilities of select conditions, and further combination of all calculated probabilities into one value for the studied message. Messages with a large number of spam attributes and little non-spam attributes will have a value close to 1, and the messages with a large number of non-spam attributes and little number of spam attributes will gain a value close to 0. We will build a classifier of messages received by the website to grade the incoming messages into three categories (spam, non-spam, unidentified). In this respect, we need to identify all conditions (words and word combinations) in the message to be analyzed, calculate statistical probabilities for some select conditions and combine all probabilities into one value for the whole message. In most cases the probability В. Тарасов, Е. Мезенцева, Д. Карбаев. Совмещенный классификатор для фильтрации сообщений на веб сайтах. Труды ИСП РАН, том 27, вып. 3, 2015 г., c. 291-302 293 of assigning a message to a certain category is a lot higher than to others, which results in further grading of such message. Before calculating the combined probabilities of conditions, we need to calculate the probability of assigning a certain condition to a specific category. For this we can divide the identified number of messages with condition i in this category by the total number of messages in the same category, but we would rather use another method described below. Let's assume: ai F is the number of messages with condition i in the spam group; bi F is the number of messages with condition i in non-spam group.
Then the statistical probability of appearance of i in a spam message can be calculated as follows: Thus, the number of messages with condition i in one category will be divided by the total number of messages featuring this condition i .
The use of (1) and (2) takes into account the fact that with time the number of messages in both categories may be equal, i.e. these formulas do not depend on the number of messages in a specific category. Note that formulas above give accurate result only to those conditions, which filter is used in both categories. As the result the spam filter becomes too sensitive on early stages of learning applying to rare words. To solve this problem we need to calculate new probability with expected a priori probability (Pex) and applied weight (w), then according to (1) and (2) add calculated probabilities. If the probability Pex = 0.5 and the weight of expected probability equals to one word (w = 1), we estimate weighted probabilities using (1) and (2): This approach allows to avoid division by zero in the following formulas and to take into account rare words. To obtain combined probabilities of the whole document (message) we will use the dictionary, which is built on the step of filter learning. We introduce the following events: Adocument is spam, Bdocument is non-spam. We assume that the probabilities are independent, thus the multiplication is allowed: (3) -for the probability of words co-occurrence in spam; (4) -for the probability of words co-occurrence in non-spam [ [1]].

Decision rules based on bayes theorem
To estimate the probability that word belongs to one of three categories (spam, nonspam, unidentified messages) we consider the two methods of classification. In this case we apply Bayes formulas using a priori knowledge [ [1]]. We introduce two hypotheses for any given message: Further, we introduce the following notation: is a priori probability that a message is not a spam; is a priori expectations that a message will be a spam; is a priori expectations that a message will be a non-spam.
Then basing on Bayes theorem using a priori knowledge we obtain: -a posteriori probability that a message is nonspam. The probabilities ) (A P and ) (B P are estimated according to (3) and (4).
Given algorithm is implemented in spam detection and filtering system for websites.

Decision rules based on fisher's method
According to Fisher method all probabilities are multiplied together in a similar manner to Bayes method, then the natural logarithm is taken of the product and the result is multiplied by -2. To do this we introduce variable hisqv, which is estimated by the following expressions: )) ( ln( where probabilities ) (A P and ) (B P are calculated according to (3) and (4).
Fisher proved that if the set of independent and random probabilities (3) and (4) is given, the value )) ( ln( * 2 A P  follows the distribution of 2 χ with 2n degrees of freedom (nthe number of words in the document): where Г(n) is the gamma function. In view of foregoing using a representation of the gamma function of even argument (5) can be written as: The calculation of the factorial and the integrand in (6) could cause the overflow error due to floating point numbers range in PHP programming language. Thus the recurrence formula is used in the calculation algorithm. Calculation the probability of (6) is implemented by Gaussian quadrature formula with 15 nodes: is low if a text contains many spam conditions. We need the opposite result to rate the message correctly. For this purpose we subtract the value from 1. The use of this subtraction for a large number of non-spam conditions allows us to get the probability that message is not spam. However the Fisher method is not symmetrical. We need to combine the probabilities of spam and non-spam into a single value in the range between 0 and 1. For this we use the Fisher index:

Optimization criteria for grading messages based on statistical methods
Let's assume that all set of conditions is divided into classes A and B, where Aclass of spam messages, and Bclass of non-spam messages. The task of assigning a message to any of these classes is not directly connected to the statistical verification of the following hypotheses: simple hypothesis HA: X A against the alternative HB: X B, where X is the message qualifying condition. As we know from the math statistics, if a message appertains to class A and it was qualified as class B, it will result in 1st type error with the conditional probability of -level of importance. It will be an error of the alternative hypothesis selection HB instead of the correct HA. If HB hypothesis is fair but, nevertheless, HA was selected, the 2nd type error will occur with the conditional probability of. The 1st type error or false-negative error occurs if the spam filter erroneously leaks an undesired message through identifying it as non-spam (spam leakage or insufficient method completeness). Whilst the spam filter is capable of identifying a large share of undesired messages, the task of minimizing the number of faulty filtering of desired (non-spam) messages may become a higher priority, i.e. the task of 2nd type of error minimization. The 2nd type error or false-negative error occurs if the spam filter erroneously classifies a legitimate message as spam (faulty triggering or method accuracy). The spam filter will be efficient with a lower number of such errors, i.e. with minimal 2nd type error level. However currently all antispam systems demonstrate correlation between 1st and 2nd type errors. The classifiers normally admit the compromise between the acceptable level of 1st and 2nd type errors, and use the threshold values for decision-making, which may vary. This results in the "strictness" or "softness" of the classifier. The level of significance set during the statistical hypothesis verification is taken as the threshold value. Whereas, the increase of the filter sensitivity leads to the increased occurrence of 1st type errors (spam leaks), and decrease of sensitivityto increased occurrence of 2st type of error (false triggering).

Bayes optimization criterion
We need to consider the losses related to 1 st and 2 nd type errors for evaluating the classification quality. For this we need to split the space of condition X into two semispaces XA and XB with point x0. Let's define c1 as the conditional price of 1 st В. Тарасов, Е. Мезенцева, Д. Карбаев. Совмещенный классификатор для фильтрации сообщений на веб сайтах. Труды ИСП РАН, том 27, вып. 3, 2015 г., c. 291-302 297 type error and c2conditional price of 2 nd type error, P(A)a priori probability of A class, P(B)a priori probability of class B, P(A) + P(B) = 1. The values c1 and c2 depend on the price matrix coefficients C2x2={c ij} and on the 1 st and 2 nd type errors: c 2 = c 21 β+ c 22 (1 -β) (8) These values are also called conditional risks with proven fairness of hypotheses HA and HB, respectively. According to the decision making theory, we introduce the decision rule of classification, which minimizes the function of losses (risk) [ [3]]: ) (9) where c1 and c2 are determined by (7) and (8). Function (9) represents the average risk, which depends on the threshold value x0, because the values c1 and c2 depend on the x0 value through type I and type II errors, therefore these errors are correlated. Minimum value Rmin of risk function (9)  Thereby we set strict limits for spam and regular for non-spam messages. Such threshold values provide minimum leakage of desired messaged into spam, i.e. minimum false triggering. However, it's notable that any system administrator will be able to easily set more convenient threshold values to suit his needs.

Combined filter
In order to receive more valid results of spam detection we need to analyze multitudes of results of various filters and a subset of their overlaps. We suggest exactly this kind of approach to classifier organization, which presumes the combined use of Bayes and Fischer methods for improved the filtration quality based on the analysis of subsets and set overlaps identified by both methods (spam, non-spam, false triggering and spam leaks). Let's assume S={si} (i=1÷M)multitude of documents (messages), including both desired and spam messages; SB  S and SF  Smultitude of documents, identified by Bayes and Fischer classifiers, respectively. Then the subset resulting from the overlap SB ∩ SF against all indicated categories may be used for evaluating the quality of the combined filter operation (see Fig. 1).

Fig. 1. Illustration of overlap degree of two subsets SB and SF.
The completeness of such overlap SB ∩ SF will also grade the subsets SB\SF and SF\SB. As a measure of overlap degree of two sets SB and SF we suggest to use the absolute measure N(SB ∩ SF)number of shared documents in these subsets. Thus, the maximum value of measure of l category (spam, non-spam, false triggering and spam leaks) will be used as the optimality criterion for spam filter self-teaching evaluation: . max ) ( Once the best values of sets SB and SF overlap are reached across all categories, the administrator will be able to choose a filter for further application (see Fig. 2). As a benefit of the combined filter implementation the evaluation of all components of the overall picture became possible: -spam messages caught by both filters; 300 -spam filters caught only by Bayes or only Fischer filters; -simultaneous false triggering of both filters; -false triggering of each individual filter; -simultaneous spam leaks by both filters; -spam leaks of each individual filter. Before testing filter was trained on 1100 messages (400 spam and 500 non-spam).
The tests were run on the flow of 1223 messages. The Bayes method showed 2.9 percent of the false triggering, 9.8 percent of spam omission. The Fisher method showed 1.5 and 4.5 percent accordingly. The combined filter showed the best result with 1.0 and 4.5 percent. The experimental results confirmed the feasibility of using the selected filtering algorithms. Only having a whole picture, we will be able to make a reasonable comparison of the combined filter self-teaching quality.