pátek 2. září 2016

What is the correct measure for classification?

There are multiple measures that we can use to describe performance of a classifier. To name just a few: classification accuracy (ACC), area under receiver-operating curve (ROC-AUC) or F-measure.

Each metric is suitable for a different task. And there are two questions we have ask to select an appropriate metric:
  1. Is the decision threshold known ahead and fixed?
  2. Do we need calibrated predictions?  
If a single threshold is known ahead and fixed, for example, because we know the misclassification cost matrix, threshold-based metrics like classification accuracy or F-measure are appropriate. However, if we do not know the cost matrix or if the cost matrix is expected to evolve in time, it is desirable to use ranking-based metrics like ROC-AUC, that consider all possible thresholds with equal weight.

Of course, sometimes we may want sometimes in the middle, because we have some knowledge about the distribution of the values in the cost matrix. Example measures that consider just a range of thresholds or weight the thresholds unequally are partial Gini index, lift or H-measure. If you have some information about the working region of the model, you should use a measure that takes that knowledge into account (see the correlations of measures in the second referenced article).

Another question to ask is whether we need calibrated results. If all we need is to rank results and pick top n% of samples, ranking measures like ROC-AUC or lift are appropriate, following the  Vapnik's philosophy: "Do not solve harder problem than you have to". On the other end, if we need to asses risk, calibrated results are desirable. Measures that evaluate quality of calibration are Brier score or logarithmic-loss.

And of course, there are measures that are somewhere in the middle, like threshold-base measures, that evaluate quality of calibration at a single threshold, while ignoring the quality of calibration at the rest of thresholds.

A metric may also feature following desirable properties:
  1. Provides information about how much better is the model in comparison to a baseline for easier interpretation of the quality of the model.
  2. Works not only with binary, but also with polynomial labels.
  3. Is bounded for easier comparison of models across different datasets.
  4. Works also with unbalanced classes. 
But that would be for a longer post. Relevant literature:
  1. Getting the Most Out of Ensemble Selection
  2. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research

sobota 27. srpna 2016

How to design and evaluate tests in schools

There two ways how to end up with high quality products. Either you invest more energy into a design phase, or into a test phase. Following paragraphs describe the second approach.

Premise: If you replace paper-based tests with electronic tests, you can easily analyze the student's responses.

The first trivial analysis is to rank the questions based on the students' average success rate on the question. By looking at the bottom 10% of questions, you may identify badly formulated questions or flawed answer options. Alternatively, the topic may have not be thoroughly explained and practiced. Similarly, you may look at the top 10% of questions and identify accidentally too easy questions. 

The second analysis is based on correlation coefficients. Pearson correlation (although we should call it point biserial correlation) is calculated for each question between the vector of students' responses on that question and the students' overall gain of points from the test. Ideally, each question would highly correlate with the overall test score. If we sort the questions in the descending order and look at the bottom 10% of questions, we may identify irrelevant, too easy, too hard or simply flawed questions.

An example of a irrelevant question is calculation of Euclidean distance between 2 points in 3 dimensional space - each university student should already know how to calculate that. But this question could have been missed by the first analysis simply because of frequent numerical errors.

On the other end, an example of a relevant question is computation of Chebyshev distance - it is unlikely that students have ever heard about it before the course. Plus, it is easier to compute than Euclidean distance. Hence, the effect of numerical errors is minimized.

After identifying 10% of the worst questions, give students free points for these questions. It will make students happy for following reasons:
  1. If a test consists of more than a single question, students will almost inevitably complain, that some of the questions were too difficult. Not counting the toughest questions solves the problem.
  2. If a test is long, students will almost inevitably complain, that something was not properly covered. Not counting the toughest questions solves the problem.
  3. The correction may improve students' scores, but never hurt.
 Professors like it because:
  1. They may dedicate the test creation to unskilled workforce (read: teacher assistants), as the questions do not have to perfect. The students themselves will identify troublesome questions and the system will deal with them.
  2. It is a systematic (and proactive) approach how deal with students' complains about the tests.
  3. As a side product, it allows generation of a large set of relevant, correct and appropriately difficult test questions, which may work as poor man's replacement for absent/obsolete textbooks.
Summary: The described approach is suitable for fast evolving domains like machine learning (or any computer science related domain), where battle-field tested questions are still not available or are already obsolete.

sobota 13. srpna 2016

Pragmatic comparison of MySQL and PostgreSQL

The biggest advantage of MySQL is that it is feature lightweight and that it does not closely adhere to SQL standard. Consequently, MySQL has following nice properties:
  1. MySQL implements a rather subtle but sufficient subset of functions from SQL. Hence, it is easier to learn MySQL thoroughly than PostgreSQL.
  2. When you get stuck with MySQL, you know it must be doable. And after a while you write a query that does what do you want. In PostgreSQL, you just Google for the function. That is not sporty.
  3. It is easier to migrate from MySQL to another database than reversely, because almost each relational database implements the bare minimum implemented in MySQL. As old wisdom says: it is always easier to move from worse to better than reversely.
  4. Tables in MySQL are treated as a matrices, not as a relations (sets), as dictated by SQL standard. Consequently, rows in a table have fixed order and the order of the columns in a table can be altered. That makes the usage of the table more user friendly.
  5. Many of MySQL commands are shorter than PostgreSQL alternatives, although it is frequently at the expense of ability to configure the command. An example is autoincrement column.
  6. MySQL is more forgiving to errors in queries.
On the other end, PostgreSQL, which more closely adhere to SQL standard, has following advantages (some of them are likely more related to the habit of thinking thru the impact of a change than adherence to standards):
  1. PostgreSQL has much more usable default setting than MySQL. I had to reconfigure MySQL 10 times during the first 6 months of deployment, until MySQL was processing all the beasty queries it had. On the other end, PostgreSQL with the same content, on the same hardware and corresponding load was working to our satisfaction for 2 years. Then it become necessary to increase the memory limit from 64MB RAM to something bigger to work reasonably with 100GBs of data... MySQL is still better in this respect than Oracle or SAS. But PostgreSQL leads in this respect.
  2. Error messages in PostgreSQL not only tell you what is wrong, but they also tell you the line where is the error and propose a solution to the problem. PostgreSQL is the nicest database in this respect I have ever worked with. It may look like a minor advantage. And if you just need to run a legacy database, it is. But if you need to develop something new on a database, it makes a heaven from PostgreSQL. Not a hell like SAS.

středa 20. července 2016

The trickery of ranking

Ranking measures in machine learning are extremely useful. But a user should be aware of their  counterintuitive properties (at least I found them startling when I observed them for the first time).


 Area Under Curve

Area Under Curve (AUC) is a measure for evaluation of binary classification accuracy, which is based on ranking of samples based on their probability (or other continuous measure). AUC has a nice property that it does not evaluate a model at one threshold but at all possible thresholds. Hence, AUC is a great measure particularly when you don't know the threshold for which to optimize (for example, because no one told you the cost matrix) or if the threshold may vary in time (one month a marketing may have a budget to spam top 5% customers, another month 10%).

So far good. Now imagine that you get a new source of data, based on which you may enhance your customer behaviour model. The new data are a treasure - they are extremely predictive of the label. But they are scarce - they are available only for 2% of customers. Currently, your model has AUC=0.80. What is the maximal improvement of the AUC we can hope for by adding new data?

In the best scenario we fix predictions for 2% of samples. If we were using classification accuracy and our starting accuracy was 0.80, at best we could move from 0.80 to 0.82. Simple.

But when it comes to AUC, the question is underdetermined. We have to know the ratio of classes. For simplicity, let's for the beginning assume a balanced dataset. Than by fixing n% of the top incorrect predictions, AUC improves by 2n percent points. An example:
    label:                0   0   0   0   0   1   1   1   1   1
    prediction:          .1  .2  .3  .4  .5  .6  .7  .8  .9   0
    improving change:    .1  .2  .3  .4  .5  .6  .7  .8  .9   1
    non-improving change:.1  .1  .3  .4  .5  .6  .7  .8  .9   0
We have 10 samples. And a prediction with 1 probability wrong (in bold). AUC for the prediction is 0.80 (as calculated by MATLAB's perfcurve). But if we are permitted to change just 1 sample and change the most wrong prediction, we can get a perfect AUC of 1.0. That is a great improvement! But of course, just like with the classification accuracy, a change of already accurate predictions may not have any effect as is the case of the last row of the example - AUC remains at 0.80.

Now, let's consider a following unbalanced dataset:
    label:                0   0   0   0   0   0   0   0   0   1
    prediction:          .1  .2  .3  .4  .5  .6  .7  .8  .9   0
    improved prediction: .1  .2  .3  .4  .5  .6  .7  .8  .9   1
In the case of an unbalanced dataset a single change of the probability can mean a difference between a completely wrong model (with AUC=0) to a perfect model (with AUC=1).

Conclusion: an addition of few new data into a model may have an unproportionally high impact on AUC.


 Friedman test

Friedman test is a popular choice for comparison of multiple classifiers. The popularity of Friedman test lies in the fact that ANOVA, a canonical parametric method for comparison of multiple treatments, has multiple assumptions that are not valid on bounded measures like accuracy or AUC (the assumption of normally distributed errors is violated). But Friedman test, which is a nonparametric test, does not have any problem with bounded measures. And even in praxis, Friedman test is for comparison of classifiers more accurate than ANOVA.

Friedman test takes measured responses, converts them to ranks. And then the ranks are averaged. Since Friedman test is based on ranking, it has quirks just like AUC.

For example, if you want to compare bagging to boosting, the outcome will depend on the presence of other models in the comparison. How is that possible? It is because of different characteristics of the models. Bagging generally reliably improves accuracy of a base classifier, although not by much. On the other end, boosting generally improves accuracy by a lot, but in some situations, boosting fails and performs worse than the base classifier.

Hence, if we compare just bagging to boosting, boosting is going to win, because in majority of scenarios, boosting is going to outperform bagging.

But if we include 98 variants of week base classifiers (like tree stumps), bagging is going to win. Why? It is because bagging is always going to be the 1st or 2nd (remember, bagging reliably beats weak classifiers, but is frequently beated by boosting). With 10% failure rate of boosting, the average ranking of bagging is going to be 1.9. But boosting is going to be the 1st or 100th, with the average rank of 10.9. Since the rank for bagging is smaller than for boosting, this times bagging outperforms boosting.

Notably, ANOVA performs consistently in both situations because ANOVA takes into the account the exact amount by how much is each classifier better than another.

Conclusion: in some situations, Friedman test is inconsistent. And you may tamper a design of the experiment to exploit the inconsistency.

The pitfall or ranking

The single most problematic thing on ranking algorithms is treatment of ties. See: Why I lost trust in RapidMiner.

pondělí 30. května 2016

In what data type to store year?

If you are using MySQL, use "year" data type, if possible. The reasons follows:
  1. It takes just 1 byte. Hence it saves storage.
  2. Date arithmetic can be applied without a conversion. Hence, it saves processor time.
  3. The reference point is well defined and unique. Hence, everyone can select the right attribute value for an event without any background knowledge (at least if you are a country that is using Gregorian calendar).
  4. Visualization and analytical tools automatically treat a "year" attribute as a temporal variable. That saves user's time.
But of course, "year" data type has some disadvantages. Namely, it has a narrow range of values. Hence, it can happen that the value you need to store is out of range. An example could be the year when an organization was established. On the other hand this limitation can work in your advantage - if someone types a year with 3 or 5 digits, the database will complain. And the database will be most likely right because the typed year is most likely a type.

čtvrtek 26. května 2016

Is it possible to evaluate a campaign without a control group?

First, this is a problem of so-called "quasi experimental studies". But all these quasi experiments need a time dimension. Can we design an experimental study that does not use a control group or multiple measurements in time?

If we assume additivity of campaigns, all we have to do is to get as many independent equations as is the count of unknown variables. For example, let's imagine we have two campaigns, the first one with response rate A and the second one with the response rate B. Let's also imagine that if we do not expose a user to a campaign, then the response rate of the user is N. And let's imagine that we can combine campaigns, either because each is using a different channel or because each is to a different product. Than we can set up 3 campaigns:

  N+A = 5
  N+B = 7
  N+A+B = 9

The first group of users is exposed to the first campaign and the response rate is 5%. The second group to users is exposed to the second campaign and the response rate is 7%. And the last group of users is exposed to both campaigns and the resulting response rate is 9%.

Since we have 3 independent equations and 3 unknown variables, we can calculate that the first campaign has uplift 9-7 = 2 percent points and that the second campaign has uplift 9-5 = 4 percent points.

Of course, this approach neglects interactions between the campaigns. But we can model interactions as well. In this example, let's consider 3 campaigns with following response rates:

 N+A = 5
 N+B = 7
 N+A+B+AB = 9
 N+C = 8
 N+A+C = 3

where AB represents an interaction of two first two campaigns. We can put the equation into a matrix form:

 matrix = [1 1 0 0 0; 1 0 1 0 0; 1 1 1 1 0; 1 0 0 0 1; 1 1 0 0 1] 
 response = [5; 7; 9; 8; 3] 
 x = [?; ?; ?; ?; ?]

Where x is a vector of the unknown variables. It holds that:

 matrix*x = response

With linear algebra:

 x = inv(matrix) * response

We can get that x = [10; -5; -3; 7; -2].

To get higher confidence in the estimates it is possible to use more independent equations than is the count of unknown variables and fit the unknown variables with least squares estimate (or other method of your choice).

neděle 8. května 2016

Why I lost trust in RapidMiner

I used to like RapidMiner despite many flaws it has. But the discovery of a bug in calculation of ROC and AUC made me furious - if I can't trust the metric that I optimize, how can I trust any model produced by RapidMiner?

To wreak my anger I publish a shame list of bugs in RapidMiner.  

The list of bugs in RapidMiner 7.1 that I am aware of:
  1. The first step in ROC (Receiver Operating Curve) is not correctly drawn - the first step is always horizontal. Sometimes the error is negligible. Sometimes it is the whole difference between a perfect model with AUC=1 and a random model with AUC=0.5. The bug is best visible on binary estimates of the label (i.e. all estimates are either p=0 or  p=1).
  2. AUC (Area Under Curve) is similarly way off (like 0.5 instead of 1.0). The reported value is different from both, the expected AUC and the area under the returned (and flawed) ROC. 
  3. DBI (Davies-Bouldin index) reported by the performance operator is negative. But by the definition it can't be negative.
  4. The returned correlation matrix sometimes contains values out of range (like -67). The error is caused by unstable calculation of variance. Since correlation matrix is in the heart of many algorithms, it is worrisome.
  5. The operator for declaration of missing values causes troubles because the missing values often backpropagate to other branches (it's because the operator is using on-the-fly processing, which is buggy in RapidMiner).
  6. Evolutionary optimization often crashes. Even on toy datasets like Iris.
  7. Weight by Chi Squared Statistic does not work with date attributes while other weighting operators (like Gini or Information Gain Ratio) work.
And finally, one missing feature that bugs me:
  1. Whenever I make a plot and re-run the schema, the setting of the plot is reset.
  2. They removed an "invert" checkbox from dictionary filter in text mining extension.
And of course there are deficiencies. For example, 200 times slower linear regression than the linear regression (lm) in R (measured with turned off feature selection and ridge bias in the both scenarios). EDIT: This deficiency was addressed in RapidMiner 7.2.