pátek 11. srpna 2017

A proper handling of nulls

I am familiar with two approaches how to deal with nulls in programming languages: either completely avoid them or embrace them. Languages for logic programming generally avoid the nulls (a great apologetic is given by Design and Implementation of the LogicBlox System). But languages that permit them should provide following facilities:

1) meta information about why the value is missing
2) nullable & non-nullable variables

For example, SAS got it right 40 years ago. In SAS, a missing value is represented with a dot. That by itself is not great whenever you need to print out the code or the data, because you never know whether that dot really represents a missing variable or it is just an imperfection of the paper or the printer. But it permits to easily define the reason why the value is missing:
    .                 // Generic missing value
    .refusedToAnswer  // Missing value with a metadatum
    .didntKnow        // Missing value with a different metadatum
Hence, generic algorithms can threat all missing values the same way. But if you want to treat them differently, for example because refusedToAnswer can have a vastly different meaning in a questionary than didntKnow, you can do it.

Furthermore, SAS provides optional non-null constraints on attributes, just like SQL. The only ward on SAS's implementation is that it raises exceptions only during the runtime, not during the compilation time as, for example, Kotlin does.

pondělí 31. července 2017

International Software Testing Contest - experience

Why I write about the contest: Since I am one of the winners of the first ISTC competition held in 2017, it is in my best interest to promote this competition in order to make famous.

The assignment description: The contest consisted of writing tests for two Java projects: Naive Bayes classifier and Prim's algorithm for calculation of minimum spanning trees. A copy of the assignment: download.

Strategy: In 2017, we were evaluated only based on the branch coverage (line coverage was ignored contrary to what was written in the invitation) and the mutation score. Only in the case of a tie, the count of test cases would have be taken into the consideration. Since no tie happened in 2017, the recommended strategy for the next years (under the assumption that the rules do not change) is simple: maximize branch coverage even at the expense of the count of test cases.

Furthermore, to maximize the mutation score you have to use asserts in your tests. My strategy was: print the result of the method into a console:
And use the printed output in the assert:
    assertEquals("textOutput".trim(), someMethod().trim());
I used trim methods because I did not want to deal with the errors caused by wrongly copy-pasting too many/too few white space characters. Is it a test strategy I would use outside of the contest? No way, because the toString() format can change any time. It may not catch all deviations. And not all objects have to implement toString(). But at ISTC 2017 it worked reasonably well.

Ideas: Prepare a script/code that executes all methods in a program in a pseudo-random order, stores the outputs of the methods and writes an appropriate unit test which uses the stored outputs into asserts. Most likely, there are already tools for that (something from a field of data-driven-testing or mocking?) but I am not familiar with them.

Mutation testing: To measure mutation score you may use PITest. If you use IDEA, a nice plugin providing integration of PITest into IDEA is Zester. Following mutators were used to calculate the mutation score:
  1. Return Values Mutator
  2. Negate Conditionals Mutator
  3. Void Method Call Mutator
  4. Conditionals Boundary Mutator
  5. Increments Mutator
Warning: The branch coverage and the mutation score that you obtain from your favourite tools may not 100% agree with the scores reported by MoocTest - the tool used to evaluate the these two metrics in the competition. Hence, if you can, train yourself against the framework used at the contest.

Another trickery I run into is that JDK7 was required. But because I also had JDK6 and JDK8 installed on my computer, I run into unpleasant clashes during the competition that has cost me 7 minutes of debugging. If you can, have just a single JDK installed on your computer.

Finally, the source codes were in their default packages and that was interfering with PITest. And I didn't identify the root of the problem until the end of the competition. Hence, if you design an intricate plan how to win the competition, prepare also a simple fallback plan.

pátek 23. června 2017

Each database has its own charm, it only takes a while to discover it

Microsoft SQL Server
This is a tough one. But this was the only database that warned me about data corruption, when the SSD disk started to fail. While other databases (Oracle, PostgreSQL and MySQL) were silently returning corrupted data. To be fair to other databases, if nothing else, you can [[https://www.postgresql.org/docs/9.5/static/wal-reliability.html|configure]] PostgreSQL to be more resistant to silent errors.

You can relly on the order of the data in the tables, i.e. the tables are not sets, as dictated by relational algebra, but lists! That makes working with the database more intuitive. Of course, some other databases have this property as well (Oracle, MSSQL, SAS) but other not (PostgreSQL, Teradata, Netteza) - generally, all distributed databases use sets. Another nice property is that you can change the order of the columns in a table any time you want to (in PostgreSQL, for example, you can only append new columns at the end of the tables).

PostgreSQL has the nicest installer on OsX I have ever seen for a database. It's just drag and drop like any other normal app. And after starting the database it tells you the connection parameters. And that's it! No configuration needed! In comparison to installation of SAS or full blooded Oracle, it is Heaven versus Hell. Also, PostgreSQL does not need any configuration fine tuning to be usable. Once, I installed MySQL and PostgreSQL at the same server and mirrored their content. While PostgreSQL worked without any touch for a year, I had to change the configuration of MySQL multiple times, because some default limit (always different one) was too tight.

Hand down Oracle has the best execution planner I have ever used. Plus, it has provides a rich set of commands.

The best part of SAS's SQL is that it allows you to use some of data step conventions in the SQL. Do you need to limit the count of read (not outputted!) rows from a table? No problem, just use inobs parameter! Or is the SQL too inconvenient for your task? Just use SAS code!

sobota 25. února 2017

TPC benchmarks

Since I repeatedly stumble upon the problem how to generate the benchmark databases, here are the references:
  1. HammerDB (for TPC-C and TPC-H databases).
  2. Benchmark Factory (for TPC-C, TPC-D, TPC-E, TPC-H and ASP3AP databases).

středa 22. února 2017


Sometimes I find LaTeX to be frustrating because of the following reasons:
  1. It is difficult to parse TeX code. This occasionally leads to unclear error messages and inconsistent treatment of white space characters.
  2. It is sensitive to the order, in which packages are loaded. If you are loading more than a few packages, finding the correct order can be a non-trivial problem.
  3. Inconsistent quality of typesetting. While TeX is prized for typography, kerning by default is not applied on equations.
For further discussion of the topic see 25 Years of TEX and METAFONT: Looking Back and Looking Forward.

čtvrtek 10. listopadu 2016

Information Gain Ratio vs. Gini vs. Chi2

Information Gain Ratio (IGR), Gini index and Chi2 are among the most popular univariate feature selection methods for classification (and decision tree construction). But just as no free lunch theorems were formulated for optimization and classification, a no free lunch theorem for feature selection could be formulated - a single method may not be averagely better than other methods over all datasets.

If we compare feature selection methods on a single dataset by the accuracy on the subsequent classifier, we generally find out that either IGR or Chi2 is the best, while Gini is (almost) always the second:
What is intriguing is the fact that both, IGR and Chi2 sometimes fail terribly. And that Gini generally lacks behind the best method just a bit. Hence, if we calculate accuracy of the feature selection methods over many (real-world) datasets, we find out that Gini is, on average, the best method.

Recommendation: On text related datasets, Chi2 generally excells. On datasets with id-like attributes (attributes with very high cardinality), IGR generally excels, because IGR, in comparison to Chi2, penalizes attributes with high cardinality. If you want to get a reasonable model on the first shot regardless of the data, use Gini (assuming real world datasets since we can always craft datasets to foul the feature selection methods).

pátek 4. listopadu 2016

How to write an agnostic Java application for SQL databases

Whenever we want to talk to a relational database, we generally have two options: use native connection or use some standard API like Open Database Connectivity (ODBC).

If we aim to support multiple databases, it is easier to use a single standard API than to deal with multiple native protocols. There are three well known APIs (and some less known) for communication with SQL databases: ODBC, JDBC and OLE DB. OLE DB was deprecated by its author, Microsoft (e.g. http://www.simba.com). Hence, it is possibly not smart to start new projects with this technology. Consequently, we are left with the decision between ODBC and JDBC. JDBC was developed for Java as a (better) replacement of ODBC. Hence, if you are developing in Java, the choice is clear - pick JDBC. If you are developing in something else (e.g. C), you don't have a choice but to pick ODBC.

Following paragraphs describe useful tricks to know about JDBC. First, forget about information_schema (PostgreSQL, MySQL,...) or dictionary tables (Oracle) to collect metadata about objects (tables, columns,...) in the database. Instead of that use DatabaseMetaData interface in JDBC because only this way works reliably and without any vendor specific code over all databases that have a JDBC driver.

Second, to deal with different data types supported by the databases, use JDBC data types. That way the JDBC driver takes care of the conversions, not you.

Third, limit yourself to a subset of SQL-92 that is supported across multiple databases. A good list of supported functions is entry conformance of SQL-92. If this list is too narrow, use ODBC Escape Sequences - they are fully supported in JDBC. Just be aware of the fact that escape sequences do not generally add any new functionality into databases - they just translate vendor's function names (and parameters) into a standardized format. Hence, if a database does not provide the functionality natively, it is unlikely it will provide the functionality over the escape sequences. Consequently, only a subset of escape sequences are save to use.

Fourth, named entity quoting and literal quoting is specific to each database. You may extract the quotes from metaData in JDBC (e.g. getIdentifierQuoteString()).

Fifth, even if you try hard, situations arise, when you want to do things vendor specific way. To deal with these scenarios, have a single class, which implements the default solution (e.g. "Ansi"), and have vendor specific classes that inherit from it (e.g. "Mysql", "Postgre",...). That way, you can switch from vendor agnostic code, which is deemed to be too slow, to vendor specific code any time. Also, it gives you the ability to work-around bugs in implementations of JDBC drivers.

Unfortunately, there are some aspects of SQL for which I didn't find an a good enough agnostic solution. For example, "limit" clause is not agnostic because it does not work, for example, in SQL Server, Oracle or SAS. The ability to limit the size of the ResultSet in JDBC is nice, but if you need to write the result into a table in the database, properties of the ResultSet are not applicable. If you know of a nice agnostic way how to "limit", let me know.