Why the datasets are not stored in CSV files?
Because CSV files don't store information about data types, PK, FK, indexes and constrains.
Why MySQL database?
Because in combination with ClowdFlows you can process the datasets online.
Just open one of the public workflows (like Wordification or Cross-validation), change the credentials in "MySQL Connect" operator to the credentials from the repository and you are ready to go!
What to do if I want an ILP format?
See a collection of datasets at ILPnet2.
Why do the datasets contain missing values/composite keys/strange data types/any other ugly thing you may think of?
Because they are also present in the real datasets.
What's the point of including artificial datasets?
While datasets like Adventure Works may not contain any pattern that could be found during modeling, they still increase the diversity of the repository. For example, the named Adventure Works dataset has the highest table count in the whole repository.
If your algorithm can process all the tables present in Adventure Works, it may be able to process real world datasets.