To support the growth of relational machine learning.
How to cite
Cite this article.
- Why are the datasets not stored in CSV files?
- Because CSV files do not store information about data types, PKs, FKs and other constraints.
- Why MariaDB database?
- Because in combination with ClowdFlows you can process the datasets online.
Just open one of the public workflows (like Wordification or Cross-validation), change the credentials in "MySQL Connect" operator to the credentials from the repository and you are ready to go!
- Why am I not able to connect to the database?
- If you are connecting to the database over a corporate network, the corporate firewalls could be the culprit (it may block port 3306).
Try to access the database with a different internet provider (e.g. with your cellular provider).
Also, keep in mind that database names are case sensitive. Database "mutagenesis" is not the same database as "Mutagenesis".
If the problems persist, contact us and provide us with the following information:
- Your database client and its version (e.g. MySQL Workbench 6.3.10).
- The database name you tried to connect to (e.g. mutagenesis).
- Why MySQL Workbench complaints about incompatible/nonstandard server version?
- We are using open source version of MySQL called MariaDB, hence the warning. For all purposes that the public account permits it is safe to ignore the message.
- Why mysqldump cannot find COLUMN_STATISTICS in information_schema?
- MariaDB has the table in MYSQL.COLUMNM_STATS. Use one of the workarounds.
- What to do if I want an ILP format?
- See a collection of datasets at ILPnet2.
Or use a conversion tool, where you have to change the connection parameters in
- Why do the datasets contain missing values/composite keys/strange data types/any other ugly thing you may think of?
- Because they are also present in the real datasets.
- What is the point of including artificial datasets?
- While datasets like Adventure Works may not contain any pattern that could be found during modeling, they still increase the diversity of the repository. For example, the named Adventure Works dataset has the highest table count in the whole repository.
If your algorithm can process all the tables present in Adventure Works, it may be able to process real-world datasets.