The dataset comprises of 230 molecules trialed for mutagenicity on Salmonella typhimurium. A subset of 188 molecules is learnable using linear regression. This subset was later termed the ”regression friendly” dataset. The remaining subset of 42 molecules is named the ”regression unfriendly” dataset
PKDD'99 Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions.
East-West challenge (1980) database describes east-bound and west-bound trains.
The IMDb database: moderately large, real database of movies.
KDD Cup 2001 prediction of gene/protein function and localization.
PKDD'02 Hepatitis dataset describes 206 instances of Hepatitis B (contrasting them against 484 cases of Hepatitis C).
A database from The Predictive Toxicology Evaluation Challenge (1997). The task is to predict whether the compound is carcinogenic, or not.
This dataset lists facts about the Department of Computer Science and Engineering at the University of Washington (UW-CSE), such as entities (e.g., Student, Professor) and their relationships (i.e. AdvisedBy, Publication).
Predictive Toxicology Challenge (2000) consists of more than three hundreds of organic molecules marked according to their carcinogenicity on male and female mice and rats.
This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).
The Musk database describes molecules occurring in different conformations. Each molecule is either musk or non-musk and one of the conformations determines this property. Such a problem is known as a multiple-instance problem, and is modeled by two tables molecule and conformation, joined by a one-
For prediction of whether a given molecule is carcinogenic or not. The dataset contains 182 positive carcinogenecity tests and 148 negative tests.