Traffic accident database consists of all accidents that happened in Slovenia’s capital city Ljubljana between the years 1995 and 2005.
Airline on-time data are reported each month to the U.S. Department of Transportation (DOT), Bureau of Transportation Statistics (BTS) by the 16 U.S. air carriers that have at least 1 percent of total domestic scheduled-service passenger revenues, plus two other carriers that report voluntarily. The
The study STULONG is a longitudinal 20 years lasting primary preventive study of middle-aged men. The study aims to identify prevalence of atherosclerosis RFs in a population generally considered to be the most endangered by possible atherosclerosis complications, i.e., middle-aged m
The task is to predict the outcomes of every match in the 2015 AFL season.
The task is to predict whether the team plays playoff, or not.
Evaluation of patients on liver disorder.
For prediction of whether a given molecule is carcinogenic or not. The dataset contains 182 positive carcinogenecity tests and 148 negative tests.
The goal is to predict the outcome of a match.
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dic
A bit more complex artificial database with loops.
Artificial data from a Czech bank.
The set of positive examples consists of all sentences of up to seven words that can be generated by the DCG in Bratko's book (565 positive examples).The set of negative examples was generated by randomly selecting one word in each positive example and replacing it by a randomly selected word the le
Dunur is a relatedness of two people due to marriage such that A is dunur of B if a child of A is married to a child of B.
Elti is a relatedness of two people due to marriage such that A is elti of B if A's husband is a brother of B's husband.
This dataset consists of 'circles' (or 'friends lists') from Facebook.
PKDD'99 Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions.
PAKDD'15 Data Mining Competition: The task is to reconstruct the information about user’s gender from product viewing logs. The data were obtained from simulations of product viewing activities of users with known gender. The data closely follow the real-life distribution in that regard.
Data on deputies and senators in the Czech Republic.
KDD Cup 2001 prediction of gene/protein function and localization.
PKDD'02 Hepatitis dataset describes 206 instances of Hepatitis B (contrasting them against 484 cases of Hepatitis C).
The Hockey Database follows the same general design as the Lahman Baseball Database. In addition to the NHL, the Hockey DB covers the following early and alternative leagues: NHA, PCHA, WCHL and WHA. It contains individual and team statistics from 1909-10 through the 2011-12 season.
The IMDb database: moderately large, real database of movies.
MovieLens data set from the UC Irvine machine learning repository.
The task is to identify, whether the position of two kings and a rook on a chessboard is legal or illegal.
Bulgarian court decision metadata.
PKDD'99 Medical dataset describes 41 patients with Thrombosis.
A geography dataset from University of Göttingen describes 114 Christian countries and 71 non-Christian countries.
The dataset describes a family composed of 86 people across 5 generations. The family dataset includes 744 positive instances and 1488 randomly generated negative instances.
The Musk database describes molecules occurring in different conformations. Each molecule is either musk or non-musk and one of the conformations determines this property. Such a problem is known as a multiple-instance problem, and is modeled by two tables molecule and conformation, joined by a one-
The dataset comprises of 230 molecules trialed for mutagenicity on Salmonella typhimurium. A subset of 188 molecules is learnable using linear regression. This subset was later termed the ”regression friendly” dataset. The remaining subset of 42 molecules is named the ”regression unfriendly” dataset
A sample database from Alchemy website.
A database with information about basketball matches from the National Basketball Association. Lists Players, Teams, and matches with action counts for each player.
2015 NCAA Basketball Tournament.
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.
A database with information about football matches from the UK Premier League. Lists Players, Teams, and matches with action counts for each player.
A database from The Predictive Toxicology Evaluation Challenge (1997). The task is to predict whether the compound is carcinogenic, or not.
The task is to predict whether two given people are from the same generation.
You are a member of the Sales Management team in a large retail bank. The current date is July 02, 2007. Your Sales Director has just asked you to generate additional revenues of $1,500,000 before September 01, 2007. You must find ways to sell more "Credit++" – the new product of consumer credit t
The task is to diagnose power-supply failures in a communications satellite.
Student Loan contains data about students enrollment and employment status, and the aim is to find rules that define a students' obligation for paying his/her loan back.
Predictive Toxicology Challenge (2000) consists of more than three hundreds of organic molecules marked according to their carcinogenicity on male and female mice and rats.
TPC-C is the benchmark published by the Transaction Processing Performance Council (TPC) for Online Transaction Processing (OLTP).
TPC-D represents a broad range of decision support (DS) applications that require complex, long running queries against large complex data structures.
TPC-DS is the new decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. Although the underlying business model of TPC-DS is a retail product supplier, the database schema, data population, queries, data maint
East-West challenge (1980) database describes east-bound and west-bound trains.
An artificial database from Simon Fraser University describing students, professors and courses.
The task is to learn rules that identify the legal states of the U-tube dynamical system.
This dataset lists facts about the Department of Computer Science and Engineering at the University of Washington (UW-CSE), such as entities (e.g., Student, Professor) and their relationships (i.e. AdvisedBy, Publication).
Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
VOC database provides a peephole view into the administrative system of an early multi-national company, the Vereenigde geoctrooieerde Oostindische Compagnie (VOC for short - The (Dutch) East Indian Company) established on March 20, 1602.
The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dict
A database of 239 states and their cities.