Traffic accident database consists of all accidents that happened in Slovenia’s capital city Ljubljana between the years 1995 and 2005.
Adventure Works 2014 (OLTP version) is a sample database for Microsoft SQL Server, which has replaced Northwind and Pub sample databases that were shipped earlier. The database is about a fictious, multinational bicycle manufacturer called Adventure Works Cycles.
Airline on-time data are reported each month to the U.S. Department of Transportation (DOT), Bureau of Transportation Statistics (BTS) by the 16 U.S. air carriers that have at least 1 percent of total domestic scheduled-service passenger revenues, plus two other carrie…
The study STULONG is a longitudinal 20 years lasting primary preventive study of middle-aged men. The study aims to identify prevalence of atherosclerosis RFs in a population generally considered to be the most endangered by possible atherosclerosis com…
The task is to predict the outcomes of every match in the 2015 AFL season.
The task is to predict rank of teams.
The task is to predict whether the team plays playoff, or not.
This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).
For prediction of whether a given molecule is carcinogenic or not. The dataset contains 182 positive carcinogenecity tests and 148 negative tests.
Transactional data from Czech debit card company specialising on payments at petrol pumps.
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding wo…
The task is to predict "Forest area (% of land area)" for 247 countries in 2012 based on the previous values.
A bit more complex artificial database with loops.
The employees test database: small, fake database of employees.
This dataset consists of 'circles' (or 'friends lists') from Facebook.
PKDD'99 Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions.
Anonymised data from hospital in Hradec Kralove about treatment and medication.
PAKDD'15 Data Mining Competition: The task is to reconstruct the information about user’s gender from product viewing logs. The data were obtained from simulations of product viewing activities of users with known gender. The data closely follow the real-life distribut…
Data on deputies and senators in the Czech Republic.
KDD Cup 2001 prediction of gene/protein function and localization.
PKDD'02 Hepatitis dataset describes 206 instances of Hepatitis B (contrasting them against 484 cases of Hepatitis C).
The Hockey Database follows the same general design as the Lahman Baseball Database. In addition to the NHL, the Hockey DB covers the following early and alternative leagues: NHA, PCHA, WCHL and WHA. It contains individual and team statistics from 1909-10 through the 2…
The IMDb database: moderately large, real database of movies.
MovieLens data set from the UC Irvine machine learning repository.
Lahman’s baseball database contains complete batting and pitching statistics from 1871 to 2014, plus fielding statistics, standings, team stats, managerial records, post-season data, and more.
Bulgarian court decision metadata.
PKDD'99 Medical dataset describes 41 patients with Thrombosis.
This domain is about finite element methods in engineering. The task is to predict how many elements should be used to model each edge of a structure. The target predicate is mesh(Edge,Number) where the Number of elements in the Mesh model can vary between 1 and 17.
A geography dataset from University of Göttingen describes 114 Christian countries and 71 non-Christian countries.
The dataset describes a family composed of 86 people across 5 generations. The family dataset includes 744 positive instances and 1488 randomly generated negative instances.
A sample database from Alchemy website.
2015 NCAA Basketball Tournament.
The Northwind database contains the sales data for a fictitious company called Northwind Traders, which imports and exports specialty foods from around the world.
A database with information about football matches from the UK Premier League. Lists Players, Teams, and matches with action counts for each player.
A database from The Predictive Toxicology Evaluation Challenge (1997). The task is to predict whether the compound is carcinogenic, or not.
A database of restaurants in San Francisco. The goal is to predict the customer's satisfaction.
The venerable sakila test database: small, fake database of movies.
A simple artificial database in star schema.
You are a member of the Sales Management team in a large retail bank. The current date is July 02, 2007. Your Sales Director has just asked you to generate additional revenues of $1,500,000 before September 01, 2007. You must find ways to sell more "Credit++" – the ne…
The task is to diagnose power-supply failures in a communications satellite.
Seznam.cz is a web portal and search engine in the Czech Republic. The data are from Seznam's "wallet".
An anonymized dump of all user-contributed content on the Stats Stack Exchange network.
Predictive Toxicology Challenge (2000) consists of more than three hundreds of organic molecules marked according to their carcinogenicity on male and female mice and rats.
TPC-C is the benchmark published by the Transaction Processing Performance Council (TPC) for Online Transaction Processing (OLTP).
Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
VOC database provides a peephole view into the administrative system of an early multi-national company, the Vereenigde geoctrooieerde Oostindische Compagnie (VOC for short - The (Dutch) East Indian Company) established on March 20, 1602.
Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations.
The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding wor…