Traffic accident database consists of all accidents that happened in Slovenia’s capital city Ljubljana between the years 1995 and 2005.
The study STULONG is a longitudinal 20 years lasting primary preventive study of middle-aged men. The study aims to identify prevalence of atherosclerosis RFs in a population generally considered to be the most endangered by possible atherosclerosis com…
Evaluation of patients on liver disorder.
Transactional data from Czech debit card company specialising on payments at petrol pumps.
A database containing geospatial information, as well as SAT average scores and Free-or-Reduced-Price Meal eligibility data, for California schools.
The goal is to predict the outcome of a match.
The schema is for Classic Models, a retailer of scale models of classic cars. The database contains typical business data such as customers, orders, order line items, products and so on.
The Consumer Expenditure Survey (CE) collects data on expenditures, income, and demographics in the United States. The public-use microdata (PUMD) files provide this information for individual respondents without any information that could identify respondents. PUMD fi…
The task is to predict "Forest area (% of land area)" for 247 countries in 2012 based on the previous values.
Craft beers labeled by styles and composition. A separate dataset lists breweries by state.
Officer-involved shootings as disclosed by the Dallas Police Department. Includes separate tables for officer and subject/suspect information.
The set of positive examples consists of all sentences of up to seven words that can be generated by the DCG in Bratko's book (565 positive examples).The set of negative examples was generated by randomly selecting one word in each positive example and replacing it by …
Anonymised data from a hospital in Hradec Kralove, Czech Republic, about treatment and medication.
PAKDD'15 Data Mining Competition: The task is to reconstruct the information about user’s gender from product viewing logs. The data were obtained from simulations of product viewing activities of users with known gender. The data closely follow the real-life distribut…
GO Sales dataset from IBM contains information about daily sales, methods, retailers, and products of a fictitious outdoor equipment retail chain “Great Outdoors” (GO). The task is to predict sale quantity.
This dataset includes funding grants from the National Science Foundation. The task is to predict the award amount.
PKDD'02 Hepatitis dataset describes 206 instances of Hepatitis B (contrasting them against 484 cases of Hepatitis C).
The IMDb database: moderately large, real database of movies.
MovieLens data set from the UC Irvine machine learning repository.
The task is to identify, whether the position of two kings and a rook on a chessboard is legal or illegal.
PKDD'99 Medical dataset describes 41 patients with Thrombosis.
The Musk database describes molecules occurring in different conformations. Each molecule is either musk or non-musk and one of the conformations determines this property. Such a problem is known as a multiple-instance problem, and is modeled by two tables molecule and…
The Northwind database contains the sales data for a fictitious company called Northwind Traders, which imports and exports specialty foods from around the world.
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.
The pubs sample database is modeled after a book publishing company.
A pyrimidine QSAR dataset. The goal is to predict the inhibition of dihydrofolate reductase by pyrimidines.
A simple artificial database in star schema.
You are a member of the Sales Management team in a large retail bank. The current date is July 02, 2007. Your Sales Director has just asked you to generate additional revenues of $1,500,000 before September 01, 2007. You must find ways to sell more "Credit++" – the ne…
Seznam.cz is a web portal and search engine in the Czech Republic. The data represent online advertisement expenditures from Seznam's "wallet". Table description: client: location and domain field of the client (anonymized) dobito: prepaid into a wallet in Czech cur…
The San Francisco Dept. of Public Health’s database of eateries, inspections of those eateries, and violations found during the inspections. The task is to predict the unscheduled inspection scores from 2013 to 2016. The scores range from 1 to 100, where 100 means that…
The Open Source Shakespeare is a collection of Shakespeare's complete works. This is a much more interesting data set than some boring imaginary online retailer. In this dataset, people die! The task is to predict the character, who speaks the lines.
Student Loan contains data about students enrollment and employment status, and the aim is to find rules that define a students' obligation for paying his/her loan back.
Predictive Toxicology Challenge (2000) consists of more than three hundreds of organic molecules marked according to their carcinogenicity on male and female mice and rats.
East-West challenge (1980) database describes east-bound and west-bound trains.
A pyrimidine QSAR dataset. The the goal is to predict the inhibition of dihydrofolate reductase by pyrimidines.
An artificial database from Simon Fraser University describing students, professors and courses.
The task is to learn rules that identify the legal states of the U-tube dynamical system.
VOC database provides a peephole view into the administrative system of an early multi-national company, the Vereenigde geoctrooieerde Oostindische Compagnie (VOC for short - The (Dutch) East Indian Company) established on March 20, 1602.
Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations.
A database of 239 states and their cities.