Factorization of relational data: benchmark datasets

This page provides a list of widely used benchmark datasets for the matrix factorization problem. All dataset are available in the MATLAB format (.mat file) that can be opened in any version of MATLAB or GNU Octave (e.g. via the load command). Boolean datasets (the number of values in the scale of grades = 2) are represented as the logical type, other datasets are represented as the double type. If you decide to use any of the datasets, do not cite this page, but sources listed for each dataset.

Boolean data

Accindents

dimension: 67557 objects, 130 attributes, number of values in the scale of grades: 2

Description and proper citation can be found here.

download

Adult

dimension: 20 objects, 10 attributes, number of values in the scale of grades: 2

Description can be found here.

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

Advertisement

dimension: 3279 objects, 1557 attributes, number of values in the scale of grades: 2

The Advertisement data describes a collection of possible advertisements (objects) on the Internet. The attributes represent the geometry of the images as well as the textual information in the advertisement. More details can be found here

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

Americas large

dimension: 3485 objects, 10127 attributes, number of values in the scale of grades: 2

Americas large datasets represent network access control rules in use in the Hewlett Packard company to control connectivity of external business partners.

Ene, A., Horne, W., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R. E.: Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies (SACMAT '08). ACM, New York, NY, USA, 1–10, 2008.

download

Americas small

dimension: 3477 objects, 1587 attributes, number of values in the scale of grades: 2

Americas small datasets represent network access control rules in use in the Hewlett Packard company to control connectivity of external business partners.

Ene, A., Horne, W., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R. E.: Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies (SACMAT '08). ACM, New York, NY, USA, 1–10, 2008.

download

Apj

dimension: 2044 objects, 1164 attributes, number of values in the scale of grades: 2

Apj datasets represent network access control rules in use in the Hewlett Packard company to control connectivity of external business partners.

Ene, A., Horne, W., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R. E.: Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies (SACMAT '08). ACM, New York, NY, USA, 1–10, 2008.

download

Chess

dimension: 3196 objects, 76 attributes, number of values in the scale of grades: 2

The Chess data describes chess games of certain kind described by attributes representing the positions of pieces on the board. More info: http://fimi.ua.ac.be/data/

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

Customer

dimension: 10961 objects, 277 attributes, number of values in the scale of grades: 2

Customer datasets represent network access control rules in use in the Hewlett Packard company to control connectivity of external business partners.

Ene, A., Horne, W., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R. E.: Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies (SACMAT '08). ACM, New York, NY, USA, 1–10, 2008.

download

DBLP

dimension: 6980 objects, 19 attributes, number of values in the scale of grades: 2

The DBLP dataset contains records of in which of the 19 conferences the 6980 authors had published. The dataset is collected from the DBLP database.

Miettinen, P.: Matrix Decomposition Methods for Data Mining: Computational Complexity and Algorithms, PhD thesis, University of Helsinki, 2009.

download

Domino

dimension: 79 objects, 231 attributes, number of values in the scale of grades: 2

Domino datasets represent network access control rules in use in the Hewlett Packard company to control connectivity of external business partners.

Ene, A., Horne, W., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R. E.: Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies (SACMAT '08). ACM, New York, NY, USA, 1–10, 2008.

download

DNA

dimension: 4590 objects, 392 attributes, number of values in the scale of grades: 2

The DNA data describes information on DNA copy number amplifications. These copies activate oncogenes and are the hallmarks of almost nearly all advanced tumors. The attributes represent chromosomal loci, i.e. positions on a chromosome.

Myllykangas, S., et al.: DNA copy number amplification profiling of human neoplasms, Oncogene 25(55)(2006), 7324–7332, 2006.

download

Emea

dimension: 35 objects, 3046 attributes, number of values in the scale of grades: 2

Emea datasets represent network access control rules in use in the Hewlett Packard company to control connectivity of external business partners.

Ene, A., Horne, W., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R. E.: Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies (SACMAT '08). ACM, New York, NY, USA, 1–10, 2008.

download

Firewall 1

dimension: 365 objects, 709 attributes, number of values in the scale of grades: 2

Firewall 1 datasets represent network access control rules in use in the Hewlett Packard company to control connectivity of external business partners.

Ene, A., Horne, W., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R. E.: Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies (SACMAT '08). ACM, New York, NY, USA, 1–10, 2008.

download

Firewall 2

dimension: 325 objects, 590 attributes, number of values in the scale of grades: 2

Firewall 2 datasets represent network access control rules in use in the Hewlett Packard company to control connectivity of external business partners.

Ene, A., Horne, W., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R. E.: Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies (SACMAT '08). ACM, New York, NY, USA, 1–10, 2008.

download

Hayes

dimension: 132 objects, 18 attributes, number of values in the scale of grades: 2

More details can be found here.

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

Healthcare

dimension: 46 objects, 46 attributes, number of values in the scale of grades: 2

Healthcare datasets represent network access control rules in use in the Hewlett Packard company to control connectivity of external business partners.

Ene, A., Horne, W., Milosavljevic, N., Rao, P., Schreiber, R., Tarjan, R. E.: Fast exact and heuristic methods for role minimization problems. In Proceedings of the 13th ACM symposium on Access control models and technologies (SACMAT '08). ACM, New York, NY, USA, 1–10, 2008.

download

Mushroom

dimension: 8124 objects, 119 attributes, number of values in the scale of grades: 2

The Mushroom dataset describes samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota family. The entries are taken from The Audubon Society Field Guide to North American Mushrooms. More info: http://fimi.ua.ac.be/data/.

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

NFS

dimension: 12841 objects, 4894 attributes, number of values in the scale of grades: 2

NFS contains document-word information on a collection of project abstracts submitted for funding NFS. Description and proper citation can be found here.

download

Paleo

dimension: 501 objects, 139 attributes, number of values in the scale of grades: 2

The Paleo dataset describes fossil records per location (objects) by means of various properties such as size, genome, species, etc. (attributes).

NOW public release 030717, available from http://www.helsinki.fi/science/now/.

download

Post

dimension: 90 objects, 25 attributes, number of values in the scale of grades: 2

More details can be found here.

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

Servo

dimension: 167 objects, 19 attributes, number of values in the scale of grades: 2

More details can be found here.

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

Shuttle

dimension: 15 objects, 23 attributes, number of values in the scale of grades: 2

More details can be found here.

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

Tic Tac Toe

dimension: 958 objects, 30 attributes, number of values in the scale of grades: 2

More details can be found here.

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

Zoo

dimension: 101 objects, 28 attributes, number of values in the scale of grades: 2

More details can be found here.

Lichman, M: UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science, 2013.

download

Data with grades

Dog breeds

dimension: 151 objects, 13 attributes, number of values in the scale of grades: 6

The Dog breeds data represents data from www.petfinder.com.

Belohlavek, R., Krmelova, M.: Beyond Boolean Matrix Decompositions: Toward Factor Analysis and Dimensionality Reduction of Ordinal Data, ICDM 2013, pp. 961–966.

download

Decathlon

dimension: 5 objects, 10 attributes, number of values in the scale of grades: 5

The Decathlon data captures results of 2004 Olympic decathlon.

Belohlavek, R., Krmelova, M.: Factor analysis of ordinal data via decomposition of matrices with grades, Annals of Mathematics and Arti cial Intelligence 72(1-2)(2014), 23–44, 2014.

download

Decathlon

dimension: 20 objects, 10 attributes, number of values in the scale of grades: 5

The Decathlon data captures results of 2004 Olympic decathlon.

Belohlavek, R., Krmelova, M.: Factor analysis of ordinal data via decomposition of matrices with grades, Annals of Mathematics and Arti cial Intelligence 72(1-2)(2014), 23–44, 2014.

download

Decathlon

dimension: 5 objects, 10 attributes, number of values in the scale of grades: 10

The Decathlon data captures results of 2004 Olympic decathlon.

Belohlavek, R., Krmelova, M.: Factor analysis of ordinal data via decomposition of matrices with grades, Annals of Mathematics and Arti cial Intelligence 72(1-2)(2014), 23–44, 2014.

download

G&P data

dimension: 2730 objects, 17 attributes, number of values in the scale of grades: 2 and 8 (two different scales)

The data relates to an examination in the subject Government and Politics. The whole examination consists of four modules (papers), but this data is just for the second paper, which covers current British governance. The first, second and third attribute represents id, total score and class.

download

IPAQ

dimension: 4510 objects, 16 attributes, number of values in the scale of grades: 3

The IPAQ data consists of international questionnaire data involving 4510 respondents answering 16 questions using a three-element scale, regarding physical activity.

Belohlavek, R., Krmelova, M.: Beyond Boolean Matrix Decompositions: Toward Factor Analysis and Dimensionality Reduction of Ordinal Data, ICDM 2013, pp. 961–966.

download

Music

dimension: 900 objects, 26 attributes, number of values in the scale of grades: 7

The Music data consists of results of a study inquiring how people perceive speed of song depending of various song characteristics.

Belohlavek, R., Krmelova, M.: Beyond Boolean Matrix Decompositions: Toward Factor Analysis and Dimensionality Reduction of Ordinal Data, ICDM 2013, pp. 961–966.

download

Rio

dimension: 87 objects, 31 attributes, number of values in the scale of grades: 4

The Rio data includes 87 countries that obtained any medal in one of 31 sport on Olympics games in Rio de Janeiro 2016.

Trneckova, M.: Formal concept analysis with ordinal attributes. PhD Thesis, 2017.

download