Here we release two datasets representing user perception of similarity between two movies. Both dataset are released under Creative Commons Attribution (CC BY 4.0). These works were done by practicum students of Carnegie Mellon University (CMU) Silicon Valley, while supervised by researchers from Ericsson. Should you use these datasets, we would appreciate if they were referred to as CMU-Ericsson Movie Similarity Dataset 1 and 2  respectively (or MovieSim-1 and MovieSim-2).

Dataset 1

This dataset has 3803 binary labels from 14 users and 143 unique movies selected. [download]

Published as: Lucas Colucci et al. “Evaluating item-item similarity algorithms for movies.” Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems. ACM, 2016.

Download the publication here.


Dataset 2

This dataset has 6605 binary labels from 114 users and 383 unique movies selected. [download]

Published as Hongkun Leng et at. “Finding Similar Movies: Dataset, Tools, and Methods.” Proceedings of the 2018 International Conferences in Central Europe on Human Computer Interaction.

Download the full publication here



If you are interested in other aspects of this work, which includes sourcecode and intermediate data send me an email. In addition, note that the above datasets omit movies labeled unknown. If you’d like this included, send me an email.

You may also want to check out the work from Wang et. al (2017) “Content-Based Top-N Recommendations With Perceived Similarity” which showed how improving similarity improves recommendations.