mmid

Words and their images in 100 languages

View the Project on GitHub penn-nlp/mmid

The Massively Multilingual Image Dataset (MMID)

MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word’s translation into English (and corresponding images.)

By far the largest dataset of its kind, with 100 languages and up to 10,000 words per language, it is useful for evaluating image-based translation methods.

Getting Started

See the documentation page

If you have questions, please email the MMID users list, mmid-users@googlegroups.com.

Downloads

CNN features and web text for the 30 languages evaluated on in the paper are up. Getting the whole 21+TB hosted will take some time, but we’re working on it!

Check out the downloads page for options on how to access the dataset.

Citation

If you use this dataset for your research, please cite:

Learning Translations via Images with a Massively Multilingual Image Dataset.
John Hewitt*, Daphne Ippolito*, Brendan Callahan, Reno Kriz, Derry Tanti Wijaya and Chris Callison-Burch.
ACL 2018.

@InProceedings{hewitt-et-al:2018:Long,
  author    = {Hewitt, John  and  Ippolito, Daphne  and  Callahan, Brendan and Kriz, Reno and Wijaya, Derry Tanti and Callison-Burch, Chris},
  title     = {Learning Translations via Images with a Massively Multilingual Image Dataset},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2018},
  address   = {Melbourne, Australia},
  publisher = {Association for Computational Linguistics}
}