mmid

Logo

Words and their images in 100 languages

View the Project on GitHub penn-nlp/mmid

The Massively Multilingual Image Dataset (MMID)

MMID is a large-scale, massively multilingual dataset of images paired with the words they represent collected at the University of Pennsylvania. The dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, and parallel to the word’s translation into English (and corresponding images.)

By far the largest dataset of its kind, it has 100 languages (including English) and up to 10,000 words per language! (and many more for English.)

Getting Started

See the documentation page

If you have questions, please email the MMID users list. (mmid-users@googlegroups.com).

Downloads

We’re happy to announce that MMID is available via the Amazon Public Datasets program! Through their generosity, we’re able to provide all of MMID free of charge via a public S3 bucket.

We currently have 75 out of 100 languages’ data hosted. We’re working on getting the final languages (as well as the English translations) ready for distribution. Also in preparation are the text dumps for (a subset of the) webpages of the images in the dataset.

Check out the downloads page for options on how to access the dataset.

Citation

We gratefully acknowledge the support of an Amazon Research Award and AWS Research Credits, which enabled the construction of MMID.

If you use MMID for your research, please cite:

Learning Translations via Images with a Massively Multilingual Image Dataset.
John Hewitt*, Daphne Ippolito*, Brendan Callahan, Reno Kriz, Derry Tanti Wijaya and Chris Callison-Burch.
ACL 2018.

@InProceedings{hewitt-et-al:2018:Long,
  author    = {Hewitt, John  and  Ippolito, Daphne  and  Callahan, Brendan and Kriz, Reno and Wijaya, Derry Tanti and Callison-Burch, Chris},
  title     = {Learning Translations via Images with a Massively Multilingual Image Dataset},
  booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  month     = {July},
  year      = {2018},
  address   = {Melbourne, Australia},
  publisher = {Association for Computational Linguistics}
}