Words and their images in 98 languages
Below are links for the full MMID image/word dataset for each language (100 images), a smaller view of MMID with only 1 image per word (1 image), the metadata of all images and the webpages they showed up on, and the dictionary containing just the words we have images for in each language, as well as their canonical MMID ID within the language. For more information, see our documentation page.
MMID was constructed by building translations for the bilingual dictionaries found here, which were built as described in the paper The Language Demographics of Amazon Mechanical Turk.
Through the generosity of the Amazon Public Datasets program, each download is available via a public S3 bucket!
To replicate the experiments in Learning Translations via Images, you’ll need the code at this github repo. It contains scripts for reading in CNN image feature files and predicting translations as described in the paper.
For these 30 languages, we extracted CNN features and plaintext for all words of a language. Using these, you can recreate or improve on the translation results of our ACL paper. As a warning, each download is as much as 11 GB per language! The metadata files relate images to their URLs.
Language | Dataset | Language | Dataset | |
---|---|---|---|---|
Albanian | download | Latvian | download | |
Arabic | download | Romanian | download | |
Azerbaijani | download | Serbian | download | |
Bengali | download | Slovak | download | |
Bosnian | download | Somali | download | |
Bulgarian | download | Spanish | download | |
Cebuano | download | Swedish | download | |
chinese | download | Tamil | download | |
Dutch | download | Telugu | download | |
Filipino | download | Thai | download | |
French | download | Turkish | download | |
German | download | Ukrainian | download | |
Gujarati | download | Urdu | download | |
Hindi | download | Uzbek | download | |
Hungarian | download | Vietnamese | download | |
Indonesian | download | Welsh | download | |
Italian | download | Yoruba | download |