mmid

Logo

Words and their images in 100 languages

View the Project on GitHub penn-nlp/mmid

MMID packages

Below are links for the full MMID image/word dataset for each language (100 images), a smaller view of MMID with only 1 image per word (1 image), the metadata of all images and the webpages they showed up on, and the dictionary containing just the words we have images for in each language, as well as their canonical MMID ID within the language. For more information, see our documentation page.

Through the generosity of the Amazon Public Datasets program, each download is available via a public S3 bucket!

language 100 images 1 image metadata dictionary
afrikaans link link link pending
albanian link link link link
arabic link link link link
aragonese link link link link
armenian link link link link
asturian link link link link
azerbaijani link link link link
basque link link link link
belarusian link link link link
bengali link link link link
bishnupriya-manipuri link link link link
bosnian link link link link
breton link link link link
bulgarian link link link link
catalan link link link link
cebuano link link link link
central-bicolano link link link link
chinese link link link link
croatian link link link link
czech link link link link
danish link link link link
dutch link link link link
esperanto link link link link
filipino link link link link
french link link link link
frisian link link link link
galician link link link link
georgian link link link link
german link link link link
greek link link link link
gujarati link link link link
haitian link link link link
hindi link link link link
ido link link link link
irish link link link link
italian link link link link
japanese link link link link
kazakh link link link link
korean link link link link
kurdish link link link link
latvian link link link link
lithuanian link link link link
low-saxon link link link link
luxembourgish link link link link
malagasy link link link link
malay link link link link
malayalam link link link link
marathi link link link link
neapolitan link link link link
nepali link link link link
newar link link link link
norwegian-nynorsk link link link link
norwegian link link link link
pashto link link link link
persian link link link link
piedmontese link link link link
polish link link link link
portuguese link link link link
punjabi link link link link
russian link link link link
serbian link link link link
serbo-croatian link link link link
sicilian link link link link
sindhi link link link link
slovak link link link link
slovenian link link link link
somali link link link link
spanish link link link link
sundanese link link link link
swahili link link link link
swedish link link link link
tamil link link link link
telugu link link link link
thai link link link link
turkish-august link link link link

Code

To replicate the experiments in Learning Translations via Images, you’ll need the code at this github repo. It contains scripts for reading in CNN image feature files and predicting translations as described in the paper.

CNN package Downloads

For these 30 languages, we extracted CNN features and plaintext for all words of a language. Using these, you can recreate or improve on the translation results of our ACL paper. As a warning, each download is as much as 11 GB per language! The metadata files relate images to their URLs.

Language Dataset   Language Dataset
Albanian download   Latvian download
Arabic download   Romanian download
Azerbaijani download   Serbian download
Bengali download   Slovak download
Bosnian download   Somali download
Bulgarian download   Spanish download
Cebuano download   Swedish download
chinese download   Tamil download
Dutch download   Telugu download
Filipino download   Thai download
French download   Turkish download
German download   Ukrainian download
Gujarati download   Urdu download
Hindi download   Uzbek download
Hungarian download   Vietnamese download
Indonesian download   Welsh download
Italian download   Yoruba download