mmid

Words and their images in 100 languages

View the Project on GitHub penn-nlp/mmid

Code

To replicate the experiments in Learning Translations via Images, you’ll need the code at this github repo. It contains scripts for reading in CNN image feature files and predicting translations as described in the paper.

[raw] Downloads

[raw] downloads are not available at this time. We’re working on it!

[med] Downloads

For these 30 languages, we extracted CNN features and plaintext for all words of a language. Using these, you can recreate or improve on the translation results of our ACL paper. As a warning, each download is as much as 11 GB per language! The metadata files relate images to their URLs.

Language Dataset   Language Dataset
Albanian download   Latvian download
Arabic download   Romanian download
Azerbaijani download   Serbian download
Bengali download   Slovak download
Bosnian download   Somali download
Bulgarian download   Spanish download
Cebuano download   Swedish download
chinese download   Tamil download
Dutch download   Telugu download
Filipino download   Thai download
French download   Turkish download
German download   Ukrainian download
Gujarati download   Urdu download
Hindi download   Uzbek download
Hungarian download   Vietnamese download
Indonesian download   Welsh download
Italian download   Yoruba download

[small] Downloads

[small] views are of the same format as [med] views, but with fewer words for each language, letting you get a taste for working with the dataset without commiting disk space and transfer time.
[small] downloads are not available at this time. We’re working on it!

[text] Downloads

[text] views are raw web crawl dumps of the webpages on which all our images appeared. We provide a docker image for extracting text performing language detection as described in Learning Translations via Images, but you’re also free to munge the text yourself.
[text] downloads are not available at this time. We’re working on it!