To replicate the experiments in Learning Translations via Images, you’ll need the code at this github repo. It contains scripts for reading in CNN image feature files and predicting translations as described in the paper.
[raw] downloads are not available at this time. We’re working on it!
For these 30 languages, we extracted CNN features and plaintext for all words of a language. Using these, you can recreate or improve on the translation results of our ACL paper. As a warning, each download is as much as 11 GB per language! The metadata files relate images to their URLs.
[small] views are of the same format as [med] views, but with fewer words for each language, letting you get a taste for working with the dataset without commiting disk space and transfer time.
[small] downloads are not available at this time. We’re working on it!
[text] views are raw web crawl dumps of the webpages on which all our images appeared.
We provide a docker image for extracting text performing language detection as described in Learning Translations via Images, but you’re also free to munge the text yourself.
[text] downloads are not available at this time. We’re working on it!