High quality test collections have been becoming more and more important for the technological advancement in geo-referenced image retrieval and analytics. In this paper, we present a large scale test collection to support robust performance evaluation of landmark image search and corresponding construction methodology. Using the approach, we develop a very large scale test collection consisting of three key components: (1) 355,141 images of 128 landmarks in five cities across three continents crawled from Flickr; (2) different kinds of textual features for each image, including surrounding text (e.g. tags), contextual data (e.g. geo-location and upload time), and metadata (e.g. uploader and EXIF); and (3) six types of low-level visual features. In order to support robust and effective performance assessment, a series of baseline experimental studies have been conducted on the search performance over both textual and visual queries. The results demonstrate importance and effectiveness of the test collection.