Studying and analyzing system performance is one of the fundamental factors for the related technological advancement in image retrieval. In this paper, we report the construction of a large scale test collection for facilitating robust performance evaluation of mobile landmark image search. Totally, the test collection consists of (1) 355,141 images about 128 landmarks in five cities over 3 continents from Flickr; (2) different kinds of textual features for each image, including surrounding text (e.g. tags), contextual data (e.g. geo-location and upload time), and metadata (e.g. uploader and EXIF); and (3) six types of low-level visual features. For the task of landmark image retrieval evaluation, we also conduct a series of baseline experimental studies on the search performance over different visual queries, which represent different views of a landmark.