Benchmarking unsupervised near-duplicate image detection
Unsupervised near-duplicate detection has many practical applications ranging from social media analysis and web-scale retrieval, to digital image forensics. It entails running a threshold-limited query on a set of descriptors extracted from the imag…
Authors: Lia Morra, Fabrizio Lamberti
Benc hmarking unsup ervised near-duplicate image detection I , II Lia Morra a, ∗ , F abrizio Lam b erti a a Dep artment of Contr ol and Computer Engine ering, Polite cnic o di T orino, T orino, Italy Abstract Unsup ervised near-duplicate detection has many practical applications ranging from so cial media analysis and w eb-scale retriev al, to digital image forensics. It entails running a threshold-limited query on a set of descriptors extracted from the images, with the goal of identifying all p ossible near-duplicates, while limiting the false p ositives due to visually similar images. Since the rate of false alarms gro ws with the dataset size, a very high sp ecificity is thus required, up to 1 − 10 − 9 for realistic use cases; this imp ortant requirement, how ev er, is often o verlooked in literature. In recen t years, descriptors based on deep con volutional neural netw orks ha ve matc hed or surpassed traditional feature extraction meth- o ds in conten t-based image retriev al tasks. T o the b est of our knowledge, ours is the first attempt to establish the p erformance range of deep learning-based descriptors for unsup ervised near-duplicate detection on a range of datasets, encompassing a broad sp ectrum of near-duplicate definitions. W e leverage both established and new benchmarks, such as the Mir-Flick Near-Duplicate (MFND) dataset, in which a known ground truth is provided for all p ossible pairs ov er a general, large scale image collection. T o compare the sp ecificit y of different descriptors, we reduce the problem of unsup ervised detection to that of binary classification of near-duplicate vs. not-near-duplicate images. The latter can b e con venien tly characterized using Receiver Op erating Curve (ROC). Our find- ings in general fav or the c hoice of fine-tuning deep con volutional netw orks, as opp osed to using off-the-shelf features, but differences at high sp ecificity settings dep end on the dataset and are often small. The b est p erformance was observed on the MFND b enc hmark, achieving 96% sensitivity at a false p ositiv e rate of 1 . 43 × 10 − 6 . Keywor ds: Near-duplicate detection, con volutional neural netw orks, instance-lev el retriev al, unsupervised detection, p erformance analysis, image I c 2019. This manuscript v ersion is made a v ailable under the CC-BY-NC-ND 4.0 license http://creativ ecommons.org/licenses/b y-nc-nd/4.0/ II F ull-text published in Expert Systems with Applications av ailable at https://doi.org/10.1016/j.esw a.2019.05.002 ∗ Corresponding author Email addr esses: lia.morra@polito.it (Lia Morra), fabrizio.lamberti@polito.it (F abrizio Lamberti) Pr eprint submitte d to Exp ert systems with applic ations July 8, 2019 forensics 1. In tro duction Near-duplicate (ND) image detection or discov ery entails finding altered or alternativ e versions of the same image or scene in a large scale collection. This tec hnique has plent y of practical applications, rangin g from so cial media analysis and w eb-scale retriev al, to digital image forensics. Our w ork w as motiv ated in particular by applications in the latter domain, as detecting the re–use of photographic material is a k ey component of sev eral passiv e image forensics tec hniques. Examples include detection of copyrigh t infringements (Zhou et al., 2017c; Chiu et al., 2012; Ke et al., 2004), digital forgery attac ks such as cut- and-paste, cop y-mov e and splicing (Chennamma et al., 2009; Hirano et al., 2006), analysis of media devices seized during criminal inv estigations (Connor & Cardillo, 2016; Battiato et al., 2014), tracing the online origin of sequestered con tent (de Oliv eira et al., 2016; Amerini et al., 2017), and fraud detection (Li et al., 2018; Cicconet et al., 2018). In all the ab ov e-men tioned applications, w e cannot resort to standard hash- ing techniques, given that even minimal alterations would mak e different copies un traceable. Similarly , it is not p ossible to rely on asso ciated text, tags or tax- onomies for retriev al, as done for instance in (Gon¸ calv es et al., 2018), since they w ould likely c hange in differen t sites or devices where conten t is used. Images ma y be sub ject to digital forgery , with parts of one or more existing images com- bined to create fake ones. Therefore, it is imp erativ e to resort to conten t-based image retriev al techniques for the task of lo cating near-duplicates. Let us consider, as a motiv ating example, the case of fraud detection. Many companies, like insurance ones, are relying on user-supplied photographic evi- dence to supp ort business pro cesses (Li et al., 2018). Photos of the same ob ject or scene ma y be re-used m ultiple times to obtain an unfair adv antage: suc h frauds are unlik ely to b e detected unless a largely automatic image analysis system is in place. It should b e noticed that w e are adopting a very broad definition of ND, encompassing all images of the same ob ject or scene, whereas many pap ers in literature restrict the definition to copies of the same digital sources that hav e b een digitally manipulated (Connor et al., 2015; F o o & Sinha, 2007). Naturally o ccurring NDs, suc h as images of the same scene or ob ject acquired at differ- en t times or from different viewp oints, are often more challenging to detect. Ho wev er, in emerging applications suc h as fraud detection, which motiv ate our w ork, w e do not wish to restrict ourselv es to either definition: as a matter of fact, we hav e no reason to assume that, when constructing fraudulent claims, digital conten t manipulation is more likely than simply acquiring differen t shots of the same scene. This broad definition brings ND detection closer to the task of instance-lev el image retriev al, whic h is abundantly studied in the liter- ature, but with a crucial difference: while the latter is usually formulated as a human-guide d supervised searc h, the former needs as little h uman sup ervision 2 as p ossible. T o achiev e this goal, we need to re-frame the problem from a sup er- vise d K –nearest neighbors searc h to an unsup ervise d threshold-limited search, where the distance is used as a classific ation function to distinguish ND from non-ND pairs. Realistic datasets in image forensics and fraud detection range b et w een 10 5 – 10 7 images (Connor & Cardillo, 2016). Since the n um b er of possible pairs grows quadratically with the dataset size, a very low false p ositive rate (or conv ersely , a high sp ecificity) is needed to obtain a tractable num b er of false alarms and therefore b e acceptable by the end user. F or a dataset of 10 6 images, a false p ositiv e rate of 10 − 9 , whic h w ould b e considered exceptionally low in man y applications, w ould still translate to 500 false alarms. In recent years, deep con volutional neural net works (CNNs) hav e sho wn unpreceden ted p erformance in many computer vision tasks, and con tent-based image retriev al is no exception. T o the best of our kno wledge, v ery few pa- p ers hav e exploited CNNs-based descriptor for ND detection, but if we lo ok at the closely related task of instance-level retriev al, a consistent b ody of researc h has emerged in recent years fa voring the adoption of CNN-based representation o ver traditional SIFT-based metho ds (Zheng et al., 2017). Exp erimen tal results on several b enc hmark datasets show that they achiev e b etter p erformance, use more compact representations and are faster to compute (Zheng et al., 2017). Ho wev er, given the need to re-frame the problem as an unsup ervise d threshold- limited search (where the o v erall p erformance is dominated by sp ecificit y rather than sensitivit y), it is not straightforw ard to ev aluate whether unsup ervised near-duplicate search lies within the grasp of the current state-of-the-art. T o the b est of our knowledge, only Connor & Cardillo (2016) hav e previously ad- dressed the issue of quantifying the p erformance of unsup ervised near-duplicate detection (Connor & Cardillo, 2016), and ours is the first contribution to specif- ically c haracterize deep learning descriptors on a wide range of ND categories. One of the underlying reasons for is certainly the lac k of suitably annotated b enc hmark datasets, as well as of an established metho dology to measure a descriptor’s p erformance. It is crucial that benchmarks for ND detection include a sufficien tly large n um b er of ne gative queries , i.e., images for whic h the absence of NDs has b een established, in order to assess b oth sp ecificity and sensitivity . In some cases, w e can resort to digital transformations to simulate NDs, but this is not applicable to all transformations. Instance-lev el retriev al b enc hmarks, suc h as the Oxford5k, Ukb enc h and Hol- ida ys datasets, comprise a v ariet y of naturally o ccurring and challenging NDs, but are rather small scale and include only clusters of related images (Zheng et al., 2017). Recently , a new b enchmark has b ecome av ailable to address the sp ecific needs of ND detection: the Mir-Flickr Near Duplicate (MFND) dataset, based on the pre-existing MIR-Flickr collection (Connor et al., 2015). In this b enc hmark, a large num ber of NDs w ere mined using a semi-automatic pro ce- dure, so that the remaining images can b e assumed to b e negative queries; how- ev er, in their initial search Connor and colleagues fo cused on sp ecific sub classes of NDs that limit the represen tativeness of this benchmark for applications such as fraud detection. Connor & Cardillo (2016) show ed that the problem of unsu- 3 p ervised detection could thus b e characterized as a binary classification problem, and w e build up on their con tribution for our exp erimen tal metho dology . The ov erarc hing ob jectiv e of this study is to ev aluate the performance of state-of-the-art deep learning descriptors and establish a baseline against whic h future researc h can be compared. A thorough experimental comparison includes a wide range of established and emerging public b enc hmarks, as well as data from a real-life fraud detection case study . Our contributions can b e summarized as follo ws: • w e compare the p erformance of CNN-based descriptors on the task of unsup ervised near-duplicate detection, and show empirically on a v ariety of datasets that sp ecificity has a large impact on the relativ e ranking of differen t descriptors; • w e extend considerably the av ailable annotations for the MFND b ench- mark to obtain a large-scale b enc hmark which supp orts a wide range of ND definitions and use cases; • finally , we extend previous work by Connor & Cardillo (2016) to wards a principled ev aluation metho dology that captures the performance require- men ts of unsup ervised ND discov ery; we show analytically and experimen- tally that by using hard negativ e mining, we can appro ximate the Area under the R OC curv e ( AU C ) that can b e used to rank the p erformance differen t descriptors. The rest of the paper is organized as follo ws: in Section 2, related w ork on instance retriev al and ND detection is reviewed. Section 3 introduces the datasets that are considered in the exp eriments. The ev aluation methodology is presen ted in Section 4, whereas the exp erimen tal setup is describ ed in Section 5. Results are presented and discussed in Sections 6 and 7, resp ectiv ely . 2. Related work 2.1. Content-b ase d image r etrieval and instanc e-level r etrieval Con tent-based image retriev al systems (CBIR) are designed to retrieve se- man tically similar images within a database based on a sp ecific query (e.g., b y pro viding another image). This problem can b e decomp osed in tw o main chal- lenges: describing image conten t in terms of visual features, and conducting an exact or appro ximate nearest neighbor search based on a distance measure (Zheng et al., 2017; Bay et al., 2008). Suc h features can b e hand-crafted, or learned from data by using deep CNNs. In this section, we will review feature extraction tec hniques, and refer to existing surv eys for the challenges related to feature aggregation, quan tization, indexing and distance measures (Zheng et al., 2017; Zhou et al., 2017b). 4 2.1.1. Hand-cr afte d fe atur es Glob al fe atur es based on the c haracteristics of the entire image (color, shap e, texture, histogram, etc..) were extensiv ely used in early CBIR systems. In the early 2000s, lo c al fe atur e extr action emerged as a more effectiv e alternative, whic h generally inv olv es tw o k ey steps: key in terest p oin t detection and lo- cal region description. In the first step, k ey salient features in the image are iden tified with high repe atabilit y , using dense sampling or more commonly by detecting lo cal extrema in the scale-space domain (e.g. Difference of Gaussians, Hessian matrix, etc.). One or m ultiple descriptors are then extracted from the lo cal region centered at each interest p oin t, usually designed to b e inv ariant to rotation changes and robust to affine distortions, addition of noise, illumination c hanges, etc. The most p opular lo cal feature descriptors are SIFT and SURF (Zheng et al., 2017). SIFT-based approaches generally yield v ery large feature sets, in the order of the thousands p er image. The Bag of Visual W ord (BoVW) is the most common approach for feature reduction and quantization in CBIR and instance retriev al. 2.1.2. De ep le arning appr o aches Since 2015, deep learning has b ecome the state of the art approach to CBIR (W an et al., 2014; Gordo et al., 2016; Baln tas et al., 2016; Bab enko & Lempitsky, 2015; Zagoruyk o & Komo dakis, 2015). Deep CNNs hav e the distinct adv an tage of learning hierarchical, high-level abstractions close to the h uman cognition pro cesses. Similarly to SIFT, CNNs can b e trained to extract features from lo- cal regions of in terests (patc hes), after detecting key interest p oin ts, which are then quantized e.g., using the BoVW (Zagoruyko & Komo dakis, 2015; Balntas et al., 2016). Alternatively , it is possible to extract semantic-a w are features from the activ ations of top conv olutional la yers in an image: it can b e shown that such feature vectors can b e interpreted as an approximate many-to-man y region matching, without the need to explicitly extract key points, and with the adv an tage of obtaining faster and more compact representations. T o this aim, t wo fundamen tal approac hes are av ailable. In the first approach, feature extrac- tion is based on pre-trained mo dels, lik e the V GG netw ork trained for ob ject recognition, alone or in combination with traditional visual features (Bab enk o & Lempitsky, 2015; W an et al., 2014). In the second one, a CNN can b e trained to learn a ranking function in an end-to-end fashion, mapping the input space to a target latent space suc h that the Euclidean distance in laten t space ap- pro ximates visual similarity (Gordo et al., 2016; W ang et al., 2014). In order to optimize a ranking loss, a sp ecial arc hitecture called a Siamese netw ork is used (W ang et al., 2014; Gordo et al., 2016, 2017). Usually , descriptors are pre- trained on ImageNet to learn image semantics, and then fine-tuned on a second training set with relev ance information (W ang et al., 2014; Gordo et al., 2016). 2.2. Ne ar-duplic ate image dete ction Sev eral works hav e fo cused on near-duplicate image detection as a distinct application from con tent-based image retriev al (Chennamma et al., 2009; F o o & 5 Sinha, 2007; Ch um et al., 2008; Li et al., 2015; Xie et al., 2014; Hu et al., 2009; Liu et al., 2015; Xu et al., 2010; Kim et al., 2015; Battiato et al., 2014; Zhou et al., 2017c; Cicconet et al., 2018; Chen et al., 2017; Connor & Cardillo, 2016). In order to frame our contribution with respect to previous literature, a more precise w orking definition of near-duplicate is needed. Given the range of p otential applications, it comes as no surprise that the definition of near- duplicate image is indeed quite v aried. Starting from the work by F o o & Sinha (2007), t w o main sources of near-duplicates hav e b een identified in the literature, namely identic al and non-identic al ne ar duplic ates (F o o & Sinha, 2007; Connor et al., 2015; Jinda-Apiraksa et al., 2013; Chen et al., 2017). Iden tical near- duplicates (INDs) are derived from the same digital source after applying some transformations, including cropping and rescaling, c hanges in image format, th umbnail resizing, insertion of logos or watermarks, or cosmetic changes (blac k & white con version, image enhancement and so forth). Non-iden tical near-duplicates (NINDs), on the con trary , are defined as im- ages that share the same con tent (i.e., they depict the same scene or ob ject), but with differen t illuminations, sub ject mov emen t, viewp oint changes, o cclu- sions, etc. (F o o & Sinha, 2007; Jinda-Apiraksa et al., 2013) Detecting NINDs is deemed more challenging than INDs, and their definition is more sub jec- tiv e (Jinda-Apiraksa et al., 2013); for these reasons, many authors ha ve mostly targeted INDs. Dep ending on the type of ND targeted and the level of transformation in- v olved, most pap ers in literature hav e either focused on global fe atures or on SIFT features combined with BoWV quantization. Global features hav e b een mostly used for IND detection (Connor & Cardillo, 2016; Chen et al., 2017; Li et al., 2015). Lo cal descriptors, such as SIFT features combined with BoVW quan tization, allow detecting more aggressiv e alterations (including NINDs), sub-image retriev al or image forgery (e.g. copy-mo v e attacks). Local descrip- tors are prone to false p ositive matc hes, as they do not take in to accoun t spatial coherence; to reduce false alarms, some authors hav e proposed pruning tec h- niques to improv e specificity and scalabilit y (F o o & Sinha, 2007; Liu et al., 2015), whereas other authors hav e fo cused on p ost-query v erification (Zhou et al., 2017c; Hu et al., 2009; Xu et al., 2010). F rom an ev aluation p oin t of view, many pap ers framed the problem of ND detection as a sup ervised K − nearest neighbor search, and few pap ers ha ve addressed the issue of quan tifying the specificity of descriptors when performing unsup ervised, threshold-limited near-duplicate disco v ery (Connor & Cardillo, 2016; Chen et al., 2017; Kim et al., 2015). The most relev ant prior w ork is that b y Connor et al, who prop osed a metho d to ev aluate the specificity of ND detectors and choosing the optimal distance threshold, based on Receiver Op erating Curve (ROC) analysis (Connor & Cardillo, 2016). A more in-depth analysis of this metho dology , and the extensions that w e prop ose, is av ailable in Section 4. Other authors hav e used small test sets to establish the optimal threshold, that was subsequen tly applied to a larger dataset (Chen et al., 2017). F or instance, Chen et al. used blo om filtering and range queries to detect duplicate images under scaling, w atermarking and format c hange transformation 6 MFND CLAIMS California-ND Holidays IND Collection Difficulty level 2 Difficulty level 3 Difficulty level 1 NIND Collection Agreement 100% Agreement 50% Agreement 30% Figure 1: Examples of near duplicates pairs of v arying complexity from the four datasets included in the comparison. F or the CLAIMS dataset, difficulty was evaluated sub jectively by one rater, whereas for the California-ND, it w as established based on the agreement b et ween 10 indep enden t raters. Dataset Size IND clusters (pairs) NIND clusters (pairs) CLAIMS 201,961 NA 1037 (1,475) MFND 1,000,000 3,825 (4,672) 10,454 (18,299) California-ND 701 NA 107 (4,609) Holida ys 1,491 NA 500 (2,072) T able 1: Comparison of the benchmark datasets. Related IND or NIND pairs were group ed into clusters; the av erage num ber of images p er cluster ranges from 2.07 to 6.55. IND and NIND pairs are counted separately , where applicable. (Chen et al., 2017); for ev aluation purp oses, they estimated the p ercentage of false and correct rejections, as well as precision and recall curv es, on a smaller case dataset. 3. Datasets The exp erimental results presented in this pap er were based on four image collections, including a priv ate dataset (Section 3.1) and three publicly a v ailable b enc hmarks (Sections 3.2, 3.3 and 3.4). The CLAIMS dataset was collected for insurance purp oses, and therefore constitutes a realistic case study for fraud detection applications. The MFND dataset (Connor et al., 2015), based on the MIR-Flickr image retriev al b enc hmark (Huiskes & Lew, 2008), contains a v ariet y of both INDs and NINDs. Both the California-ND and Holida ys datasets con tain p ersonal holiday photos and, while muc h smaller in size, include several c hallenging NIND examples (Jegou et al., 2008; Jinda-Apiraksa et al., 2013). Examples of ND pairs of different complexity are given in Fig. 1, whereas a summary of the datasets c haracteristics is rep orted in T able 1. 7 3.1. CLAIMS dataset The CLAIMS dataset includes a v ariety of indo or and outdoor scenes, mostly from residential and commercial buildings. It contains a total of 201,961 images coming from 22,327 claims. Im age subsets that are associated with a given claim generally include images of the same scenes or ob jects, representing a source of relativ ely rapidly identifiable NINDs. More infrequently , images from different claims may also represent the same scene or ob ject. This dataset con tains b oth IND (e.g. insertion of small captions or logos, changes in asp ect ratio, format c hange, compression, etc.) and NIND clusters (e.g. sequen tial snapshots of the same scene, viewp oin t c hanges, etc.). The collection was annotated to generate b oth p ositive queries (i.e., with kno wn NDs) and negativ e queries (i.e., for which absence of NDs was confirmed). NDs w ere annotated following a semi-man ual procedure, in whic h a set of claims w as randomly selected. F or each claim, all p otential image pairs w ere generated and the ND pairs were manually selected. Connected pairs of NDs from the same claim w ere group ed to form clusters. Non-near duplicate (NND) pairs were randomly extracted following a hard negativ e mining strategy (see Section 4.1 for details). The results w ere visually insp ected obtaining additional 103 near-duplicate pairs. The final annotated set included 1,475 ND pairs, forming 1,037 distinct clusters; the av erage num ber of images p er cluster is 2.2. 3.2. MirFlickr Ne ar Duplic ate The MIR-Flickr Near Duplicate (MFND) collection is a recent revisitation of the MIR-Flickr image retriev al b enchmark (Huisk es & Lew, 2008). Con- nor and colleagues observ ed a significant num b er of NDs in this one million image collection, which were semi-automatically retrieved using differen t ND finders (Connor et al., 2015). W e hav e expanded their annotations b y adopting a broader definition of ND, as w ell as using different descriptors. The first MFND annotation was generated using a set of four global de- scriptors (based on MPEG-7 and p erceptual Hashing global features) and five distance measures, which were combined to form different similarity functions (Connor et al., 2015). A threshold-limited nearest-neigh bor search was con- ducted using approximated metric search tec hniques, yielding a few thousand p oten tial ND pairs for every function. W e hav e expanded this annotation by using the three CNN-based descriptors included in this study , and the Eu- clidean distance. Several threshold-limited, K -nearest neighbor searc hes were p erformed (with K =5 and K =1), yielding a few hundred thousands p otential ND pairs which were visually inspected. Exact duplicates were eliminated based on the MD5 hash. Eac h of the resulting image pairs was manually assigned to one of three cat- egories, IND, NIND or other, follo wing the categorization illustrated in Section 2.2 (Connor et al., 2015). The strength of this metho dology is that it minimizes biases with resp ect to the images in the collection, as w ell as to the metho d with which the near-duplicates hav e b een detected. W e assumed, as in previous 8 w ork (Connor et al., 2015), that b oth IND and NIND relations are transitive, allo wing the iden tification of clusters of images that share the same con tent. The resulting clusters w ere also visually insp ected for consistency . As for the CLAIMS collection, NND pairs w ere generated through a hard negativ e mining pro cedure; results were visually insp ected identifying 120 addi- tional NIND pairs. The av ailable annotations were thus substantially extended from 1,958 to 3,825 IND clusters (4,672 vs. 2,407 pairs) and from 379 to 10,454 NIND clus- ters (18,299 pairs). Many new IND pairs detected w ere sub ject to digital conten t manipulations, cropping or color alterations; we found that CNN-based descrip- tors were particularly robust to colorization tec hniques. A total of 30,925 images w ere found to hav e at least one IND or NIND in the collection, with a mean cluster size of 2.2. 3.3. California-ND The California-ND collection comprises 701 photos taken from a real user’s p ersonal photo collection (Jinda-Apiraksa et al., 2013). It includes many chal- lenging NIND cases, without resorting to artificial image transformations. T o accoun t for the intrinsic ambiguit y of NIND definition, the collection was man- ually annotated by 10 different observers, including the photographer himself. Instructions suc h as “If an y t w o (or more) images look similar in visual ap- p earance, or conv ey similar concepts to you, label them as near-duplicates.” w ere giv en to the raters. Out of 245,350 unique p ossible combinations, 4,609 image pairs were identified as ND by at least one sub ject; notably , in 82% of the cases raters disagreed to some extent on whether or not a pair of images should b e considered ND. The image pairs form 107 clusters of NIND images, where each cluster contains on av erage 6.55 images; the ND pairs were group ed assuming that the ND relationship is transitive (whic h is not generally the case, but seemed reasonable in this particular situation). 3.4. Holidays The INRIA Holidays dataset (Jegou et al., 2008), a p opular b enchmark for instance retriev al, is mainly composed by the authors’ personal holidays photos. The images, all high resolution, include a large v ariety of scene types (natural, man-made, water and fire effects, etc). Images were group ed b y the authors in 500 disjoint image clusters, each representing a distinct scene or ob ject, for a total of 1,491 images and an av erage cluster size of 2.98 images. F rom the 500 clusters, a total of 2,072 ND pairs, mostly NIND, can b e identified. 4. P erformance ev aluation In this section, we illustrate the p erformance metrics and proto col used to ev aluate the specificity and sensitivit y of unsupervised ND detection. In the first battery of test, w e extended the work of Connor and colleagues (Connor & Cardillo, 2016), reducing the problem of unsupervised discov ery to that of 9 binary classification; ROC analysis can b e used to measure the ability of a ND detector to distinguish ND pairs from visually similar examples, as detailed in Section 4.1. The second battery of test inv olves estimating the a v erage false p ositiv e rates generated by a negative query , and is explained in Section 4.3; the relationship b etw een these tw o p erformance measures is also explored. An o verview of the metho dology is presented in Fig. 2. S e m i - a utom a t e d discover y Not Nea r - D upl i ca t e (neg a t i v e) pa i r s Nea r - D upl i ca t e (pos i t i v e) pa i r s ❶ ❷ ❹ ❸ D i s t a nce Dist a nc e di st ribut i on Th r e s h ol d Neg Pos H a rd ne g a t i v e m i ning Q ue ry l e v e l a na l y sis ❺ R O C a na l y sis a nd t hr e sh o l d se l e ct i o n H i gh s en s H i gh s pe c Figure 2: Metho dology employ ed to calculate the p erformance on the MFND b enc hmark. First, all near-duplicate pairs are discov ered through a semi-sup ervised search tec hnique (step 1). On the remainder of the collection, hard negativ e mining is used to iden tify hard samples of visually similar, but not near-duplicate pairs (step 2); crucially , this step needs to b e rep eated for each descriptor. Using the distance as a classification function (step 3), ROC analysis can be used to characterize the ability of the descriptor to distinguish near-duplicates from not- near duplicate pairs. F rom ROC analysis, suitable thresholds on the distance can b e selected based on the application requiremen ts. Performance at query time can b e thus b e reliably estimated (step 5). 4.1. ROC analysis As suggested by Connor & Cardillo (2016), a near-duplicate finder can b e mo deled as a p ositiv e numeric function D ov er an y tw o image descriptors x and y , where normally D will b e a prop er distance metric. T o run an unsup ervised searc h, it is necessary to use D as a classification function o ver images pairs, whic h without loss of generality can b e achiev ed by choosing a threshold t : D t ( x, y ) = D ( x, y ) < t (1) The problem of unsup ervised discov ery can b e c haracterized as finding the near-duplicate intersection of tw o image sets X ∩ N D Y , that is the set of pairs of images from sets X and Y that satisfy the conceptual near-duplicate relation ND (Connor & Cardillo, 2016). If S ens ( t ) and S pec ( t ) are the sensitivity and 10 sp ecificit y of D t ( x, y ), the num b er of true p ositiv e (TP) matc hes will b e T P ( t ) = S ens ( t ) | X ∩ N D Y | (2) and the n umber of false p ositiv es (FP) will b e F P ( t ) = (1 − S pec ( t )) | X | | Y | (3) assuming that | X ∩ N D Y | << | Y | . In our setting, | X | = K is the num b er of query images and | Y | = M is the size of the collection. Another useful figure to define is the num ber of false p ositiv es / query image, which can b e computed as F P q ( t ) = (1 − S pec ( t )) M (4) Giv en a set of ND and NND pairs, the sensitivity (or recall) and the sp eci- ficit y of a ND detector can b e estimated as follows: S ens ( t ) = No. of correctly identified ND pairs T otal no. of ND pairs (5) S pec ( t ) = No. of correctly identified NND pairs T otal no. of NND pairs (6) Both quantities are function of the threshold t , and the o verall p erformance can b e characterized by R OC analysis. 4.2. Har d ne gative mining In a realistic dataset the p o ol of NND images is very large, compared to the num b er of ND pairs - it is not feasible to ev aluate all p ossible pairs. Hard negativ e mining extracts a compact set of NND from a large image collection, starting from a subset of randomly selected query images, for which we can assume that a near-duplicate match do es not exist in the collection. F or each query image, the pairwise distances b etw een the query images and all the other images in the collection are calculated, and the most ”difficult” examples are selected. Starting from a random sample of query images, t wo hard negativ e mining strategies w ere considered: • the nearest neigh b or for eac h query is selected ( hn1 ); • the K − nearest neigh b ors for all queries are retriev ed and sorted; the most difficult pairs (i.e., those with the smallest distances) are then selected ( hn2 ). Notably , the hard negativ e mining procedure depends on the relative ranking of the images, and hence has to b e repeated for eac h descriptor and for each distance form ulation. 11 The distances of the hard negativ es are among the smallest of the K × M distances measured, where K is the num b er of query images and M is the di- mension of the dataset: if a distance threshold exists such that all the “dif- ficult” pairs are successfully iden tified, then we can assume that all p oten- tial NND pairs in the collection will b e identified as well. F or instance, for the CLAIMS dataset K = 4400 and M = 80 , 000, yielding a sp ecificity of 1 − 1 / 353 , 000 , 000 = 1 − 2 . 83 × 10 − 9 , which is the smallest p ossible sensitivity that can b e measured in this setting. The sp ecificit y measured on the hard negative samples can b e used to ap- pro ximate the true sp ecificit y that would b e observed at large. F or instance, a .9 sp ecificit y (.1 FP rate), would allow to successfully discard 0 . 9 M NND pairs; ho wev er, it would also fail to discard at least 0 . 1 K NND pairs, and hence would corresp ond to a sp ecificity of at most 1 − 1 . 25 × 10 − 6 . In this wa y , it is p ossi- ble to estimate a low er b ound on the amount of FPs generated on datasets of arbitrary size. 4.2.1. Ar e a under the ROC curve The AUC is a common summary metrics that quan tifies the global p erfor- mance of a classifier (Bradley, 1997). W e estimated the A UC for each descriptor and 95% confidence in terv als w ere calculated under the normal assumption ac- cording to Hanley & McNeil (1982). Since the ROC is calculated on a subset of all p ossible negative pairs, the resulting A UC will b e an approximation of the true AUC if all pairs were taken in to account. W e will refer in the following to AU C hn 1 and AU C hn 2 to denote the AU C calculated on pairs extracted using hard negative mining strategies hn 1 and hn 2, resp ectiv ely . Let p 1 , ..., p N + b e ND pairs (i.e., p ositive samples) and n 1 , ..., n N − b e all the NND pairs (i.e., negative samples), where in our case N − = K × M . The AUC can b e expressed as a sum of indicator functions (Hanley & McNeil, 1982): AU C = 1 N + N − N + X i =1 N − X j =1 1 f ( p i ) >f ( n j ) (7) where f ( · ) is a scoring function which, in our case, is the distance b et ween the descriptors of the tw o images in the pair 1 . F or simplicity , w e omit f ( · ) from the notation in the rest of the pap er. Let n l , ..., n H − b e the hard negative NND pairs, where H − N − . The estimated A UC can b e expressed as follo ws: AU C hn = 1 N + H − N + X i =1 H − X l =1 1 p i >n l (8) 1 W e follow here the notation normally used in ROC literature where the p ositiv e samples are exp ected to b e scored higher than negativ e samples, whereas in our case the scoring function is a distance and pairs with low er distance would b e scored higher 12 Since in general 1 p i >n j ≤ 1 p i >n l ∀ n j ∈ N − − H − ∀ n l ∈ H − (9) it can b e demonstrated that for b oth hard negative mining strategies AU C hn is an upp er b ound for the true AU C . It follows that the most appropriate c hoice w ould b e to use the strategy that pro vides the tighter b ound. Analytical pro of is pro vided in App endix A. 4.3. R ange query p erformanc e R OC analysis do es not directly represent the observ ed system p erformance, whic h also dep ends on the distribution of the type of images, the size of the clus- ters, and so forth. W e analyzed an alternative p erformance measure, obtained b y sim ulating the case of a single query image x compared against a collection of images Y , whic h is a special case of the general problem of near-duplicate detec- tion describ ed in Section 4.1. An unsup ervised, threshold-limited range search is conducted to retrieve a list of p oten tial near-duplicates, and used to estimate the num b er of FPs/query or F P q ( t ). In practice, it is conv enien t to restrict the searc h to the K –nearest neighbors in order to cap the n umber of FPs/query to a reasonable num ber. The prop osed exp erimen tal setup executes a num b er of p ositiv e queries (i.e., images with one or more known ND), and negative queries (i.e., images that hav e no expected NDs), ov er a dataset constructed as follo ws: • positive queries were derived from the clusters of ND images, where the first image are used as queries and the rest are inserted in the database, as normally done for Holida ys and other image retriev al benchmarks; • negativ e query images were selected from the NND pairs, and a set of dis- tractors are used to ev aluate sp ecificity; in practice, we use for conv enience the same image p ool used for hard negative mining. F or v arying v alues of the threshold t = T i on the distance measure, we compared the aver age r e c al l , calculated ov er all positive queries, and the aver age numb er of FPs/query , calculated ov er all negativ e queries. Note that a verage recall is different from pair-wise sensitivity used in ROC analysis, as each query ma y contain m ultiple pairs of v arying “difficult y”. The FPs/query dep end on the size of the dataset and the sp ecificit y as predicted by Equation 3. 5. Exp erimen tal setup In this section, a detailed analysis of the exp erimen tal setup is given con- cerning the descriptors selection, their implemen tation , and the hard negative mining parameters. 13 5.1. Descriptors Tw o sets of descriptors were compared in this w ork: global descriptors, and CNN-based descriptors; for the latter, we compared examples of the tw o main approac hes (aggregation of raw deep con volutional features without embedding and Siamese net works) describ ed in Section 2. Among global descriptors, GIST (Oliv a & T orralba, 2001) was selected based on previous results on the MFND collection (Connor & Cardillo, 2016). The GIST, or spatial env elop e, is a bio-inspired feature that simulates human visual p erception to extract rough but concise context information (Oliv a & T orralba, 2001). The input image is decomp osed using spatial pyramid in to N blocks, filtered b y a num ber of m ulti-scale, multi-orien tation Gabor filters (4 scales, 8 orien tations per scale), and then summarized b y a feature extractor that captures the ”gist” of the image, handling translational, angular, scale and illumination changes. W e exp erimen ted with p erceptual Hashing, how ever the results are not rep orted as they were generally v ery p o or. The SPoC (Sum-Po ole d Convolutional) descriptor w as initially prop osed b y Bab enk o & Lempitsky (2015). The features are extracted from the top conv olu- tional lay er of a pre-trained neural net work and spatially aggregated using sum p ooling. The length of the feature vector will th us b e equal to the depth of the final conv olutional la y er (usually in the order of the h undreds). Best results w ere obtained extracting features after ReLU activ ation, confirming previous findings (Bab enk o & Lempitsky, 2015). PCA whitening and compression is applied, and the v ectors are normalized to unit length (L2 normalization). The R-MAC architecture, proposed by T olias et al. (2015) builds a com- pact feature vectors b y enco ding several image regions in a single pass. First, sub-regions are defined using a fixed grid ov er a range of progressively finer scales l ranging from 1 to L ; then, max-po oling is used to extract features from eac h individual region. Each region feature vector is p ost-pro cessed with PCA-whitening and L2 normalization. Finally , the regional feature vectors are summed in to a single image vector, which is again L2 normalized. The De epR etrieval architecture, prop osed b y Gordo et al. (2016), employs a Siamese netw ork to learn a ranking function based on the triplet loss function. Let I q b e a query image with descriptor q , I + b e a relev an t image with descriptor d + , and I − b e a non-relev ant image with descriptor d − . The ranking triplet loss is defined as L ( I q , I + , I − ) = 1 2 max (0 , m + k q − d + k 2 − k q − d − k 2 ) (10) where m is a scalar that controls the margin. At test time, the features are ex- tracted from the top conv olutional lay er and aggregated using sum-po oling and normalization. The Deep Retriev al architecture includes an additional prop osal net work, similar to the R-MAC grid net work, so that the features are calcu- lated on several p otential regions of interest, as opp osed to the en tire image. The Deep Retriev al netw ork is pre-trained on the Landmarks dataset (Bab enk o et al., 2014). 14 5.2. Implementation The tested descriptors, and related parameters, are summarized in T able 2. F or SPoC and GIST, images were resized to 512 × 512, whereas for Deep Retriev al images were rescaled so that the longest side is equal to S. A Python re-implemen tation of the original Matlab co de by Olive and T or- ralba was used for GIST, after con verting images to gra yscale 2 . The SP oC descriptor was computed from pre-trained netw orks architectures such as VGG (Simon yan & Zisserman, 2014) and Residual Netw orks (He et al., 2016). W e included b oth models pre-trained on ImageNet (Deng et al., 2009), as well as on the Places205 or Places365 datasets (Zhou et al., 2014, 2017a), and on a hybrid dataset including images from both ImageNet and Places. The R-MA C descrip- tor w as re-implemen ted in Python based on the original Matlab implemen tation b y the authors 3 ; R-MAC w as calculated only for the ResNet101 arc hitecture. All netw orks were av ailable in Caffe; pre-trained mo dels were downloaded from the Caffe Model Zoo 4 , or were made av ailable b y the authors for the Places dataset 5 , 6 . W e trained the PCA parameters, without reducing the num ber of features, on a representativ e set of the collection (95,230 for CLAIMS and 100,000 for MFND), which was not used in testing. The parameters trained on the MFND collection were also used for the Holida ys and California-ND collection; although the image characteristics in the tw o datasets is different, this is consistent with previous w orks in literature (Jegou et al., 2008). The F AISS library (Johnson et al., 2017), sp ecifically the flat index with L2 exact searc h, w as used to index the collection and perform range queries. All ex- p erimen ts w ere run on a system with a i7-7700 CPU @3.60GHz and GTX1080Ti nVIDIA GPU. Descriptor Label Size Parameters GIST (Oliv a & T orralba, 2001) GIST4 512 num ber of blo cks = 4 GIST (Oliv a & T orralba, 2001) GIST8 512 num ber of blo cks = 8 Deep Retriev al (Gordo et al., 2016) DeepRet800 2048 ResNet101, Fine-tuned on Landmarks dataset, S = 800, no multiresolution Deep Retriev al (Gordo et al., 2016) DeepRet500 2048 ResNet101, Fine-tuned on Landmarks dataset, S = 500, no multiresolution Deep Retriev al (Gordo et al., 2016) DeepRet500MR 2048 ResNet101, Fine-tuned on Landmarks dataset, S = 500, multiresolution (2) SPOC (Babenko & Lempitsky, 2015) SP VGG19IN 512 VGG19, T rained on ImageNet SPOC (Babenko & Lempitsky, 2015) SP VGG16PL 512 V GG16, T rained on Places205 aSPOC (Babenko & Lempitsky, 2015) SP VGG16HY 512 VGG16, T rained on Hybrid (Places205 & ImageNet) dataset SPOC (Babenko & Lempitsky, 2015) SP ResNet101IM 2048 ResNet101, T rained on ImageNet dataset SPOC (Babenko & Lempitsky, 2015) SP ResNet152IM 2048 ResNet152, T rained on ImageNet dataset SPOC (Babenko & Lempitsky, 2015) SP ResNet152HY 2048 ResNet152, T rained on Hybrid (Places365 & ImageNet) dataset R-MAC T olias et al. (2015) RMAC 2048 ResNet101, T rained on ImageNet dataset, L =2 T able 2: Synthetic description of the descriptors used. 2 http://people.csail.mit.edu/torralba/code/spatialenvelope/ 3 https://gith ub.com/gtolias/rmac 4 https://gith ub.com/BVLC/caffe/wiki/Model-Zo o 5 https://gith ub.com/CSAIL Vision/places365 6 http://www.europe.nav erlabs.com/Researc h/Computer-Vision/Learning-Visual- Representations/Deep-Image-Retriev al 15 5.3. Har d ne gative mining F or each dataset (CLAIMS and MFND), we randomly selected a set of neg- ativ e query images (4,500 and 5,000, resp ectiv ely), and a larger p o ol for mining (80,000 and 70,000 images, resp ectively), after excluding IND or NIND pairs and images used to train the PCA parameters. The tw o hard negative min- ing strategies introduced in Section 4.2, were compared: in hn2 , the 10 nearest neigh b ors were found for each query images, and then the 10,000 most difficult pairs w ere selected. The n um b er of samples w as increased for hn2 to accoun t for images that b elong to more than one pair, whic h we observed exp erimen tally , and ensure sufficient div ersity . F or GIST and Deep Retriev al, the hard negative mining pro cedure was calculated for one parameter set, to reduce the compu- tational cost. F or the Holida ys and California-ND datasets, we used the NND pairs mined for MFND; this is consistent with previous works that hav e used the same collection as distractors for large scale retriev al testing (Jegou et al., 2008). F or b oth datasets, the hard negative mining pro cedure was rep eated for a large num b er of descriptors (8 for MFND, and 13 for the CLAIMS dataset), and the results w ere visually insp ected for the presence of near duplicates. A total of 101 (2.2%) and 121 (2.4%) ND pairs w ere found for the CLAIMS and MFND datasets, resp ectiv ely , and their lab els were changed accordingly . 6. Results In this section, results of the R OC analysis are compared across different descriptors (Section 6.1) and different datasets (Section 6.2). Finally , the p er- formance obtained from random queries is analyzed and compared with that predicted b y ROC analysis in Section 6.3. Descriptor CLAIMS MFND California-ND Holida ys AUR OC AUR OC-IND AUR OC-all AUR OC AUR OC GIST4 0.397 ( 0.381 – 0.413) 0.808 ( 0.799 – 0.817) 0.5 ( 0.491 – 0.509) 0.598 ( 0.587 – 0.610) 0.365 ( 0.351 – 0.378) GIST8 0.45 ( 0.433 – 0.467) 0.854 ( 0.847 – 0.862) 0.561 ( 0.552 – 0.569) 0.696 ( 0.685 – 0.706) 0.493 ( 0.478 – 0.508) DeepRet800 0.891 ( 0.88 – 0.903) 0.994 ( 0.993 – 0.996) 0.983 ( 0.981 – 0.984) 0.929 ( 0.924 – 0.934) 0.744 ( 0.731 – 0.758) DeepRet500 0.88 ( 0.868 – 0.892) 0.992 ( 0.991 – 0.994) 0.979 ( 0.978 – 0.981) 0.917 ( 0.911 – 0.923) 0.748 ( 0.734 – 0.761) DeepRet500MR 0.891 ( 0.88 – 0.903) 0.992 ( 0.99 – 0.994) 0.983 ( 0.982 – 0.984) 0.938 ( 0.933 – 0.943) 0.789 ( 0.777 – 0.802) SP VGG19 IM 0.391 ( 0.375 – 0.407) 0.934 ( 0.929 – 0.940) 0.904 ( 0.901 – 0.908) 0.88 ( 0.873 – 0.887) 0.671 ( 0.656 – 0.685) SP VGG16 PL 0.446 ( 0.429 – 0.463) 0.907 ( 0.901 – 0.914) 0.881 ( 0.877 – 0.886) 0.909 ( 0.903 – 0.915) 0.675 ( 0.66 – 0.689) SP VGG16 HY 0.418 ( 0.401 – 0.434) 0.929 ( 0.923 – 0.934) 0.906 ( 0.903 – 0.910) 0.896 ( 0.89 – 0.903) 0.697 ( 0.683 – 0.711) SP ResNet101IM 0.518 ( 0.501 – 0.536) 0.961 ( 0.957 – 0.965) 0.941 ( 0.938 – 0.943) 0.93 ( 0.925 – 0.936) 0.776 ( 0.763 – 0.789) SP ResNet512IM 0.522 ( 0.505 – 0.540) 0.963 ( 0.959 – 0.967) 0.944 ( 0.941 – 0.946) 0.921 ( 0.916 – 0.927) 0.78 ( 0.767 – 0.793) SP ResNet512HY 0.459 ( 0.442 – 0.476) 0.924 ( 0.918 – 0.930) 0.916 ( 0.913 – 0.920) 0.866 ( 0.859 – 0.874) 0.737 ( 0.723 – 0.751) RMAC 0.323 ( 0.308 – 0.338) 0.99 ( 0.988 – 0.992) 0.945 ( 0.942 – 0.947) 0.88 ( 0.873 – 0.888) 0.737 ( 0.723 – 0.751) Average 0.549 0.937 0.870 0.863 0.684 T able 3: Area under the ROC curve (AUC) with 95% confidence in terv als. The NND pairs are extracted using the first hard negative mining strategy ( hn1 ). F or the MFND dataset, the AUC is calculated separately for b oth IND and NIND pairs (AUR OC-all), and for IND pairs vs. NND pairs; in the latter case, NIND are not counted as either FP or TP . The average AUC across all descriptors provides a semi-quantitativ e estimate of the ”difficulty” of each dataset. 6.1. ROC analysis The Area under the ROC curve (AUC) for all descriptors, along with 95% confidence interv als, is rep orted in T ables 3 and 4. There is a large difference in 16 Descriptor CLAIMS MFND California-ND Holida ys AUR OC AUR OC-IND AUR OC-all AUR OC AUR OC GIST4 0.108 ( 0.101 – 0.114) 0.706 ( 0.696 – 0.715) 0.317 ( 0.31 – 0.323) 0.398 ( 0.389 – 0.408) 0.178 ( 0.17 – 0.186) GIST8 0.121 ( 0.114 – 0.128) 0.756 ( 0.747 – 0.765) 0.346 ( 0.339 – 0.352) 0.462 ( 0.452 – 0.472) 0.243 ( 0.234 – 0.253) DeepRet800 0.381 ( 0.366 – 0.395) 0.99 ( 0.988 – 0.992) 0.969 ( 0.967 – 0.970) 0.929 ( 0.924 – 0.934) 0.628 ( 0.614 – 0.642) DeepRet500 0.428 ( 0.412 – 0.443) 0.987 ( 0.985 – 0.989) 0.962 ( 0.96 – 0.964) 0.917 ( 0.911 – 0.923) 0.614 ( 0.6 – 0.628) DeepRet500MR 0.46 ( 0.444 – 0.476) 0.987 ( 0.985 – 0.989) 0.97 ( 0.968 – 0.971) 0.938 ( 0.933 – 0.943) 0.676 ( 0.663 – 0.690) SP VGG19 IM 0.24 ( 0.229 – 0.251) 0.882 ( 0.876 – 0.889) 0.82 ( 0.816 – 0.825) 0.787 ( 0.779 – 0.796) 0.52 ( 0.507 – 0.534) SP VGG16 PL 0.288 ( 0.274 – 0.302) 0.866 ( 0.859 – 0.873) 0.821 ( 0.817 – 0.826) 0.859 ( 0.851 – 0.866) 0.577 ( 0.563 – 0.591) SP VGG16 HY 0.267 ( 0.255 – 0.279) 0.881 ( 0.874 – 0.887) 0.834 ( 0.83 – 0.838) 0.814 ( 0.806 – 0.822) 0.563 ( 0.55 – 0.577) SP ResNet101IM 0.397 ( 0.382 – 0.412) 0.943 ( 0.939 – 0.948) 0.911 ( 0.908 – 0.914) 0.892 ( 0.885 – 0.898) 0.685 ( 0.671 – 0.698) SP ResNet512IM 0.396 ( 0.381 – 0.411) 0.947 ( 0.943 – 0.952) 0.917 ( 0.914 – 0.920) 0.885 ( 0.878 – 0.891) 0.694 ( 0.68 – 0.707) SP ResNet512HY 0.327 ( 0.313 – 0.340) 0.881 ( 0.874 – 0.887) 0.866 ( 0.862 – 0.870) 0.784 ( 0.776 – 0.793) 0.627 ( 0.613 – 0.641) RMAC 0.336 ( 0.322 – 0.350) 0.985 ( 0.983 – 0.988) 0.917 ( 0.914 – 0.920) 0.825 ( 0.817 – 0.833) 0.641 ( 0.627 – 0.655) Average 0.312 0.901 0.804 0.791 0.554 T able 4: Area under the ROC curve (AUC) with 95% confidence in terv als. The NND pairs are extracted using the second hard negative mining strategy ( hn2 ). F or the MFND dataset, the AUC is calculated separately for b oth IND and NIND pairs (AUR OC-all), and for IND pairs vs. NND pairs; in the latter case, NIND are not coun ted as either FP or TP . The a verage AUC across all descriptors provides a semi-quantitativ e estimate of the ”difficulty” of each dataset. estimated p erformance dep ending on the hard negativ e mining technique em- plo yed, with hn1 yielding optimistically biased estimates. This is most evident for the CLAIMS dataset, whic h contains a larger num ber of visually similar, but not duplicate images. F or IND detection, the difference b etw een global features and deep learning based features is less pronounced. The results are lo wer than previously rep orted in literature, b ecause the IND dataset has b een significantly expanded, and the new pairs include transformations to whic h previous descriptors were less robust. The DeepRetriev al architecture generally outp erforms SP oC for all datasets, despite b eing trained on a differen t dataset (Landmarks) with no fine-tuning. The actual gap in p erformance is very lo w for the MFND dataset, and increases for other datasets, with CLAIMS exhibiting the highest gap. It is worth noting, ho wev er, that DeepRetriev al is also more prone to FPs due to visually similar images, and the performance estimates are more sensitive than for SPoC to the hard negative mining strategy . The R-MA C descriptor p erforms slightly b etter than SP oC for the MFND dataset, and slightly worse for the other datasets. The p erformance of the SPoC descriptor strongly dep ends on the netw ork arc hitecture, with Residual Net w orks consistently outp erforming VGG on all datasets. The dataset on which the netw ork was trained has instead a limited impact, p ossibly due to the effect of PCA whitening. ROC curves for selected descriptors and datasets are rep orted in Fig. 3. The remaining ROC curves are av ailable as supplementary material. On the MFND collection, the best p erformance is obtained b y the DeepRet descriptor, retrieving 96% of the true p ositiv es at a FP rate of 1 . 43 × 10 − 6 . 6.2. Dataset c omp arison In order to better highlight differences b etw een the datasets, w e computed the false FP and TP rate w.r.t. the distance threshold for each dataset and for eac h of the tw o best p erforming descriptors, as detailed in Fig. 4. 17 (a) (b) (c) (d) (e) (f ) (g) (h) Figure 3: ROC curves for the Deep Retriev al descriptors for hard negative mining strategy hn1 (a-d) and hn2 (e-h) resp ectively . A logarithmic scale was used for the FP rate axis to highlight lo w v alues in the 0.01 – 0.1 range. Since the NND pairs were extracted using a hard negative mining strategy , a 0.1 FP rate corresponds to a pro jected minim um FP of 1 . 25 × 10 − 6 and 1 . 43 × 10 − 6 for the CLAIMS and MFND datasets, resp ectively . Not surprisingly , INDs are more easily detected than NINDs. The CLAIMS dataset contains the most c hallenging near duplicates, closely follow ed by the 18 (a) (b) (c) (d) Figure 4: Comparison of TP rate and FP rates across different datasets for selected descriptors: SP ResNet101 ImageNet (a-b), and DeepRet800 (c-d). F or FPs, pairs selected with both hard negative mining strategies hn1 and hn2 are separately plotted. Holida ys dataset. Giv en the annotation procedure follo wed for the MFND b enc hmark, it is p ossible that the NIND examples are skew ed to wards examples that are more easily detected using the present descriptors, and future exp er- imen ts will likely find new examples. Examples of ND pairs that were p oorly scored are rep orted in Fig. 5; empirically , large changes in viewp oin t app ear among the most c hallenging differences. W e also compared FP rates on the MFND and CLAIMS datasets with the t wo hard negative mining strategies. F or hn1 , MFND app ears to b e more dif- ficult than CLAIMS, whereas for hn2 the t wo datasets are quite comparable for b oth descriptors. Given a random query image, it is more likely to find a similar image for MFND than CLAIMS, but CLAIMS contains larger clusters of images that are b oth semantically and visually similar, as is likely going to b e the case for an y dataset that comes from a fo cused domain. Examples of hard negativ es ( hn2 ) for b oth datasets are rep orted in Fig. 5. 6.3. Query p erformanc e analysis In this section, the t wo b est p erforming descriptors at ROC analysis w ere compared: DeepRet800 and SP ResNet152IN, using the exp erimen tal setup de- tailed in Section 4.3. W e p erformed threshold-limited queries at thresholds T corresponding to a FP rate in the [0.01 – 0.1] range, and a maximum num b er of results/query K b et w een 2 and 10. The results are plotted in Fig. 6. Since we are using the same dataset for b oth hard negative mining and estimating query p erformance, it can b e easily 19 (a) (b) (c) (d) (e) (f ) (g) (h) (i) Figure 5: Examples of c hallenging image pairs for unsupervised near-duplicate detection. Examples (a-e) are challenging negative examples which, despite high semantic and visual similarity , are not near duplicates. Examples derived from CLAIMS (a-c) are related to image types that are particularly common this collection, whereas examples from MFND (d- e) are mostly of sub jects which are particularly p opular on Internet, such as sunsets and cats. Examples (g-h) are challenging near-duplicates from the CLAIMS dataset which were given low similarity scores by all descriptors; common patterns that are difficult to detect include drastic changes in viewp oin t, or one of the tw o images in the pair represents a detail of a larger scene. sho wn from Eq. 4 that a FP rate of 0.1 should correspond to an a verage n umber of FPs/query of 0.1 as w ell. Estimates based on hn1 hav e larger deviations from exp ected v alues, esp ecially for the DeepRet descriptor on the CLAIMS dataset, whic h is a 20x larger than exp ected. F or hn2 , actual measured FPs/query are usually slightly lo wer than predicted. Since we limit the maximum num b er of images retrieved by each query , this factor may explain the discrepancy , which is higher for the CLAIMS dataset where images are more tightly clustered in feature space. It should also b e noticed that in Eq. 4 the sp ecificit y dep ends only on the threshold, and not on the query image; our exp erimen ts, how ev er, suggest that this do es not hold true in practice, and that certain types of images are more prone to false p ositiv es. 7. Discussion 7.1. Dataset and metho dolo gy Our con tributions are a crucial step to wards a principled ev aluation metho d- ology through which estimating the specificity of unsupervised detection in arbi- trarily sized datasets is reduced to the simpler problem of binary classification of ND vs. NND pairs; a tractable num ber of NND pairs can b e extracted through 20 hard negativ e mining strategies. In the simplest implementation, hard negatives can be mined by finding the nearest neighbors in the dataset, using exact or appro ximate search dep ending on the size. W e established the first benchmark for unsupervised NIND detection, an ex- tension of the MFND b enchmark comprising more than 20,000 pairs of INDs or NINDs. W e follo wed a semi-automatic pro cedure that p oten tially could lo cate almost all pairs of NDs in the dataset (Connor et al., 2015). In our exp erimen ts, o ccasionally hard negatives mined may still contain a small p ercen tage (1-2%) of NDs: hence, annotation of the MFND b enc hmark should b e regarded as an ongoing pro cess, that will grow as new descriptors will b e tested. F or compar- ison, in an initial exp eriment p erformed b efore extending the dataset (Connor et al., 2015), roughly 8% of the hard negative mined were either NIND or IND pairs. Our exp erimen tal comparison on state-of-the-art descriptors suggests that, when compared with a real-life dataset representativ e of a fraud detec- tion application, MFND is a surprisingly realistic b enc hmark for estimating the sp ecificit y . On the contrary , NIND samples in MFND are on av erage slightly easier to detect than other datasets, alb eit the difference is m uch reduced com- pared to IND samples. The presented metho dology builds up on previous results from Connor & Cardillo (2016) on IND detection; we prov ed that the accuracy of the estimated sp ecificit y crucially dep ends on c ho osing a proper hard negative mining strategy . W e provide tw o additional con tributions that strengthen the adoption of this metho dology: first, we show analytically that the AUC of the R OC obtained on the hard negative subset is an upp er b ound of the true AUC. Secondly , w e show exp erimen tally that, starting from the exp erimen tal ROC, w e are able to predict quite accurately the false p ositiv e rate p er query , which is an indirect pro of that the ROC is indeed a go od approximation of the true curv e. F or this exp erimen tal comparison, we used the same dataset for hard negativ e mining and p erformance ev aluation, but in principle, it would b e more con venien t to perform the hard negative mining on a smaller dataset. F uture w ork is needed to determine whether the false p ositiv e rate can b e extrap olated to a larger dataset. An alternative, more intuitiv e, figure of merit w ould b e the a v erage recall and FPs/query as a function of the distance threshold t . This curve is less practical to use as it dep ends on the size of the dataset and, being unbounded, defin- ing summary p erformance measures such as the Area under the ROC curve is not straightforw ard. It closely resembles the F ree-Resp onse Receiv er Op erating Characteristics (FR OC), an extension of R OC analysis used for many diagnostic tasks where the observer (h uman or machine) can identify the lo cation of an ar- bitrary num ber of p oten tial abnormalities, as opp osed to the binary prediction task of determining whether an abnormality is present or not (Petric k et al., 2013). In that context, alternatives to the AUC hav e b een proposed and could b e extended to our use case. 7.2. Performanc e c omp arison T o the b est of our knowledge, this the first attempt to ev aluate deep learning descriptors on unsup ervised discov ery of non-identical near duplicates. 21 Connor & Cardillo (2016) argued that global descriptors are sufficient for IND detection. Our exp erience on the GIST descriptor, which obtained the highest p erformance in the previous comparison, suggest s that CNN-based descriptors offer significant adv an tages also in this case, and compare fa vorably in terms of execution time. W e hav e included in our comparison three widely used architecture: SP oC, R-MA C and DeepRetriev al. Note that the DeepRetriev al arc hitecture includes region p o oling (lik e R-MAC), but unlike other descriptors the features are fine- tuned on the Landmarks dataset for the retriev al task using a Siamese netw ork. Confirming previous results on instance-level image retriev al benchmarks, nicely summarized by Zheng et al. (2017), our exp erimental results o verall fav or the c hoice of fine-tuning the representation for retriev al, as opposed to using off- the-shelf features trained using classification loss (Gordo et al., 2016). The actual p erformance gap, how ev er, strongly dep ends on factors related to b oth the netw ork architecture, the chosen trade-off b et ween specificity and sensitivity , and the underlying dataset structure. The Holidays dataset has b een extensively used to b enchmark instance-lev el retriev al tasks, and all descriptors analyzed in this pap er were also previously tested on this dataset, alb eit using a different approac h for p erformance as- sessmen t. The p erformance (mean Average Precision) is rep orted in previous literature as follows: 75.9 (SPoC), 85.2 (R-MA C) and 86.7 (DeepRetriev al) (Gordo et al., 2016). F or the Holidays near-duplicate detection task, the b est results for the three descriptors are 0.641 (R-MA C), 0.694 (SPoC) and 0.676 (DeepRetriev al), suggesting that SPoC may outp erform architectures that are significan tly more complex to train and deplo y . W e should note that none of the descriptors were trained on the Holidays dataset, but the PCA for SPoC and R-MAC was trained on the MFND dataset, which is use d as distractors for the near-duplicate detection task. First, the task is different, not only b ecause the p erformance measure is dif- feren t, but also b ecause in our exp erimen tal setting, images from the MFND collection are used as negative samples; this is needed to ev aluate specificity , whic h is difficult to do directly on Holidays due to the small size of the dataset and the absence of distractor images. W e found exp erimentally that in many cases the increase in sensitivity is counterbalanced b y a corresp onding increase in the false p ositiv e rate. This is esp ecially evident for the R-MAC descriptor, for which the ov erall p erformance decreases in all datasets except MFND. Sec- ondly , eac h descriptor has man y parameters, and the b est combination is dataset dep enden t. While exploring all possible com binations is a daun ting task, our ex- p erimen ts provide some useful insights. W e found that the backbone depth and arc hitecture w ere the single most imp ortan t factor affecting p erformance . The original SPoC paper, and man y subsequent comparisons (Babenko & Lempitsky, 2015; Gordo et al., 2016; Zheng et al., 2017), employ ed the V GG architecture as backbone, but w e found a ma jor b o ost in performance b y using Residual Net works; the DeepRetriev al architecture, on the con trary , uses ResNet101 as bac kb one (Gordo et al., 2017). In our exp eriments, the depth of the architecture app ears a more relev an t factor than the sp ecific feature training, and this an 22 imp ortan t consideration that should b e kept in mind by practitioners. When compared on the same backbone architecture (Resnet101), the Deep- Retriev al outp erformed SPoC on CLAIMS and MFND, but not on Holidays. The Holidays dataset con tains a lot of outdo ors and natural scenes imagery , whic h ma y not sufficien tly co v ered by the Landmarks dataset. W e exp ected that SP oC features extracted from net w orks trained on a scene recognition task, for instance on the Places dataset, or a mixture of Places and ImageNet, could p erform b etter for near-duplicate detection, since many near-duplicates include complex scenes. How ev er, we did not find consisten t adv an tages, esp ecially when using Residual Net works as the backbone architecture. In a high sp ecificit y setting, the difference b etw een pre-trained and fine-tuned net works is further reduced, as visually similar images tend to generate many false positives. F uture w ork will b e dedicated to training a specific descriptor for unsup ervised near-duplicate detection, incorp orating specificity requiremen ts at training time as well as test time. In literature, feature weigh ting schemes ha ve also b een prop osed (Mohedano et al., 2018; Kalan tidis et al., 2016); such descriptors could b e trained in an unsup ervised fashion, or do not require an y training at all. The performance of suc h schemes from the point of view of sp ecificit y is another direction w orth exploring. In this work, we ha ve used the same descriptor and distance function for all images, regarding of their conten t. Notably , images are not uniformly dis- tributed in the embedded feature space, and the sp ecificity is largely affected b y the presence of clusters of images that are v ery similar from a semantic and visual p oint of view. This b eha viour is observ ed in b oth CLAIMS and MFND datasets, despite their different origin. Exploiting this underlying structure to impro ve the p erformance of ND disco very is an important av enue for future researc h. 8. Conclusions Unsup ervised discov ery of near-duplicate detection is an imp ortan t problem in digital forensics and fraud detection. As the n umber of false alarms gro ws quadratically with the size of the input dataset, practical applications require a v ery high sp ecificit y , or conv ersely low false p ositive rate, often in the range of 10 − 7 –10 − 10 . Hard negativ e mining can b e used to select a subset of the dataset, on whic h ROC analysis can b e used to ev aluate the p erformance. W e hav e ev aluated a selection of descriptors based on Conv olutional Neural Net works follo wing the prop osed metho dology . While the task of NIND detec- tion is conceptually similar to instance-lev el image retriev al, we exp erimen tally found that the same descriptors ma y b e ranked differently , as the Area under the ROC curve dep ends more strongly on sp ecificity than the mean Average Precision. This strengthens the need for a dedicated b enc hmark, targeting ap- plications where unsup ervised search is required. Our findings in general fav or the c hoice of fine-tuning deep conv olutional net works, as opp osed to using off- the-shelf features, but differences at high sp ecificity settings strongly depend 23 on the sp ecific dataset and are often small. On the MFND collection, promis- ing p erformance is obtained by the DeepRet descriptor, retrieving 96% of the true p ositives at a FP rate of 1 . 43 × 10 − 6 . How ev er, further impro vemen t in sp ecificit y w ould b enefit man y applications, esp ecially in the forensics domain. App endix A. Hard negative mining provides an upp er b ound for the A UC In this section, proof that the A UC calculated using either hard negativ e mining strategies is an upp er b ound for the true AUC is provided. Prop osition 1. When using har d ne gative mining str ate gy hn 2 , the r esulting AU C hn 2 is an upp er b ound for the true AU C . Pr o of. Hard negativ e mining strategy hn 2 ensures that the selected n l pairs are the most difficult pairs within the set N − ; it follo ws that: f ( n j ) ≤ f ( n l ) ∀ n j ∈ N − − H − ∀ n l ∈ H − (A.1) and consequen tly: 1 p i >n j ≤ 1 p i >n l ∀ n j ∈ N − − H − ∀ n l ∈ H − (A.2) The A UC can b e decomp osed in tw o terms AU C = 1 N + N − N + X i =1 H − X l =1 1 p i >n l + N + X i =1 N − − H − X j =1 1 p i >n j (A.3) where the first term is kno wn and is proportional to AU C hn 2 from Eq. 8, and the second term is the con tribution of the negative samples that are not observed. Ho wev er, w e can substitute the second term by replicating the hard negative samples N − − H − H − times, and com bining with Eq. A.2 we conclude that: AU C ≤ 1 N + N − N + X i =1 H − X l =1 1 p i >n l + N − − H − H − X k =1 N + X i =1 H − X l =1 1 p i >n l = = 1 N + N − N + H − AU C hn 2 + N − − H − H − N + H − AU C hn 2 = AU C hn 2 Prop osition 2. When using har d ne gative mining str ate gy hn 1 , the r esulting AU C hn 1 is an upp er b ound for the true AU C . Pr o of. Again, let us decomp ose the AUC as the sum of t wo terms, where the first term is kno wn and is prop ortional to AU C hn 1 , and the second term is the con tribution of the negative samples that are not observed, as detailed in Eq. A.3. 24 Eac h sample n j consists of a pair of images ( x k , y m ), where x k ∈ X and y m ∈ Y ; in other terms, N − = { ( x k , y m ) , k = 1 , ...K, m = 1 , ..., M } . Then according to the definition of hn 1, H − = { ( x k , y m ∗ ) | m ∗ = arg max m f ( x k , y m ) } (A.4) The sum ov er l = 1 , ..., H − and j = 1 , ..., N − in Eq. A.3 can b e decomp osed in terms of k = 1 , ..., K and m = 1 , ..., M as follows: AU C = 1 N + N − N + X i =1 K X k =1 1 p i > ( x k ,y m ∗ ) + N + X i =1 K X k =1 M X m =1 ,m 6 = m ∗ 1 p i > ( x k ,y m ) (A.5) where N − = K × M , and according to the definition of hn1 there are exactly K hard negative pairs. By definition, f ( x k , y m ) ≤ f ( x k , y m ∗ ) and th us 1 p i > ( x k ,y m ) ≤ 1 p i > ( x k ,y m ∗ ) ∀ k = 1 , ..., K ∀ m 6 = m ∗ (A.6) Com bining Eqs. A.5 and A.6, we conclude that: AU C ≤ 1 N + N − N + H − AU C hn 1 + N + X i =1 K X k =1 M X m =1 ,m 6 = m ∗ 1 p i > ( x k ,y m ∗ ) = 1 N + N − N + K AU C hn 1 + ( M − 1) N + K AU C hn 1 = AU C hn 1 Ac kno wledgment The authors gratefully ackno wledge the financial support of Reale Mutua Assicurazioni. W e are grateful to Lucia Sabatino, Lucia Romano and Giulia Gemesio for help in data cleaning and annotation. References References Amerini, I., Uricc hio, T., & Caldelli, R. (2017). T racing images back to their so cial netw ork of origin: A cnn-based approac h. In Information F or ensics and Se curity (WIFS), 2017 IEEE Workshop on (pp. 1–6). Bab enk o, A., & Lempitsky , V. (2015). Aggregating local deep features for image retriev al. In Pr o c e e dings of the IEEE international c onfer enc e on c omputer vision 25 Bab enk o, A., Slesarev, A., Chigorin, A., & Lempitsky , V. (2014). Neural co des for image retriev al. In Eur op e an c onfer enc e on c omputer vision (pp. 584–599). Baln tas, V., Riba, E., Ponsa, D., & Mik ola jczyk, K. (2016). Learning lo cal feature descriptors with triplets and shallow conv olutional neural netw orks. In BMV C (p. 3). Battiato, S., F arinella, G. M., Puglisi, G., & Rav ` ı, D. (2014). Aligning codeb o oks for near duplicate image detection. Multime dia T o ols and Applic ations , = Article Ba y , H., Ess, A., T uytelaars, T., & V an Go ol, L. (2008). Sp eeded-up robust features (surf ). Computer vision and image understanding , Bradley , A. P . (1997). The use of the area under the ro c curve in the ev aluation of mac hine learning algorithms. Pattern r e c o gnition , Chen, M., Li, Y., Zhang, Z., Hsu, C.-H., & W ang, S. (2017). Real-time, large- scale duplicate image detection metho d based on m ulti-feature fusion. Journal of R e al-Time Image Pr o c essing , Chennamma, H., Rangara jan, L., & Rao, M. (2009). Robust near duplicate image matching for digital image forensics. International Journal of Digital Crime and F or ensics (IJDCF) , Chiu, C.-Y., Li, S.-Y., & Hsieh, C.-Y. (2012). Video query reformulation for near-duplicate detection. IEEE T r ansactions on Information F or ensics and Se curity , Ch um, O., Philbin, J., Zisserman, A. et al. (2008). Near duplicate image detec- tion: min-hash and tf-idf weigh ting. In BMVC (pp. 812–815). Cicconet, M., Elliott, H., Richmond, D. L., W ainsto c k, D., & W alsh, M. (2018). Image forensics: Detecting duplication of scien tific images with manipulation- in v arian t image similarity . Connor, R., Cardillo, F., MacKenzie-Leigh, S., & Moss, R. (2015). Identification of mir-flic kr near-duplicate images. Connor, R., & Cardillo, F. A. (2016). Quan tifying the sp ecificity of near- duplicate image classification functions. Deng, J., Dong, W., So cher, R., Li, L.-J., Li, K., & F ei-F ei, L. (2009). Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern R e c o gnition, 2009. CVPR 2009. IEEE Confer enc e on (pp. 248–255). F o o, J. J., & Sinha, R. (2007). Pruning sift for scalable near-duplicate im- age matching. In Pr o c e e dings of the eighte enth c onfer enc e on Austr alasian datab ase-V olume 63 (pp. 63–71). 26 Gon¸ calv es, F. M. F., Guilherme, I. R., & Pedronette, D. C. G. (2018). Semantic guided interactiv e image retriev al for plan t identification. Exp ert Systems with Applic ations , Inproceedings Gordo, A., Almaz´ an, J., Rev aud, J., & Larlus, D. (2016). Deep image retriev al: Learning global representations for image search. In Eur op e an Confer enc e on Computer Vision (pp. 241–257). Gordo, A., Almazan, J., Rev aud, J., & Larlus, D. (2017). End-to-end learning of deep visual represen tations for image retriev al. International Journal of Computer Vision , Hanley , J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiv er op erating c haracteristic (ro c) curv e. R adiolo gy , He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Pr o c e e dings of the IEEE c onfer enc e on c omputer vision and p attern r e c o gnition Hirano, Y., Garcia, C., Sukthank ar, R., & Hoogs, A. (2006). Industry and ob ject recognition: Applications, applied research and c hallenges. In T owar d Cate gory-L evel Obje ct R e c o gnition (pp. 49–64). Hu, Y., Cheng, X., Chia, L.-T., Xie, X., Ra jan, D., & T an, A.-H. (2009). Coheren t phrase model for efficient image near-duplicate retriev al. IEEE T r ans. Multime dia , Inpro ceedings Huisk es, M. J., & Lew, M. S. (2008). The mir flickr retriev al ev aluation. In Pr o- c e e dings of the 1st ACM international c onfer enc e on Multime dia information r etrieval (pp. 39–43). Jegou, H., Douze, M., & Schmid, C. (2008). Hamming embedding and weak geometric consistency for large scale image searc h. In Eur op e an c onfer enc e on c omputer vision (pp. 304–317). Jinda-Apiraksa, A., V onik akis, V., & Winkler, S. (2013). California-nd: An an- notated dataset for near-duplicate detection in p ersonal photo collections. In Quality of Multime dia Exp erienc e (QoMEX), 2013 Fifth International Work- shop on (pp. 142–147). Johnson, J., Douze, M., & J´ egou, H. (2017). Billion-scale similarity search with gpus. Kalan tidis, Y., Mellina, C., & Osindero, S. (2016). Cross-dimensional weigh t- ing for aggregated deep con v olutional features. In Eur op e an c onfer enc e on c omputer vision (pp. 685–701). Ke, Y., Sukthank ar, R., & Huston, L. (2004). An efficient parts-based near- duplicate and sub-image retriev al system. In Pr o c e e dings of the 12th annual A CM international c onfer enc e on Multime dia (pp. 869–876). 27 Kim, S., W ang, X.-J., Zhang, L., & Choi, S. (2015). Near duplicate image disco very on one billion images. In Applic ations of Computer Vision (W A CV), 2015 IEEE Winter Confer enc e on (pp. 943–950). Li, J., Qian, X., Li, Q., Zhao, Y., W ang, L., & T ang, Y. Y. (2015). Mining near duplicate image groups. Multime dia T o ols and Applic ations , = Article Li, P ., Shen, B., & Dong, W. (2018). An anti-fraud system for car insurance claim based on visual evidence. Liu, L., Lu, Y., & Suen, C. Y. (2015). V ariable-length signature for near- duplicate image matc hing. IEEE T r ansactions on Image Pr o c essing , Mohedano, E., McGuinness, K., Gir´ o-i Nieto, X., & O’Connor, N. E. (2018). Saliency w eighted conv olutional features for instance searc h. In 2018 Interna- tional Confer enc e on Content-Base d Multime dia Indexing (CBMI) (pp. 1–6). Oliv a, A., & T orralba, A. (2001). Mo deling the shap e of the scene: A holis- tic represen tation of the spatial env elope. International journal of c omputer vision , de Oliv eira, A. A., F errara, P ., De Rosa, A., Piv a, A., Barni, M., Goldenstein, S., Dias, Z., & Ro cha, A. (2016). Multiple parenting phylogen y relationships in digital images. IEEE T r ansactions on Information F or ensics and Se curity , P etrick, N., Sahiner, B., Armato, S. G., Bert, A., Correale, L., Delsanto, S., F reedman, M. T., F ryd, D., Gur, D., Hadjiiski, L. et al. (2013). Ev aluation of computer-aided detection and diagnosis systems. Me dic al physics , Simon yan, K., & Zisserman, A. (2014). V ery deep conv olutional netw orks for large-scale image recognition. T olias, G., Sicre, R., & J ´ egou, H. (2015). Particular ob ject retriev al with integral max-p ooling of cnn activ ations. W an, J., W ang, D., Hoi, S. C. H., W u, P ., Zh u, J., Zhang, Y., & Li, J. (2014). Deep learning for con tent-based image retriev al: A comprehensive study . In Pr o c e e dings of the 22nd A CM international c onfer enc e on Multime dia (pp. 157–166). W ang, J., Song, Y., Leung, T., Rosenberg, C., W ang, J., Philbin, J., Chen, B., & W u, Y. (2014). Learning fine-grained image similarity with deep rank- ing. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition Xie, L., Tian, Q., Zhou, W., & Zhang, B. (2014). F ast and accurate near- duplicate image search with affinity propagation on the imageweb. Computer Vision and Image Understanding , 28 Xu, D., Cham, T. J., Y an, S., Duan, L., & Chang, S.-F. (2010). Near duplicate iden tification with spatially aligned pyramid matching. IEEE T r ansactions on Cir cuits and Systems for Vide o T e chnolo gy , Zagoruyk o, S., & Komo dakis, N. (2015). Learning to compare image patches via conv olutional neural netw orks. In Pr o c e e dings of the IEEE Confer enc e on Computer Vision and Pattern R e c o gnition Zheng, L., Y ang, Y., & Tian, Q. (2017). Sift meets cnn: A decade survey of instance retriev al. Zhou, B., Lap edriza, A., Khosla, A., Oliv a, A., & T orralba, A. (2017a). Places: A 10 million image database for scene recognition. Zhou, B., Lap edriza, A., Xiao, J., T orralba, A., & Oliv a, A. (2014). Learning deep features for scene recognition using places database. In A dvanc es in neur al information pr o c essing systems Zhou, W., Li, H., & Tian, Q. (2017b). Recen t adv ance in conten t-based image retriev al: A literature survey . Zhou, Z., W ang, Y., W u, Q. J., Y ang, C.-N., & Sun, X. (2017c). Effective and efficien t global context v erification for image copy detection. IEEE T r ansac- tions on Information F or ensics and Se curity , 29 (a) (b) (c) (d) (e) (f ) (g) (h) Figure 6: Average recall vs. FPs/query for the CLAIMS (a-b, e-f ) and MFND dataset (c- d, g-h), with thresholds calculated using hard negative mining hn1 (top ro w, a-d) and hn2 (bottow ro w, e-h). P erformance is measured at fixed thresholds (dots in the abov e curves), bars indicate the standard error. The maximum number of images retrieved by each query is limited to K = 2 , 4 , 6 , 8 , 10, results are plotted as separate curves. 30
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment