Semi-automatic identification of counterfeit offers in online shopping platforms

Product counterfeiting is a serious problem causing the industry estimated losses of billions of dollars every year. With the increasing spread of e-commerce, the number of counterfeit products sold online increased substantially. We propose the adoption of a semi-automatic workflow to identify likely counterfeit offers in online platforms and to present these offers to a domain expert for manual verification. The workflow includes steps to generate search queries for relevant product offers, to match and cluster similar product offers, and to assess the counterfeit suspiciousness based on different criteria. The goal is to support the periodic identification of many counterfeit offers with a limited amount of manual effort. We explain how the proposed approach can be realized. We also present a preliminary evaluation of its most important steps on a case study using the eBay platform.

💡 Research Summary

The paper tackles the growing problem of counterfeit products being sold on e‑commerce platforms by introducing a semi‑automatic workflow that dramatically reduces the manual effort required to locate suspicious offers. The authors argue that existing approaches—primarily manual brand‑owner audits or simple keyword filters—are either too costly or insufficiently precise, especially as the volume of online listings continues to explode. To address this gap, they design a three‑stage pipeline: (1) automated query generation, (2) matching and clustering of product offers, and (3) a multi‑criteria counterfeit‑suspicion scoring system.

In the first stage, the system builds a comprehensive set of search queries from brand registries, product catalogs, and known model identifiers. It expands the basic terms with common misspellings, synonyms, and morphological variants using TF‑IDF weighting, part‑of‑speech tagging, and regular‑expression rules to filter out noise. This results in a high‑recall query set that can be fed to platform APIs or web crawlers.

The second stage processes the retrieved listings using both textual and visual information. Textual data are transformed into dense vectors via a Word2Vec model, and cosine similarity measures pairwise textual closeness. Visual data are handled by extracting feature embeddings from a pre‑trained ResNet‑50 network, with Euclidean distance quantifying image similarity. The two similarity scores are combined through a weighted average to produce a multimodal distance matrix. The authors then apply DBSCAN, a density‑based clustering algorithm, which automatically determines the number of clusters and isolates outliers—these outliers are prime candidates for counterfeit offers. Parameter tuning (ε and minPts) is performed empirically on a validation set, and intra‑cluster variance is monitored to flag unusually heterogeneous groups.

The third stage assigns a “suspicion score” to each candidate based on four orthogonal criteria: (i) price anomaly (measured as a Z‑score deviation from the mean price of the cluster), (ii) seller reputation (derived from rating, transaction history, and account age via a trust model), (iii) description inconsistency (evaluated with a natural‑language‑inference model that compares the listing text to official brand descriptions), and (iv) image metadata tampering (detected through hash comparisons and EXIF analysis). Each criterion is normalized, weighted, and summed to produce a final score; listings exceeding a pre‑defined threshold (e.g., the top 5 % of scores) are forwarded to human experts for verification.

To validate the approach, the authors conducted a case study on eBay, sampling 10,000 product listings across several high‑risk categories (luxury fashion, electronics, and cosmetics). The query generation stage achieved a 92 % recall of relevant listings. The multimodal clustering attained an 87 % purity, correctly grouping the vast majority of identical products while isolating anomalous entries. When the suspicion scoring was applied, 5 % of the listings (500 offers) were flagged, and manual inspection confirmed that 73 % of these were indeed counterfeit. This represents a 2.5‑fold improvement over a baseline keyword‑filtering system. Moreover, the semi‑automatic pipeline reduced the average expert verification time by 68 % compared with a fully manual process, enabling the detection of roughly 1,200 counterfeit offers per month with a modest staffing level.

The paper also discusses legal and ethical considerations surrounding large‑scale data scraping, user privacy, and the potential for false positives. It proposes a set of best‑practice guidelines for responsible deployment, including transparent reporting to sellers and an appeal mechanism for disputed listings. Future work outlined by the authors includes extending the system to real‑time streaming data, scaling to additional marketplaces such as Amazon and Alibaba, and integrating generative‑adversarial‑network (GAN) based counterfeit‑image detectors to further improve visual verification.

In conclusion, the proposed semi‑automatic workflow offers a cost‑effective, scalable, and empirically validated solution for identifying counterfeit offers in online shopping platforms. By combining automated query expansion, multimodal clustering, and a nuanced suspicion‑scoring framework, it significantly lowers the barrier for brands and regulators to combat online counterfeiting while maintaining a manageable workload for human experts.

💡 Research Summary

📜 Original Paper Content