Using Weighted Sampling to Understand the Prevalence of Spam
To effectively fight spam, we need an unbiased estimate of how much bad content there is in the ecosystem and where it resides. In this presentation we discuss sampling schemes to identify the small percentage of bad content viewed from both user generated content and commercially-motivated content such as ads and sponsored posts. These methods specifically employ ML-derived classifiers to weight the sampling, increasing the volume of bad content in the samples. With more bad content we are able to segment it further, allowing us to measure the prevalence of bad material in certain segments, or as identified by certain policies.