Journalists at The Washington Post have investigated Google’s C4 data set, which has been used to train AI models at Google and Facebook. Amongst the sites are very few surprises, a couple of odd choices—a World of Warcraft forum and sites that sell dumpsters—and, of course, personal blogs. A neat tool lets you search for domains to find out if a specific site, like yours, is part of the corpus. (via)