Google’s Webspam report explains SpamBrain’s role

Google’s annual Webspam Report for 2022 highlights all the ways SpamBrain anti-spam systems have become better at detecting multiple forms of spam. Although this report is primarily intended to report how much spam we caught compared to the previous year, we believe that information about how SpamBrain works is equally important. was broken.

Google SpamBrain Platform

SpamBrain is the name Google gave to its machine learning system. Google calls it a platform that launches algorithms that detect unwanted content in multiple forms.

Machine learning is a form of artificial intelligence that becomes increasingly adept at tasks designed to learn and complete using data.

Not much is known about SpamBrain, except that it’s a machine learning platform and the “heart” of Google’s initiative to keep spam unranked.

Google’s Webspam report has this to say about SpamBrain:

“We have also launched multiple solutions that improve SpamBrain as a robust and versatile platform and better deal with different types of exploits.”

SpamBrain improvements

Webspam reports that system improvements now detect 500% more spam sites than the previous year.

Additional training increased SpamBrain’s ability to identify hacked websites by a factor of 10.

Link spam detection

The report attributes SpamBrain’s ability to learn as a key to its success, saying that as a result of special link spam training, it detected 50 times more sites creating link spam than in the previous year.

“Thanks to SpamBrain’s learning capabilities, we detected over 50x more link spam sites compared to the previous link spam update.”

index gatekeeper

An interesting fact about SpamBrain is how it identifies spam as it crawls.

If a crawled page is detected to be spam, it will be blocked immediately to prevent the page from entering Google’s search index and waste resources crawling unwanted web pages. prevent.

The ability to block spam when crawled is a feature announced in 2021 that will block indexing not only when spam is crawled, but also when it tries to enter via search consoles and sitemaps. It has been.

In 2021 they wrote:

“…we have systems that can detect spam when we crawl pages and other content. Some content detected as spam will not be added to the index.

These systems also work for content discovered through sitemaps and Search Console.

For example, Search Console has a Request Indexing feature that allows authors to let Google know that new pages should be added soon. A spammer has hacked into vulnerable sites, impersonated the owners of these sites, verified himself in Search Console, and used tools to spam his pages into being crawled and indexed by Google. I have confirmed that I will ask you to register.

Using AI, we were able to identify suspicious validations and prevent spam URLs from entering the index this way. “

So it’s fair to say that one of SpamBrain’s many features is to act like a gatekeeper, blocking spam before it gets indexed by Google.