Skip to content

Defending Your Site Against Content Scrapers

Defending a website from content scrapers is like defending a castle

In our last blog post, we talked about the dangers of content scrapers. In short, some websites on the internet make money by copying content from your legal blog and pasting it on their website – their efforts are often automated, using content scraping bots. This can hurt your search engine optimization (SEO) efforts or even result in your site being penalized, dropping you in the rankings, reducing your web traffic, and sinking your online marketing efforts.

So, content scrapers are bad. Here’s how to detect them, what to do if you see that your legal blog’s content is appearing on other websites without your permission and, more importantly, how to prevent it from happening, in the first place.

How to Tell if Your Legal Blog’s Content Has Been Scraped

There are a handful of ways to quickly and easily detect content scraping on your website.

Trackbacks or Pingbacks

If a content scraper copies a legal blog post from your law firm’s site and then pastes it somewhere else, and that legal blog post has an internal link to somewhere else on your website, that internal link suddenly becomes a backlink.

While backlinks are the Holy Grail of internet marketing, when they come from sites of ill repute like those that use content scraping, they can actually hurt (for the full details, see last week’s post).

When it comes to detecting content scrapers, though, these backlinks are a flashing beacon that you can use to see where your content is being replicated. Trackback or pingback tools, like Akismet, send you automated notifications whenever a new link to your site has been made. Whenever you see a notification for a backlink that you find suspicious, look into it to make sure your content isn’t being copied in a place you don’t want it to be.

Good, Old-Fashioned Google

Trackback tools can only see backlinks, so if a content scraper lifts a post from your legal blog that doesn’t have any internal links (just one more reason to have internal links in all of your posts), they’ll never detect it.

The solution is simple:

  1. Copy an entire line from one of your legal blog posts,
  2. Go to a search engine,
  3. Type an opening quotation mark into the query field,
  4. Paste your excerpt into the query field,
  5. Add an closing quotation mark to the end of your excerpt, and
  6. Hit enter.

This searches for exact hits on the passage from your legal blog post, so the only result you should see is your post, on your site. If there are other hits, ta-da, you’ve found a content scraping website with your content on it.

What To Do Once You’ve Detected a Content Scraper Stealing Your Content

Once you’ve found another site hosting your content, you are legally entitled to flex your intellectual property rights. The good part about being a lawyer is that, unlike the rest of the law-abiding populace, you probably already have a cease-and-desist letter filed away somewhere. Pull it out, tailor it for your current circumstances, and fire away.

If they take the content down, great. If they don’t, though, you’ll have to invoke your rights under the Digital Millennium Copyright Act (DMCA) by filing a take-down request with whoever is hosting the content scraper’s website. While the process is streamlined, it is still time-consuming and a pain, highlighting the need to prevent content scrapers, rather than reacting to them.

3 Ways to Prevent Content Scrapers from Wasting Your Time

Actually preventing content scrapers might be impossible – as sites get better at fighting them, they’ll evolve and find new ways to do what they do. However, putting walls and moats around your site can minimize the damage until the golden day where they disappear forever in flames.

1. Kill Your RSS Feed

We think that this is an overreaction, but one way to protect your legal blog is by eliminating its RSS feed. Content scrapers tend to use RSS feeds as an automatic way of detecting new posts on your site, so not using one in your legal blog can prevent them from accessing it.

However, it also prevents other, legitimate readers from accessing it, as well. Contrary to popular belief, RSS feeds are still widely used: The Verge polled 14,755 readers, and found that 68% of them used RSS feeds “religiously.” These are relatively internet-savvy readers, though, so the number is likely a smaller, though still sizeable, chunk of internet users.

Additionally, eliminating your site’s RSS feed might only be a temporary solution. RSS feeds have been losing popularity to social media as a source of news, so it’s only a matter of time before content scrapers adjust, accordingly.

2. Only Send an Article Summary to RSS Readers

Another, less “scorched earth” policy is to only give an article summary to your RSS readers. While this might lower the number of readers your legal blog gets from RSS feeds, it might be worth it, depending on how many you lose and how often your legal blog’s content is getting stolen.

3. Yoast’s Solution

Yoast, a popular SEO plugin for sites that use WordPress (so popular that we felt the need to clarify how it handles keywords), has its own way of protecting your RSS feed from content scrapers: Yoast automatically adds a link to the original post on your domain, effectively telling search engines that the content originally appeared on your firm’s site, rather than on the scraper’s. This feature can be activated in the plugin’s advanced settings.

Unfortunately, the popularity of the plugin could also be its downfall: The more sites use Yoast’s anti content scraping measures, the more those very scrapers will be inclined to pursue other avenues to your content.

Conclusion

Of course, all of these options deal with the RSS feed pipeline between the content scraping website and your own. Right now, this avenue is the one of least resistance. However, once content scrapers are able to access your site from another route, you can expect to have to adapt your castle to new means of attack, once again.