SMX Notes – Duplicate Content Summit
It’s always tough to be the first to address an “advanced” audience. How advanced is advanced? Tough to say. Not everyone there is going to be a Cuttlet, a Mozzer or a Bruce Clay employee (you guys need a term. How about Bruisers?)
That being said, I think a lot of the people there already knew what duplicate content was and some of the presentations were a little low level. However, I’m sure that at least a few people there benefited greatly (if only the search engine reps getting feature requests).
Eytan Seidman, Lead Program Manager, Live Search, Microsoft
Dupe content is bad because it:
- Fragments anchor text
- Creates different versions of the page you might want people to link to
- Makes it hard (confusing) for others to link in
- Can be difficult for search engines to determine which is the canonical URL
- Session parameters in URLs: keep them simple. Wherever possible, avoid feeding them to the engine (especially lots and lots of parameters
- Location-based dupe content: largely identical content for US/UK sites is not okay. Think about unique content
- Employ client-side versus server-side redirect. (He would later clarify this statement to include 301s as â€˜client side,’ whereas server-side is like Wikipedia where the client is not taken to a new URL/meta refresh.) Examples: JCPenney.com has the same content as JCPenny.com. Wikipedia does this a lot (look at the URLs for the pages on startrek vs. Star trek).
- Don’t produce entire secure (https) and nonsecure (http) versions of your site. Use absolute links to move between secure/nonsecure versions.
So, how do you avoid having people copy your content? (Speaking for myself, to prevent scraping, I have this gun… No, not really, I don’t own a gun.) Tell people not to take it and call out people who do. Verify user agents and block unknown IP addresses from crawling your site.
Not all dupe content is bad: proper attribution and links from other sites can add value and drive traffic to your site.
If you think you have duplicate content, make sure you’re adding value, attribute and consider blocking robots.txt for local copies, Linux pages, etc.
How Live Search handles duplicate content: We don’t have sitewide penalties, generally speaking. We use session parameter analysis at crawl time and keep them hidden from the crawler for the most part. We filter dupes at run time to prune the results returned to include only useful, unique results.
Peter Linsley, Senior Product Manager for Search, Ask.com
Dupe content is an issue for search engines because it impairs the user experience and consumes our resources. It’s an issue for webmasters because:
- There’s a risk of missing votes (links)
- There’s a risk of selecting the wrong candidate (the wrong URL)
- Some cases are beyond our control
While concerns are valid, issues are rare.
How Ask handles dupe content: There’s no penalty; it’s similar to not being crawled. We pick one version to be the best one and keep that in our index. We only compare indexable content, not templates or HTML code. We filter them out when our confidence factor is high: we have low tolerance on false positives. The proper candidate is identified from numerous signals, much like ranking. Usually, it’s the most popular version of the site.
What to do:
- Consolidate content
- Canonicalize URLs
- Employ redirects
- Display a copyright notice
- “Uniquify your content”: differentiate yourself from everyone else. Comepting with resellers? Make your content stand out. Add on comments, reviews, etc.
- Make it hard for scrapers: mark your territory with absolute links, be specific, make it difficult for that content to be used out of context); legal action; contact.
Open questions from Ask:
- webmaster outreach. Educate them.
- W3C Standardization (trailing slash or no? Index.html or no?)
- Watermarking or other authentication formats?
- Search engines to improve on antispam
- Legal: what else can be done? Creative commons?
- Economic: make it harder for others to monetize your content (but how?)
Yahoo eliminates dupes in 4 or 5 places:
- During the crawl: less likely to crawl or extract links from known duplicate pages and sites.
- Index time filtering: less representation from dupes when indexing crawled pages
- Query-time filtering (they do it here as much as possible): limit to two results per domain per SERP, filter out similar documents
Legitimate reasons to dupe:
- Alternate document formats (HTML, PDF, .doc)
- Legal syndication (like the Associated Press)
- Multiple languages or regional markets (different language versions are never considered duplicates)
- Partial duplicates: navigation, common site elements, disclaimers.
Accidental duplication (not abusive, but may “hamper Yahoo’s ability to display your content”):
- Session ids in URLs: remember that engines aren’t smart. A URL is a URL is a URL. Even if you’re doing rewriting.
- Make sure your 404 pages return a 404 status, or they’ll be indexed repeatedly.
Duplication that’s “somewhat dodgy”: unnecessary repeating of content across domains. Aggregation or identical content repeated without adding value.
Abusive duplication: scrapers/spammers, weaving/stitching content from multiple sources together, cross-domain duplication, all with or without small changes.
- Avoid bulk duplication of underlying documents
- Avoid accidental proliferation of many URLs for the same document: offer crawlers a session id-free/cookie-free path.
- Attribute nonoriginal content
- Similar content (The Buffy episode with two Xanders: they eventually figured out that it was two parts of the same Xander and they needed to recombine them.)
- But different (The Buffy episode with two Willows): they’re obviously not the same. One needed to “go back to . . . evil land.”
- Examples: Syndicated content (make arrangements with syndicators to have them block it), manufacturer’s databases, printable versions, multiple languages (not a problem).
- Dynamic content
- Blog issues: RSS, archive pages, category pages
Vanessa Fox: What’s your ideal duplicate content world?
- Submit site map with canonical versions (and Google ignores duplicates & redirects link value)
- The ability to tell spiders to scrape off parameters
- A way to verify authorship even on content that’s been scraped
Can the four search engines agree on a variable we can use to indicate to spiders that they should strip all parameters after that parameters?
People don’t know about parameters, and they won’t know about these. Can all CMSes handle this? Would we do it thruugh a meta tag, the site map, Webmaster Console? Robots.txt? Wild card?
We nofollow links to the noncanonical versions of our page. Does that raise a red flag with search engines?
Vanessa: That’s not a good way to prevent dupes. Other people can link to the page. Robot them out instead.
WordPress as a CMS: Multiple author designations, tags, tag/author combinations have resulted in posts appearing on lots of pages. Can search engines work with major blogging platforms to fix this?
Vanessa: Probably. They’d come up in SERPs, but we could work with CMSes to put the burden on their side.
Eytan: What’s your goal? Do you want to designate which of those is the primary page or tag?
Vanessa: Mostly we can sort these things out.
Amit: In general, the linking structures lend themselves to the right things happening, noting where things are important.
Danny Sullivan: Characterize the three major ways of duplicate content: scraping, syndication (including without permission), site-specific (on your site, for whatever reason, you have two versions of the same page). Vote: people were mostly concerned with site-specific, then syndication and scraping. (“All of the above!” comes a shout from the back.)
Is there some way to provide a date-stamp, time-stamp or date of discovery? If our content is scraped or syndicated right away, often the scrape is discovered first. Do you use date of discovery and how do you differentiate valid dates? How do you know who’s first? (Focused mostly on blogs, but applies to all pages)
Eytan: It’s a factor, but not a big part of it. We’re looking for a lot of other signals to indicate which is the canonical content. There’s some aspect of that on news with a time-based element.
Peter: It’s gameable by scrapers
Amit: If you want your content to show up earlier in the indexes: sitemap updates when you update your sites.
Danny: To me, best is first. To you is it the most linked to?
Eytan: The best is the one that ranks highest in our run-time ranking (lots of factors). Over time, we won’t look at time as the defining variable.
Danny: How many would be interested in a way to tell search engines that mine really is the canonical version?
Peter: That’s still gameable: we can’t solve it for mom & pops that way.
Danny: Then never solve it for anyone? Maybe they can learn. It just seems like there ought to be a way to tell search engines that “this is my doc, I just pinged you with it, I know you trust me.”
You collaborated on the site maps standard. Is work continuing?
Danny: They walked over hand in hand. You missed it; it was beautiful.
Vanessa: Yes. We all support the ability to specify the location of your sitemap in robots file and we regularly discuss this.
Follow up: Do you pull the files or do you leave them on our servers?
Yahoo & Google pull the site map files. MSN is doing this on a fairly selective basis right now, but they’re ramping it up now. Ask is similar to MSN.
Amit Kumar—Poll: how many have sitemap files? All. How many specify sitemap in robots file? Much fewer. How many update sitemap weekly? Some.
Comment: With tracking parameters: if you use a robots exclusion parameter, I want to consolidate link popularity—I don’t want to throw them all out since our pages are getting popular with session ids in incoming links.
Amit Kumar: How many have submitted site maps to Site Explorer? 10-15
How should we address duplicate content as we begin syndicating our video?
Vanessa: In video serps? (Exactly).
Amit: Does the modus indication actually point back to you? (Yes and no; we also drive traffic back to our site from YouTube).
Vanessa: Ask that they block it with robots when making the syndication deal.
Can we use a digital signature to avoid scraping? There are no good reporting tools from engines telling us which pages they consider as dupes. Why not?
Danny: Maybe permission to have one hidden link; standardize the format to make sure that scrapers don’t steal that one?
(Questioner) No, more like my special secret word that I set in the webmaster console.
Eytan: we’ve done this in email with moderate success. The challenge is adoption. How do we get people to broadly adopt it?
Vanessa: Of course, scrapers could steal content from unauthenticating sites and authenticate it themselves.
Danny: To your [the search engines’] credit, you’ve done lots of cool stuff. Some have worked and been widely adopted.
Amit: Robots nocontent came from conferences.
We’ve given resellers our content. Hundreds or thousands have our content out there. Now what? How do you prove it’s yours? What can we do to address it?
Vanessa: Unfortunately, you may just have to rewrite your content to regain control of it.
Amit: they might be outranking you for other reasons. You might want to work on other aspects of your marketing (get more authority/links/trust).
Danny: It’s a trade off: maybe resellers/affiliates will do a better job marketing for you.
Websites like eBay with good SEO teams are creating subdomains/multiple websites with unique content, and they’re receiving more than the two results per SERP. Is this subdomain spam?
Vanessa: We’re trying to serve up a variety of results
Amit: send test cases to Yahoo. We’re working on it!
Danny: It’s tough to fix this. Can you only show one blog hosted on wordpress.com in a SERP, then?
Eytan: We want to show a lot of unique, but relevant content. We don’t want to bring in a bunch more sites that are less relevant.
Why should we only see two listings or one listing of a piece of content? For example, searching for a fact. Why not just have something where you can click and see all dupes?
Peter: We try to be as accurate as possible. Dupes are the EXACT same content. Would that enhance real user experience?