dealing with duplicate content

18
Dealing with Duplicate Content Ehren Reilly May 8, 2012

Upload: ehren-reilly

Post on 29-Oct-2014

302 views

Category:

Technology


0 download

DESCRIPTION

A presentation I shared with members of the Ask team before we embarked on a major duplicate content cleanup project.

TRANSCRIPT

Page 1: Dealing with duplicate content

Dealing with Duplicate ContentEhren ReillyMay 8, 2012

Page 2: Dealing with duplicate content

Why focus on Answers?• What is duplicate content? Why is it bad?• Examples of duplicate content on our site and other notable sites• Techniques

– robots.txt disallow– Meta robots tag– 301 redirect dupe URL to primary URL– Canonical URL tag– Prevent duplicate page from being created in the first place

• Related topics– rel=“alternate” for language/regional support– Expired and no-longer-relevant content, and strategic content deletion

• Suggested reading

Agenda

Page 3: Dealing with duplicate content

Why focus on Answers?

• What is duplicate content & why is it bad?• Examples of duplicate content on our site and other notable sites• Techniques

– robots.txt disallow– Meta robots tag– 301 redirect dupe URL to primary URL– Canonical URL tag– Prevent duplicate page from being created in the first place

• Related topics– rel=“alternate” for language/regional support– Expired and no-longer-relevant content, and strategic content deletion

• Suggested reading

Agenda

Page 4: Dealing with duplicate content

Why focus on Answers?• The same content often appears at more than one URL• Intentional duplication

– Quotation– Re-use

• E.g. Ask.com Wikipedia

– Content syndication• Inadvertent duplication

– Separate mobile-optimized or printer-optimized version of page.– Separate regional design or branding

• E.g., uk.ask.com/wiki/Rihanna vs www.ask.com/wiki/Rihanna

– Dynamic content where different queries return same results• E.g., ask.com/questions-about/iPhones vs

ask.com/questions-about/iphone vs ask.com/questions-about/iPhone

– Extra junk in URL that does not substantively change content.• E.g., www.ask.com/wiki/Symbolics?qsrc=3044 vs

www.ask.com/wiki/Symbolics

– Pagination and filtering• E.g.,

• Manipulative duplication– Scraper sites– Blatant copyright infringement / plagiarism– SEO spam

What is duplicate content?

Page 5: Dealing with duplicate content

Why focus on Answers?

• Fragmentation of link equity, authority & anchor text– If there are 100 links to “iphone” and 50 links to “iPhones”:

• Do I treat this as a single page with 150 links? • Do I treat both pages as separate and important? If so, “iphone” is 100

links worth of importance, and “iPhones” is 50 links worth of importance.

• Lower confidence in single, definitive source– If there are many versions, which version is the definitive one?– Which URL has the most relevant/reliable copy of this for a given

search query?– “I know ask.com is a good source for X, but I can’t figure out which

of these URLs is ask.com’ definitive page on X.”• Penalties for manipulative and non-user-friendly duplication

– Posting exact same content on multiple different sites– Panda penalty for “thin content”

Why is duplicate content a problem?

Page 6: Dealing with duplicate content

Why focus on Answers?

• Case-insensitive URLs– /q/ vs /Q/

• www.ask.com/q/What-Causes-Sepsis

• www.ask.com/Q/What-Causes-Sepsis

– Questions About page paths• ask.com/questions-about/t-rex• ask.com/questions-about/T-Rex

• Duplicate questions in Ask.com Community (e.g.)

• US vs UK Ask.com wiki– uk.ask.com/wiki/Rihanna vs

www.ask.com/wiki/Rihanna• Accidentally indexable weird

subdomains– replyask.lc.iad.www.ask.com

Examples from Ask.com

Page 7: Dealing with duplicate content

Why focus on Answers?Examples from Other Sites

Same people’s bios used verbatim for two different brands’ websites.http://www.google.com/search?q=pangea+media+snapapp+management+teamhttp://www.google.com/search?q=snapapp+management+team

Google will never show you both versions for a single search query.

Page 8: Dealing with duplicate content

Why focus on Answers?

• Facebook has massive duplicate content issues, and as a result deep pages do not rank well in Google search results.– Five different versions of NYC Ballet’s “Videos” page.– None of them is on the first page in Google for

New York City Ballet Videos.• Instead, main facebook.com/nycballet page is #8 in Google.

• In this heavily re-blogged post from Google’s blog– Many people quoted this passage.– The original source shows up first in the SERP

Examples from Other Sites

Page 9: Dealing with duplicate content

Why focus on Answers?

• Many ways duplicate content can arise and many techniques to manage it.

• Different techniques are better suited to different situations.

• Things to consider about each method:– Prevents penalties?– Allows for alternate styling?– Speed/effectiveness?– Propagates link equity to all outbound links?– Consolidates link equity from all inbound links

Techniques for Managing Duplicate Content

Page 10: Dealing with duplicate content

Why focus on Answers?

• What it is: File on site that tells bots how to crawl various sections of your site. Specific to each subdomain.

• Message to bots: “Don’t crawl this content, don’t put it in your index, and disregard any links that point here. Go away.”

• What it’s good for: – Sections of the site that have no SEO value.– Secret stuff that you don’t want getting crawled.

• What it’s bad for: – Inelegant, brute-force way of dealing with duplicate content.

Robots.txt “Disallow”

Page 11: Dealing with duplicate content

Why focus on Answers?

• What it is: Meta tag on individual page, which is like a more targeted version of robots.txt.

• Message to bots: Has two separate parameters.– index/noindex: Should this page be crawled & indexed by the bot?– follow/nofollow: Should links out from these page be allowed to propagate link

equity?☞ Usually, if you’re trying to block a page from the index, but it has links to other

indexed pages, you want <meta name=“robots” content=“noindex,follow”>

• What it’s good for: – More targeted version of robots.txt– Allows you to block from index but still propagate link equity.– Great for deep pages of paginated/listed content.

• What it’s bad for: – Alternate versions of content that users might actually want to find from search.

• Suggested Use: eHow Content Pages that they won’t let us use for SEO.

Meta Robots

Page 12: Dealing with duplicate content

Why focus on Answers?301 Redirect

• What it is: Permanent redirection of duplicate/old URL to primary/new URL.

• Message to bots: “Don’t go to that old URL, go to this new one. Remove the old one from the index, and forward all link equity to the new one.”

• What it’s good for: – Consolidating content that exists in unnecessary variations.– Preserving the value of links, no matter which URL they link to.

• What it’s bad for: – Not possible to maintain alternate versions of content, since both

users and bots are redirected to a different URL.• Suggested Use: Content deleted from Community because it is

redundant/duplicate.

Page 13: Dealing with duplicate content

Why focus on Answers?Canonical URL Tag• What it is: Meta tag that tells bots which instance of the page to index. If there are multiple

instances of the same, they are consolidated together into a single URL.– E.g., <link rel=“canonical” href=“http://www.ask.com/questions-about/T-Rex”/>

• Message to bots: “For purposes of search listings, this content belongs to such-and-such URL”.• What it’s good for:

– Consolidating link equity among various versions of the same content.– Allows you to maintain different versions without incurring a penalty or forfeiting any link equity.– Prevents accidental indexing of trivially/accidentally different URLs.

• What it’s bad for: – Slow to work.– Officially just a “suggestion”, not a “rule”.– Not 100% effective at keeping pages out of index.

• Suggested Use: Any page that gets listed in search engines. Especially:– Pages with lots of meaningless URL parameters– Pages with case-insensitive URLs (choose a single, canonical capitalization format)– Pages that can be accessed on multiple weird subdomains

• Warning: Do not just dynamically put the page URL here. – Make sure it is actually a canonical version of the URL.– Should only not vary based on capitalization.– Should not dynamically insert domain. Should specify the correct domain for search index.

Page 14: Dealing with duplicate content

Why focus on Answers?Prevention

• Nothing else is more effective than abstinence.• Foresee potential duplicate content issues, and

build technologies to prevent them.– For user-generated pages, suggest an already-created

page rather than create a new one with the same topic. (E.g., Quora)

– For automatically-created pages, do programmatic de-duplication before pages are created.• Does this query return all the same content as some other

query?• Don’t include words/characters in URL that don’t affect the

query results.

Page 15: Dealing with duplicate content

Why focus on Answers?Technique Comparison ChartPrevents penalties

Allows for alternate styling

Fast & effective removal from index

Propagates link equity to all outbound links

Consolidates link equity from all inbound links

Robots.txt✔ ✔ ✔ ✗ ✗

Meta Robots“noindex,nofollow” ✔ ✔ ✔ ✗ ✗Meta Robots“noindex,follow” ✔ ✔ ✔ ✔ ✗301 Redirect

✔ ✗ ✔ ✔ ✔Canonical URL Tag ✔ ✔ ✗ ✔ ✔Prevention

✔ ✗ ✔ ✔ ✔

Page 16: Dealing with duplicate content

Why focus on Answers?Related:

International versions with rel=“alternate”• Used in combination with rel=“canonical”• Tells Google if and when there are

country/language specific versions of the page.• Different versions share link equity and other

ranking signals.• Google SERP links to the appropriate country-

specific version for each user.

Page 17: Dealing with duplicate content

Why focus on Answers?Related:

Deleted and Expired Content• Sometimes content gets intentionally deleted.

– Community Terms violation.– Legal/copyright issues.– Terminated partnerships.– Expired or no longer valuable.

• User experience options for deleted/empty pages a) 301 redirect to another relevant pageb) Replace with “content deleted” message and links to other relevant

pages.c) Generic error message.

• HTTP/robots treatment of deleted/empty pagesa) 301b) 404c) 200 with meta robots “noindex,follow”d) 200 that can be indexed Duplicate content