how big is the web.pdf

14
7/17/2019 How Big Is The Web.pdf http://slidepdf.com/reader/full/how-big-is-the-webpdf 1/14 Let’s start by looking at the growth of the web The Linking Open Data (LOD) Cloud is essentially the beginnings of the Web of Data or Semantic Web. It represents the interconnected data that has been produced by the Open Data Movement as a part of the W3C SWEO Linking Open Data community project (http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData) . SWEO is an acronym for the W3C Semantic Web Education and Outreach Interest Group (http://www.w3.org/wiki/SweoIG) .  All of the information in the LOD cloud is often referred to as data commons where the intention is that the data is open and freely available but that may not always be the case. Depending on the license for any given data-set from a particular source or supplier, as a minimum, “open” will mean that the data is at least accessible. As part of the ‘common data’ it is also usually expected that the data can be freely copied and re-used, but that may not always be the case so always, always check the license conditions. Let’s look at how the semantic web has grown since 2007 using The LOD cloud (http://lod- cloud.net/) diagram(s) which are maintained by Richard Cyganiak (DERI, NUI Galway) and Anja Jentzsch (HPI). As you look through you may find ‘data’ that will be useful to your business as we progress. I’ve highlighted some important additions that will give you an indication of how far advanced ‘things’ already are.

Upload: sam0x

Post on 07-Jan-2016

230 views

Category:

Documents


0 download

TRANSCRIPT

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 1/14

Let’s start by looking at the growth of the web

The Linking Open Data (LOD) Cloud is essentially the beginnings of the Web of Data or 

Semantic Web. It represents the interconnected data that has been produced by the Open

Data Movement as a part of the W3C SWEO Linking Open Data community project

(http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData) . SWEO is

an acronym for the W3C Semantic Web Education and Outreach Interest Group

(http://www.w3.org/wiki/SweoIG).

 All of the information in the LOD cloud is often referred to as data commons where the

intention is that the data is open and freely available but that may not always be the case.

Depending on the license for any given data-set from a particular source or supplier, as a

minimum, “open” will mean that the data is at least accessible. As part of the ‘common data’ it

is also usually expected that the data can be freely copied and re-used, but that may not

always be the case so always, always check the license conditions.

Let’s look at how the semantic web has grown since 2007 using The LOD cloud (http://lod-

cloud.net/) diagram(s) which are maintained by Richard Cyganiak (DERI, NUI Galway) and Anja Jentzsch (HPI). As you look through you may find‘data’ that will be useful to your business as we progress. I’ve highlighted some important additions that will give you an indication of how far 

advanced ‘things’ already are.

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 2/14

 (http://networkempire.com/semantic/wp-

content/uploads/2013/11/lod-cloud-may-2007.png)

It was next updated in October of 2007 and the introduction of W3C Wordnet should not go un-noticed. WordNet (http://wordnet.princeton.edu/)®

is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each

expressing a distinct concept. It’s structure makes it a useful tool for computational linguistics and natural language processing. And natural

language processing is used by search engines.

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 3/14

(https://reader008.{domain}/reader008/html5/0314/5aa817fc5f27a/5aa817fe3542c.png)

The next update was in November 2007 with four new additions including OpenCyc. OpenCyc (http://www.cyc.com/platform/opencyc) is the world’s

largest and most complete open source general knowledge base and commonsense reasoning engine. It is produced and maintained by

Cycorps who participate in cutting-edge artificial intelligence, machine reasoning, knowledge modeling, and natural language processing research

initiatives. And let’s not forget their semantic search framework

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 4/14

(https://reader008.{domain}/reader008/html5/0314/5aa817fc5f27a/5aa817ff2fa0a.png)

The Cloud grew and was updated twice in early 2008, we’ll show you the latter of the two diagrams (March) here, in February Audio Scrobbler,

QDOS and Flickr joined by linking their data.

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 5/14

(https://reader008.{domain}/reader008/html5/0314/5aa817fc5f27a/5aa8180022275.png)

The next updates were in March 2009 and things were being added quickly. It’s worth noting that Freebase joined and also the growing number 

of medical and gene sites as the importance and benefits of previous work started to become apparent to the research community.

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 6/14

(http://networkempire.com/semantic/wp-content/uploads/2013/11/lod-cloud_colored-2009-03-27.png)

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 7/14

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 8/14

(https://reader008.{domain}/reader008/html5/0314/5aa817fc5f27a/5aa81801d6157.png) Did you spot the changes? To me the important

 

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 9/14

  ,

addition of the BBC Music database.

The reason that there wasn’t a massive increase in size as a result of Tim’s call for ‘more data now’ is that it takes time for anyone, let alone big

corporat ions, companies and governments to change the way they do things. You can only wonder at the number of meetings and discussions

that took place. And what about all of those blank looks and serious questions from senior board members and managers – oh to have been a fly

on the wall

But things were changing, and data was being published as you’re about to see in the next update which came in September 2010.

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 10/14

(https://reader008.{domain}/reader008/html5/0314/5aa817fc5f27a/5aa8180381d3b.png)

The last update to the LOD Cloud diagram was in September 2011 (below) but in many respects it’s only the beginning of the story.

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 11/14

What makes this diagram so important is that when you click on it, it will open a new window and take you to an image map, where each

dataset is a hyperlink to the homepage where that data set and more information is available for you to use , subject to license as

we’ve already mentioned.

(http://lod-cloud.net/versions/2011-09-19/lod-cloud.html)

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 12/14

So How Big is The Web Now?

That’s an impossible question to answer accurately as content is being added on a daily basis, and quite probably not even Google would be able

to give you an accurate answer either, but we can give you a good idea and possibly blow your mind at what is available for you today to use and

experiment with.

On Tuesday, November 12, 2013 Prof. Dr. Christian Bizer, Oliver Lehmberg & Robert Meusel made the following announcement via the W3C

mailing list and other channels;

Hi all,

we are happy to announce the publication of a new large hyperlink graph.

The WDC Hyperlink Graph has been extracted from the Common Crawl 2012 web corpus [1] and covers 3.5 billion web pages and

128 billion hyperlinks between these pages . To the best of our knowledge, the graph is the largest hyperlink graph that is

available to the public.

 As the graph covers hyperlinks between web pages and not hyper links between data items, it is of course a bit off-topic for this list, but

we thought it might still be interesting to some people on this list, maybe for comparing linking pattern in the LOD cloud with linking

patterns in the classic document Web.

The graph can be downloaded in various formats from

http://webdatacommons.org/hyperlinkgraph (http://webdatacommons.org/hyperlinkgraph)

We provide initial statistics about the topology of the graph at

http://webdatacommons.org/hyperlinkgraph/topology.html (http://webdatacommons.org/hyperlinkgraph/topology.html)

We hope that the graph will be useful for researchers who develop

+ Search algorithms that rank results based on the hyperlinks between pages.

+ SPAM detection methods which identity networks of web pages that are published in order to trick search engines.

+ Graph analysis algorithms and can use the hyperlink graph for testing the scalability and performance of their tools.

+ Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that

govern these domains.

The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services. We

thank your sponsors a lot.

Best,

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 13/14

Christian Bizer, Oliver Lehmberg & Robert Meusel

[1] http://commoncrawl.org/ (http://commoncrawl.org/)

How does all of this help you?

Firstly lets cover what Common Crawl is: Common Crawl is a non-profit foundation dedicated to providing an open repository of web crawl data

that can be accessed and analyzed by everyone . You can see just how big the whole crawl is below (source:

http://www.webdatacommons.org/ (http://www.webdatacommons.org/))

Crawl Date January-June 2012

Total Data 40.1 Terabyte (compressed size)

Parsed HTML URLs 3,005,629,093

URLs with Triples 369,254,196

Domains in Crawl 40,600,000

Domains with Triples 2,286,277

Typed Entities 1,811,471,956

Triples 7,350,953,995

You could access the hypergraph data and work out what your competitors are doing and where. Based on what you find you could then work out

a more efficient plan but …

Take a deeper look at what the announcement hopes researchers will use all of these facts for. Especially the second one. If you haven’t already

seen the first of our semantic revolution webinars’ where we introduced the term ‘glowing beacons of spam’ (and made a few predictions ) it

may be worth watching now. We don’t want to alarm you unnecessarily but one of the things that was stopping independent and university

research on link and data spam and how to combat it was the lack of actual real world data to work on. We’re sure you can fill in the dots now …

The Good News

The semantic web is here to help you get the most from the web. There is (as you can see) lots of useful data that you can use in your sites to

make them more interesting not only for people but also search engines. There are new business models just waiting to be discovered. There

are new apps and sites just waiting to be developed. While they may take time to come to fruition there are simple things you can do now

using the stable semantic technologies that already exist which you’re about to discover.

Next: It All Begins With You and Starting to Build Your Authority (http://networkempire.com/semantic/semantic-web-overview/your-personal-what-to-do-

checklist/)

7/17/2019 How Big Is The Web.pdf

http://slidepdf.com/reader/full/how-big-is-the-webpdf 14/14

© COPYRIGHT 2013-201 4. POWERED BY THE SEMANTIC REVOLUTION (H TTP://THESEMANTICREVOLUTION.COM/)