using wikipedia as a source of chemical information

64
Prof. Martin A. Walker, SUNY Potsdam June 27, 2013 Webinar for ACS Chemical Information Division Using Wikipedia as a Source of Chemical Information

Upload: martin-walker

Post on 28-May-2015

567 views

Category:

Technology


3 download

DESCRIPTION

Webinar for the Chemical Information Division of the American Chemical Society. Describes descriptions of the types of chemical data in Wikipedia, and also how these are uploaded and maintained by the Wikipedia community.

TRANSCRIPT

Page 1: Using wikipedia as a source of chemical information

Prof. Martin A. Walker, SUNY Potsdam

June 27, 2013

Webinar for ACS Chemical Information Division

Using Wikipedia as a Source of Chemical Information

Page 2: Using wikipedia as a source of chemical information

Introduction

Chemical substance data in Wikipedia

Other chemistry-related content

Behind the scenes: •How articles are written•WikiProjects

Conclusion

Overview

Page 3: Using wikipedia as a source of chemical information

What is a wiki?

Page 4: Using wikipedia as a source of chemical information

What is a wiki?“A collaborative website which can be directly edited by anyone with access to it.”

(Wiktionary, March 20, 2007)

From the Hawaiian word “wiki wiki” meaning “quick.”

Picture byJshapiroWM CommonsCC license

Page 5: Using wikipedia as a source of chemical information

What is Wikipedia?

Wikipedia defines itself as: "a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation."

Wikipedia logo is © Wikimedia Foundation, San Francisco, CA

Page 6: Using wikipedia as a source of chemical information

Wikipedia is…

An encyclopedia in over 200 languages

An incredibly useful resource for academia

Written by volunteers

Editable by anyone

Free to be copied, re-used

Free to use (no cost)

Operating for no profit

Wikipedia is not…

A “soapbox” or a place to publish your own work

An authoritative resource for academia

Written mainly by kids, or by paid professionals

Free to re-use without attribution

Run by a corporation

Page 7: Using wikipedia as a source of chemical information

Traditional encyclopedia:“Experts know best”

1• Editors choose an expert

2

• Expert writes, based on authoritative resources

3• Editors review and check facts

Page 8: Using wikipedia as a source of chemical information

Wikipedia – a new paradigm?“Many eyes are better”

1

• Volunteer writes, supposedly using authoritative resources

2

• Other volunteers review and check facts

3

• Ongoing process of adding content then review

Page 9: Using wikipedia as a source of chemical information

Much chemical information on the Web is generated by machine. Wikipedia is large, even though most information is entered word-by-word by a human. This means that:

• It exhibits nuances of human analysis

• Much of it first enters the Web through Wikipedia

• It is curated by humans

• It has silly human errors!

The value to cheminformatics – original human input

Editing Wikipedia articlesPic by Girona7, Wikimedia CommonsCC license

Page 10: Using wikipedia as a source of chemical information

Article pages describe a specific topic

To comment on something in the article, click on the “Discussion” tab

To look at earlier versions, click the “History” tab

To change the article, click the “Edit” tab – but be careful!

The Wikipedia article page

Page 11: Using wikipedia as a source of chemical information

Chemical substance information

Page 12: Using wikipedia as a source of chemical information

Substance articles

Page 13: Using wikipedia as a source of chemical information

After a general lead section (“lede”), most decent substance articles cover these main areas:• Physical & chemical properties• Preparation• Uses• Identifiers, physical & chemical data

(in a Chembox)

Detailed information on safety or chemical suppliers is considered inappropriate.

Substance articles

Page 14: Using wikipedia as a source of chemical information

Wikipedia - an encyclopedia, NOT a database

• But can it be used like a database anyway?

• What about DBpedia?

Substance data in Wikipedia

Page 15: Using wikipedia as a source of chemical information

The Chembox on a substance page contains standard representations such as

•Skeletal formula• IUPAC name• InChI and InChIKey•CAS no. (represents substance, not structure)

•SMILES (de facto standard before InChI)

These were traditionally supplied for use by readers to copy/paste, but we were asked to make a machine-friendly version

Chemboxes & Drugboxes

Page 16: Using wikipedia as a source of chemical information

Chemboxes were originally set up as tables – OK for people, but not for data mining.

EARLY CHEMBOXES

A typical chembox From 2007

Page 17: Using wikipedia as a source of chemical information

Some data (e.g., InChIs for complex molecules) can be very long – and this was a hindrance to their use in Wikipedia.

TABLE EXPLOSIONS!

Page 18: Using wikipedia as a source of chemical information

Now designed as a set of data fields with values entered by the editor – better for data extraction and for validation

Drugboxes also redesigned

Machine-friendly formats (SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes

Hide/show used to avoid table “explosions”

Collections of Wikipedia data are now available for cheminformatics groups to use

NEW CHEMBOXES

Page 19: Using wikipedia as a source of chemical information

FULL FORMSIMPLE

Current form of CHEMBOX

Page 20: Using wikipedia as a source of chemical information

• InChI can be used to define what structure is being represented when compiling a virtual database.

• InChI can provide an unambiguous reference when validating structures on Wikipedia

• InChIKey is useful to help those using search engines

Value of the InChI and InChIKey

Page 21: Using wikipedia as a source of chemical information

PROBLEM: Table creep – a user asks for the table to include the Standard Free Energy of Hydroformylation in a Black Box

ANSWER: Put it on a sub-page – the supplementary data page (chemistry is unique in Wikipedia in having these!).

Click on a link from the bottom of the Chembox:

Data pages

These do have value, with some data pages having over 50,000 hits/year

Page 22: Using wikipedia as a source of chemical information

Data pages

Page 23: Using wikipedia as a source of chemical information

Wikipedia Drug pages

Page 24: Using wikipedia as a source of chemical information

Maintained by the Pharmacology WikiProject, which has a medicinal focus. This means that:

• Some items of interest to chemists may be missing (though main ones are in the drugbox)

• There are no supplementary data pages with spectral data, etc.

• At the “border” between drugs and chemicals, there may be two similar substances that have different boxes. For example:

• caffeine has a drugbox, but paraxanthine has a chembox

Drugboxes

Page 25: Using wikipedia as a source of chemical information

Other chemistry content

Page 26: Using wikipedia as a source of chemical information

Chemical reactions

Page 27: Using wikipedia as a source of chemical information

Some have ReactionBoxes

Page 28: Using wikipedia as a source of chemical information

Good coverage of named organic reactions, but otherwise coverage is patchy – Wikipedia is very weak on reactions compared to March

probably because of the classic cheminformatics problem – substances are easy to define, reactions are hard

Only a handful have ReactionBoxes. No database available based on Wikipedia reaction articles

Typical content:

• Mechanism

• Reagents, catalysts, conditions

• Scope & limitations

• Stereochemistry

• Variations

Reaction articles

Page 29: Using wikipedia as a source of chemical information

Biographical articles

Page 30: Using wikipedia as a source of chemical information

• Large proportion of Wikipedia overall, but low in chemistry – chemists tend to be more interested in chemistry than in people! Many more could be written.

• Mainly covers Nobel Laureates and important historical figures, plus a few chemists where someone has taken the time to write an article.

• “Vanity articles” are strongly discouraged!

Biographical articles

Page 31: Using wikipedia as a source of chemical information

Variable coverage. None of these usually have data boxes, but many include templates to show related topics.• Methods and equipment• Constants, equations• Theories and hypotheses• Chemical families (e.g., “Aldehyde”)• Terms used (e.g., “Coordination complex”)• Many others – history, chemical companies, etc.

Concepts & other chemistry content

Page 32: Using wikipedia as a source of chemical information

The Wikipedia community

How articles are written

User:Polimerek – a Polish Wikipedian and polymer chemistPicture from Wikimedia Polska, CC license

Page 33: Using wikipedia as a source of chemical information

The lonely editor…Most articles started by a topic-enthusiast, and then expanded & maintained by the community if it is considered useful. Picture: WM Commons, Public domain

These “Wikipedians” are motivated by altruism and a love of learning, and they want to share their knowledge with the world, for free. They can also enjoy seeing their work read by thousands, or even millions.Picture by Ziko van Dijk, CC license

Page 34: Using wikipedia as a source of chemical information

WikiProjects provide a place for like-minded editors to discuss articles and organize collaborations. They also agree on standards & templates, and assess quality.

WikiProjects

Page 35: Using wikipedia as a source of chemical information

If you plan major changes to an article or articles, post a comment on the article talk page and also on the relevant WikiProject talk page.

WikiProject talk pages – for informing

Page 36: Using wikipedia as a source of chemical information

These discussions matter; the article discussed here had half a million hits the the last year. Wikipedia’s influence may be unofficial, but it is powerful and in many cases its definitions become the de facto standard.

…and for discussions

Page 37: Using wikipedia as a source of chemical information

Types of chemistry articleWIKIPROJECT CHEMISTRYChemical conceptsChemical reactions & processesChemists

WIKIPROJECT ELEMENTSChemical elements

WIKIPROJECT CHEMICALSChemical substances

WIKIPROJECT PHARMACOLOGYPharmaceuticals

WIKIPROJECT CELL & MOLECULAR BIOLOGYMolecular biology

Page 38: Using wikipedia as a source of chemical information

WikiProject Chemicals~60 members (10-20 active)

Collaborates on writing quality articles and standards for:•developing data boxes for articles

•chemical naming, structure drawing

•article assessment

Data validation

Beta-CyclodextrinPublic domain picture by Edgar181

Page 39: Using wikipedia as a source of chemical information

ChemBoxes, article validation, chemical names, structure drawing, style guide: all are organized by the WikiProjects. Type WP:MOSCHEM into Wikipedia to find the Manual of Style for Chemistry.

WikiProjects collaborate to set standards

Page 40: Using wikipedia as a source of chemical information

Articles are assessed, then tagged on the talk page. A bot compiles these assessments into lists & tables, allowing the project to review and track their articles.

WikiProjects assess articles for quality & importance

Page 41: Using wikipedia as a source of chemical information

Type WP:ASSESS into Wikipedia to see this

Article assessment – by editors

Page 42: Using wikipedia as a source of chemical information

Assessment guides article improvement priorities

Page 43: Using wikipedia as a source of chemical information

WikiTrust – to check trustworthiness of contributions

Downloadable as an extension to Firefox, this adds a tab above the article – click to see :

Page 44: Using wikipedia as a source of chemical information
Page 45: Using wikipedia as a source of chemical information

General ways to remove vandalismWatchlists: Users watch all changes to specific pages they care about

Huggle: Software to help Wikipedians track and remove vandalism quickly

Bots: “Obvious” vandalism (such as deleting all content from a page) is spotted and reverted almost immediately by “bots” that patrol the recent edits. (Bots are scripts that automate the editing process.)

Part of my Watchlist from early this morning

Page 46: Using wikipedia as a source of chemical information

Collaborations for validating data

2007-present: ChemSpider and Antony Williams have a longstanding collaboration with the Chemicals WikiProject, aimed at curating data in both ChemSpider and Wikipedia.2008-2010: CAS provided a database of around 8000 substances to the Chemical WikiProject free of charge; this collection was also used as the basis for a new CAS open access site for the general public, CAS Common Chemistry

Page 47: Using wikipedia as a source of chemical information

CAS CommonChemistry

• Launched in April 2009• Offered as a free service to

provide CAS RNs to the public.

Page 48: Using wikipedia as a source of chemical information

Since 2007 Wikipedia has collaborated with IUPAC to help propagate IUPAC definitions.

This ensures that Wikipedia has accurate, current definitions, and IUPAC can reach a much wider audience.

Currently, a collaboration is actively inserting IUPAC definitions for polymer terms into articles, and editing/expanding content as needed.

IUPAC collaboration

Page 49: Using wikipedia as a source of chemical information

Data validation

Page 50: Using wikipedia as a source of chemical information

How I use the key terms:

Validation =>

“How I can be sure the data are correct?”

Curation => an ongoing process of fixing errors

Data validation

Page 51: Using wikipedia as a source of chemical information

In 2008 a data validation drive was initiated for basic chemical identifiers, in collaboration with Antony Williams (ChemSpider)

Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct

Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN

Other identifier fields (e.g., KEGG) have since been validated.

Validated content indicated with a check mark

Content validation

Page 52: Using wikipedia as a source of chemical information

Every old version (called a RevID) of an article is preserved (for all) for posterity, and can potentially serve as a permanent record of a validated version.

The approach to validation

Page 53: Using wikipedia as a source of chemical information

PROBLEM: This is “the encyclopedia anyone can edit” – so anyone can change the BP of water to 200 oC.

SOLUTION: A bot patrols the pages, and watches for edits to key fields. Any dubious edits are flagged with a red X (next to the data), and logged.

System developed by Dirk Beetstra (Eindhoven University of Technology). It is the only such tool on Wikipedia.

Protecting validated fields

Page 54: Using wikipedia as a source of chemical information

If anyone tries to vandalize a validated field, this will be flagged by a bot soon afterwards.• This example received a red X 11 minutes after it was vandalized.

Validation protected by bot

Page 55: Using wikipedia as a source of chemical information

Validated revisionIDs

Page 56: Using wikipedia as a source of chemical information

IN 2008-2010, around 3000 chemical structures were informally checked against CAS Common Chemistry

PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed

Checking structures

Page 57: Using wikipedia as a source of chemical information

The bot has been modified to watch changes to the RevID of the Wikimedia Commons structure image

A few hundred images validated so far

Since fall 2010

Page 58: Using wikipedia as a source of chemical information

Drugboxes are patrolled by the bot, but at present WP:PHARM not active in formal validation. Most work done by Dirk Beetstra, using official lists from data sources (e.g., ChEBI).

Drugboxes

Page 59: Using wikipedia as a source of chemical information

CONCLUSIONS

Page 60: Using wikipedia as a source of chemical information

Type the shortcuts shown in yellow into the Wikipedia search window

• P:CHEM takes you to the Chemistry Portal

• WP:CHEM and WP:CHEMISTRY – WikiProject pages are often a useful place to look for guidelines and to ask for help

• WP:MOSCHEM takes you to the Chemistry Manual of Style – be sure to check this before making major edits

• WP:CHEAT gives a “cheat sheet” for common edits

• For general chemical information resources, Gary Wiggins has a WikiBook available at http://en.wikibooks.org/wiki/Chemical_Information_Sources

Useful sources

Page 61: Using wikipedia as a source of chemical information

• Wikipedia can be a useful source of highly curated information on chemistry, common chemicals and drugs.

• WikiProjects and the Wikipedia community play an important role in setting standards and maintaining articles. Validation will improve quality further.

• Don’t forget the data page information!

• The writing and the validation need to go further – YOUR help is very welcome!

Conclusion

Page 62: Using wikipedia as a source of chemical information

Thanks to Antony Williams for the invitation to present this Webinar, and also for his many contributions to Wikipedia.

Thanks to Dave Martinsen for moderating this session, even while traveling!

Thanks to the Wikipedia chemists who built this resource.

Thank you for your attention.

Acknowledgements

Picture byVistamommyFlickr, CC license

Page 63: Using wikipedia as a source of chemical information

Thank you for your attention

Any questions?

Page 64: Using wikipedia as a source of chemical information

All of my own content in this presentation is released under a Creative Commons BY-SA-3.0 license

Copyright information for images is usually attributed on the slide itself

Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the original Wikipedia page and select the “history” tab.

Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3.

Copyright information