using wikipedia as a source of chemical information
DESCRIPTION
Webinar for the Chemical Information Division of the American Chemical Society. Describes descriptions of the types of chemical data in Wikipedia, and also how these are uploaded and maintained by the Wikipedia community.TRANSCRIPT
Prof. Martin A. Walker, SUNY Potsdam
June 27, 2013
Webinar for ACS Chemical Information Division
Using Wikipedia as a Source of Chemical Information
Introduction
Chemical substance data in Wikipedia
Other chemistry-related content
Behind the scenes: •How articles are written•WikiProjects
Conclusion
Overview
What is a wiki?
What is a wiki?“A collaborative website which can be directly edited by anyone with access to it.”
(Wiktionary, March 20, 2007)
From the Hawaiian word “wiki wiki” meaning “quick.”
Picture byJshapiroWM CommonsCC license
What is Wikipedia?
Wikipedia defines itself as: "a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation."
Wikipedia logo is © Wikimedia Foundation, San Francisco, CA
Wikipedia is…
An encyclopedia in over 200 languages
An incredibly useful resource for academia
Written by volunteers
Editable by anyone
Free to be copied, re-used
Free to use (no cost)
Operating for no profit
Wikipedia is not…
A “soapbox” or a place to publish your own work
An authoritative resource for academia
Written mainly by kids, or by paid professionals
Free to re-use without attribution
Run by a corporation
Traditional encyclopedia:“Experts know best”
1• Editors choose an expert
2
• Expert writes, based on authoritative resources
3• Editors review and check facts
Wikipedia – a new paradigm?“Many eyes are better”
1
• Volunteer writes, supposedly using authoritative resources
2
• Other volunteers review and check facts
3
• Ongoing process of adding content then review
Much chemical information on the Web is generated by machine. Wikipedia is large, even though most information is entered word-by-word by a human. This means that:
• It exhibits nuances of human analysis
• Much of it first enters the Web through Wikipedia
• It is curated by humans
• It has silly human errors!
The value to cheminformatics – original human input
Editing Wikipedia articlesPic by Girona7, Wikimedia CommonsCC license
Article pages describe a specific topic
To comment on something in the article, click on the “Discussion” tab
To look at earlier versions, click the “History” tab
To change the article, click the “Edit” tab – but be careful!
The Wikipedia article page
Chemical substance information
Substance articles
After a general lead section (“lede”), most decent substance articles cover these main areas:• Physical & chemical properties• Preparation• Uses• Identifiers, physical & chemical data
(in a Chembox)
Detailed information on safety or chemical suppliers is considered inappropriate.
Substance articles
Wikipedia - an encyclopedia, NOT a database
• But can it be used like a database anyway?
• What about DBpedia?
Substance data in Wikipedia
The Chembox on a substance page contains standard representations such as
•Skeletal formula• IUPAC name• InChI and InChIKey•CAS no. (represents substance, not structure)
•SMILES (de facto standard before InChI)
These were traditionally supplied for use by readers to copy/paste, but we were asked to make a machine-friendly version
Chemboxes & Drugboxes
Chemboxes were originally set up as tables – OK for people, but not for data mining.
EARLY CHEMBOXES
A typical chembox From 2007
Some data (e.g., InChIs for complex molecules) can be very long – and this was a hindrance to their use in Wikipedia.
TABLE EXPLOSIONS!
Now designed as a set of data fields with values entered by the editor – better for data extraction and for validation
Drugboxes also redesigned
Machine-friendly formats (SMILES, InChI, InChIKey, CAS Reg. No.) included in nearly all chemboxes
Hide/show used to avoid table “explosions”
Collections of Wikipedia data are now available for cheminformatics groups to use
NEW CHEMBOXES
FULL FORMSIMPLE
Current form of CHEMBOX
• InChI can be used to define what structure is being represented when compiling a virtual database.
• InChI can provide an unambiguous reference when validating structures on Wikipedia
• InChIKey is useful to help those using search engines
Value of the InChI and InChIKey
PROBLEM: Table creep – a user asks for the table to include the Standard Free Energy of Hydroformylation in a Black Box
ANSWER: Put it on a sub-page – the supplementary data page (chemistry is unique in Wikipedia in having these!).
Click on a link from the bottom of the Chembox:
Data pages
These do have value, with some data pages having over 50,000 hits/year
Data pages
Wikipedia Drug pages
Maintained by the Pharmacology WikiProject, which has a medicinal focus. This means that:
• Some items of interest to chemists may be missing (though main ones are in the drugbox)
• There are no supplementary data pages with spectral data, etc.
• At the “border” between drugs and chemicals, there may be two similar substances that have different boxes. For example:
• caffeine has a drugbox, but paraxanthine has a chembox
Drugboxes
Other chemistry content
Chemical reactions
Some have ReactionBoxes
Good coverage of named organic reactions, but otherwise coverage is patchy – Wikipedia is very weak on reactions compared to March
probably because of the classic cheminformatics problem – substances are easy to define, reactions are hard
Only a handful have ReactionBoxes. No database available based on Wikipedia reaction articles
Typical content:
• Mechanism
• Reagents, catalysts, conditions
• Scope & limitations
• Stereochemistry
• Variations
Reaction articles
Biographical articles
• Large proportion of Wikipedia overall, but low in chemistry – chemists tend to be more interested in chemistry than in people! Many more could be written.
• Mainly covers Nobel Laureates and important historical figures, plus a few chemists where someone has taken the time to write an article.
• “Vanity articles” are strongly discouraged!
Biographical articles
Variable coverage. None of these usually have data boxes, but many include templates to show related topics.• Methods and equipment• Constants, equations• Theories and hypotheses• Chemical families (e.g., “Aldehyde”)• Terms used (e.g., “Coordination complex”)• Many others – history, chemical companies, etc.
Concepts & other chemistry content
The Wikipedia community
How articles are written
User:Polimerek – a Polish Wikipedian and polymer chemistPicture from Wikimedia Polska, CC license
The lonely editor…Most articles started by a topic-enthusiast, and then expanded & maintained by the community if it is considered useful. Picture: WM Commons, Public domain
These “Wikipedians” are motivated by altruism and a love of learning, and they want to share their knowledge with the world, for free. They can also enjoy seeing their work read by thousands, or even millions.Picture by Ziko van Dijk, CC license
WikiProjects provide a place for like-minded editors to discuss articles and organize collaborations. They also agree on standards & templates, and assess quality.
WikiProjects
If you plan major changes to an article or articles, post a comment on the article talk page and also on the relevant WikiProject talk page.
WikiProject talk pages – for informing
These discussions matter; the article discussed here had half a million hits the the last year. Wikipedia’s influence may be unofficial, but it is powerful and in many cases its definitions become the de facto standard.
…and for discussions
Types of chemistry articleWIKIPROJECT CHEMISTRYChemical conceptsChemical reactions & processesChemists
WIKIPROJECT ELEMENTSChemical elements
WIKIPROJECT CHEMICALSChemical substances
WIKIPROJECT PHARMACOLOGYPharmaceuticals
WIKIPROJECT CELL & MOLECULAR BIOLOGYMolecular biology
WikiProject Chemicals~60 members (10-20 active)
Collaborates on writing quality articles and standards for:•developing data boxes for articles
•chemical naming, structure drawing
•article assessment
Data validation
Beta-CyclodextrinPublic domain picture by Edgar181
ChemBoxes, article validation, chemical names, structure drawing, style guide: all are organized by the WikiProjects. Type WP:MOSCHEM into Wikipedia to find the Manual of Style for Chemistry.
WikiProjects collaborate to set standards
Articles are assessed, then tagged on the talk page. A bot compiles these assessments into lists & tables, allowing the project to review and track their articles.
WikiProjects assess articles for quality & importance
Type WP:ASSESS into Wikipedia to see this
Article assessment – by editors
Assessment guides article improvement priorities
WikiTrust – to check trustworthiness of contributions
Downloadable as an extension to Firefox, this adds a tab above the article – click to see :
General ways to remove vandalismWatchlists: Users watch all changes to specific pages they care about
Huggle: Software to help Wikipedians track and remove vandalism quickly
Bots: “Obvious” vandalism (such as deleting all content from a page) is spotted and reverted almost immediately by “bots” that patrol the recent edits. (Bots are scripts that automate the editing process.)
Part of my Watchlist from early this morning
Collaborations for validating data
2007-present: ChemSpider and Antony Williams have a longstanding collaboration with the Chemicals WikiProject, aimed at curating data in both ChemSpider and Wikipedia.2008-2010: CAS provided a database of around 8000 substances to the Chemical WikiProject free of charge; this collection was also used as the basis for a new CAS open access site for the general public, CAS Common Chemistry
CAS CommonChemistry
• Launched in April 2009• Offered as a free service to
provide CAS RNs to the public.
Since 2007 Wikipedia has collaborated with IUPAC to help propagate IUPAC definitions.
This ensures that Wikipedia has accurate, current definitions, and IUPAC can reach a much wider audience.
Currently, a collaboration is actively inserting IUPAC definitions for polymer terms into articles, and editing/expanding content as needed.
IUPAC collaboration
Data validation
How I use the key terms:
Validation =>
“How I can be sure the data are correct?”
Curation => an ongoing process of fixing errors
Data validation
In 2008 a data validation drive was initiated for basic chemical identifiers, in collaboration with Antony Williams (ChemSpider)
Led to a collaboration with CAS, to ensure Wikipedia CAS registry nos. are correct
Now around 3500+ substances have been validated against CAS Common Chemistry, as having correct name, structure & CAS RN
Other identifier fields (e.g., KEGG) have since been validated.
Validated content indicated with a check mark
Content validation
Every old version (called a RevID) of an article is preserved (for all) for posterity, and can potentially serve as a permanent record of a validated version.
The approach to validation
PROBLEM: This is “the encyclopedia anyone can edit” – so anyone can change the BP of water to 200 oC.
SOLUTION: A bot patrols the pages, and watches for edits to key fields. Any dubious edits are flagged with a red X (next to the data), and logged.
System developed by Dirk Beetstra (Eindhoven University of Technology). It is the only such tool on Wikipedia.
Protecting validated fields
If anyone tries to vandalize a validated field, this will be flagged by a bot soon afterwards.• This example received a red X 11 minutes after it was vandalized.
Validation protected by bot
Validated revisionIDs
IN 2008-2010, around 3000 chemical structures were informally checked against CAS Common Chemistry
PROBLEM: Structures are loaded from an external file on Wikimedia Commons, which can be “invisibly” changed
Checking structures
The bot has been modified to watch changes to the RevID of the Wikimedia Commons structure image
A few hundred images validated so far
Since fall 2010
Drugboxes are patrolled by the bot, but at present WP:PHARM not active in formal validation. Most work done by Dirk Beetstra, using official lists from data sources (e.g., ChEBI).
Drugboxes
CONCLUSIONS
Type the shortcuts shown in yellow into the Wikipedia search window
• P:CHEM takes you to the Chemistry Portal
• WP:CHEM and WP:CHEMISTRY – WikiProject pages are often a useful place to look for guidelines and to ask for help
• WP:MOSCHEM takes you to the Chemistry Manual of Style – be sure to check this before making major edits
• WP:CHEAT gives a “cheat sheet” for common edits
• For general chemical information resources, Gary Wiggins has a WikiBook available at http://en.wikibooks.org/wiki/Chemical_Information_Sources
Useful sources
• Wikipedia can be a useful source of highly curated information on chemistry, common chemicals and drugs.
• WikiProjects and the Wikipedia community play an important role in setting standards and maintaining articles. Validation will improve quality further.
• Don’t forget the data page information!
• The writing and the validation need to go further – YOUR help is very welcome!
Conclusion
Thanks to Antony Williams for the invitation to present this Webinar, and also for his many contributions to Wikipedia.
Thanks to Dave Martinsen for moderating this session, even while traveling!
Thanks to the Wikipedia chemists who built this resource.
Thank you for your attention.
Acknowledgements
Picture byVistamommyFlickr, CC license
Thank you for your attention
Any questions?
All of my own content in this presentation is released under a Creative Commons BY-SA-3.0 license
Copyright information for images is usually attributed on the slide itself
Content from Wikipedia and Learn Chemistry is reused via a Creative Commons BY-SA-3.0 license. For authors, please visit the original Wikipedia page and select the “history” tab.
Other pictures not attributed should only be my own personal pictures, also CC-BY-SA3.
Copyright information