franz 2017 sols cbs seminar the limits of synthesis for integrative biology
TRANSCRIPT
The limits of synthesis
for integrative biology
Nico Franz
School of Life Sciences, Arizona State University
Center of Biology + Society Conversation Series
October 11, 2017 – School of Life Sciences, ASU
@ http://www.slideshare.net/taxonbytes/franz-2017-sols-cbs-seminar-the-limits-of-synthesis-for-integrative-biology
Premise: The notion of synthesis is appealing
https://www.nsf.gov/funding/index.jsp
Implementation (in systematics): Synthesis = one view (at a time)
• Example: The Open Tree of Life project
doi:10.1073/pnas.1423041112
Implementation (in systematics): Synthesis = one view (at a time)
• Example: The Global Biodiversity Information Facility (GBIF)
https://www.slideshare.net/mdoering/gbif-checklist-bank-and-the-backbone
Implementation (in systematics): Synthesis = one view (at a time)
• Example: The Global Biodiversity Information Facility (GBIF)
• "It is updated regularly through an automated process in which the Catalogue of Life acts as a
starting point also providing the complete higher classification above families. The following 54
sources have been used to assemble the GBIF backbone: …"
doi:10.5072/hufs9m
Initial questions – How to integrate biological data?
• Does synthesis necessarily mean one view?
⇒ No. Most generally: "The combination of components or elements
to form a connected whole" (~ Oxford).
Initial questions – How to integrate biological data?
• Does synthesis necessarily mean one view?
⇒ No. Most generally: "The combination of components or elements
to form a connected whole" (~ Oxford).
• Is equating synthesis with one hierarchy empirically and socially adequate,
or desirable?
Initial questions – How to integrate biological data?
• Does synthesis necessarily mean one view?
⇒ No. Most generally: "The combination of components or elements
to form a connected whole" (~ Oxford).
• Is equating synthesis with one hierarchy empirically and socially adequate,
or desirable?
⇒ Likely not if novel or conflicting views are thereby somehow suppressed.
Initial questions – How to integrate biological data?
• Does synthesis necessarily mean one view?
⇒ No. Most generally: "The combination of components or elements
to form a connected whole" (~ Oxford).
• Is equating synthesis with one hierarchy empirically and socially adequate,
or desirable?
⇒ Likely not if novel or conflicting views are thereby somehow suppressed.
• What are the consequences of synthesis = one view?
• What are the remedies?
• What are the incentives to conceive of synthesis differently?
• What are the obstacles to doing so?
Initial questions – How to integrate biological data?
• Does synthesis necessarily mean one view?
⇒ No. Most generally: "The combination of components or elements
to form a connected whole" (~ Oxford).
• Is equating synthesis with one hierarchy empirically and socially adequate,
or desirable?
⇒ Likely not if novel or conflicting views are thereby somehow suppressed.
• What are the consequences of synthesis = one view?
• What are the remedies?
• What are the incentives to conceive of synthesis differently?
• What are the obstacles to doing so?
⇒ To be explored for the use case of biological systematics / biodiversity data.
Language
Types
Background: Linnaean names refer to "non-types" contingently
Dubois. 2005. http://sciencepress.mnhn.fr/sites/default/files/articles/pdf/z2005n2a8.pdf
Non-types
Language
Background: Linnaean names refer to "non-types" contingently
Dubois. 2005. http://sciencepress.mnhn.fr/sites/default/files/articles/pdf/z2005n2a8.pdf
Non-types
Cleistes bifaria
acc. to author 1
Language
Background: Linnaean names refer to "non-types" contingently
Dubois. 2005. http://sciencepress.mnhn.fr/sites/default/files/articles/pdf/z2005n2a8.pdf
Non-types
Cleistes bifaria
acc. to author 2
Language
Background: Linnaean names refer to "non-types" contingently
Dubois. 2005. http://sciencepress.mnhn.fr/sites/default/files/articles/pdf/z2005n2a8.pdf
Non-types
Cleistes bifaria
acc. to author 3
The Cleistes/Cleistesiopsis use case
⇒ 20 orchid occurrence records, 3 taxonomies, 1 synthesis
⇒ Let's map them!
Charly Lewis, CC BY-SA 3.0doi:10.1101/157214
A. sec. Radford, Ahles & Bell 1968 – The Bible
Source: Radford, Ahles & Bell. 1968. Manual of the vascular flora of the Carolinas. UNC Press, Chapel Hill.
B. sec. Kartesz 2010 – The Federal Standard
Source: Kartesz. 2010. Floristic synthesis of North America, version 9-15-2010. Biota of North America Program, Chapel Hill.
C. sec. Weakley 2015 – The "Best" New Regional Flora
Source: Weakley. 2015. Flora of the Southern and Mid-Atlantic States. UNC Herbarium, Chapel Hill.
D. sec. SERNEC Raw – Mid-Level Herbarium Aggregator
Source: SERNEC Data Portal. 2017. Available from http://sernecportal.org. Accessed 01 June 2017.
E. sec. SERNEC Synthesis – Mid-Level Herbarium Aggregator
Source: SERNEC Data Portal. 2017. Available from http://sernecportal.org. Accessed 01 June 2017.
What are the implications of "synthesis"?
⇒ The orchids are variously rare and red-listed
Charly Lewis, CC BY-SA 3.0
How to remedy?
⇒ Synthesis as a conflict exposition and alignment service
Charly Lewis, CC BY-SA 3.0
Remedy: Representing taxonomic concepts and alignments
• 9 schemata for the Cleistes/Cleistesiopsis complex
doi:10.3897/rio.2.e10610
• 9 schemata for the Cleistes/Cleistesiopsis complex
• Vertical sections identify congruent taxonomic concept regions
Remedy: Representing taxonomic concepts and alignments
doi:10.3897/rio.2.e10610
• 9 schemata for the Cleistes/Cleistesiopsis complex
• Vertical sections identify congruent taxonomic concept regions
• Colors identify lineages of taxonomic names (epithets) in use
Remedy: Representing taxonomic concepts and alignments
doi:10.3897/rio.2.e10610
• 9 schemata for the Cleistes/Cleistesiopsis complex
• Vertical sections identify congruent taxonomic concept regions
• Colors identify lineages of taxonomic names (epithets) in use
• There is no consensus! Five incongruent schemata are used concurrently
Remedy: Representing taxonomic concepts and alignments
doi:10.3897/rio.2.e10610
Further diagnosis:
If incongruent taxonomies are endorsed
– locally, provisionally, and democratically –
then what is the impact for
aggregated biodiversity data?
Further diagnosis:
⇒ Taxonomy becomes a variable
that we need to represent,
and thereby control for
(at the system level).
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Controlling the taxonomic variable"
"Just
bad"
Expert views
are in conflict
Solution:
Instead of aggregating
an artificial 'consensus',
…
doi:10.3897/rio.2.e10610
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Controlling the taxonomic variable"
"Just
bad"
Expert views
are reconciled
Solution:
Instead of aggregating
an artificial 'consensus',
build translation services
doi:10.3897/rio.2.e10610
Challenge:
How can we redesign aggregation to yield
high-quality biodiversity data packages?
(very abbreviated version)
Step 1 ⇒ Represent only taxonomic concept labels (TCLs) 1
• Syntax (TCL): taxonomic name [author, year, page] sec. source
1 Multi-taxonomy input/alignment visualizations generated with Euler/X toolkit: https://github.com/EulerProject/EulerX
Cleistes divaricata
sec. Gregg & Catling 1993
Pogonia
sec. Brown & Wunderlin 1997
Step 2 ⇒ Represent each source coherently (Parent-Child relationships)
• Syntax (PC): TCL1 is a child/parent of TCL2 [where TCL1/2 = same source]
Cleistesiopsis bifaria sec. Pans. & de Barr. 2008
is a child of
Cleistesiopsis sec. Pans. & de Barr. 2008
Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf
== < > >< !
• Two regions N, M are either:
• congruent (N == M)
• properly inclusive (N < M)
• inversely properly inclusive (N > M)
• overlapping (N >< M)
• exclusive of each other (N ! M)
Step 3⇒ Align concepts with Region Connection Calculus (RCC–5)
Source: Thau, D.M. 2010. Reasoning about taxonomies. Thesis, UC Davis. http://gradworks.proquest.com/3422778.pdf
== < > >< !
• Two regions N, M are either:
• congruent (N == M)
• properly inclusive (N < M)
• inversely properly inclusive (N > M)
• overlapping (N >< M)
• exclusive of each other (N ! M)
• RCC–5 articulations answer the query: "Can we join regions N and M?"
• Taxonomies have multiple RCC–5 alignable components: nodes (parents,
children), node-associated traits, even node-anchoring specimens.
Step 3⇒ Align concepts with Region Connection Calculus (RCC–5)
Step 4⇒ Identify occurrence records only to TCLs
Records:EKY39235
MTSU003611
NCSC00040204
…
Records:BOON8098
CLEMS0061133
WILLI39399
…
Records:GMUF-0039355
IBE006808
USCH58399
…
Records:CONV0006268
MDKY00006482
NCU00038930
…
Records:BRYV0023582, BRYV0023584
KHD00032030, MISS0016604
MMNS000227, NCSC00040206
USMS_000002923, USMS_000002924
VSC0053223, VSC0065528
…
Records:ARIZ393087
DBG39049
USCH51217
…
Records:NCU00040710
USCH96248
VSC0053218
…
Records:CLEMS0012881
FUGR0003293
GA023130
…
Records:BOON8100
NCSC00040210
SJNM45487
…
Records:GA023144
LSU00012494
MISS0016608
…
Records:IBE006810, IND-0012374, MMNS000227
Records:NY8654
• Syntax (ID): Occurrence / organism is identified to TCL
"CLEMS0012881"
is identified to
Cleistes divaricata sec. Smith et al. 2004
[additional ID metadata]
Step 5⇒ Generate logically consistent RCC–5 alignments
• Euler/X is a toolkit that infers logically consistent RCC–5 alignments
• Valued-added: MIR – set of Maximally Informative Relations containing
the RCC–5 articulation for every possible TCL pair ⇒ Scalability
Reasoner inference
Step 5⇒ Generate logically consistent RCC–5 alignments
Step 6⇒ Integrate occurrence-to-TCL identifications & alignments
Records:BOON8098, CLEMS0061133, CONV0006268, EKY39235
GMUF-0039355, IBE006808, IBE006810, IND-0012374
MDKY00006482, MMNS000227, MTSU003611, NCSC00040204
NCU00038930, NY8654, USCH58399, WILLI39399
…
Records:ARIZ393087, BRYV0023582, BRYV0023584, DBG39049
KHD00032030, MISS0016604, MMNS00022, NCSC00040206
USMS_000002923, USMS_000002924, VSC0053223, VSC0065528
…
Records:BOON8100, CLEMS0012881, FUGR0003293
GA023130, GA023144, LSU00012494
MISS0016608, NCSC00040210, NCU00040710
SJNM45487, USCH96248, VSC0053218
…
• Specimen integration is fully driven by TCL-to-TCL RCC–5 signals
The 'consensus' The 'bible'
The (formerly)
federal 'standard'
The 'best', latest
regional flora
"Controlling the taxonomic variable"
Impact:
"Please select your preference (A – D);
we can perform all translations"
doi:10.3897/rio.2.e10610
• We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records ⇒ Resolves incongruent lineage of name usages
Remedy: Aggregation as a translational service
• We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records ⇒ Resolves incongruent lineage of name usages
• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"
• Returns record subset ⇒ Resolves only one narrowly circumscribed concept
Remedy: Aggregation as a translational service
• We can now respond to queries such as:
• "Show all specimens identified to the taxonomic name Cleistes divaricata"
• Returns many records ⇒ Resolves incongruent lineage of name usages
• "Now show specimens with the TCL Cleistesiopsis divaricata sec. Weakley 2015"
• Returns record subset ⇒ Resolves only one narrowly circumscribed concept
• "Now show specimens identified to the TCL Cleistes divaricata sec. RAB 1968,
yet translated into the more granular TCLs sec. Weakley 2015"
• Returns (again) many records, yet represents and contrasts two treatments,
as opposed to providing the ambiguous lineage view (above)
• "Show all specimens with ambiguous 2010/2015 TCL identifications…" (etc.)
Remedy: Aggregation as a translational service
Understanding the attraction of synthesis = one view
• Ok, so we have diagnosed an issue. How prevalent does it need to be for
aggregation designs to actually change?
Understanding the attraction of synthesis = one view
• Ok, so we have diagnosed an issue. How prevalent does it need to be for
aggregation designs to actually change?
• Complication: Under the one-view design, we cannot measure the extent of the
phenomenon very well.
Understanding the attraction of synthesis = one view
• Ok, so we have diagnosed an issue. How prevalent does it need to be for
aggregation designs to actually change?
• Complication: Under the one-view design, we cannot measure the extent of the
phenomenon very well.
• Is the threshold (of the prevalence of the phenomenon) shared universally
between contributors and users? [⇒ Fitness for use]
Understanding the attraction of synthesis = one view
• Ok, so we have diagnosed an issue. How prevalent does it need to be for
aggregation designs to actually change?
• Complication: Under the one-view design, we cannot measure the extent of the
phenomenon very well.
• Is the threshold (of the prevalence of the phenomenon) shared universally
between contributors and users? [⇒ Fitness for use]
• Are unitary aggregation systems designed to foster distrust particularly among
career-advancing experts (e.g. graduate students, postdocs, early-career
researchers) who tend to produce novel, "groundbreaking" views?
Understanding the attraction of synthesis = one view
• Ok, so we have diagnosed an issue. How prevalent does it need to be for
aggregation designs to actually change?
• Complication: Under the one-view design, we cannot measure the extent of the
phenomenon very well.
• Is the threshold (of the prevalence of the phenomenon) shared universally
between contributors and users? [⇒ Fitness for use]
• Are unitary aggregation systems designed to foster distrust particularly among
career-advancing experts (e.g. graduate students, postdocs, early-career
researchers) who tend to produce novel, "groundbreaking" views?
• Is the "sweeping under the rug" of conflict an expectation grounded in the long
history of taxonomy? It's 2017 for crying out load, shouldn't we have figured
out orchids already? Why can't we have one unified "webpage" for every
species? Or: We're so close, fund us once more and we'll promise to "get there".
Understanding the attraction of synthesis = one view
• Ok, so we have diagnosed an issue. How prevalent does it need to be for
aggregation designs to actually change?
• Complication: Under the one-view design, we cannot measure the extent of the
phenomenon very well.
• Is the threshold (of the prevalence of the phenomenon) shared universally
between contributors and users? [⇒ Fitness for use]
• Are unitary aggregation systems designed to foster distrust particularly among
career-advancing experts (e.g. graduate students, postdocs, early-career
researchers) who tend to produce novel, "groundbreaking" views?
• Is the "sweeping under the rug" of conflict an expectation grounded in the long
history of taxonomy? It's 2017 for crying out load, shouldn't we have figured
out orchids already? Why can't we have one unified "webpage" for every
species? Or: We're so close, fund us once more and we'll promise to "get there".
• Is the quieting of conflict an increasingly acceptable design feature of big data?
Understanding the attraction of synthesis = one view
• Better integration – that accounts for past/present/future conflict – requires a
kind of cognitive readjustment. "I need to ready my data now so that a
dissenting view is more easily/scalably linkable to them". That may be asking
for too much…
Understanding the attraction of synthesis = one view
• Better integration – that accounts for past/present/future conflict – requires a
kind of cognitive readjustment. "I need to ready my data now so that a
dissenting view is more easily/scalably linkable to them". That may be asking
for too much…
• Better integration will likely also force contributors and users to be more
transparent upfront regarding the aims of integration, i.e., to make stronger and
more transparent commitments about fitness for use. Again, asking a lot.
Understanding the attraction of synthesis = one view
• Better integration – that accounts for past/present/future conflict – requires a
kind of cognitive readjustment. "I need to ready my data now so that a
dissenting view is more easily/scalably linkable to them". That may be asking
for too much…
• Better integration will likely also force contributors and users to be more
transparent upfront regarding the aims of integration, i.e., to make stronger and
more transparent commitments about fitness for use. Again, asking a lot.
• It does seem that we are in the process of giving something up for the sake of
big data integration. To some extent the integration designs are still too driven
by technical feasibility constraints (which are a moving target, however).
Understanding the attraction of synthesis = one view
• Better integration – that accounts for past/present/future conflict – requires a
kind of cognitive readjustment. "I need to ready my data now so that a
dissenting view is more easily/scalably linkable to them". That may be asking
for too much…
• Better integration will likely also force contributors and users to be more
transparent upfront regarding the aims of integration, i.e., to make stronger and
more transparent commitments about fitness for use. Again, asking a lot.
• It does seem that we are in the process of giving something up for the sake of
big data integration. To some extent the integration designs are still too driven
by technical feasibility constraints (which are a moving target, however).
• Dealing with ambiguity and conflict in the ways we humans are accustomed to
in integrative biology, is not something that we have translated well enough
into the machine processing realm yet.
Understanding the attraction of synthesis = one view
• Better integration – that accounts for past/present/future conflict – requires a
kind of cognitive readjustment. "I need to ready my data now so that a
dissenting view is more easily/scalably linkable to them". That may be asking
for too much…
• Better integration will likely also force contributors and users to be more
transparent upfront regarding the aims of integration, i.e., to make stronger and
more transparent commitments about fitness for use. Again, asking a lot.
• It does seem that we are in the process of giving something up for the sake of
big data integration. To some extent the integration designs are still too driven
by technical feasibility constraints (which are a moving target, however).
• Dealing with ambiguity and conflict in the ways we humans are accustomed to
in integrative biology, is not something that we have translated well enough
into the machine processing realm yet.
• Personal issue: At what point should my advocacy "stop"?
Acknowledgments
• CBS hosts: Kelle Dhein, Andrea Cottrell & Beckett Sterner – Thank you!
• Euler/X team: Bertram Ludäscher, Shizhuo Yu, Jessica Cheng, Ed Gilbert.
• NSF DEB–1155984 (PI Franz); IIS–118088, DBI–1147273 (PI Ludäscher).
• If you have to read one paper: https://doi.org/10.1093/sysbio/syw023
Products: Concept taxonomy in theory and in practice
ZooKeys. doi:10.3897/zookeys.528.6001
Semantic Web. doi:10.3233/SW-160220
Biological Theory. doi:10.1007/s13752-017-0259-5
PloS ONE. doi:10.1371/journal.pone.0118247
Systematics Biodiv. doi:10.1080/14772000.2013.806371
Systematic Biology. doi:10.1093/sysbio/syw023
Biodiversity Data Journal. doi:10.3897/BDJ.5.e10469 Research Ideas and Outcomes. doi: 10.3897/rio.2.e10610