discovering similar research ideas using semantic vectors ... · discovering similar research ideas...

29
Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO

Upload: others

Post on 05-Oct-2019

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Discovering similar Research Ideas using Semantic Vectors and

Machine Learning

Mads Rydahl, UNSILO

Page 2: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence
Page 3: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

UNSILOText Intelligence For ScienceQ4/2016

Page 4: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Mads has managed software dev teams for over 20 years. He has built games for Lego Mindstorms, interfaces for Bang & Olufsen, authored a portfolio of patents acquired by Apple, and created the world’s best casual game (before that was profitable ;-)

Mads has lived 5 years in Silicon Valley, worked at Stanford University, and was head of Product and Design for Siri.com, a startup funded by SRI and DARPA and acquired by Apple in 2010.

Mads is cofounder of UNSILO, a Danish startup building semantic discovery tools for science.

Mads [email protected]://linkedin.com/in/rydahl

Page 5: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

UNSILO Mission

To build discovery services that make it easy and fast to find relevant knowledge and discover new patterns across all of science

AutomatedBecause scientific language is constantly growing, evolving, and accelerating.

OmniscientBecause important findings may not be apparent. Even to the author.

UnbiasedBecause existing solutions rank by popularity and cause filter bubbles.

Page 6: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

UNSILO Core Technology

We extract key phrases without prior domain knowledge, and use Machine Learningto identify novel ideas as they emerge

Apache UIMA, Apache RutaUnstructured Information Management Framework

Stanford NLP tools, DKPRo, et.al. Natural Language Processing suites

Python, Java, Hadoop, Spark, TensorFlow,Mahout, Vowpal Wabbit, GenSim,LevelDb, Elasticsearch, Docker, AWS, Cloudsigma, Open Languages, libraries, and frameworks

Page 7: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Key Challenges

Our Knowledge does Not Compute▪ The world moves too fast for data curators and ontology writers▪ Most Scientific Disciplines have no ontologies (or even controlled vocabularies)

▪ Dictionaries and Reference Works are too small and often out-of-date▪ New discoveries have no official names

People are too creative▪ There is a lot of variation in language▪ Researchers often add descriptive detail that obscure facts▪ There is no “right way” to describe most things

Some things seem obvious …but mostly to the author▪ The right Level-of-Detail depends both on the context and the reader▪ The most obvious facts are often omitted because they are implicitly included▪ Editors think in themes and topics, researchers in methods, properties, and facts

Page 8: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Full Text Search

Pseudohyponatremia: Does It Matter in Current Clinical Practice?http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3894530/doi: 10.5049/EBP.2006.4.2.77

Serum consists of water (93% of serum volume) and nonaqueous components, mainly lipids and proteins (7% of serum volume). Sodium is restricted to serum water. In states of hyperproteinemia or hyperlipidemia, there is an increased mass of the nonaqueous components of serum and a concomitant decrease in the proportion of serum composed of water. Thus, pseudohyponatremia results because the flame photometry method measures sodium concentration in whole plasma. A sodium-selective electrode gives the true, physiologically pertinent sodium concentration because it measures sodium activity in serum water. Whereas the serum sample is diluted in indirect potentiometry, the sample is not diluted in direct potentiometry. Because only direct reading gives an accurate concentration, we suspect that indirect potentiometry which many hospital laboratories are now using may mislead us to confusion in interpreting the serum sodium data. However, it seems that indirect potentiometry very rarely gives us discernibly low serum sodium levels in cases with hyperproteinemia and hyperlipidemia. As long as small margins of errors are kept in mind of clinicians when serum sodium is measured from the patients with hyperproteinemia or hyperlipidemia, the present methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry could be maintained in the clinical practice.

Page 9: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Using Keywords and Ontologies

Pseudohyponatremia: Does It Matter in Current Clinical Practice?http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3894530/doi: 10.5049/EBP.2006.4.2.77

Key: Chemical Technique Anatomy Disease Species

Serum consists of water (93% of serum volume) and nonaqueous components, mainly lipids and proteins (7% of serum volume). Sodium is restricted to serum water. In states of hyperproteinemia or hyperlipidemia, there is an increased mass of the nonaqueous components of serum and a concomitant decrease in the proportion of serum composed of water. Thus, pseudohyponatremia results because the flame photometry method measures sodium concentration in whole plasma. A sodium-selective electrode gives the true, physiologically pertinent sodium concentration because it measures sodium activity in serum water. Whereas the serum sample is diluted in indirect potentiometry, the sample is not diluted in direct potentiometry. Because only direct reading gives an accurate concentration, we suspect that indirect potentiometry which many hospital laboratories are now using may mislead us to confusion in interpreting the serum sodium data. However, it seems that indirect potentiometry very rarely gives us discernibly low serum sodium levels in cases with hyperproteinemia and hyperlipidemia. As long as small margins of errors are kept in mind of clinicians when serum sodium is measured from the patients with hyperproteinemia or hyperlipidemia, the present methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry could be maintained in the clinical practice.

Page 10: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

UNSILO Exhaustive Concept Extraction

Pseudohyponatremia: Does It Matter in Current Clinical Practice?http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3894530/doi: 10.5049/EBP.2006.4.2.77

Key: Chemical Technique Anatomy Disease Species

Serum consists of water (93% of serum volume) and nonaqueous components, mainly lipids and proteins (7% of serum volume). Sodium is restricted to serum water. In states of hyperproteinemia or hyperlipidemia, there is an increased mass of the nonaqueous components of serum and a concomitant decrease in the proportion of serum composed of water. Thus, pseudohyponatremia results because the flame photometry method measures sodium concentration in whole plasma. A sodium-selective electrode gives the true, physiologically pertinent sodium concentration because it measures sodium activity in serum water. Whereas the serum sample is diluted in indirect potentiometry, the sample is not diluted in direct potentiometry. Because only direct reading gives an accurate concentration, we suspect that indirect potentiometry which many hospital laboratories are now using may mislead us to confusion in interpreting the serum sodium data. However, it seems that indirect potentiometry very rarely gives us discernibly low serum sodium levels in cases with hyperproteinemia and hyperlipidemia. As long as small margins of errors are kept in mind of clinicians when serum sodium is measured from the patients with hyperproteinemia or hyperlipidemia, the present methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry could be maintained in the clinical practice.

Page 11: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

UNSILO Complete Semantic Mapping

Pseudohyponatremia: Does It Matter in Current Clinical Practice?http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3894530/doi: 10.5049/EBP.2006.4.2.77

Key: Action/Relation Chemical Technique Anatomy Disease Species

Serum consists of water (93% of serum volume) and nonaqueous components, mainly lipids and proteins (7% of serum volume). Sodium is restricted to serum water. In states of hyperproteinemia or hyperlipidemia, there is an increased mass of the nonaqueous components of serum and a concomitant decrease in the proportion of serum composed of water. Thus, pseudohyponatremia results because the flame photometry method measures sodium concentration in whole plasma. A sodium-selective electrode gives the true, physiologically pertinent sodium concentration because it measures sodium activity in serum water. Whereas the serum sample is diluted in indirect potentiometry, the sample is not diluted in direct potentiometry. Because only direct reading gives an accurate concentration, we suspect that indirect potentiometry which many hospital laboratories are now using may mislead us to confusion in interpreting the serum sodium data. However, it seems that indirect potentiometry very rarely gives us discernibly low serum sodium levels in cases with hyperproteinemia and hyperlipidemia. As long as small margins of errors are kept in mind of clinicians when serum sodium is measured from the patients with hyperproteinemia or hyperlipidemia, the present methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry could be maintained in the clinical practice.

Page 12: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

■ Natural Language Processing Sentences are annotated with part-of-speech tags; noun, verb, adjective, and a dependency tree

methods for measuring sodium concentration in serum by indirect sodium-selective electrode potentiometry 

[··thing··] [··action··] [···········thing··········] [·thing·] [····························· thing ······························]

■ Extract all “things”MethodSodium concentrationSerumIndirect Sodium-Selective Electrode Potentiometry

Phrase Extraction

Page 13: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

■ Deduplicate variationsReduce Morphological and Syntactic variation (Grammar)

■ Normalize adjectival modifiers, compound paraphrases, and expand coordinationsConcentration of Sodium >> Sodium ConcentrationThe Electrode Potentiometry was indirect >> Indirect Electrode PotentiometryMethodology >> MethodSera >> Serum

Reduce Lexical and Semantic variation (Synonyms, hypernyms, and form)■ Normalize semantic Level-of-Detail using ontologies and vector models

Method >> MechanismSerum Sample >> Blood SampleSerum Sodium Concentration >> Serum Natrium ConcentrationIndirect Electrode Potentiometry >> Electroanalysis

Snap to most common fragments and hypernyms■ Indirect Sodium Selective Potentiometry is-a-kind-of Indirect Potentiometry is-a-kind-of Electroanalysis

Remove too rare super-grams and hyponyms■ E.g. Clinically implemented indirect sodium-selective potentiometry ■ E.g. Error-prone Indirect ion-selective electrode potentiometry

Phrase Extraction

Page 14: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

■ Rank and Filter using Frequency and Distribution MetricsLocal features:

■ Occurrence count■ Position in document graph■ textual context

Global features: ■ Occurrence count■ TF/IDF■ Domain distribution■ Document concentration■ Aggregated textual context

■ Train Ranking Models using External MetricsHuman training data:

■ Article data: Which concepts have been included in the abstract and title■ Behavioral data: Which concepts are clicked on by users■ Behavioral data: Which articles are clicked on by users (...mostly those with promising titles ;-)

Synthetic training data: ■ Synthetic sentence data: Measure synonymi recall/precision against a known outcome■ Synthetic text collections: Aggregate using simple keyword searches, then prune out keywords

Phrase Extraction

Page 15: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

● We build high-dimensional vector-space representations of all concepts and phrases from corpus context

Word Embeddings and Word2Vec

Page 16: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Vasodilatation (finding)Peripheral vasodilation (finding)

Vasodilator (substance)Poisoning by vasodilator (disorder)

Vasodilating agent (product)Intra-cavernosal vasodilator (product)

Intra-arterial vasodilator (product)Coronary vasodilator (product)

Alpha blocking vasodilator (product)Nitrate-based vasodilating agent (product)

Human B-type natriuretic peptide (product)Endothelin receptor antagonist (product)

Pentaerythritol tetranitrate (product)Nitroglycerin (product)

Isosorbide mononitrate (product)Isosorbide dinitrate (product)

Measurement of blood pressure (procedure)Self-measurement devices (product)Systolic arterial pressure (observable entity)Non-invasive arterial pressure (observable entity)Blood pressure finding (finding)Blood pressure cuff, device (physical object)Blood pressure cuff inflator (physical object)Lying blood pressure (observable entity)Abnormal blood pressure (finding)Lower tourniquet cuff inflation (procedure)Cuff inflated (attribute)

principle.n.01generalizationbasic truthassumptionlaw

receptor.n01Plasma membrane moleculeG protein-coupled receptorligand-gated ion channelP2X receptorP2Y receptor

● We build high-dimensional vector-space representations of all concepts and phrases from corpus context

● We use POS tags and disambiguated senses to increase the precision of concepts and phrases in the context

● We apply ontologies, dictionaries and thesauri to improve accuracy on rare, complex, or novel concepts

● We use our high-dimensional model to build real-time semantic indexes with unprecedented precision

Ontology Augmented Vector-space

Page 17: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Synsets built from Vector Cosine Similarity

Page 18: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Human-readable Fingerprints

We have built a Corpus-based Recommender that use our novel and flexible approach to document fingerprinting and similarity

▪ Traditional Document Similarity ▪ Document vectors based on TF-IDF and Naïve BOW▪ Slow moving ontologies (snomed, doid, dron)▪ Simple concepts (“insulin” and “obesity”)▪ Limited recognition (only lemmatization/stemming)

▪ UNSILO▪ Dynamic corpus-driven concept similarity▪ Captures novel significant phrases (“insulin insensitivity”)▪ Links concepts across terminology variations (“reduced hormone response”)

Page 19: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

UNSILO Products for Science

Mads Rydahl

[email protected]

Founder & Chief Product Officer

Page 20: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

UNSILO Discovery Widgets

Page 21: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Springer.com

“Using UNSILO’s fully automated content enrichment technology, we can identify the most descriptive concepts and phrases within any document in our content portfolio, and provide more valuable reading suggestions, even across domains with a highly variable terminology.”

Jan-Erik de BoerChief Information OfficerSpringer Nature

“Our goal with this new feature is to make it easy for our users to drill down on what they find important in an article, and use that insight as a departure point for their discovery process.”

Stephen CorneliusProduct OwnerIT Platform DevelopmentSpringer Nature

UNSILO technology vendor for Springer Nature9M scientific articles and book chapters22M monthly users Significant increase in traffic and user engagementDisplaced leading competitor

Page 22: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence
Page 23: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence
Page 24: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence
Page 25: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence
Page 26: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

UNSILO value for researchers▪ Point directly to the most important ideas of an article▪ Provide more relevant suggestions by applying

a deep semantic understanding of key article concepts▪ Allow users to "drill down" and interactively explore

key concepts of the most relevant related articles

UNSILO value for Scientific Publishers▪ A scalable way of adding value across all content types▪ Supplements or replaces manual curation of ontologies▪ Broader discovery, reduced bounce rates,

longer session times, more article views

Easier Content Exploration

Page 27: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

■ Normalize Actions and RelationshipsSample linguistic variations of common relationships from re-statements of known facts, Then apply what we learn to less well understood domains:

■ Serum consists of water■ Serum amounts to 93% water■ Serum contains water■ Serum is composed of water■ Serum is mostly water

■ Providing hooks into Unstructured TextImprove training and prediction capabilities of larger AI initiatives by improving access to consumer feedback, corporate data lakes, or conversations within large communities of practice.

■ Reasoning at ScaleQuestion answering, uncover hidden causal chains, invalidate futile research projects

■ Augment Researcher’s cognitive abilities■ Accelerate the pace of Research■ Improve the return on R&D investments■ Finding the Cure for Cancer

Future Directions

■ Thin film Coated Gold Nano Particles■ Coating of Iron nano-particles with thin Gold film■ Fe Nanoparticles thin-film Gold coat■ Evaporation-coating of nanoparticles with gold■ Gold-coated magnetic nanoparticles

Page 28: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence

Yes! We are hiring!

[email protected]

Mads Rydahl

[email protected]

Founder & Chief Product Officer

Page 29: Discovering similar Research Ideas using Semantic Vectors ... · Discovering similar Research Ideas using Semantic Vectors and Machine Learning Mads Rydahl, UNSILO. UNSILO Text Intelligence