protvec - rostlab · 2018. 7. 2. · 3 concept of word2vec (1) mikolov et al. (2013): efficient...

ProtVecProblem based learning - July 2

2

General introduction - what is ProtVec?Asgari et al. (2015): Continuous distributed representation of biological sequences for deep proteomics and genomics

● Feature selection/representation of features as central step in machine learning

● ProtVec: “unsupervised data-driven distributed representation for biological sequences”

● Each sequence represented as n-dimensional vector○ Characterizes biophysical and biochemical properties○ Determined using neural networks

2

33

Concept of Word2Vec (1)● Mikolov et al. (2013): Efficient Estimation of Word Representations in

Vector Space● Continuous vector representation established in Natural Language

Processing (NLP) - Word2Vec● Goal: Development of word embeddings representing the meaning of

a word○ Meaning of a word is defined by its context (i.e. neighbouring words)○ → Supervised learning task of predicting context - target pairs (is word a used in

the same context as word b?) to train word embeddings → Skip-gram model○ Skip-gram model attempts to find word representations that are useful for

predicting the surrounding words in a sentence or a document.3

44

Concept of Word2Vec (2)● Syntactically or semantically similar words have similar vector

representations● Pairs of words with similar relationships can be found through simple

algebraic operations

Vec(King) - Vec(Man) + Vec(Woman) ≈ Vec(Queen)

4

55

Concept of Word2Vec (3)

● Can this be applied to proteins as well? → ProtVec5

666

Protein-Space Construction (1)● In NLP: large corpus of sentences to train word embeddings● For proteins: large corpus of sequences to train representation

○ Swiss-Prot with 546,790 manually annotated and reviewed sequences● Break sequences into subsequences (i.e. biological words)

→ 546,790 x 3 = 1,640,370 sequences (sentences) of 3-grams (words)

6

777

Protein-Space Construction (2)● Training of the embedding through the Skip-gram neural network

○ for protein sequences: usage of a vector size of 100 and a context size of 25○ → every 3-gram is represented as a vector of size 100

● Advantage: Model only has to be trained once and can be adopted for specific problems

7

888

Application● Representation of a sequence as the summation of the vector

representation of overlapping 3-grams ○ → vector of size 100 for one sequence

● Representation of a residue ri by vector representation of 3-gram “ri-1,ri,ri+1”○ → vector of size 100 for one residue○ problems with the 1st and last residue

● ProtVec can be used to predict features on a protein or residue level○ e.g. used for protein family classification○ contains important physical and chemical patterns

8

999

Application

9

101010

Implementation (1)● You can download the pre-trained 3-grams here

○ tab-separated file with header (skip 1st row when reading it into Python)○ each row consists of 101 columns: 1st column is the 3-gram, rest is the vector of

length 100 to represent the 3-gram○ create a dictionary which maps every 3-gram to its vector

● Vector representation for a residue can be directly obtained from the dictionary

● Calculate vector representation for a sequence by summing up the overlapping 3-gram vectors

10

https://dataverse.harvard.edu/file.xhtml?fileId=2712445&version=RELEASED&version=.1

111111

Implementation (2)seq = MAKLPQRST… #your sequencesequence_vec = [0,...,0] #length=100, all zeros

for each 3-gram in seq #iterate over all overlapping #3-grams(e.g. MAK,AKL,KLP,LPQ,...)

prot_vec_3_gram = threeGrams[3-gram]# threeGrams is your dictionary with pre-trained 3-gramssequence_vec = sequence_vec + prot_vec_3_gram

11

Any questions?

protvec - rostlab · 2018. 7. 2. · 3 concept of word2vec (1) mikolov et al. (2013): efficient...

Documents