protvec - rostlab · 2018. 7. 2. · 3 concept of word2vec (1) mikolov et al. (2013): efficient...
TRANSCRIPT
ProtVecProblem based learning - July 2
2
General introduction - what is ProtVec?Asgari et al. (2015): Continuous distributed representation of biological sequences for deep proteomics and genomics
● Feature selection/representation of features as central step in machine learning
● ProtVec: “unsupervised data-driven distributed representation for biological sequences”
● Each sequence represented as n-dimensional vector○ Characterizes biophysical and biochemical properties○ Determined using neural networks
2
33
Concept of Word2Vec (1)● Mikolov et al. (2013): Efficient Estimation of Word Representations in
Vector Space● Continuous vector representation established in Natural Language
Processing (NLP) - Word2Vec● Goal: Development of word embeddings representing the meaning of
a word○ Meaning of a word is defined by its context (i.e. neighbouring words)○ → Supervised learning task of predicting context - target pairs (is word a used in
the same context as word b?) to train word embeddings → Skip-gram model○ Skip-gram model attempts to find word representations that are useful for
predicting the surrounding words in a sentence or a document.3
44
Concept of Word2Vec (2)● Syntactically or semantically similar words have similar vector
representations● Pairs of words with similar relationships can be found through simple
algebraic operations
Vec(King) - Vec(Man) + Vec(Woman) ≈ Vec(Queen)
4
55
Concept of Word2Vec (3)
● Can this be applied to proteins as well? → ProtVec5
666
Protein-Space Construction (1)● In NLP: large corpus of sentences to train word embeddings● For proteins: large corpus of sequences to train representation
○ Swiss-Prot with 546,790 manually annotated and reviewed sequences● Break sequences into subsequences (i.e. biological words)
→ 546,790 x 3 = 1,640,370 sequences (sentences) of 3-grams (words)
6
777
Protein-Space Construction (2)● Training of the embedding through the Skip-gram neural network
○ for protein sequences: usage of a vector size of 100 and a context size of 25○ → every 3-gram is represented as a vector of size 100
● Advantage: Model only has to be trained once and can be adopted for specific problems
7
888
Application● Representation of a sequence as the summation of the vector
representation of overlapping 3-grams ○ → vector of size 100 for one sequence
● Representation of a residue ri by vector representation of 3-gram “ri-1,ri,ri+1”○ → vector of size 100 for one residue○ problems with the 1st and last residue
● ProtVec can be used to predict features on a protein or residue level○ e.g. used for protein family classification○ contains important physical and chemical patterns
8
999
Application
9
101010
Implementation (1)● You can download the pre-trained 3-grams here
○ tab-separated file with header (skip 1st row when reading it into Python)○ each row consists of 101 columns: 1st column is the 3-gram, rest is the vector of
length 100 to represent the 3-gram○ create a dictionary which maps every 3-gram to its vector
● Vector representation for a residue can be directly obtained from the dictionary
● Calculate vector representation for a sequence by summing up the overlapping 3-gram vectors
10
111111
Implementation (2)seq = MAKLPQRST… #your sequencesequence_vec = [0,...,0] #length=100, all zeros
for each 3-gram in seq #iterate over all overlapping #3-grams(e.g. MAK,AKL,KLP,LPQ,...)
prot_vec_3_gram = threeGrams[3-gram]# threeGrams is your dictionary with pre-trained 3-gramssequence_vec = sequence_vec + prot_vec_3_gram
11
Any questions?