large scale nlp using python's nltk on azure
TRANSCRIPT
Beat Schweglerhead in the cloud feet on the ground
Twitter: @cloudbeatsch Blog: http://cloudbeatsch.com
I saw Mr. Washington with a saw!large scale NLP using python's NLTK on Azure
I saw Mr. Washington.This is your saw… I told you!Is this really a chainsaw?
fundamentals of nlp
natural language toolkit (nltk)
running python and nltk on Azure
source: http://www.nltk.org/book_1ed/ch01.html
simple pipeline architecture for a spoken dialogue system
dialogue with a chatbot
identify languagetokenize & tag part of speech (pos)identify named entities
corpora and lexical resourcescorpus is a large body of textlexical resource is a collection words associated with additional information
e.g. brown corpusfirst million-word electronic corpus of english, created in 1961 at brown university
segmentationtokenizetag part of speech (pos)identify named entities
source: http://www.nltk.org/book_1ed/ch07.html
entity detection using chunking
fundamentals of nlp
natural language toolkit (nltk)
running python and nltk on Azure
text as a sequence of words and punctuation represented as a list
sent = [‘I', ‘love', ‘Dublin', ‘!']upper_sent = [w.upper() for w in
sent]
downloading corpus and lexical resourcesnltk.download(‘all’)nltk.download(‘brown’)
segment text into sentencesfrom nltk.tokenize import sent_tokenizesent_tokenize_list = sent_tokenize(text)
tokenize sentencefrom nltk.tokenize import word_tokenizetokens = word_tokenize(sentence)
tag part of speech (pos)tags = nltk.pos_tag(tokens)
identify named entitiesentities = nltk.ne_chunk(tags)entities.draw()
demo
language recognition import langidlang = langid.classify(text)[0]
fundamentals of nlp
natural language toolkit (nltk)
running python and nltk on Azure
azure cloud services azure webjobsazure functions
azure cloud services & pythonpip’s requirements.txtPowerShell scripts for setup and launch
azure webjobs & pythonupload zip (inc. dependencies)runs run.py (or the first py file it finds)
configuration settings key = os.environ["STORAGE_KEY"]
publish webjobpip packages into site-packages zip application (inc. depended packages)upload zip file
add package location to sys.pathp = os.path.join(os.getcwd(), "site-packages")sys.path.append(p)
downloading corpusD:\\local\\AppData\\nltk_dataif os.getenv("DOWNLOAD", True) == True : dest = os.environ[“NLTK_DATA_DIR"] nltk.download('all', dest)
using queues for communicationreads text from input queue writes processed text into output queues
auto scalebased on queue length
debugging python webjobslocal: vs and webjob simulatorcloud: use kudu (xyz.scm.azurewebsites.net) and logs
demo
in closing…
nltk is a great toolkit to perform nlp tasksazure provides an elastic and scalable platform to run python nltk jobs
http://www.nltk.org/ http://www.nltk.org/book_1ed
http://azure.com/