text mining course for knime analytics platform · 2019-08-23 · hot keys (for future reference)...
TRANSCRIPT
![Page 1: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/1.jpg)
Copyright © 2019 KNIME AG
Text Mining Coursefor KNIME Analytics PlatformKNIME AG
![Page 2: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/2.jpg)
Table of Contents
1
1. The Open Analytics Platform
2. The Text Processing Extension
3. Importing Text
4. Enrichment
5. Preprocessing
6. Transformation
7. Classification
8. Visualization
9. Clustering
10. Supplementary Workflows
Copyright © 2019 KNIME AG®
2
![Page 3: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/3.jpg)
OverviewKNIME Analytics Platform
1Copyright © 2019 KNIME AG®
3
![Page 4: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/4.jpg)
What is KNIME Analytics Platform?
• A tool for data analysis, manipulation, visualization, and reporting
• Based on the graphical programming paradigm
• Provides a diverse array of extensions:
– Text Mining
– Network Mining
– Cheminformatics
– Many integrations, such as Java, R, Python, Weka, Keras, H2O, etc.
Copyright © 2019 KNIME AG®
4
![Page 5: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/5.jpg)
Visual KNIME Workflows
NODES perform tasks on data
Nodes are combined to createWORKFLOWS
3
Status
Inputs Outputs
Not Configured
Configured
Executed
Error
Copyright © 2019 KNIME AG®
5
![Page 6: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/6.jpg)
Data Access
• Databases– MySQL, PostgreSQL– any JDBC (Oracle, DB2, MS SQL
Server)
• Files– CSV, txt– Excel, Word, PDF– SAS, SPSS– XML– PMML– Images, texts, networks, chem
• Web, Cloud– REST, Web services– Twitter, Google
4Copyright © 2019 KNIME AG®
6
![Page 7: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/7.jpg)
Big Data
• Spark
• HDFS support
• Hive
• Impala
• Vertica
• In-database processing
5Copyright © 2019 KNIME AG®
7
![Page 8: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/8.jpg)
Transformation
• Preprocessing
– Row, column, matrix based
• Data blending
– Join, concatenate, append
• Aggregation
– Grouping, pivoting, binning
• Feature Creation and Selection
6Copyright © 2019 KNIME AG®
8
![Page 9: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/9.jpg)
Analysis & Data Mining
• Regression– Linear, logistic
• Classification– Decision tree, ensembles, SVM,
MLP, Naïve Bayes
• Clustering– k-means, DBSCAN, hierarchical
• Validation– Cross-validation, scoring, ROC
• Deep Learning– Keras, DL4J
• External– R, Python, Weka, H2O, Keras
7Copyright © 2019 KNIME AG®
9
![Page 10: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/10.jpg)
Visualization
• Interactive Visualizations
• JavaScript-based nodes
– Scatter Plot, Box Plot, Line Plot
– Networks, ROC Curve, Decision Tree
– Adding more with each release!
• Misc
– Tag cloud, open street map, molecules
• Script-based visualizations
– R, Python
8Copyright © 2019 KNIME AG®
10
![Page 11: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/11.jpg)
Deployment
• Database
• Files
– Excel, CSV, txt
– XML
– PMML
– to: local, KNIME Server, SSH-, FTP-Server
• BIRT Reporting
9Copyright © 2019 KNIME AG®
11
![Page 12: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/12.jpg)
Analysis & MiningStatisticsData MiningMachine LearningWeb AnalyticsText MiningNetwork AnalysisSocial Media AnalysisR, Weka, PythonCommunity / 3rd
Data AccessMySQL, Oracle, ...SAS, SPSS, ...Excel, Flat, ...Hive, Impala, ...XML, JSON, PMMLText, Doc, Image, ...Web CrawlersIndustry SpecificCommunity / 3rd
TransformationRowColumnMatrixText, ImageTime SeriesJavaPythonCommunity / 3rd
VisualizationRJFreeChartJavaScriptCommunity / 3rd
Deploymentvia BIRTPMMLXML, JSONDatabasesExcel, Flat, etc.Text, Doc, ImageIndustry SpecificCommunity / 3rd
Over 2000 Native and Embedded Nodes Included:
10Copyright © 2019 KNIME AG®
12
![Page 13: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/13.jpg)
Overview
• Installing KNIME Analytics Platform
• The KNIME Workspace
• The KNIME File Extensions
• The KNIME Workbench
– Workflow editor
– Explorer
– Node Repository
– Node Description
• Installing new features
Copyright © 2019 KNIME AG®
13
![Page 14: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/14.jpg)
Install KNIME Analytics Platform
• Select the KNIME version for your computer:
– Mac
– Windows – 32 or 64 bit
– Linux
• Download archive and extract the file, or download installer package and run it
Copyright © 2019 KNIME AG®
14
![Page 15: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/15.jpg)
Start KNIME Analytics Platform
• Use the shortcut created by the installer
• Or go to the installation directory and launch KNIME via the knime.exe
Copyright © 2019 KNIME AG®
15
![Page 16: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/16.jpg)
The KNIME Workspace
• The workspace is the folder/directory in which workflows (and potentially data files) are stored for the current KNIME session.
• Workspaces are portable (just like KNIME)
14Copyright © 2019 KNIME AG®
16
![Page 17: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/17.jpg)
The KNIME Workbench
15
KNIME Explorer
Workflow Coach
Node Repository
Workflow Editor
Outline
Console
Node Description
Copyright © 2019 KNIME AG®
17
![Page 18: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/18.jpg)
KNIME Explorer
• In LOCAL you can access your own workflow projects.
• The Explorer toolbar on the top has a search box and buttons to– select the workflow displayed in
the active editor
– refresh the view
• The KNIME Explorer can contain 4 types of content:– Workflows
– Workflow groups
– Data files
– Metanode templates
Copyright © 2019 KNIME AG®
18
![Page 19: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/19.jpg)
Creating New Workflows, Importing and Exporting
• Right-click in KNIME Explorer to create new workflow or workflow group or to import workflow
• Right-click on workflow or workflow group to export
Copyright © 2019 KNIME AG®
19
![Page 20: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/20.jpg)
Node Repository
• The Node Repository lists all KNIME nodes
• The search box has 2 modes– Standard Search – exact match
of node name
– Fuzzy Search – finds the most similar node name
• Nodes can be added by drag and drop from the Node Repository to the Workflow Editor.
Copyright © 2019 KNIME AG®
20
![Page 21: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/21.jpg)
Console and Other Views
• Console view prints out error and warning messages about what is going on under the hood
• Click on View and select Other… to add different views
– Node Monitor, Licenses, etc.
• KNIME Hub Search View: search for nodes and workflows on the Hub
Copyright © 2019 KNIME AG®
21
![Page 22: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/22.jpg)
Node Description
• The Node Description window gives information about:
– Node Functionality
– Input & Output
– Node Settings
– Ports
– References to literature
Copyright © 2019 KNIME AG®
22
![Page 23: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/23.jpg)
Workflow Coach
• Node recommendation engine
– Gives hints about which node use next in the workflow
– Based on KNIME communities' usage statistics
– Based on own KNIME workflows
Copyright © 2019 KNIME AG®
23
![Page 24: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/24.jpg)
Tool Bar
The buttons in the toolbar can be used for the active workflow. The most important buttons:
– Execute selected and executable nodes (F7)
– Execute all executable nodes
– Execute selected nodes and open first view
– Cancel all selected, running nodes (F9)
– Cancel all running nodes
Copyright © 2019 KNIME AG®
24
![Page 25: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/25.jpg)
KNIME File Extensions
• Dedicated file extensions for Workflows and Workflow groups associated with KNIME Analytics Platform
• *.knwf for KNIME Workflow Files
• *.knar for KNIME Archive Files
23Copyright © 2019 KNIME AG®
25
![Page 26: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/26.jpg)
More on Nodes…
A node can have 3 states:
24
Not Configured: The node is waiting for configuration or incoming data.
Configured:The node has been configured correctly, and can be executed.
Executed: The node has been successfully executed. Results may be viewed and used in downstream nodes.
Copyright © 2019 KNIME AG®
26
![Page 27: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/27.jpg)
Inserting and Connecting Nodes
• Insert nodes into workspace by dragging them from Node Repository orby double-clicking in Node Repository
• Connect nodes by left-clicking output port of Node A and dragging thecursor to (matching) input port of Node B
• Common port types:
Data
Model
Image
Flow Variable
Database Connection
Database Query
Copyright © 2019 KNIME AG®
27
![Page 28: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/28.jpg)
Node Configuration
• Most nodes require configuration
• To access a node configuration window:
– Double-click the node
– Right-click -> Configure
Copyright © 2019 KNIME AG®
28
![Page 29: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/29.jpg)
Node Execution
• Right-click node
• Select Execute in context menu
• If execution is successful, statusshows green light
• If execution encounters errors, statusshows red light
Copyright © 2019 KNIME AG®
29
![Page 30: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/30.jpg)
Node Views
• Right-click node
• Select Views in context menu
• Select output port to inspect execution results
28
Plot View
Data View
Copyright © 2019 KNIME AG®
30
![Page 31: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/31.jpg)
Curved Connections!
29Copyright © 2019 KNIME AG®
31
![Page 32: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/32.jpg)
Getting Started: KNIME Example Server
• Public repository with large selection of example workflows for many, many applications
• Connect via KNIME Explorer
30Copyright © 2019 KNIME AG®
32
![Page 33: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/33.jpg)
KNIME Community Workflow Hub
A place to share knowledge about Workflows and Nodes https://hub.knime.com
Copyright © 2019 KNIME AG®
33
![Page 34: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/34.jpg)
Hot Keys (for Future Reference)
32
Task Hot key Description
Node Configuration F6 opens the configuration window of the selected node
Node View Shift + F6 opens first out-port view
Node Execution
F7 executes selected configured nodes
Shift + F7 executes all configured nodes
Shift + F10 executes all configured nodes and opens all views
F9 cancels selected running nodes
Shift + F9 cancels all running nodes
Move Nodes and Annotations
Ctrl + Shift + Arrow moves the selected node in the arrow direction
Ctrl + Shift + PgUp/PgDown
moves the selected annotation in the front or in the back of all overlapping annotations
Workflow Operations
F8 resets selected nodes
Ctrl + S saves the workflow
Ctrl + Shift + S saves all open workflows
Ctrl + Shift + W closes all open workflows
Meta-node Shift + F12 opens meta-node wizard
Copyright © 2019 KNIME AG®
34
![Page 35: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/35.jpg)
Stay connected with KNIME
33
Blog: knime.com/blog
Forum: forum.knime.com
KNIME Hub: hub.knime.com
Follow us on social media:
KNIME E-Learning Course:www.knime.com/e-learning-course
Copyright © 2019 KNIME AG®
35
![Page 36: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/36.jpg)
1
Today’s Example
Copyright © 2019 KNIME AG®
36
![Page 37: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/37.jpg)
Today’s Example
• Classification of free-text documents is a common task in the field of text mining.
• It is used to categorize documents, i.e. assign pre-defined topics, or it can be used for sentiment analysis.
• Today we want to construct a workflow that reads and preprocesses text documents, transforms them into a numerical representation and builds a predictive model to assign pre-defined labels to documents.
• Additional tasks:– Sentiment analysis– Visualization of documents– Document clustering
2Copyright © 2019 KNIME AG®
37
![Page 38: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/38.jpg)
Today’s Example
3
RatingTitle
FulltextAuthor
Copyright © 2019 KNIME AG®
38
![Page 39: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/39.jpg)
Today’s Example
4
Goal:
• Build a classifier to distinguish between reviews about Italian or Chinese restaurants.
Review aboutan Italian or a Chinese restaurant?
Copyright © 2019 KNIME AG®
39
![Page 40: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/40.jpg)
Today’s Example
5Copyright © 2019 KNIME AG®
40
![Page 41: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/41.jpg)
Bonus Examples
6Copyright © 2019 KNIME AG®
41
![Page 42: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/42.jpg)
1
The KNIME Text Processing Extension
Copyright © 2019 KNIME AG®
42
![Page 43: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/43.jpg)
Installation
2
1.) 2.) KNIME & Extensions -> KNIME Textprocessing
Copyright © 2019 KNIME AG®
43
![Page 44: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/44.jpg)
Tip
• Increase maximum memory for KNIME
• Edit knime.ini
– Add “-Xmx3G” as last line of knime.ini file
– Replace 3 by the amount of gigabytes allocated for KNIME
• Useful additional extensions
– XML-Processing (KNIME extension)• Parsing and processing of XML documents
– KNIME JavaScript Views (Labs)• Tagged Document Viewer
3Copyright © 2019 KNIME AG®
44
![Page 45: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/45.jpg)
Philosophy
4
… perhaps your nameis
Rumpelstiltskin[Person] ? …
… perhaps your nameis
Rumpelstiltskin[Person] ? …
Visualization
Cluster-ing
Classifi-cation
1 1 1 0 1 0 0 1 10 1 1 0 0 1 0 0 00 0 1 1 1 0 1 1 0
Copyright © 2019 KNIME AG®
45
![Page 46: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/46.jpg)
Additional Data Types
• Document Cell
– Encapsulates a document• Title, sentences, terms, words
• Authors, category, source
• Generic meta data (key, value pairs)
• Term Cell
– Encapsulates a term• Words, tags
5Copyright © 2019 KNIME AG®
46
![Page 47: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/47.jpg)
Data Table Structures
• Document table– List of documents
• Bag of words– Tuples of documents
and terms
• Document vectors– Numerical
representations of documents
6Copyright © 2019 KNIME AG®
47
![Page 48: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/48.jpg)
Section Exercise
• Open KNIME
• Import workflows from USB stick
7Copyright © 2019 KNIME AG®
48
![Page 49: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/49.jpg)
1
Importing Text
Copyright © 2019 KNIME AG®
49
![Page 50: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/50.jpg)
Data Source Nodes
• Typically characterized by:
– Orange color
– No input ports, 1 output port
2
Status
Node annotation
Output port
Copyright © 2019 KNIME AG®
50
![Page 51: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/51.jpg)
New Node: File Reader
• Workhorse of the KNIME Source nodes
– Reads text based files
– Many advanced features allow it to read most ‘weird’ files
3Copyright © 2019 KNIME AG®
51
![Page 52: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/52.jpg)
File Reader: Configuration
4
Preview
Basic Settings Advanced
Settings
File path
Node description
Copyright © 2019 KNIME AG®
52
![Page 53: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/53.jpg)
New Node: Excel Reader (XLS)
• Reads .xls and .xlsx file from Microsoft Excel
– Supports reading from multiple sheets
5Copyright © 2019 KNIME AG®
53
![Page 54: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/54.jpg)
Excel Reader Configuration
Preview
Sheet specificsettings
File path
6Copyright © 2019 KNIME AG®
54
![Page 55: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/55.jpg)
New Node: Table Reader
• Reads tables from the native KNIME Format
• Maximum performance
• Minimum configuration
7
v
File path
Copyright © 2019 KNIME AG®
55
![Page 56: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/56.jpg)
New Node: Database Reader
• Connectors for Common DB types
(MySQL, Postgres, SQLite)
• Also works with any jdbc driver
• Common nodes for SQL Query Building
(Groupby, Join, Filter, Sort)
8Copyright © 2019 KNIME AG®
56
![Page 57: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/57.jpg)
Other Interesting Nodes
• PMML Reader – reads standard predictive models
• XML Reader with XPATH support
• REST/SOAP, and many more
9Copyright © 2019 KNIME AG®
57
![Page 58: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/58.jpg)
Parser Nodes
• Node Repository: Other Data Types/Text Processing/IO
• Available Parser Nodes
– Flat File Document Parser
– PDF Parser
– Word Parser
– Document Grabber
– …
10Copyright © 2019 KNIME AG®
58
![Page 59: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/59.jpg)
New Node: Strings To Document
• Creation of document cells from strings
– Converts string cells to document cells
– Useful in combination with e.g. File Reader, XLS Reader, database nodes
11Copyright © 2019 KNIME AG®
59
![Page 60: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/60.jpg)
Strings To Document: Configuration
12
TitleText
Authors
Category
Tokenizer
Copyright © 2019 KNIME AG®
60
![Page 61: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/61.jpg)
Tokenizers
• Different tokenizers are available leading to slightly different terms extracted from the document
Example: “I’m enjoying the tutorial”
“I”, “’m”, “enjoying”, “the”, “tutorial” “I’m”, “enjoying”, “the”, “tutorial”
WhitespaceTokenizerEnglish WordTokenizer
Copyright © 2019 KNIME AG®
61
![Page 62: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/62.jpg)
New Node: Meta Info Inserter & Extractor
• Inserter allows adding document meta info
– Adds meta info to documents as key value pairs
– Helpful if more meta info available than covered by Strings to Documents node
• Extractor brings data back from document cell into table columns
– Each key results in a column, containing the specific values for each document related to that key.
14Copyright © 2019 KNIME AG®
62
![Page 63: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/63.jpg)
New Node: Tika Parser
• Reads files of various formats from directory
– Searches for all files with specified extension in directory
– Creates one document for each file
– Extracts specified (meta) information
15Copyright © 2019 KNIME AG®
63
![Page 64: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/64.jpg)
Tika Parser: Configuration
16
Directory
File extensions
Recursivesearch
Meta data toextract
Extraction ofattachments
Copyright © 2019 KNIME AG®
64
![Page 65: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/65.jpg)
Section Exercise
• Start with “Exercise: Importing text”
– Import string data from:
• TripadvisorReviews-SanFranciscoRestaurants-ItalianChineseFood.table
– Filter rows with missing titles
– Convert strings to documents
– Filter all columns except the document column
17
You can download the training workflows from the KNIME Hub:https://hub.knime.com/knime/space/Education/02%20KNIME%20Text%20Mining%20Course/
Copyright © 2019 KNIME AG®
65
![Page 66: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/66.jpg)
Section Solution
Import text
• Table Reader
• Row Filter
• Strings to Documents
• Column Filter
18Copyright © 2019 KNIME AG®
66
![Page 67: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/67.jpg)
1
Enrichment
Copyright © 2019 KNIME AG®
67
![Page 68: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/68.jpg)
Enrichment
• Semantic information is indicated by a tag assignment
– Part of speech, named entities (persons, organizations, genes, …), sentiments
• A tag consists of a type and a value
– Type represents the class or set of tags• e.g. POS (part of speech), NE (named entity)
– Value represents the actual tag value• e.g. NN (noun), PERSON
2
Column containing terms
with tags
Term “food” with tag value “NN” and type
“POS”
Copyright © 2019 KNIME AG®
68
![Page 69: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/69.jpg)
Tagger Nodes
• Typically characterized by:
– Yellow color
– 1 to 2 input ports (requiring one document column), 1 output port
– Assignment of semantic information (tags) to terms
3Copyright © 2019 KNIME AG®
69
![Page 70: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/70.jpg)
Tagger Nodes
• Node Repository:
Other Data Types/Text Processing/Enrichment
• Available Tagger Nodes
– Stanford tagger
– Dictionary (& Wildcard) tagger
– OpenNLP tagger
– Abner tagger
– Amazon Comprehend
– …
4Copyright © 2019 KNIME AG®
70
![Page 71: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/71.jpg)
Tagger Nodes
• Allows to specify the number of parallel threads.
• Note: each thread will load a separate model into memory!
• Tagged terms are set unmodifiable.
5
Number ofparallel threads
Copyright © 2019 KNIME AG®
71
![Page 72: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/72.jpg)
New Node: Stanford tagger
• Assigns part of speech tags to terms
– Models for English, German, French (from Stanford NLP Group)
– Alternative node: POS tagger
• Model only for English (from OpenNLP)
6Copyright © 2019 KNIME AG®
72
![Page 73: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/73.jpg)
Stanford Tagger: Configuration
7
Model touse
Number ofparallel threads
Copyright © 2019 KNIME AG®
73
![Page 74: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/74.jpg)
• Assigns selected tag to matching terms
– Matches terms in documents against terms in dictionary
– Tag to be assigned to matching terms is specified in the dialog
– Alternative node: Wildcard tagger
• Terms in dictionary may contain wild cards and regular expressions
New Node: Dictionary Tagger
8Copyright © 2019 KNIME AG®
74
![Page 75: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/75.jpg)
Dictionary Tagger: Configuration
9
Dictionarycolumn
Tag value tobe assignedType of tag
to be assigned
Exact matchor “contains”
Copyright © 2019 KNIME AG®
75
![Page 76: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/76.jpg)
New Node: Tagged Document Viewer
• Displays documents with tags highlighted:
– Takes a column with documents as input
– Allows to inspect tags assigned to documents
Documentcolumn
Document with tags highlighted
Copyright © 2019 KNIME AG®
76
![Page 77: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/77.jpg)
Tagged Document Viewer: Configuration
Document column
Enable display of tags
Number of documents to
display
View and interactivity
configuration
Copyright © 2019 KNIME AG®
77
![Page 78: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/78.jpg)
Section Exercise
• Start with “Exercise: Enrichment”
– Assign (English) POS tags
– View tagged documents
12Copyright © 2019 KNIME AG®
78
![Page 79: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/79.jpg)
Section Solution
Enrichment
• POS tagger
• Tagged Document Viewer
13Copyright © 2019 KNIME AG®
79
![Page 80: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/80.jpg)
Section Exercise (Bonus)
• Start with “Exercise: Enrichment II”
– Read files that contain positive and negative words
• MPQA-OpinionCorpus-PositiveList.csv
• MPQA-OpinionCorpus-NegativeList.csv
– Assign positive and negative sentiment tags based on positive and negative word lists
– View tagged documents
– Tip: Dictionary Tagger node
14Copyright © 2019 KNIME AG®
80
![Page 81: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/81.jpg)
Section Solution (Bonus)
Enrichment
• File Reader
• Dictionary Tagger
• Tagged Document Viewer
15Copyright © 2019 KNIME AG®
81
![Page 82: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/82.jpg)
Custom NER models
16
• The provided NER models of OpenNLP NE tagger and StandfordNLP NE tagger are trained for a few types of entities and English language only.
• For more specific applications and other languages custom models are needed.
Copyright © 2019 KNIME AG®
82
![Page 83: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/83.jpg)
• Trains a NER model based on the input dictionary and corpus
– Tag type and value can be set in the dialog
– Creates tagged corpus based on input documents and dictionary. Trains model with tagged corpus.
New Node: StanfordNLP NE Learner
17
Dictionary
Documentcorpus StanfordNLP
NE model
Copyright © 2019 KNIME AG®
83
![Page 84: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/84.jpg)
Stanford Tagger: Configuration
18
Dictionarycolumn
Documentcorpus
Tag type and value
Copyright © 2019 KNIME AG®
84
![Page 85: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/85.jpg)
• Tags documents based on input NER model.
– NER model can be specified in dialog, built-in or model from input port
New Node: StanfordNLP NE tagger
19Copyright © 2019 KNIME AG®
85
![Page 86: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/86.jpg)
StanfordNLP NE tagger: Configuration
20
Parameters for builtin
model
Use modelfrom input
port or built-in models
Copyright © 2019 KNIME AG®
86
![Page 87: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/87.jpg)
Tagging Conflicts
21
• In case of tag intersections the last node overwrites.
• “Serbian-American inventor Nikola Tesla developed the …”1. POS tagger: “Serbian-American\NNP inventor\NNP Nikola\NNP Tesla\NNP developed\VBD
the\DT…”
2. NE tagger: “Serbian-American\NNP inventor\NNP Nikola Tesla\Person developed\VBD the\DT …”
Overwrite!
Copyright © 2019 KNIME AG®
87
![Page 88: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/88.jpg)
• Tagged terms can be set unmodifiable
• Unmodifiable terms are not affected by any preprocessing node
• Preprocessing nodes can explicitly ignore unmodifiability
Unmodifiable Terms
22
Set unmodifiablein tagger nodes
Ignoreunmodifiability in
preprocessingnodes
Copyright © 2019 KNIME AG®
88
![Page 89: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/89.jpg)
Supplementary Workflows: NER Tagger Model Training
• Trains NER model for latin and gallic names based on “De Bello Gallico” from Julius Caesar.
23Copyright © 2019 KNIME AG®
89
![Page 90: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/90.jpg)
1
Preprocessing
Copyright © 2019 KNIME AG®
90
![Page 91: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/91.jpg)
Preprocessing
• Reduction of feature space (terms)
• Filtering of unnecessary terms
– Stop words, based on POS tags, dictionaries, regex, …
• Normalization of terms
– Stemming, case conversion
2Copyright © 2019 KNIME AG®
91
![Page 92: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/92.jpg)
• Typically characterized by:
– Yellow color
– 1 to 2 input ports (requiring one document column), 1 output port
– For filtering and normalizing terms of documents and bags of words
Preprocessing Nodes
3Copyright © 2019 KNIME AG®
92
![Page 93: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/93.jpg)
Preprocessing Nodes
• Node Repository:
Other Data Types/Text Processing/Preprocessing
• Available Preprocessing Nodes
– Stop Word Filter
– Snowball Stemmer
– Tag Filter
– Case Converter
– RegEx Filter
– …
4Copyright © 2019 KNIME AG®
93
![Page 94: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/94.jpg)
Preprocessing Nodes
• Preprocessing tab in node dialog to specify:
– Append original documents
– Ignore term unmodifiability(set by tagger nodes).
5
Appendoriginal
document
Ignore termunmodifiability
Copyright © 2019 KNIME AG®
94
![Page 95: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/95.jpg)
New Node: Stop Word Filter
• Filters stop words
– Built-in stop word lists: English, French, German, Italian, …
– Alternatively load custom stop word list
6Copyright © 2019 KNIME AG®
95
![Page 96: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/96.jpg)
Stop Word Filter: Configuration
7
Built-in stopword lists
Custom stopword list
Copyright © 2019 KNIME AG®
96
![Page 97: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/97.jpg)
Stemming
Stemming reduces different forms of a word to it’s common base by sequentialapplication of stemming rules.
Original text:Light caresses colours, sets them aglow, plays with nuances, shadows and structures
Porter stemmer:Light caress colour, set them aglow, plai with nuanc, shadowand structure.
Rule Example
SSES → SS caresses → caress
IES → I ponies → poni
SS → SS caress → caress
S → cats → cat
Copyright © 2019 KNIME AG®
97
![Page 98: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/98.jpg)
New Node: Snowball Stemmer
• Reduces terms to word stem
– For various languages: English, German, French, Italian, …
– Integration of Snowball stemming library
– Alternative nodes: Porter Stemmer, Kuhlen Stemmer
• For English only
9Copyright © 2019 KNIME AG®
98
![Page 99: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/99.jpg)
Snowball Stemmer: Configuration
10
Language selection
Copyright © 2019 KNIME AG®
99
![Page 100: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/100.jpg)
New Node: Tag Filter
• Filters terms based on specified tag values
– For all tag types and values
11Copyright © 2019 KNIME AG®
100
![Page 101: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/101.jpg)
Tag Filter: Configuration
12
Tag typeselection
Tag valueselection
Copyright © 2019 KNIME AG®
101
![Page 102: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/102.jpg)
Section Exercise
• Start with “Exercise: Preprocessing”
– Filtering:
• Numbers
• Punctuation marks
• Stop words
• All terms except: nouns, verbs, adjectives
– Stemming
– To lower case
13Copyright © 2019 KNIME AG®
102
![Page 103: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/103.jpg)
Section Solution
Preprocessing
• Number Filter
• Punctuation Erasure
• Stop Word Filter
• Case Converter
• Snowball Stemmer
• POS Filter
14Copyright © 2019 KNIME AG®
103
![Page 104: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/104.jpg)
1
Transformation
Copyright © 2019 KNIME AG®
104
![Page 105: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/105.jpg)
Transformation
• Transformation of data table structures
– List of documents ➔ bag of words
– Bag of words ➔ document / term vectors
– Extraction of document fields to string columns
– Conversion of terms to strings
2Copyright © 2019 KNIME AG®
105
![Page 106: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/106.jpg)
• Typically characterized by:
– Yellow color
– 1 input port, 1 output port
Transformation Nodes
3Copyright © 2019 KNIME AG®
106
![Page 107: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/107.jpg)
Transformation Nodes
• Node Repository:
Other Data Types/Text Processing/Transformation
• Available Transformation Nodes
– Bag of Words Creator
– Document Vector
– Strings to Document
– Sentence Extractor
– Document Data Extractor
– Unique Term Extractor
– …
4Copyright © 2019 KNIME AG®
107
![Page 108: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/108.jpg)
New Node: Bag of Words Creator
• Transforms list of documents into bag of words
– Original documents can be appended in a column
5
Documentlist
Bag of words
Copyright © 2019 KNIME AG®
108
![Page 109: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/109.jpg)
Bag of Words Creator: Configuration
6
Documentsused to createbag of words
Original documents
can beappended
Copyright © 2019 KNIME AG®
109
![Page 110: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/110.jpg)
New Node: Term to String
• Transforms term cells into string cells
– Tag information will get lost
7
Bag of words
Bag of wordswith string
column
Copyright © 2019 KNIME AG®
110
![Page 111: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/111.jpg)
Term to String: Configuration
8
Terms totransform to
strings
Copyright © 2019 KNIME AG®
111
![Page 112: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/112.jpg)
Section Exercise
• Start with “Exercise: Preprocessing II”
– Create bag of words
– Filter terms that occur in less than 5 documents
– Tip: Bag of Words, GroupBy, and Reference Row Filter
9Copyright © 2019 KNIME AG®
112
![Page 113: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/113.jpg)
Section Solution
Preprocessing II
• Bow Creator
• Term to String
• GroupBy
• Row Filter
• Reference Row Filter
10Copyright © 2019 KNIME AG®
113
![Page 114: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/114.jpg)
New Node: Document Vector
• Transforms bag of words into document vectors
– Creates bit or numerical vectors
11
Bag of words withfrequency column
Documentvector
Copyright © 2019 KNIME AG®
114
![Page 115: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/115.jpg)
Document Vector: Configuration
12
Documents toappend to leftof the created
vector columns
Create bit ornumerical
vector
Copyright © 2019 KNIME AG®
115
![Page 116: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/116.jpg)
New Node: Document Vector Applier
• Transforms bag of words into document vectors
– Creates feature space of reference document vectors
– Creates bit or numerical vectors
13
Reference document vectors
Documentvector
Copyright © 2019 KNIME AG®
116
![Page 117: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/117.jpg)
Document Vector Applier: Configuration
14
Include andexclude lists offeatures of the
referencevectors
Use settings from model
input
Copyright © 2019 KNIME AG®
117
![Page 118: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/118.jpg)
New Node: Document Vector Hashing
• Transforms documents into document vectors
– Vector indices of terms are determined by term hashing
– Creates bit or numerical vectors
– Is streamable
15
Hasheddocument
vector
Copyright © 2019 KNIME AG®
118
![Page 119: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/119.jpg)
Document Vector Hashing: Configuration
16
Dimensions ofdocument vectors
Hashingfunction
Copyright © 2019 KNIME AG®
119
![Page 120: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/120.jpg)
New Node: Document Data Extractor
• Extracts document fields as strings
– Title, text, categories, …
17
Documentcolumn
Extracted field as string column
Reminder: we stored restaurant type into Category field in String to
Document conversion
Copyright © 2019 KNIME AG®
120
![Page 121: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/121.jpg)
Document Data Extractor: Configuration
18
Fields toextract
Copyright © 2019 KNIME AG®
121
![Page 122: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/122.jpg)
Frequencies
• Frequencies are based on the number of occurrences of terms
– Locally (in documents): term frequency (TF) absolute or relative
– Globally (in corpus): inverse document frequency (IDF)
• Frequencies can also be used for term filtering
19Copyright © 2019 KNIME AG®
122
![Page 123: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/123.jpg)
• Typically characterized by:
– Green color
– 1 input port, 1 output port
– Require bag of words
Frequency Nodes
20
Append column withrelative TF values
Copyright © 2019 KNIME AG®
123
![Page 124: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/124.jpg)
Frequency Nodes
• Node Repository:
Other Data Types/Text Processing/Frequencies
• Available Frequency Nodes
– TF
– IDF
– Ngram creator
– …
21Copyright © 2019 KNIME AG®
124
![Page 125: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/125.jpg)
New Node: TF
• Computes the relative orabsolute term frequency (tf) of each term within a document
Appended columnwith TF values
Copyright © 2019 KNIME AG®
125
![Page 126: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/126.jpg)
New Node: DF
• Computes the number of documents that contain each term
Appended columnwith DF values
Copyright © 2019 KNIME AG®
126
![Page 127: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/127.jpg)
New Node: IDF
• Computes three variants of inverse document frequency (IDF) for each term within the documents
– Smooth, normalized, and probabilistic
Appended columnwith IDF values
Copyright © 2019 KNIME AG®
127
![Page 128: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/128.jpg)
New Node: Term Co-Occurence Counter
• Counts the number ofpairwise co-occurences ofterms in bag of wordswithin selected parts of document (e.g. sentence, paragraph, title)
Copyright © 2019 KNIME AG®
128
![Page 129: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/129.jpg)
New Node: Ngram Creator
• Creates ngrams from documents of input table and counts their frequencies
• Both word and character ngrams are possible
Copyright © 2019 KNIME AG®
129
![Page 130: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/130.jpg)
Section Exercise
• Start with “Exercise: Transformation”
– Compute relative term frequencies
– Create document vectors
– Extract class label / category
27Copyright © 2019 KNIME AG®
130
![Page 131: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/131.jpg)
Section Solution
Transformation
• TF
• Document Vector
• Document Data Extractor
28Copyright © 2019 KNIME AG®
131
![Page 132: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/132.jpg)
1
Classification
Copyright © 2019 KNIME AG®
132
![Page 133: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/133.jpg)
Classification
• Assigning pre-defined labels to documents– Categorization
– Sentiment analysis
– Topic assignment
• Supervised learning
• In the last section we transformed textual documents into a numerical representation (document vectors).
• We can use standard KNIME nodes to classify / analyze these vectors.
2Copyright © 2019 KNIME AG®
133
![Page 134: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/134.jpg)
Classification
Methods:
• Decision Trees
• Neural Networks
• Naïve Bayes
• Logistic Regression
• Support Vector Machine
• Tree Ensembles
3Copyright © 2019 KNIME AG®
134
![Page 135: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/135.jpg)
Predictive Modeling Overview
4
Training
Set
Test
Set
Original
Data Set
Train
Model
Apply
Model
Score
Model
Data PartitioningTraining and
Applying ModelsModel Evaluation
Copyright © 2019 KNIME AG®
135
![Page 136: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/136.jpg)
New Node: Partitioning
• Use it to split data into training and evaluation sets
• Partition by count (e.g. 10 rows) or fraction (e.g. 10%)
• Sample by a variety of methods; random, linear, stratified
5Copyright © 2019 KNIME AG®
136
![Page 137: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/137.jpg)
Predictive Modeling Overview
6
Training
Set
Test
Set
Original
Data Set
Train
Model
Apply
Model
Score
Model
Data Partitioning Training and
Applying Models
Scoring
Strategies
Copyright © 2019 KNIME AG®
137
![Page 138: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/138.jpg)
• All data mining models use a Learner-Predictor motif.
• The Learner node trains the model with its input data.
• The Predictor node applies the model to a different subset of data.
The Learner-Predictor Motif
7
Training set
Test set
Trained Model
Copyright © 2019 KNIME AG®
138
![Page 139: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/139.jpg)
Decision Tree
• C4.5 builds a tree from a set of training data using the concept of information entropy.
• At each node of the tree, the attribute of the data with the highest normalized information gain (difference in entropy) is chosen to split the data.
• The C4.5 algorithm then recourses on the smaller sub lists.
8
J.R. Quinlan, “C4.5 Programs for machine learning”
J. Shafer, R. Agrawal, M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data
Mining”
Copyright © 2019 KNIME AG®
139
![Page 140: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/140.jpg)
New Node: Decision Tree Learner
9Copyright © 2019 KNIME AG®
140
![Page 141: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/141.jpg)
Decision Tree: View
10
If the word “Italian” occurs in a review, the restaurant is
very likely an Italian restaurant.
Copyright © 2019 KNIME AG®
141
![Page 142: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/142.jpg)
New Node: Decision Tree Predictor
• Consumes a Decision Tree model and new data to classify
• Check the box to append class probabilities
11Copyright © 2019 KNIME AG®
142
![Page 143: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/143.jpg)
Predictive Modeling Overview
12
Training
Set
Test
Set
Original
Data Set
Train
Model
Apply
Model
Score
Model
Data Partitioning Training and
Applying Models
Scoring
Strategies
Copyright © 2019 KNIME AG®
143
![Page 144: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/144.jpg)
New Node: Scorer
• Compare predicted results to known truth to evaluate model quality
• Confusion matrix shows the distribution of model errors
• An accuracy statistics table provides additional info
13Copyright © 2019 KNIME AG®
144
![Page 145: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/145.jpg)
Scorer: Confusion Matrix
14
This is the difference
between the confusion
matrix data table and the
confusion matrix view
True Positives
False Negatives
False Positives
True Negatives
Copyright © 2019 KNIME AG®
145
![Page 146: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/146.jpg)
Scorer: Accuracy Measures
15
From the confusion matrix
Copyright © 2019 KNIME AG®
146
![Page 147: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/147.jpg)
Section Exercise
• Start with “Exercise: Classification”
– Append color information based on class labels
– Split data into training and test set
– Train decision tree classifier on training set
– Apply trained model on test set
– Score model
16Copyright © 2019 KNIME AG®
147
![Page 148: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/148.jpg)
Section Solution
Classification
• Color Manager
• Column Filter
• Partitioning
• Decision Tree Learner
• Decision Tree Predictor
• Scorer
17Copyright © 2019 KNIME AG®
148
![Page 149: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/149.jpg)
Classification (Bonus)
• Usually the documents used to train a model are read from a different source than that of the documents to which the model is applied afterwards
• To apply a trained model on a second set of documents we need to ensure that all features of the training set exist as features of the second set.
• This means that all document vector columns of the training set must exist as document vector columns in the second set.
18Copyright © 2019 KNIME AG®
149
![Page 150: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/150.jpg)
Classification (Bonus)
19
All features of the trainingset must exist as features
in the second set.
Copyright © 2019 KNIME AG®
150
![Page 151: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/151.jpg)
Section Exercise (Bonus)
• Start with “Exercise: Classification II”
– Create document vectors for the second set of documents “Boston Tripadvisor Reviews”
– The feature space of the second set has to contain all features of the training set!
– Apply the trained model on the second set of documents
20Copyright © 2019 KNIME AG®
151
![Page 152: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/152.jpg)
Section Solution (Bonus)
Classification II
21Copyright © 2019 KNIME AG®
152
![Page 153: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/153.jpg)
Sentiment Analysis (Bonus)
• In sentiment analysis predefined sentiment labels, such as "positive" or "negative“, are assigned to texts.
Methods:
• Predictive modeling
• Dictionary based
• Deep parsing
• …
22Copyright © 2019 KNIME AG®
153
![Page 154: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/154.jpg)
Sentiment Analysis Example (Bonus)
• The Large Movie Review Dataset v1.0
– 50.000 English movie reviews
– Associated sentiment labels “positive” and “negative”
– http://ai.stanford.edu/~amaas/data/sentiment/
• Subset contains 2000 documents
– 1000 positive reviews
– 1000 negative reviews
– …/data/IMDb-sample.csv
23Copyright © 2019 KNIME AG®
154
![Page 155: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/155.jpg)
Sentiment Analysis Example (Bonus)
24
Predictive modeling:
• Build classifier to distinguish between positive and negative reviews.
– “Ah, Moonwalker, I'm a huge Michael Jackson fan, I grew up with his music, Thriller was actually the first music video I ever saw apparently. …”
– “This film has a very simple but somehow very bad plot. …”
Positive ornegative?
Copyright © 2019 KNIME AG®
155
![Page 156: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/156.jpg)
Section Exercise (Bonus)
• Start with “Exercise: Classification III”
– Create document cells
– Preprocess documents
• Punctuation Erasure, N Chars Filter, Stop Word Filter, Case converter, Snowball Stemmer
• Filter all terms that occur in less than 20 documents
– Create document vectors
– Extract sentiment label and assign colors
– Partition into training and test set
– Train decision tree model and score it
25Copyright © 2019 KNIME AG®
156
![Page 157: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/157.jpg)
Section Solution (Bonus)
Classification
• Strings to document
• Preprocessing nodes
• Bag of words creation, grouping, counting, and filtering
• Vector creation
• Model training and scoring
26Copyright © 2019 KNIME AG®
157
![Page 158: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/158.jpg)
Sentiment Analysis Example (Bonus)
27
Dictionary based:
• Use a custom dictionary to count positive andnegative words.
• Compute sentiment score to predict sentimentlabel.
Copyright © 2019 KNIME AG®
158
![Page 159: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/159.jpg)
Section Exercise (Bonus)
• Start with “Exercise: Classification IV”
– Create document cells
– Tag terms based on sentiment dictionaries
• Tip: Dictionary Tagger
– Extract and count positive and negative terms
– Compute sentiment score based on the number of positive and negative terms
– Predict sentiment labels based on score
– Score predictions
28Copyright © 2019 KNIME AG®
159
![Page 160: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/160.jpg)
Section Solution (Bonus)
Classification• Strings to
Documents• Dictionary Tagger• Bag of words, TF,
and GroupBy for counting
• Pivoting• Math Formula• Rule Engine• Scorer
29Copyright © 2019 KNIME AG®
160
![Page 161: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/161.jpg)
1
Visualization
Copyright © 2019 KNIME AG®
161
![Page 162: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/162.jpg)
• Typically characterized by:
– Blue color
– 1 input port, 1-2 output port (image port)
Visualization Nodes
2Copyright © 2019 KNIME AG®
162
![Page 163: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/163.jpg)
Visualization Nodes
• Node Repository:
Other Data Types/Text Processing/Misc
• Available Visualization Nodes
– Document Viewer
– Tagged Document Viewer (in JS Views (Labs))
– Tag Cloud
• KNIME Text Processing provides only two dedicated viz. nodes
• Various other nodes can be used for viz. too.
3Copyright © 2019 KNIME AG®
163
![Page 164: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/164.jpg)
New Node: Tag Cloud (JavaScript)
• Shows terms visualized in a cloud
– Colors are specified via the Color Manager
– Requires a term and a numerical column (usually tf)
– Creates image, available at image out port
4
List of termsand
frequencies
Size of wordscorrespondsto frequency
Copyright © 2019 KNIME AG®
164
![Page 165: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/165.jpg)
Tag Cloud: Configuration
5
Display onlytop N terms
(rows)
Term columnand frequency
column
Copyright © 2019 KNIME AG®
165
![Page 166: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/166.jpg)
Tag Cloud: View
6
Min and max fontsize, angle, …
Scaling of fontsize: linear,
log, exp
Copyright © 2019 KNIME AG®
166
![Page 167: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/167.jpg)
Additional Visualizations
• Decision Tree View
– Inspect trained model
– See which terms are discriminative
7Copyright © 2019 KNIME AG®
167
![Page 168: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/168.jpg)
Section Exercise
• Start with “Exercise: Visualization”
– Inspect decision tree via its view
– Visualize bag of words using a tag cloud
– Assign colors to terms in tag cloud (Optional)
• Green if term occurs mostly in Chinese reviews, blue if terms occurs mostly in Italian reviews
8Copyright © 2019 KNIME AG®
168
![Page 169: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/169.jpg)
Section Solution
Visualization
• Decision Tree Learner
• Tag Cloud
• (Optional Coloring)
– TF, Document Data Extractor, Group By, Pivoting, Math Formula, Color Manager
9Copyright © 2019 KNIME AG®
169
![Page 170: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/170.jpg)
New Node: Document Viewer
• Shows details of documents
– Title, Full text
– Meta information
– Tagged terms can be hilited and linked
10
Documentcolumn
Copyright © 2019 KNIME AG®
170
![Page 171: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/171.jpg)
Document Viewer: View
11
List of all documents.
Double click fordetails
Details view withtitle andfull text
Tagged termscan be hilited
Author, category, meta information,
…
Tagset tohilite
Copyright © 2019 KNIME AG®
171
![Page 172: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/172.jpg)
Reminder: Tagged Document Viewer
• Displays documents with tags highlighted:
– Takes a column with documents as input
– Allows to inspect tags assigned to documents
Documentcolumn
Document with tags highlighted
Copyright © 2019 KNIME AG®
172
![Page 173: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/173.jpg)
Section Exercise
• Start with “Exercise: Visualization II”
– View document content
– View document content and highlight tagged terms
– View tagged documents
13Copyright © 2019 KNIME AG®
173
![Page 174: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/174.jpg)
Section Solution
Visualization
• Document Viewer
• Tagged Document Viewer
14Copyright © 2019 KNIME AG®
174
![Page 175: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/175.jpg)
Bonus Visualizations
• Supplementary Workflows/
– R Theme River (R plot)
– Twitter Word Tree (JavaScript view)
15Copyright © 2019 KNIME AG®
175
![Page 176: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/176.jpg)
1
Clustering
Copyright © 2019 KNIME AG®
176
![Page 177: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/177.jpg)
Clustering
• Find groups (clusters) of similar documents
– Topic detection
– Exploration
• Unsupervised learning
• We can use standard KNIME nodes to cluster the numerical document vectors.
2Copyright © 2019 KNIME AG®
177
![Page 178: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/178.jpg)
Clustering
Methods:
• Hierarchical clustering
• K-Means / Medoids
• Density based
• …
3Copyright © 2019 KNIME AG®
178
![Page 179: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/179.jpg)
Hierarchical Clustering
• Creates hierarchy for all data points– Agglomerative, bottom-up– Combine the “closest” data points/clusters, one at a time
• Hierarchy can be illustrated by dendrogram• Applicable only on small data sets (<5000)
• Complete linkage: combine data object/cluster with minimal maximum distance– Finds compact, convex clusters
• Single linkage: combine data object/cluster with minimal minimum distance– Also finds concave clusters
• Average linkage: distance between two clusters c1 and c2 = mean distance between all points in c1 and c2
4Copyright © 2019 KNIME AG®
179
![Page 180: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/180.jpg)
Prototype-based Clustering
• K-Medoids, K-Means, Fuzzy C-Means, …
• Data are condensed to a small fixed number of prototypical data points
• Each prototype represents a subset of data points
• Applicable on large data sets
• Number of prototypes (k) must be specified in advance
5Copyright © 2019 KNIME AG®
180
![Page 181: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/181.jpg)
New Node: Distance Matrix Calculate
• Computes all pairwise distances
• Different distance measures available
– Euclidean, Manhattan, Cosine, Dice, Tanimoto, …
• Optional distance model input port
6
Documentvectors
Distancecolumn
Copyright © 2019 KNIME AG®
181
![Page 182: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/182.jpg)
Distance Matrix Calculate: Configuration
7
Distancemeasure
Columns to usefor distancecomputation
Name ofdistancecolumn
Copyright © 2019 KNIME AG®
182
![Page 183: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/183.jpg)
New Node: Hierarchical Clustering (DistMatrix)
• Creates hierarchy of input data points
– Complete Linkage, Average Linkage, Single Linkage
• Requires distance column or model
8
Distancecolumn
Clustering model
Distancefunction
(optional)
Copyright © 2019 KNIME AG®
183
![Page 184: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/184.jpg)
Hierarchical Clustering (DistMatrix): Configuration
9
Distancecolumn Linkage
type
Copyright © 2019 KNIME AG®
184
![Page 185: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/185.jpg)
New Node: Hierarchical Cluster View
• Shows:
– Dendrogram of clustering
– Distance curve
– Colors
10
Data points, e.g.document
vectors
Hierarchicalclustering
model
Dendrogramor distance
Copyright © 2019 KNIME AG®
185
![Page 186: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/186.jpg)
New Node: Hierarchical Cluster Assigner
• Assigns data points to clusters based on
– Distance threshold
– Number of clusters
11
Data points, e.g.document
vectors
Hierarchicalclustering
model
Cluster assignment
Copyright © 2019 KNIME AG®
186
![Page 187: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/187.jpg)
Hierarchical Cluster Assigner: Configuration
12
Threshold orcluster count
based assignment
Copyright © 2019 KNIME AG®
187
![Page 188: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/188.jpg)
Hierarchical Clustering: Example Workflow
13
Data e.g.:document
vectors
Hierarchy ofdata points Illustration of
dendrogram
Assignment ofclusters
Copyright © 2019 KNIME AG®
188
![Page 189: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/189.jpg)
New Node: k-Medoids
• Computes k prototypes (medoids)
• Requires distance column or model
• Requires specification of k
• Similar nodes:
– k-Means
– Fuzzy c-Means
14
Data points anddistance column
Cluster assignment
Copyright © 2019 KNIME AG®
189
![Page 190: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/190.jpg)
k-Medoids: Configuration
15
Cluster count k
Distance matrixcolumn
Random seedfor reproducible
results
Copyright © 2019 KNIME AG®
190
![Page 191: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/191.jpg)
k-Medoids Clustering: Example Workflow
16
Data e.g.:document
vectors
Assignment ofclusters
Copyright © 2019 KNIME AG®
191
![Page 192: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/192.jpg)
Section Exercise
• Start with “Exercise: Clustering”
– What groups of documents are in the data?
– Compute pairwise cosine distances
– Apply hierarchical clustering
• View dendrogram to find out the number of clusters (k)
• Assign k clusters
– Apply k-Medoids with k as number of clusters
– Select documents of one cluster in dendrogram, hilite them, and inspect data in a table view
17Copyright © 2019 KNIME AG®
192
![Page 193: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/193.jpg)
Section Solution
Clustering
• Distance Matrix Calculate
• Hierarchical Clustering
– Cluster View
– Cluster Assigner
• k-Medoids
18Copyright © 2019 KNIME AG®
193
![Page 194: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/194.jpg)
1
Supplementary Workflows
Copyright © 2019 KNIME AG®
194
![Page 195: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/195.jpg)
R Theme River
Creates theme river using ggplot2.
• ggplot2 has to be installed!
• Change lib path
2Copyright © 2019 KNIME AG®
195
![Page 196: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/196.jpg)
Twitter Word Tree
Creates a word tree using the JavaScript Google charting library.
3Copyright © 2019 KNIME AG®
196
![Page 197: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/197.jpg)
Term Co-occurrences
Term co-occurrences of all term pairs are counted on sentence and document level.
4Copyright © 2019 KNIME AG®
197
![Page 198: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/198.jpg)
Topic Extraction
Extracts two topics from the input documents and 10 words to represent each topic.
5Copyright © 2019 KNIME AG®
198
![Page 199: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/199.jpg)
RESTful Geolocation
6
Try Catch Block
REST call to get lat long for IPs
Copyright © 2019 KNIME AG®
199
![Page 200: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/200.jpg)
RESTful Geolocation
• Translates IPs to geo coordinates via RESTful service
• GET Resource: access RESTful API via GET
• IP to geo coordinates (lat/lon)
• Read REST Representation: parse REST result
– JSON, XML, CSV, …
• Try Catch nodes to log errors gracefully
7Copyright © 2019 KNIME AG®
200
![Page 201: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/201.jpg)
Geographic Analysis
8Copyright © 2019 KNIME AG®
201
![Page 202: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/202.jpg)
Geographic Analysis
• Reads IPs from download weblog and related geo coordinates
• Aggregates downloads by city, country, and US states
• OSM Map View to visualize geo coordinates
• OSM Map to Image to create image of map view
9Copyright © 2019 KNIME AG®
202
![Page 203: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/203.jpg)
Social Media Analysis
10
Sentiment analysis of users
Leader / Follower analysis of users
Copyright © 2019 KNIME AG®
203
![Page 204: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/204.jpg)
Social Media Analysis
• Slashdot forum data
• Text Mining: sentiment analysis of users
• Network Mining: leader and follower scoring of users
11Copyright © 2019 KNIME AG®
204
![Page 205: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/205.jpg)
Romeo and Juliet
12
Load JPEG and convert to PNG
Read epub file
Insert PNG images and visualize network
Tag character names and count frequencies
Copyright © 2019 KNIME AG®
205
![Page 206: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/206.jpg)
Romeo and Juliet
• Interaction network of characters.
• Border color indicates family assignment
• Node size is related to TF of character names
13Copyright © 2019 KNIME AG®
206
![Page 207: Text Mining Course for KNIME Analytics Platform · 2019-08-23 · Hot Keys (for Future Reference) 32 Task Hot key Description Node Configuration F6 opens the configuration window](https://reader030.vdocument.in/reader030/viewer/2022033120/5e2f886146203a2fd2156869/html5/thumbnails/207.jpg)
Copyright © 2019 KNIME AG
Thank You!
The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by KNIME AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.