exploring social media with nodexl

Post on 15-Jul-2015

215 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Exploring Social Media with NodeXL

(Updated April 15, 2016)

One Goal Today

• What are the capabilities of NodeXL, and do I have a use for it (for research, for exploration, for fun, or some mix of the prior)?

2

Overview

1. Network graphs and related terminology

2. Potential uses in research

3. NodeXL (Network Overview, Discovery, and Exploration for Excel)

4. Social media platforms

5. Data extraction runs

6. Data processing

3

Overview (cont.)

7. Network graph data visualizations8. NodeXL Graph Gallery and the virtual community 9. Beyond NodeXL10.Presentation review11.Some general takeaways about network analysis using social media data12.Questions? Comments?

4

1. Network Graphs and Related Terminology

A Very Brief Overview

5

6

Underlying Data in Matrices

7

Underlying Data in Matrices (cont.)

• May show whether a relationship exists or not (binary)

• May show strength (or intensity) of a relationship

• May show the direction of a relationship (one-way, two-ways/reciprocated)

• …and other information

8

Unit of Analysis

Global Network• Global network measures: Indicators of the types

of interrelated communities being observed • Inferences about the state of the community

• Inferences about how power moves, how information moves

• Inferences about who is influential and how

• Predictive analytics about where this community is going

• Overlapping networks

About Online Global Networks

• Central masses continue for a time

• Small clusters either meld with larger ones, or they eventually disappear

• Often held together for a time through charismatic leaders

• Isolates and pendants usually disappear over time

• Dynamism is a part of all networks

9

Unit of Analysis (cont.)

Nodes • Node-level measures: Indicators of the egos

• Inferences about the ego even if it is “invisible” based on its effect on the surrounding egos and entities

About Online Nodes • Cyber selves somewhat representational of the

real-world selves

• Messaging / location / imagery / profiles may be analyzed to infer personality and interests

• Popularity falling under a power law (a few stars garnering most of the attention, the rest in the long tail of social aspirants and poseurs)

10

Statistical Measures

Global Network Measures Betweenness centrality: Total number of shortest paths or

walks for each pair of dyadic nodes (info moves between the shortest paths and closest ties), how much of a bridge a node is for network connectivity In an undirected graph, distance to all other nodes

In directed graph, distances to a node more meaningful because node has little control over in-coming nodes <-

Closeness centrality: Geodesic path distance between a node and every other node (farness as sum of all distances to all other nodes; closeness as inverse of farness)

Node-level (Local) MeasuresDegree centrality: In-degree and out-degree

(relative popularity within the network)

Clustering coefficient: Embeddedness of single nodes in cliques or ego neighborhoods with its alters

11

Statistical Measures (cont.)

Global Network Measures Eigenvector centrality (diversity): Relative distances

between a node and every other node and those connected to higher-value or popular nodes resulting in a higher value (values between 0 and 1) as a measure of relative influence in a graph

Clustering coefficient: Aggregation of multiple nodes based on similarity (like co-occurrence) or connectivity, and expressed as proximity or closeness visually; may be a measure of transitivity

Motif Measures Dyads, triads, and other structured sub-groupings

Local and experiential for the nodes in terms of structured connections

May (fractals) / may not be reflective of the overall structure

Global motif censuses (counts of occurrences of various types of motif structures in a whole network)

Structural holes as indicators of potential openings for nodes and links (to build resilience)

12

Different Network Graphs

Social Network Graphs• Entities and interrelationships

• Follower – following (formally declared relationships) • Parasocial relationships online

• Weak ties, fragile linkage

• Reply message / retweet / reply video / likes / comment-to and others (situationally created relationships, empirical)

• Entities and contents

• Entities and events

Content Network Graphs• Based on content similarity

• Based on content proximity to other terms (based on different-sized “windows” moved across a text)

• Co-occurring terms (content) or tags (metadata)

• Scraped thumbnail images

• May be based on pre-structured content “thesauruses” or may be extracted and structured in an emergent way from the text corpuses (or texts)

13

Some Related Terminology

• Structure-mining: The study of networks and interrelationships in order to make inferences about systems (structure as a “topology” or a “map”)

• Graph: A data visualization of interrelationships (in either 2D or 3D), including node-link diagrams (usually without set x- y- axes but spatially relational via Euclidean distance) • Undirected graph: A graph in which the relationship between nodes is associational,

without arrows at the ends

• Directed graph (digraph): A graph in which the relationship between nodes is directional, with the potential for arrows at the ends

14

Some Related Terminology (cont.)

• Sociogram / sociograph: A network graph showing interrelationships between social entities

• Degree: The proximity of relationship (such as a 1, 1.5, or 2 degree relationship), in terms of directness of ties

• In-degree: The numbers of relationships in-coming to a vertex or node

• Out-degree: The numbers of out-relationships from a vertex or node

15

Some Related Terminology (cont.)

• Motifs: Various types of structures of node-based relationships between nodes in dyadic, triadic, and other polyatic relationships

• Clusters: Densely connected groups (subgraphs) in a network, including islands

• Isolates: Nodes in a network that are not directly connected to any other node

• Pendants or whiskers: A node connected to a network by only one relationship (link, edge)

16

Some Related Terminology (cont.)

• Bridging nodes: A node which is on the periphery of multiple social networks and connects them in a way that would not exist otherwise (and so is influential even if it is peripheral in the respective networks)

• Core-periphery dynamic: A concept of power and influence with those closest in to the core considered as most influential and those on the periphery as less so

• Graph diameter: The distance between the two farthest nodes in a network (in terms of shortest-distance hops between intermediate nodes)

17

Affordances of Electronic Social Network Analysis (E-SNA)

• Plenty of theory and current research • Social Networks (Journal, Elsevier)

• Structure-mining (relational topologies) and content-mining (text analysis, cultural analysis)

• Micro-, meso-, and macro- levels of analysis (zooming in and zooming out (for different levels of granularity): nodes / entities and links / relationships; motifs, clusters, and branches; entire networks

• Part of “network science”

18

2. Potential Uses in Research

19

General Research Possibilities

• Social media account profiling (through inferential analysis, data leakage, de-aliasing to personally identifiable information or “PII”)

• Trending online conversations (by #hashtag, by keyword); human sensor networks

• Identification of the “mayors of the hashtag” (per Dr. Marc A. Smith of SMRF)

• Public mindset on a topic (by both direct and indirect analysis) (by related tags networks from “free-form” folk tagging / folksonomies)

20

Some General Research Possibilities (cont.)

• Eventgraphing, event detection and monitoring, and event postmortems

• Reverse-engineering a social-mediated (political, marketing, fund-raising, or other) campaign; semi-live-tracking a social-mediated campaign

• Discovery of artificial accounts (including AI social bots); some application to potential fraud analysis

• The “company you keep” concept

21

Some General Research Possibilities (cont.)

• Geolocational applications: location -> messaging; messaging -> location

• “Oppo” (opposition) research (such as for political campaigns) through open-source intelligence (OSINT)

• Messaging: broadscale themes and particulars

• Inter-relationships

22

General Research Sequence

23

A Simplified Research Sequence of Extracting and Analyzing Social Media Information with NodeXL

1. Research question / open exploration / mixed intent 2. Social media strategy: social media platform(s), seeding term(s), and data

extraction parameters3. Data extractions using NodeXL4. Data processing 5. Data visualizations6. Data analysis (in NodeXL)7. Data analysis (outside of NodeXL)

24

Data Limitations

• Limited data sets (with no knowledge of the “N of all,” at least not without insider access)

• “Recent” data only (usually reverse listed from present to the past)

• Rate-limited data extractions

• Time-dependent data (with hidden dependencies)

25

Data Limitations (cont.)

• Reliance on (often noisy) textual descriptions of multimedia contents

• Inherent “noise” in metadata, content labeling, content descriptions, tagging, and related online conversations

• Sparse geolocational data in microblogging messages and in uploaded imagery / videos (in terms of “exchangeable image file format” or “EXIF” data)

26

Local NodeXL Mitigations to Data Limitations

• Re-running the data extraction on different machines but with the same parameters at the same time

• Running the data extractions at slightly different times

• Running multiple and different data visualizations on the same dataset

• Using multiple seeding terms for a particular issue

27

Using an N = All

• Capturing an N= all through Gnip (a company now owned by Twitter) or a similar company (unless Gnip has an exclusivity contract)

• Working directly with the company or organization behind the social media platform (particularly their research divisions), but research may be embargoed (restricted from any release or publication)

28

Using Proper Research Practices

• Posing research questions in strategic ways: Ask ambitiously, but do not over-claim from results

• Respecting the research traditions and methods of the respective domain or field

• Applying serious efforts at (dis)confirmation of findings

29

Using Proper Research Practices (cont.)

• Capturing multiple streams of data (often in a cross-platform way)

• Documenting all data extraction parameters, data processing, and data provenance issues

• Using multiple analytical tools to analyze the captured data

• Comparing cyber info with real-world info (determining where the cyber-physical confluence lies)

• Using accurate qualifiers to the presented data30

3. NodeXLNetwork Overview, Discovery, and Exploration for Excel

31

NodeXL “Template”Brief History• Formerly known as .NetMap

• First released in July 2008 as an add-on to Microsoft Excel

• Available at the Microsoft CodePlex site

• Supported by the Social Media Research Foundation (SMRF) with the tagline “Open Tools, Open Data, Open Scholarship for Social Media”

• Third-party data importer tools to NodeXL available through integrated links available through NodeXL

APIs and Add-ons• Application programming interface (API):

protocols for the building of software applications to interact with (in this case) public-facing social media platform databases

• Add-on: An addition to a software program to add functionality

32

Workspace

33

4. Social Media Platforms

34

Social Media

• Integrated online sites and applications that enable people to …

• Interact

• Inter-communicate

• Share information, digital artifacts and objects, materials, funds, and other elements

• Collaborate (co-create knowledge, fund-raise, support, and others)

• Create continuing and long-term profiles

35

Web 2.0 / the Social Web• Microblogging site: Twitter, Sina Weibo

• Social networking sites: Facebook, LinkedIn

• Wikis: Wikipedia (with a MediaWiki understructure)

• Video sharing: YouTube, Vimeo

• Image-sharing: Flickr

• Blogs: WordPress (understructure)

• Email:

• Short message service (SMS):

• and others36

Note: It helps to immerse in each platform and observe how users use the platform and how the platform’s community responds to in-world events. It helps to challenge assumptions about how things actually work vs. how one assumes it works.

Social Media Accessible via NodeXL

• Facebook Fan Page Network

• Facebook Personal Network

• Flickr Related Tags Network

• Flickr User’s Network

• MediaWiki Page Network*

• Twitter Search Network (#hashtag, keyword, other)

• Twitter User’s Network (@account, @group)

• Web 1. / Blog Network (via VOSON / “Virtual Observatory for the Study of Online Networks”)*

• YouTube User’s Network

• YouTube Video Network (topic)

• [3rd party graph data importers]*

37

Social Media Account Types

• Social media accounts

• Public or private accounts

• Individual or group (often topic-focused) accounts

• Human, cyborg, ‘bot (including socialbots)

38

Application Programming Interfaces (APIs)

• Application Programming Interfaces (APIs) enabling access to some limited data from the social media platforms

• Often rate-limited by the social media platform

• Enables downloading of a percentage of the available public data (full amount of dataset not indicated by the API)

• Data released by content creators through the end user license agreements (EULAs)

• Data scraping also possible39

Application Programming Interfaces (APIs) (cont.)

• Access requires an email-verified account to “whitelist” to access the data (to enable the platform’s rate-limiting)

• Some (like Flickr) require a secret and a key

• Terms of access change, and developers may not keep up with changing the software to ensure some access

40

Types of Social Media Data Available

NodeXL• Topical slice-in-time; dynamic and continuous

(for a certain period of time) (on Twitter)

• Protected user accounts in Facebook (with log-in authentication into Facebook)

• Public-facing user accounts in Flickr, YouTube, and fan accounts in Facebook

• Article edits in Wikipedia

Others• Tweetstreams going back in time (up to about

3,000 per account) (NCapture in NVivo, on Twitter)

• Geomapping of Tweets (NCapture in NVivo, on Twitter)

• Links between accounts on social media platforms to the Surface Web (MaltegoChlorine 3.6.0)

41

5. Data Extraction Runs

42

General Parameters of a Data Extraction

• Seeding term(s)

• Boolean data types (sets) [# and #; # and keyword; tag and tag]

• Type of social or content network (or two-mode / bipartite or multi-mode networks)

• Degree of network (1, 1.5, or 2)

• Amount of vertices or messages or videos (size of network), and others

43

6. Data Processing

44

45

Graph Metrics

• Selection of desired metrics of the extracted graph

• Processed on the local machine • May have to process in parts

and pieces (instead of “select all”) because of machine processing limits (saving after each iteration)

46

Graph Metrics (in detail)

• Overall graph metrics

• Vertex degree (undirected graphs only)

• Vertex in-degree (directed graphs only)

• Vertex out-degree (directed graphs only)

• Vertex betweenness and closeness centralities (a measure of influence in the network based on “bridging” along shortest paths / transmission / propagation efficiency)

• Vertex eigenvector centrality (a measure of influence in the network based on connectivity to influential or high-scoring nodes)

• Vertex PageRank

• Vertex clustering coefficient

• Vertex reciprocated vertex pair ratio (directed graphs only)

• Edge reciprocation (directed graphs only)

• Group metrics

• Words and word pairs

• Edge creation by shared content similarity

• Top items

• Twitter search network top items

47

Resulting Global-View Graph Metrics Table

48

• Vertices

• Unique edges

• Edges with duplicates

• Total edges

• Self-loops

• Reciprocated vertex pair ratio

• Reciprocated edge ratio

• Connected components

• Single-vertex connected components

• Maximum vertices in a connected component

• Maximum edges in a connected component

• Maximum geodesic distance (diameter)

• Average geodesic distance

• Graph density (or sparseness)

• Modularity

7. Network Graph Data Visualizations

49

Data Visualizations

• Graph Layout Algorithms: Fruchterman-Reingold, Harel-Koren Fast Multiscale, Circle (Ring Lattice Graph), Spiral, Horizontal Sine Wave, Vertical Sine Wave, Grid, Polar, Polar Absolute, Sugiyama, and Random

• Autofill Columns

• Dynamic Filters

• Layout Options

• Graph Options50

51

Toggling between the Graph Visualizations and the Underlying Data

Data Cleaning• Deletion of information from the graph that

may not be directly relevant (from the data worksheets)

• De-duplication of messaging (if relevant)

Data Filtering• Using “Dynamic Filters” to select particular types

of data of interest to show in the graph pane: relationship date, Tweet Date (UTC), x-axis, y-axis, in-degree, out-degree, betweenness centrality, closeness centrality, eigenvector centrality, PageRank, clustering coefficient, reciprocated vertex pair ratio, followed, followers, Tweets, favorites, joined Twitter date (UTC)

• (and UTC degree time to geo-location)

52

Dynamic Filters

53

NodeXL Graph Gallery

• Set up as a place for shared research about social network graphs

• Includes experimental interactive versions of the graphs (if GraphML version is enabled in the upload by the creators of the data)

• Includes some downloadable datasets

• Enables email-verified account creation (which allows the revision of related texts and reversing publication of graphs)

• No commenting on others’ graphs or datasets here

55

NodeXL Virtual Community and Resources

• NodeXL on CodePlex

• Source Code (open-source)

• Documentation

• Discussions

• Issues

• License (Ms-PL, Microsoft Public License)

56

9. Beyond NodeXLOther Complementary Software Tools

57

Other (Complementary) Tools

Surface Web Data Collection• Maltego Chlorine 3.6.0 (commercial

“subscription” license but with a limited community version)

• NCapture of NVivo 10 (commercial license: perennial or subscription-type site license)

Text Analysis• Natural Language Toolkit (NLTK) in Python

(open-source and free)

• AutoMap and NetScenes (CASOS) (open-source and free)

58

10. Presentation Review

59

Review: NodeXL Capabilities

NodeXL Capabilities with Social Media Data

• Data extractions from both social media platforms and the Surface Web (with VOSON or “Virtual Observatory for the Study of Online Networks” third-party data importer server)

• Additional social media platforms in the works

NodeXL Capabilities • Network graph data processing

• Network graph analysis

• Graph visualizations

• Multi-lingual data processing

• … and others

• Addition of rudimentary sentiment analysis in commercial version (“NodeXL Pro”) released in 2015

60

A Short Note about the Sentiment Analysis Feature

• Based on a positive-negative polarity

• Uses a built-in positive word set and a built-in negative word set

• Customizable

• Enables the addition of a third type of word set (a new construct) based on a custom-made text set

61

62

11. Some General Takeaways about Network Analysis using Social Media Data

63

Some General Takeaways

• Unique aspects of social media platforms and their particular users. The social media platforms are constantly changing. Their users and their metrics are critical to understanding the extracted data. As such, only some voices are captured via social media platforms.

• In other words, who is online, and how are they actually using the social media platforms? What geographical regions are covered? (How does this skew the data?)

64

Some General Takeaways (cont.)

• Nature of social media platforms. The nature of the social media platforms are important—whether they are for content sharing, knowledge structures, social networking (and for what purpose), and so on.

• Rules of engagement change what is seeable and seen in terms of messaging

• Technically, how “entities” and “relationships” are defined depends on the social media platform. (Read the fine print. Read the developers’ pages.)

• Continuous (dynamic data) vs. slice-in-time (static data); access to historical data

65

Some General Takeaways (cont.)

• A sampling. There are numerous dependencies in terms of data extractions. The connectivity speed, the busyness of the target servers, the rate limiting of the application programming interfaces (APIs), the dynamism of the data, and such, affect what is collected. This sampling is not a random sample, but it is hard to know how much of a part of a full set has been captured. In most cases, only a very small sample is acquired.

• Very rarely is a full set possible, and only for particular types of data (such as an article network from Wikipedia).

66

Some General Takeaways (cont.)

• Data visualizations used with underlying data: The data visualizations are rich and varied; however, they are always in a sense less than the full set of information. By definitions, data visualizations are data summaries.

• The “graph metrics” table is a critical aspect of the information. Data visualizations should be used with the underlying data.

67

Some General Takeaways (cont.)

• Understandings of how social media platforms are used: The general public tends to be a lot faster than one would assume in terms of responding to breaking events with messaging across the various platforms.

• Any eventgraphing has to draw from all public sources (and across social media platforms) because each contributes different angles and perspectives on the events; each also attracts different portions of the population.

• (And of course, a lot of information is not publicly shared, so the whole social media angle is still somewhat limited.)

68

Some General Takeaways (cont.)

• Speed: With unfolding events on social media, most have gone to automated means to surveil and monitor communications. Computational text analytics (and visual analytics) are applied to the messaging in order to see • what is trending

• the strength and direction of sentiments (positive or negative)

• the types of emotions expressed and in what textual contexts, and so on.

• There is progress in terms of computational visual analysis for object identification, facial recognition, and others.

69

12. Questions? Comments?

70

Questions? Comments?

71

• What research questions are you interested in pursuing? What is the potential role of social media in augmenting your (main) research?

• How do you think you might go about capturing the required social media information? How would you confirm or disconfirm any findings?

• What are complementary streams of data you could use to bolster your work?

Questions? Comments? (cont.)

• How would you pursue leads that are surfaced from social media? The Surface Web?

• How would you represent your work in publication and / or presentation (to show methods, explain complexity, and delimit your assertions)?

• What skills can you hone in order to better exploit public social media data? What do you perceive as strengths in this area? Weaknesses? Why?

72

Conclusion and Contact

• Dr. Shalin Hai-Jew

• Instructional Designer, iTAC, K-State

• 212 Hale / Farrell Library

• shalin@k-state.edu

• 785-532-5262

• Querying Social Media with NodeXL (an open-source text on the Scalar platform)

73

top related