scm tech report cmu-isr-16-108-final · scm system joel h. levine*, kathleen m. carley june 3, 2016...

31
~ 1 ~ SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Center for the Computational Analysis of Social and Organizational Systems CASOS technical report. “Measure what is measurable, and make measurable what is not so.” - Galileo Galilei Quoted in I Gordonand and S Sorkin, The Armchair Science Reader (New York 1959). Quotations by Galileo Galilei http://www-history.mcs.st-and.ac.uk/Quotations/Galileo.html This work was supported in part by the Office of Naval Research (ONR) N000141512797 Minerva award for Dynamic Statistical Network Informatics, and the Center for Computational Analysis of Social and Organization Systems (CASOS). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Office of Naval Research or the U.S. government. *Joel H. Levine is a Professor at Dartmouth College, Hanover, NH

Upload: others

Post on 13-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 1 ~

 SCM System

Joel H. Levine*, Kathleen M. Carley June 3, 2016

CMU-ISR-16-108

Institute for Software Research School of Computer Science Carnegie Mellon University

Pittsburgh, PA 15213

Center for the Computational Analysis of Social and Organizational Systems CASOS technical report.

“Measure what is measurable, and make measurable what is not so.” - Galileo Galilei

Quoted in I Gordonand and S Sorkin, The Armchair Science Reader (New York 1959). Quotations by Galileo Galilei http://www-history.mcs.st-and.ac.uk/Quotations/Galileo.html

This work was supported in part by the Office of Naval Research (ONR) N000141512797 Minerva award for Dynamic Statistical Network Informatics, and the Center for Computational Analysis of Social and Organization Systems (CASOS). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Office of Naval Research or the U.S. government.

*Joel H. Levine is a Professor at Dartmouth College, Hanover, NH

Page 2: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 2 ~

   

Page 3: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 3 ~

ABSTRACT 

Socio‐cultural cognitive maps (SCMs) are the best‐fit network model to the set ofunderlying node to variable data. SCM’s permit objective visualization of the network,inferenceabouttheimpactofchangesintheunderlyingconditionsinfluencingthenodes,andcomparisonofdisparatedata.ThisreportdetailstheprocessofcreatingandassessingtheseSCMs.Firstageneralintroductionisprovidedandthenastepbystepguidebasedonacognitivewalkthroughispresented.

   

Page 4: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 4 ~

   

Page 5: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 5 ~

INTRODUCTION 

Howdowemakesenseofcommunities?Howdoweunderstandandpredictchangesinthesecommunities? Fromasocio‐culturalperspectiveaddressingthesequestionsmeansattaininga structuralunderstandingof actors, issues, and the relations connecting them.Or in otherwords, itmeans answering these questions: 1)Who are the critical actors,particularly the political, tribal, religious, economic, educational and religious elite andassociated groups. 2) Onwhat specificmicro‐issues are the interests of these elites andtheirgroupsaligned,andonwhat issuesdotheycompete?3)Whatis thebasis forthoserelations, alliances and conflicts e.g., are they based in economics, status, education,religion,orlocation.Further,itisimportanttonotonlyunderstandthelayoftheland,buttousethatinformationtoassessthecommunityofactorsofinterestvis‐à‐vissomeissue,e.g. resilience, cyber‐attacks, or deterrence given those relations and the basis for them.And4)Howwill thecommunitychangeitspositiononabroadareaofconcern,anissue,givenchangesinthealliancesandcompetitionsonthemoremicroissues,perhapsduetotheunderlyingbasisforarelationbeingaltered,oranactorbeingremoved?Thatis,howstableisthecommunity,howresilientisthecommunity,givenchangeattheactorlevelorbasisforalliance/competitionlevel?

We suggest that these questions can be addressed through the development andassessmentofsocio‐cognitiveculturalmaps(SCMs).Wefurthersuggestthatitiscriticaltodevelop,visualizeandassess theseSCMsquickly, and ina fashion that supports ‘what‐if’reasoning.TheSCMisthebest‐fitmodeloftheseunderlyingrelationsamongactorsintheregion of interest based on a socio‐cognitive understanding of the social and culturalsimilaritiesanddifferencesamongacommunityofactorsgivenasetoftopicsrelevanttoanissue.

Ingeneral,inanSCMtheactorsmightbeindividualsorcollectivesandthesetofactorsin theSCMare the “community”. Thiscommunitymaybe comprisedof individualactorsthatarepublicpersona(e.g.,politicalelitesuchasheadsofcountry),groups(e.g.,ethno‐religious,socio‐economic,covertandpoliticalgroups),nationstatesorgovernancebodies(e.g., the UN), or key stakeholder groups (e.g., the executive or military branch of acountry). An SCM is typically developed around an issue. This issue is the thing aboutwhichtheanalystwantstoknowthecommunity’scurrentposition,andhowthatpositionwill change if the setof actors, the relations among them,or thebasisof those relationschanges.Illustrativeissuesaretheresilienceofthecommunityofgroupswithinacountrytochanges insocio‐economicconditionssuchaschanges inwealthandeducation;or thedangerof a nuclear event (deterrence) in a regionof interestgiven changes in the forceposture of key stakeholders; or the resiliency to cyber‐attacks of third‐world countriesgiventheglobalcyber‐threatmapping.Foreachactorinthecommunity,pursuanttotheseissues, there are a number of topics of relevance on which the individual actors have“scores.”Thesetopicsincludeissues,beliefs,ornormswheretheactorhasaposition(orscore)suchasthebeliefthatanallywillsupportthemintheeventofanuclearincidentorthelevelofconcernwithclimatechangeorageneralsocio‐demographicattributesuchaslevel of education orwealth, or an infrastructure attribute such as internet penetration.Thesetopicsarethedimensionsalongwhichactorscanbesimilarordifferent. TheSCM

Page 6: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 6 ~

can be represented as a networkwhere the nodes are actors and the links express theconnectivityamongtheactorstakingintoaccounteithersimilarityanddissimilarityofthetwoactorsgivenissuerelevanttopicsand/orissuerelevantnetworks(suchastradevolumenetworks,hostilitynetworks,andallianceagreements).

Mathematically, the SCM is a reduction of the more complex detail available in thehyper‐cubewherethedimensionsareactorsbyactorsbytopicsbytopics;theactor‐topiclinks are the strength of connectivity; the topic‐topic‐links are the co‐presence orcovarianceofthetopics;andtheactor‐actorlinksareinferredfromtheotherdimensionsgiven the degree to which two actors share the same topics and the extent of theconnectivity between those topics. Finding the SCM and assessing it, however, is a timeintensiveprocessthatrequirestheanalysttomakealargenumberofchoicesregardingtheunderlyingdata.ThegoalthenistodevelopanSCMtechnologythatsupportsthea)rapiddevelopmentofSCMs,b) issensitivetoculturaldifferences,c)results inan interpretablemodel, and d) is usable for assessing possible interventions. The algorithms needed forgenerating, assessing, and visualizing SCMs need to be robust, scalable, reusable, andreproducible. However,thereisnosuchtechnology.Incontrast, inthisdocumentwelayout a possible technology and walk the reader through the underlying steps in theconstruction,visualizationanduseofSCMs.

Thisdocumentisorganizedasfollows.WebeginbydescribingthevisionofSCMsandthe role of linearmethods in that process. Thenwedescribe the cognitivewalkthroughprocessusedtoidentifytheworkflowandtechnologiesneededtocreate,use,visualizeandassessSCMs.

LINEAR METHODS FOR NOT‐YET‐MEASURED VARIABLES 

The linear model is the best tool we have for describing the relation between twovariables.ItsaysthatvariableYisalinearfunctionofvariableX.Itissimpleandpowerful,yetlimited.Itdoesnotapplytocategoricalvariablesnortonetworksthathavestructuresbutnovariables.Itislimitedtonumericalvariables.

Wecanandhavedevelopedothertools forcorrelationsamongnon‐numericalobjects,but there is an alternative—which is to extend number and the linearmodel to theseobjects.Wedothatbyassumingthatnumbersandlinearrelationsexistforthesevariablesbut have not yet been discovered. Thenwe attempt to reverse engineer these not‐yet‐measuredvariablesandvalidatetheassumptionsunderwhichtheyhavebeendiscovered.

Thereisamplereasontosuspectthatthesenumbersandvariablesexist.Forexample,Figure 1 shows data from theWashingtonPost, describing prior activities of terroristsinvolved in the 911 attack on the United States. The data tell us that Nawaf al‐Hazmi(column3)andKhalidal‐Mindhar(column4)appearedtogether inavideo(row1), thatMohammedAttaandMarwanal‐ShehhitookflyinglessonsinVeniceFlorida(row3),andsoonfor31activities.

Page 7: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 7 ~

Figure1Activitiesofthe911Terrorists‐WashingtonPostDataOrderedbyDateandTarget

Isthereorderinthesedata?Isthereadimension?ThePostpresentedthedatabydateandbyairplane,butre‐organizingthedataasinFigure2stronglysuggeststhereisalinearorder.Visualpatternsprovenothingbuttheycansuggestagreatdeal,suggestinganorderandintervalsthatexistbuthavenotbeenmeasured.

Figure2:Activitiesofthe911Terrorists‐WashingtonPostData—Reorganized

Subject to test, thediscoveryof themissingnumbersbeginsbypracticingondata forwhichthenumbersareknownandbylearningtherulesbywhichtheknownx’sandy’sarelinked to the fine detail of the data. Then, having learned the rules,we switch tomorechallengingdata inwhich thex’sandy’sareunknownandworkbackward fromthe finedetailofthedatatoestimatesofthenot‐previously‐measuredx’sandy’s—assumingandtestingtheassumptionthat therulescontinuetoapply. TheSCMprocess isahuman‐in‐the‐loopsemi‐automatedapproachtodoingthistestanddiscovery,tofindingthepatternshiddenin,butnotpreviouslymeasuredfrom,therawdata.

Page 8: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 8 ~

Nowlet’smovetoanothersetofdatareferringtotheheightandweightofindividuals.InFigure3, theheight‐weightdatademonstrate theempirical linkbetween thenumbersfor height and weight (shown at the left and at the bottom) and the data for jointfrequenciesofheightsandweights(showninthecells). Wecanaskofthesedata: Whataretherulesthatgovernthisempiricallink?

Page 9: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 9 ~

Figure3.ImageofrelationbetweenHeightandWeight

The do’s and the don’ts 

The dont’s 

Inprinciple,thismightbesimple. Wemightassumethatthedatahaveatwo‐variable“normal” (Gaussian) distribution, and use the Gaussian assumption as the rule,acknowledgingthattheGaussianassumptionmightbeonlyanapproximation.

Theproblemisthat,insodoing,weassume‐awaywhatmaybe(andis)arealstructureinthedata,astructurethatisbothnotGaussianandrichwithinformation:Wearenotfreetoassumerulesaccordingtoconvenienceorconvention: The“Rules”linkingnumberstodataaretheoriesandneedtoberespectedassuch.

How do we know it is wrong? The goodness (or badness)of fit of the Gaussianassumption to these data can be observed by writing and testing 343 simultaneousequations that link the knownheights andweights to the 343 known frequencies in thedata—wheretheordinatesoftheGaussianshouldbeapproximatelyproportionaltotheobservedfrequencies. Bydirectcomputation,weknowthe5parametersoftheGaussianthatarerequiredbytheequations(2means,2standarddeviationsandonecorrelation,r).And we can the constant of proportionality, a sixth parameter, by chi‐square best fitbetweentheequationsandthedata.Theresultisachi‐squareerrorof879,401with368degreesoffreedom:

Equation1

, ∝1

2 √

12 1

2 ,

where̅and

where ̅ , , , ,and aretheappropriatemeans,standarddeviations,andcorrelation,

andwheretheequationisapproximatebecausethebivariatenormaliscontinuouswhilethefrequenciesareconstructedbygroupingthedataintointervals,

Thatisabadfit:Usingthetheoreticalpropertiesofchi‐squareasaconvention,thechi‐squareerrorofagoodfitshouldbeintheneighborhoodof368(thenumberofdegreesoffreedom). Bycontrast, the chi‐squarevalueof879,401 (associatedwith thenormal) isabout2,000greater(worsethan)thistarget.

Page 10: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 10 ~

Assessing the goodness of fit by a different criterion, the error (associated with theGaussian)canbecomparedtotheerrorfromamodelweknowtobefalse:comparingtheerrorofthenormaltotheerrorassociatedwiththeobviouslyfalseassumptionthatthereisno correlation. The best‐fit no‐correlationmodel produces a chi‐square error of 1,127,with330degreesoffreedom:ThismeansthattheassumptionsembeddedintheGaussianarenotonlyinappropriatebutworsethantheassumptionthatthereisno‐correlationatall–where thenull reduceserror toaboutone‐tenthofonepercentof theerrorassociatedwiththeGaussian.

More important for a scientist, while the data are not Gaussian but they do show apattern. For example, consider the isolated data for women at 62 and 64 inches withweights146.5to152.5pounds.Inthissubsetofthedatathetallerwomenarelighter:Theodds that a 62 inch women will weigh 152.5 lbs, as compared to 146.5, are 81 to 76,approximately1.06to1.Bycontrast,at71to91,theoddsthatthetallerwomenwillhavetheheavierweightaresmaller.

152.5lbs 81 71

146.5lbs 76 91

62.0in 64.0in

Figure4.Tallerwomenarelighter–datasubset.

This reverse correlation is both counter‐intuitive and impossible— if the data wereGaussian. Yet it occurs in roughly one third of the subsets that can be isolated (evenexcluding low‐frequency data by using only those examples that show aminimum of 5peoplepercell.)[SeeLevine1995?]

The message, is that the reverse local correlations are an unexpected but real andorderlypropertyofthedata.

And the do’s 

Reducingthealgebraandthespecificity)oftheGaussian,thebettermodelreplacesthekey factor (thecorrelation factor)withanexpression |X‐Y|awhere theGaussianuses theexpression|X‐Y|2.Anditdropsassumptionsabouttheone‐variabletermsoftheGaussian,otherthantoassumethattheyaremultiplicativeandcanbereplacedwithmultiplicativeparameterstobeestimatedfromthedata.SeeFigure5foranillustration.

Page 11: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 11 ~

Figure5:IllustrativeNon‐Gaussian,a≈1.1

It tellsus that thereexist stronglypreferredcombinationsofheightandweight,morenarrowlylimitedbythedatathantheaveragesorabell‐shaped(normal)distribution.

Itsendsresearchdownadifferentpath: WhereGaussiansvariablesarethought tobegenerated by aggregations of many uncorrelated causal variables, the “spike” is notGaussian.Fittingthese“spiked”distributionstothedata,fitsthedata(chi‐square≈248)butestimatesasubstantiallydifferentnon‐Gaussianline,estimatingalinearslopeof4.3lbsperinch(ascomparedtotheto2.6lbsperinchbystandardregressionorthe8.4poundsper inch obtained by attempting to fit the full two‐variable Gaussian). And unlike theconventional estimates of the number of pounds per inch, the non‐Gaussian (non‐leastsquares)hypothesesisbackedupbyatightfittothedata.

Makingnoaprioridecisionabouta, theimprovedmodelreplaces2withavaluetobeestimatedfromthedata.

, 2 (4)

– – &

wherethe“Allfactor”(ofEquation2)hasbeenabsorbedintotheX‐FactorandY‐factor,

where ,

⁄ and ,

Page 12: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 12 ~

wherebase“2” isused inpreferenceto“e”forconveniencewithapplications(withoutaffectingthegoodnessoffit),

wheretheMandWsymbolsindicatethattheseadditiveandmultiplicativeparametersofthelinearrelationsneednotbemeansandstandarddeviations,

andwherea(estimatedat1.1fortheheight‐weightdata)isanattenuationconstantthatgovernstherateofdescent(ofthecorrelationfactor)withrespectthetodistanceofeachcombinationofXandYfromthelineX=Y.

Withthisnot‐necessarilyquadraticsupplementto thekindsofcorrelationthatwillberecognized, the best fit ofa is1.1which reduces theerroranother63%(another2.7‐fold)to248(with325degreesoffreedom).Withthisrule/theorythemagnitudeofthechi‐square is now less than the magnitude of the degrees of freedom, exceeding theconventionalstandardforgoodnessoffit.Withthisfit,Equation4establishesacloselinkbetweenthenumbersandthedata(althoughthemodelisnotnecessarilyunique).

For the height‐weight exemplar in Figure 3, Figure 6 shows the successiveimprovementsoffitasthefeaturesofEquation4areimplementedinstages.

F‘Rule’

Chi‐Square StandardizedChi‐Square

Chi‐Square/DF

NumberofParameters

DegreesofFreedom

Bi‐Variate Normal As‐sumed:

Wt=8.369 lbs/inch*Ht–394.912poundsf1

879,401

32,357.62 2,383.20 6 368

Best‐FitNull1f2 1,127 30.94 3.40 44 330

…asabove plusBest‐Fitattenuation,a=1.1Estimated:

Wt=4.290lbs/inch*Ht–157.394poundsF4

248 ‐3.45 .76 47 327

Figure6.ReductionofErrorCorrespondingtoFeaturesoftherulesforheight‐weightdatainonedimension)2‐Theparametersofthetwolineartransformations,m,n,p,andq,resolve

1Aseverymodelbasedsolelyonrowandcolumnmultipliers isanullmodel, thebest‐fitnullisthenullmodelthatbestfitsthedatausingthesamecriterion(leastchi‐square)thatisusedtoassessthepositivemodel.

2Thereisnoprobabilitycalculationimpliedbytheseempiricalchi‐squarevalues:Proba‐bility testswould require that the cell values bePoissondistributedwithmeans greaterthanapproximately4or5whichisnotthecaseinthesedata.

Page 13: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 13 ~

intotwoparametersdescribingthelinearrelationbetweenthemanduseonlytwodegreesoffreedom.

Falsifiable Hypotheses:   Transferringthisnon‐Gaussianruletoanetworkforwhichnumbersareyettobefound,

theWashingtonPost’s911datawerepreparedasafrequencytableshowingthenumberoflinks (activities) shared by eachpair of individuals, Figure 7. The fitted frequencies areshowninFigure8.Finally,forthesefrequenciesthebest‐fittwo‐dimensionalSCMdisplaysthe not‐previously‐estimated numbers, Figure 9. The SCM of not‐previously‐measurednumbersisbackedupbyaclosefittothedata,chi‐square≈1.66—aclosefit(Toworkintwo dimensions, the distances on the line, for height and weight, were replaced byMinkowskidistancesintwodimensions,withcoordinatesyet‐to‐be‐estimated3)

Itdisplayssomesubjectivelyfamiliarfeatures: Adense ‘clique’combiningpartsofthetwoWord Trade Center groups, a separate ‘clique’ for the Pentagon, and no structuralcoherenceforthethree‐membergroupthatfailed.

Addressing the primary question, does the evidence support the hypothesis that theSCM with not‐previously‐measured x’s and y’s is an objective representation of thisnetwork? The evidence from the fit is consistent with that hypothesis: Given theseestimates of the x’s and y’s, the hypothesis achieves a close fit to the frequency data inmuchthesamewaythatanordinarylinearmodel,withknownx’sandy’s,mightachieveaclosefittothemeans.4Forthesedata,theclosefitisastrongargumentinsupportoftheattenuationandMinkowskiparametersofthemodel(AppendixI),insupportoftherealityofthespace,andinsupportoftheestimatesofthesenotpreviouslyestimatedx’sandy’s.

3TogeneralizeHiddenLinemethodsto2ormoredimensionstheabsolutedifference

(4)

isgeneralizedtotheMinkowskidistance—therebeingnoreasontoassumethatthege‐ometry of Euclidean space (used for physical space) is a good geometry for other dataspaces.

∑ (5)

ThisisthefamilyofMinkowskimetrics.Theirdifferentformsofcombination,parameter‐izedbyMallowexplorationofbest‐fitreal‐worldrulesofcombination.IfMwereequalto2,ifitfits,themetricwouldimplythatdimensionsofadataspace,likedimensionsofphys‐icalspace,combineaccordingtothesquarerootofthesumoftheirsquares.Ifthebest‐fitMwereequalto1,itwouldimplythatthedimensionsofadataspacecombinebystraightadditionoftheirorthogonalcomponents.

Page 14: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 14 ~

ObservedFrequencies

Figure7.NumbersofActivitiesSharedbyeachpair–ObservedFrequencies

Figure8.NumbersofActivitiesSharedbyEachPair‐FittedFrequencies

Page 15: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 15 ~

Figure9,Map: InferredCoordinates for911Terrorists ‐Attenuation12.64,Minkowskiparameter.81,Chi‐Square=1.66

Why Measure? Puttingnumbersonnot‐yet‐measuredobjectsisnotanendinitself. Itisimportant,in

part, because it joins measurement with theory. In turn that puts a data analysis injeopardy, as it should be: “Methods” are not theory‐neutral. Joining measurement totheorywhichmeans it can fail, whichmeans it can be improved—whichmeans it canextract simplicity (in theheight‐weight case) that thecreativeambiguitiesofEnglishcannotdetect.Incontrast,theproposedSCMapproachwilllaybaretherelationsandhelptheanalyst to develop empirically driven theory inwhich ambiguity is reduced through thesystematicassessmentofalternatives.

LINEAR MODEL VISUALIZATION IS OBJECTIVE 

Currentmapping technologies using standard graph theory and social network visualanalyticsdonotsupport formal inferenceanddeductionof factsnotexplicitlypresent inthe data.. Current mapping techniques also have trouble simultaneously handling bothnetworksandattributes,andthenusingthepositionofthenodesonthoseattributesbothtoinfernetworksandtointerprettheresultsoffindingsaboutthepositionofactorsinthenetwork or the composition of subgroups. Finally, many social network visualizationmethods,e.g.,force‐fieldlayouts,invitetheanalysttochangethevisualization,inpursuitofvisualclarity–whichissubjective.Theforcefieldlayoutallowstheusertoalter“gravity”and“repulsion”tosuit.Thismakesthemsubjective,bydefinition:Theresultdependsontheobserver.

Tobesure there isabodyof researchaimedat two‐modedataandmethodsexist forcreatingnetworkfrombi‐partitegraphs(e.g.,multiplyinganetworkbyitstranspose).Andthatresolvespartofthevisualizationissue.Morerelevanttotheworkproposedherethereareanumberoftechniquesforcreatingdistancenetworks.Adistancenetwork,isamatrixofrelationsamongnodessuchthatthelinkweightrepresentsthe“distance”betweenthosenodes given a set of indicators. Common distance metrics include similarity, relative

Page 16: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 16 ~

similarity,Euclidean,Chebyshev,Canberra,andMinkowski.Themetricsvaryintheextenttowhichtheyweightoutliers,arevalidforcontinuousversuscategoricaldata,andcontrolforcorrelationamongvariables. Wenote that there isnoagreementonwhichmetric touse.Thus,akeyelementofourresearchwillbetoidentifythebestcandidatesandassessthesensitivityoftheSCMresultstothismetric.

ThevisualizationresultsachievedfromtheuseoflinearmodelsandtheSCMapproacharequitedifferent thanthosetypical insocialnetworkanalysis. Inordinarygeometry, ifthedistancesbetweenoneobjectandthreedifferentobjectsareknown,thenthedistancebetween that object and all different objects can be inferred, whether or not data areprovidedfortheseadditionaldistances.Further,ifdataexistforthedistancesbetweenoneobjectandfourormoredifferentobjectsareknown–andthereareerrorsinthedata,thenthedatacanbecorrected,becausethecorrectvaluesmustbeconsistwiththeEuclideanrules. It is for this reason that satellitenavigation systemsuse asmany satellites as areavailable. In contrast, current “networkgeography” techniqueswhichareused toassesstopic‐maps and social‐networks do not support this type of formalized inference anddeductionasthereisnomeaningassociatedwiththepositionofthenodesinthe2Dor3Dvisualimage.

If a semantic or social network is represented as shown in Figure 10, visualization1,whichisatypicalnetworkvisualization,thistypicalvisualizationisnottruetoallthatweknow: In this visualization, A, B, D, and E are equidistant from each other, through thecenternode‐C.Butthevisualizationdoesnotshowthat:ItshowsAclosertoBthantoE.Further,Cisperceivedascriticalasitisinthemiddle.Finally,anyinformationaboutthestrengthoftherelationorthebasisoftherelationismissing.InFigure10,visualization2which is an enhanced traditional network visualization, color and weight are used toprovideadditional information. Butagain, the image isnotaccurateas the lengthof thelinesconnectingthenodeshasnoinherentmeaning.OnecannotinferthatnodeAistwiceas different from node C as is B. We propose to develop a SCM network visualizationapproachwherethenodesareplaceineither2Dor3Dspace,andsuchthatthenearnessofthe nodes to each other reflects similarity in this space; where inference can be drawnbasedonposition;andwheremissingdatacanbeestimatedgiventhemodelexpressedinthis topic space. A stylized interpretation of this is shown in Figure 10 visualization 3which is the SCM visualization. Our proposed approach will generate models whereformalizedinferenceandcorrectionformissingdataispossibleanddistancemeaningful.

Figure 10, visualization 3 is the result of the linear model visualization. Thisvisualization, like the real data example shown in Figure 9 ‐ is objective. Theserepresentationsare‘objective’inthesensethatoncedatadecisionsaremade,e.g.,onceitisdecidedthatthefrequencytableistobemapped,decidedthatthediagonalwillbeignored(ignoringtheventsthatanodesharewithitself)–decisionsthattheusercanchoose,theplacementonthemapisobjectivelydetermined.Itaddsadegreeofobjectivitytonetworkanalysis,anobjectivitythathasbeenlackingwithnetworkvisualizations.Inthe911data,theremightbesomevisualappealtospreadthetangledwebofthefivepeopleatbottomcenter.Butthatdecisionisnotoursandnotmadeonsubjectivegrounds.

Page 17: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 17 ~

Figure10.Illustrationofdifferentnetworkvisualizationstyles

In these SCM visualizations we are trying to display non‐Euclidean images on aEuclideanpiece of paper. Imagine the above images on a city‐block grid. In city blockmetric,visualization1(inFigure10)wouldhaveABandDallequidistantfromeachother.Considerdrawingagridof‘cityblocks”inwhicheachwastwounitsawayfromeveryoneoftheotherthree.

What does the SCM model tell us? 

Wecanaddressthatquestionintermsofboththeory(therules)andpractice–thetwocometogetherintheattenuationparameter,“a”.

Considerwhatitisnot:aisnot2.Insimpleonevariabledistributions,“2”andthebell‐shaped curves it describes, go with the informal interpretation of the Central LimitTheoremtotheeffectthatwhenanobservedvariableisanaggregate(asumoraverageofseveralindependentandidenticallydistributedcontributingvariables)itwillapproximatea “normal”distribution,a=2. This thing isnotnormal, ithasa ‘spike’as it crosses thecentral line, where the “normal” would be flat. The suggestion is that weight is not anaggregate,andisrelativelysimple:Withfew(orhighlycorrelated)contributingvariablesthatarethemselves“spiked”.Itsuggestsaresearchpathlookingforfewcausalvariables.Thatisverydifferentfromthestandardstoryofregressionanalysis:Useonevariable,pickup a couple of percent of the variation. Add another, pickup a couple of percent— inyearsof further research, the “percentvarianceexplained”will gradually improve. Thisspikeisadifferentstory.

Alsonotethatthechi‐squareerrordroppedata=1.1and,atthesametime,theslopeofthelinearrelationdroppedfrom8.5lbsperinch(standardleastsquaresregression)downto4.3lbsperinch,almost50%. Andnotethatonethingthatwasclearlywrongwiththestandard analysis is that it includes some women whose relatively heavy weight is notassociatedwithheight. Ordinaryleastsquaresandthenormaldonot“know”whattodowith this: There is a something ‘goingon’ in thesedata that hasnothing todowith therelationbetweenheightandweight.

These extreme cases will affect the column means and, therefore, the least squaresregression line. By contrast, these extreme cases do not affect cells associatedwith the

Page 18: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 18 ~

“spike”(froma~1.1).Thenon‐Gaussian‘rule’isassociatingthelinewiththatpartofthedatathatexhibitsthespike.

Wherethestrategyisconsistentwiththedata,whatdoesthestrategyteachabouttheworldfromwhichthedatacome?Whatdowelearnthatisnottaughtbythepre‐computerstatisticaldevicesofclassicaldataanalysis?Whatdoesthestrategysimplifyandadvance?

One thing it allows is a distinctionbetweenpredictionandprocess: The least squarebestfitlineisdesignedtopredictaverages.Thelinearrelationthatallowsthismodeltofitthedatadescribesdescentoneithersideofalinearrelationbetweenxandy.Butthereisnoreasontoassumethatthesetwolinesarethesame.Forheightandweighttheyarenot.Thereisaridgeassociatedwiththislinearrelationliterallychangesourdescriptionoftheworld,oroffersacompetingdescriptionoftheworldthatgeneratesthedata..Anticipatingsubsequentanalyses,the“deepest”questionsmayinvolvethat“a”parameter.

The“a”relatestotheseldomexplicitbutoftentaught“storyline”ofbehavioralresearch.The story has it that theworld is a very complicated placewhereinwhatwe see is theresultofmanyvariablesactingtoproducethebehaviorwesee.Thestorytellsustopredict“normal” (bell‐shaped) scatter among values surrounding predicted values — because,roughlyspeaking(veryroughly)thatiswhattheCentralLimitTheoremtellsustoexpectfromphenomena thatare theaggregateofmanyunderlyingphenomena. Thestory linetells us that one generation of scholars will settle for explanations that explain some“percentofthevariance”,tobefollowedbythenextgeneration“explaining”anothercoupleofpercentofthevariancethatisleft,followedby…

Whereaisnotequalto2,thestorychanges. Thenthelogicrunsbackward:Ifwearenot seeingbell‐shapeddistributions, a=2, thenourphenomenaare (ormaybe) simple.Theymay indeedbe the result ofmanyunderlying variables, but those variableswill becorrelatedsuchthatthenumberofindependentdeterminantsofbehaviorissmall.Whena≠2fitsthedataitsuggests,butdoesnotprove,that“theworld” issimplerthanwehaveassumed,notsimplebutmoreso.

TheextensionofEquation4tomultipledimensionsgeneralizesdistancefromtheone‐dimensionalexpression

tothetwoormoredimensionalexpression

This is the family ofMinkowskimetrics for distance. They express properties of theunderlyingdata,specificallytherulebywhichdifferencesonseveraldimensionscombine.

Page 19: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 19 ~

For M = 2, distance components of distance are combined by adding their squares andtakingthesquarerootofthesum—theEuclideandistance. ForEuclideandistanceeachcomponent contributes to the combination in proportion to its square so that largercomponentsdominatethecombination.

ForM=1distancecomponentscombinebyaddition,theso‐calledManhattanmetricinreferencetoroadsarrangedinarectangulargrid.ForManhattandistanceeachcomponentcontributestothecombinationinstrictproportiontoitssize.

For0<=M<=1distancecomponentscombinebyaddingtheirrootsandputtingthesumto a power. It can be referred to as the “can’t get there from here”metric because theshortest distance between twopoints lies through an intermediary third point. (Thesemetrics are properly referred to as semi‐metrics as they relax the triangular inequalitysatisfiedbyordinary(physical)distance.

Whichevermetricapplies,ifonemetricproducesabetterfittothedata,thatfitrevealstheoreticalinformationabouttheunderlyingdata.Examiningcombinationsofattenuationandmetric,thebestcombinationforthese911dataistheManhattanmetric,M~1.0,andattenuation slightly flatter than the normal attenuation, a ~ 3. (With a chi square ofapproximately3, accumulated from153cells, errors this small canonlybeapproximate.)TheimpactofattenuationandthechosenmetricvalueisshowninFigure11.

Attenuation

Metric a=.7 a=1 a=2 a=3 a=4

M=.70 87.82 22.80 15.62 3.96 4.07

M=1.00 122.99 8.06 3.14 2.85 1.99

M=2.00 108.46 8.29 4.24 4.25 5.31

M=3.00 110.56 6.80 4.89 3.42 4.55

M=4.00 92.12 6.05 4.21 6.54 What hap‐penedhere?

Figure11,Chi‐Squares‐LeastChi‐squarevaluescorrespondingtoeachof25fixedcom‐binationsofthemetricandtheattenuation.

MINKOWSKI PROCESS 

The core of the SCM process relies on the Minkowski metric (or semi‐metric). Thissectiondescribestheunderlyingmathematics.

Ri=rowimultiplier

Cj=columnjmultiplier

dMij=theMinkowskimetric(orsemi‐metric),withMinkowskiparameterMfordistancefromrowitocolumnj

Page 20: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 20 ~

a=attenuation–thisisthepowertowhichtheMikowskimetricisraised

ij=fittedfrequencyforcellij

xi,dim=xistherowcoordinateforrowi

yj,dim=yisthecolumncoordinateforcolumnj

M=Minkowskiparameter

Ndim=thenumberofdimensions–Thiswillbe1,2or3.

Chi‐square=thesumoverallrelevantcellsof(frequency–Fij)2/

Nowthefittedfrequencycanbecalculatedas:

Theaboveequationisoftenreferredtoasthebasicmodel. Thealternativeistouse2ratherthaneintheabovefunction. Sometimesit iseasiertogetadistanceof“1”ontheSCMcorrespondtoahalf‐distancedeclineoffrequencies.

Finally,theMinkowskidistanceparameteriscalculatedas:

AUTOMATING THE PROCESS 

Theimplementationofthesestrategiesiscomputationallyintense—toocostlyforourpredecessors,increasinglyaccessibletous. Itleadstomoreambitiousappetitesforwhatcanbemeasured,torulesthatcanbestated,appliedtodata,andtested,andtoaclassof‘methods’thatarehalfmethodsandhalftheory,abletoextractmoreorderfromourdatathanwaspreviouslyknowntoexist.

Whatare thenutsandboltsof implementing thisapproach topreviouslyunmeasuredbehavior? This section provides the result of a cognitivewalkthrough thatwill include,stepbystepinstructions,tocreateanSCM.

In the SCM process the analyst will start by inputting a one mode network withattributes or a two mode network into the system, and then using the automatedworkflowswhichwill direct the analyst through the creation, analysis and visualizationandforecastingphasesfortheSCM(seeFigure11).Userswithmultipledatasourceswillenter two or more data matrices and will then be routed through the enhancement

, ,

/

Page 21: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 21 ~

workflow (see Figure 12. In Figure 11, SCM creation is a transformation and editingprocessbywhichthesourcedatainwhateverformitisin,isconvertedintoafrequencytable. Note that if the input is a binary actor x topicmatrix (M) then this is created bymultiplyingMby its transpose,andthengenerallyremovingthediagonal. Butaswillbeseen,dependingonthenatureofthedatathismaybemorecomplex.

Figure 11. High level workflow for SCMs.SCM relevant technologies are in red. Notethe assessment and visualizationtechnologiesmostlyexistandareinORAandwillbere‐used.Thisworkflowhides,whatweestimate tobeabout50 low level tasks thatarecurrentlynotautomated,butthatwillbeautomatedintheproposedsystem.

Figure12.Highlevelconceptualizationofenhancement.Theenhancementsectionoftheprocesswillbeautomaticallycalledwhentheuserentersorselectstwoormoreactorxtopicmatrices.SpecializedanalysisandvisualizationtoolswillbeusedtosupportcomparingandcontrastingtheoriginalSCMsandtheunifiedenhancedSCM.ThiswillinvolvepresentingtheSCMvisualizationwithannotations,creatingaspecializedreportthatprovidesinformationonthefitoftheSCMtotherawdata,keynodes,etc.And,thiswillinvolvetheabilitytonotjustspatiallyvisualizetheSCMbuttovisualizedifferenceintwoSCM’soroverlaythem.

Step 0: Keys in identifying attributes for creating an SCM from node by attribute 

data. Inthissteptheuserchooseswhattheywanttouseasnodesandattributes.Thenodes

aretheentitiesthatwillbedisplayedwhentheSCMisvisualized. Ifthisisactorxtopicsthenthenodesmightbeactors. Theattributesarethethings thatare, inabinarysense,trueornottrueofthatnode,e.g.,whetherornottheyareconcernedwiththattopic.Whenwe are referring to the raw data we will use the term question, and reserve the termattributeforthefinalbinaryindicator.

For this process, the user startswith the raw data. We assume that the raw data iaalready in ORA and that each question from a questionnaire or variable from a codingschemeisitsowncolumn.Weassumethatthenodesofinterestaretherows.Note–theoriginal raw data can be very messy and different solutions may be needed for eachattribute in the raw data. For example, education may be a single attribute but it iscategoricalwiththecategorieslessthanhighschool,highschool,college,Ph.D.other. Thiscouldbe converted to5binaryattributes. In contrast, in the rawdata identitymightbe

Page 22: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 22 ~

coded acrossmanydifferent attributes such as Suni, Sufi, Christian… and thenodemayhave a score on each. In this case in the SCM creation process each of these originalattributesmightbesavedasthebinaryattributes.Iftherawdataisoriginallybinary:e.g.Column??areyouSunni.Youcanuseittoproduce1attribute(maybe2ifnon‐responseisa2nd).And,ofcourse,therawdatacouldbecontinuous.

Theuserhasmanychoiceshere:

Howarenon‐responsesormissingdatatreated?Shouldtheybecometheirownattribute? And if so, is there one such attribute per original question withmissingdataornon‐response.Note–thedefaultistoignorethisandtreatnon‐responseormissingdataasa0.

Doesthequestionhaveabinaryresponse.Inthiscase,itisalreadyanattributeandusedas is. Or, if there ismissingdataornon‐response itcanbeconvertedintotwoattributes.

Does the question have a categorical response. In this case each categorybecomesanattribute(default)oralternativelytheusercanchoosetocreateanattributethatis1ifthecategoricalresponsewasgreaterthanthemeanormodeand0otherwise.

Doesthequestionhaveacontinuousresponse.Inthiscaseabinaryattributeiscreatedbyputtinga1iftheresponseis>=tothemeanorelse0(default),option1–allowtheusertodefinecategoriesofresponsesthatformanattributee.g.if<18thenattributeyouth=1else0,if19‐30thenattributeyoungadult=1else0,andsoon.

Aretheremeta‐attributescreatedbycombininganswerstoquestions–e.g.<18andsunni.

Ingeneral,weassumetheuserhasanactorbyattributematrix.Butitcouldbeanynodeclass with a set of attributes. We use actor by attribute to describe the process. Thewalkthrough revealed that it does not make sense to automate the process of definingattributes – but automation shouldprovide some support tools. InORAwhen there arenode attributes wewill let the user select a set of these and then have ORA create the“binary”fileneededforSCM.

Case1:theattributesarebinaryorcanbeconvertedtobinary.Ifnot,theuserneedstoconvertthemtobinary.

Todothis.Firstselectthesetofattributes.Thenforeachattributedothefollowing.Ifitisalreadybinarybuttextconverttobinarynumeric.Lettheuserchoosewhichstringwillbea1andwhich0.Ifitisnotbinaryandnottext,telltheusertheycannotuseit.Ifitisbinaryandnumericuseasis.Ifitiscategorical,allowforthreeoptions:converteachcategorytoitsownattribute, lettheuserselectacategory,collapsethecategoriestobinaryusing>=meanis1andelse0.Ifcontinuousthenconverttobinarybyifvalueis>=meanthen1else0.

Inanactorbyattributefile–thecellsarebinary–i.e.theactoreitherhastheattributeortheydonot.Notemissingdatacanbetreatedasanattribute.Makethisanoptionfortheuser.InthissteptheSCMdoesnotneedtodistinguishbetweenmissingdataand0’s

Page 23: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 23 ~

Case 2: the attributes are not binary and are to be left as non‐binary. This is anextensionthatwillnotbeinitiallydealtwith.

Step 1: SCM Binary input is selected Gettheactorbyattributefile

ForSyriathisis(thisistheidentityquestion‐800Peopleby17Attributes(11identitiesand6educationallevels)

For9‐11thisis18peopleby26attributes

Case1:Cellsarebinary‐suchasthatshowninFigure1.

Case2:Cellsarenon‐binary

WeconsiderthisanextensionandwillnothandleitinV1.WewillconsiderthisinV2.

Thengivenabinarymatrixuseittogenerateafrequencytable. Note,atthispointtheuser can enter the process by selecting a two mode network that is binary or can beconvertedtobinarybysettingeachcellby ifvalue is>=networkmeanthen1else0. Inaddition.Atthispointthematrixshouldbesavedasatwomodenetwork.

Step 2: Create the Frequency Table • Inputisbinarymatrix.Thismaybeatwomodenetworkthatisbinary(e.g.,actorby

knowledge) or a node set and the set of binary attributes from theprevious step(e.g.,actorbyattribute).Athirdoptionisifyouhaveasetofthreefactorse.g.actorsresponse about education and actors responses about identity. In this caseeducationmightbecomethenodesandidentitytheothersetofnodes.

• Output is a frequency table (e.g., actor by actor). In the case of option 3 justdescribedthefrequencytableistwomode(e.g.educationbyidentity).

• Oryoucanskip thisstepanduseas frequenciesaoneor twomodenetwork thatalreadyexists.Note‐itispossibletouse0’sand1’sinabinaryone‐modenetworkas if they were frequencies, where “1” might indicate a high (buy unmeasured)frequencyoffriendlybehaviorbetweentwonodes.

Can’tjustdosimplematrixmultiplication

• If thisweredone then thediagonalwouldhave tobe converted to0but only forsquarematrices.

Page 24: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 24 ~

• Youneedtokeeptrackofmissingdataforlateranalysis–usetheORAmissingvaluenumber–somethinglike‐9.9999999999999999

• IfAisthe800x17matrixthenAA’isnotwhatiswanted.

• A’Aisattributesbyattributes

– Identities x identities ‐ In this case youwant to just look at identities youwould remove the rows and columns for education ‐ probablywould juststartwith800x11–sharedidentities

– Educationxeducation‐ Ifyou justwantededucationyouremoverowsandcolumnsforidentity–probablywouldjuststartwiththe800x6–ifyouzerodiagonalitisempty

– Relationofidentitytoeducation–thisis11x6‐thisistheupperrightofthe17x17

Step 3: Check to make sure the attributes are not mutually exclusive and if they 

are fix it Iftheattributesaresuchthattheyaremutuallyexclusivethen

• Ifallmutuallyexclusiveitwillgeneratemissingdatacellsinthefrequency–andtheentire matrix is blank. Just code this as the the ORA missing value number –something like ‐.9999999999999999. Note that0canbea correct reallyvalue inthisprocedure,butnegativevaluescannotshowup.

• Ifonly somearemutuallyexclusive thiswill generatea0cell thatwilldistort thefinalmapasitimpactsthegoodnessoffit. All0cell’smustberesolved.Therearetwoapproachestoresolvingthis:

1)don’tcreatesuchcategories.Inthiscasetelltheanalystwhichcategoriescreatedtheproblem and see if they want to remove the category, or select another binarizationapproach,oraddthecategories.

2) if you need such a category the cell needs to be marked as missing data beforemapping.Thenmarkthecellasmissing–use‐9.999999999999999999999

Result–awellformedfrequencytable.AnexampleofwhichisshowninFigure6.

Ifyouareusingasinputatwomodenetwork,insteadofanodesetbyattributematrix–youalsowanttomakesurethatnotwocolumnsaremutuallyexclusive.ThisisassumingthattherowsarewhatyouareformingtheSCMonandyouaretreatingthecolumnnodesandlinkstothoseasattributes).

Note, inV2youwantORA to store the choicesonhow theattributeswere created sothat the user can later go back and try a different approach. For example, it is notuncommonfortheusertoredefinewhichattributestoinclude–suchasincludingornotincluding DRUZE in the Syrian data, or segmenting DRUZE into two different binaryattributesbasedonwhethertheyhadhighorloweducation.

Step 4: Check to make sure the frequency table is well formed and if not fix it Thereareabunchofchecksherethatareerrorchecktomakesurethatthefrequency

file that is sent to thenext step iswell formed, the right size, etc. Note, if the frequency

Page 25: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 25 ~

tableisgeneratedviatheORAprocess–thesechecksshouldallpasseasily.IftheuserisenteringtheSCMprocessatthispointwithafrequencytablethen,itmaynotpass.AlsoiftheuserstartsandrestartstheSCMprocessandmessesuptheORAgeneratedfrequencyfile then it may not pass. A frequency file from ORA’s perspective is just a weightednetwork.Thereare3typesoffrequencyfilesthatshouldbeallowed:

• Squaresymmetrice.g.identitybyidentity–Thisisaonemodenetwork.

• Squareasymmetrice.g.adirectedrelationsuchasmobilityoccupationoffatherbyoccupationofson.Thisistwomodenetwork.

• Rectangular–e.g.identitybyeducation.Thisisatwomodenetwork.

Sofirstthesystemneedstoidentifywhattypeofnetworkitisandthesizeofeachmode.ThatshouldbereportedtotheuserandtheinformationusedtochoosethepaththroughtheSCMprocedure.

Second,theusershouldbeaskedwhethertheSCMprocessshouldtrytofitthediagonal.Ingeneral,thedefaultisthatforaonemodenetworkthediagonalisnotfitandindeeditshouldbezeroedoutandforatwomodenetworkitisfit.Theadvancedoptionistoallowtheusertochoose.

AtanypointingoingthroughtheSCMworkflowtheusershouldbeabletobackuptothepriorsteporquit.

Step 5. Set parameters for optimization and the Minkowski procedure 

Select Number of Dimensions 

First,theuserneedstospecifythenumberofdimensions,Ndim,ofthedesiredsolution.Atthispointtheoptionsare1,2and3andNdim=2isthedefault.

Select Function for Calculating Distance/Similarity 

Second, the user needs to specify themetric formeasuring distance andwhether theSCMprocedurecanoptimizeorchangethatmetric.Ifitisoptimizedorchangeditisdonewith Minkowski. Note a set of similarity/distancemetrics should be provided, and thedefault is to use the Minkowski approach. This is choosing the way in whichsimilarity/distancewillbecalculated.

Fordistance,dMij,thedefaultistousetheMinkowskimetricswithparameterM.

For0<M<1, thename“metric”doesnotapplybecause thenumbers for “distance”canviolatethetriangleinequality.Inthehelpwewillusethephrase“semi‐distance”tonamethisthing.Thesesemi‐distancesviolatethetriangleinequalitybutitisprobablypeculiarlyappropriate to network analysis. Given paths diverging from a center, the shortestdistance may be through the center, rather than directly across the coordinate’s space.Therefore the indirect path, through a common center, may be shorter than a direct‐lookingpaththattriestogetbetweentwonodeswithoutgoingtothecenter.

In futurewewant to allow theuser to select a set ofmetrics and then each is run inparallel.But‐–thiscancauseaproblemwithlocalminimization.

UsingtheMinkowskidistanceisthedefault.Advancedoptionsare:

Page 26: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 26 ~

Cauchy Option 

Using the base model to calculate the frequency matrix is the default. An advancedoptionistotryinsteadtheCauchyform:

1

This model worked spectacularly well on 4 examples from one of Green’s books oncorrespondenceanalysis.Oddly,ithasn’tworkedwellelsewhere,althoughIrarelytryit.

This model is to the base model as “fat tailed” probability distributions are to theGaussian.Itisanoption,butwillrarelybeused,

Other distance metrics 

Thesearetobedetermined.ThesemaynotrequireaMinkowskiparameter.

Select the Minkowski parameter – M 

Bydefault aEuclidean space is assumedandM is equal to2. Theuser canchoose toalter it. By default constrain M > 0, As an advanced option, allow the user to choosewhetherornottoenforcethisconstraint..

Determine the number of multipliers and coordinates 

Third,thesystemneedstosethowmanymultipliersandcoordinatesareneededsothattheensuingsystemwillproducethecorrectnumber.

• Forsquaresymmetrictherowmultipliersandcoordinatesarethesameasforthecolumn. Thus is there are 4 rows/columns there are 4 multipliers and 4coordinates.

• Forsquareasymmetricthemultiplierscanbedifferentbutthecoordinatesarethesame. In this case if there are 4 rows/columns you will have either 4 or 8multipliersand4coordinates

– Youwouldwantthemultiplierstobedifferentifthereisalogicalreasonwhytherowsandcolumnsarefacingdifferentissues.Otherwiseyouwantthemtobethesame.Soasktheuserwhatisthecase.

• For rectangular – the row multipliers and coordinates are different from thecolumn’s. So if you have 4 rows and 6 columns youwant 10multipliers and 10coordinates.

• Note–howmany“numbers”therearetoacoordinatedependsonthedimensions–so if there are 10 coordinates but 1 dimension there are 10 numbers, if 2dimensions10pairsofnumbersor20numbers,andif3dimensions10tripletsofnumbersor30numbers.

Select whether to fit the diagonals 

Ifthediagonalsare0donotfit,elsefitthem.

Page 27: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 27 ~

Step 6.  System determines whether it will fit the diagonal • Ifthefrequencyfileissquaresymmetric

– Ifthediagonalis0andthenyoudon’ttrytofitit.Notethisisthedefault.

– If thediagonal isnot0thentheusershouldbegiven thechoiceto fit itornot; e.g. if diagonal is “different in kind” and not just the result of foldingthendon’t try to fit it. TheORASCMprocesswill know this if it hasbeenusedtocreatethefrequencyfile.

• Ifthefrequencyfileissquareasymmetric.Fitthediagonalifitisnon0otherwisedonotfitit.

• Ifthefrequencyfileisrectangularalwaysfitthediagonal

Step 7. Generate multipliers and coordinates ThesearegeneratedautomaticallybytheSCMprocess.Theuserisnotinvolveddirectly.

Thisstepbasicallysetstheinitialvaluesastheoptimizerwillchangethem.

Step 8.  Calculating the Chi‐Square In thenextstep,anoptimization function is run tominimize thechi‐square. Thischi‐

squareisbasedonatableofdata.Thedataisthefrequencytablefromstep2thathasbeenchecked through steps 3 and 4. The “Observed” in the chi‐square are the cells in thefrequencytable.Theroleof“Expected”valuesistakenbythevaluespredictedbytheSCMmodel.

Note–inlatervariantswemighttrythingsotherthanaChi‐square.

Step 9.  Optimize the fit of the SCM Theinputistheoptionsjustidentifiedinsteps5and6,thefrequencyfile,andtheinitial

valuesforthemultipliersandcoordinates.

The goal is to generate a least‐chi‐square for the fitted value for the cells beingexamined. Thisdoesnotsatisfythemathematicalpropertiesofchi‐squareandjustusingthisasaconvenience.Soweshouldputawarningaboutthatintheinterface.Further,itisslightlydifficult tocalculate thedegreesof freedom. For the initial toolwewillnoteventry. For V2 we will include this calculation. With NR row multipliers and CR column

Page 28: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 28 ~

multipliers there are NR+NR – 1 of them that count against the degrees of freedom.Further.itactuallymatterswhetherornotthespaceisEuclideanwithattenuationequalto2. Inthiscasefeweroftheparametersareindependent. Seethesectionondeterminingthenumberofcoordinatesandmultipliers.

Thefitisafunctionofthemultipliers,thecoordinates,theMinkowski,theattenuation,givenamodel.

Optimization is used to select the multipliers, the coordinates (and maybe theMinkowskiand theattenuation) thatgive the least‐chi‐squared. Optimization is amulti‐stepprocess.Thegoaloftheoptimizationistominimizethechi‐square.

Why?Heuristicallyitworks. Wouldapureoptimizationapproachbebetter?Itisnotclearasoneneedstoconsiderrateofconvergenceandsothespeedoftheoverallsystem.Theideaistousethisapproachandthenexperimentwithalternatives.

HeuristicsMethod:pick4possiblevaluesforMinkowskiand4forattenuation–thenforeach of these 16 cells run the optimizer and find the multipliers and coordinates thatminimizethechi‐square

• ThevaluestouseasadefaultforMinkowskiare.7,1,2,3,andinfinity.Asanoptionallow user to set their own values to try and there can be any number of thesebetween0andinfinity.

• Thevalues touseasadefault forattenuationare1,2, infinityand .7.Asanoptionallow user to set their own values to try and there can be any number of thesebetween0andinfinity.

• Goodresultsoftenhaveattenuationbeing=totheMinkowskivalueminus1.

Selecttheonethesethatledtotheminimumandthenrunasecondoptimizationwhereitchangesallofthemultipliers,thecoordinates,theMinkowskiandtheattenuation.Iftheoperationisfasttorun,calculateall16cells,pickthebestasthestartingpointandmoveout from there. If it is slow to run, start out only checking the cases: Minkowski 2attenuation1,Minkowski1attenuation .7andseewhich isbetter– thenmoveout fromthoseinthedirectionthatmakesitbetter.

Possibleoptimizertouseisthesimulatedannealer.NoteeverythingshouldbewritteninC++liketherestofthetool.

Options for Optimizing the SCM   

Log2 or ln 

Useeasthedefault,and2asanoptioninthebasicmodel.Note,thismaynotbethebestdefault and alternatives should be explored. This issue is that if you use the sameparametersyoucancompareabsolutedistances. Ifwe’vegiventhemtodifferentbases,thereisgoingtobeconfusion.(Ifthedistancesareshort,yougethigherfrequencies,evenif those distances look exactly the same (proportional to) longer distances in a differentmap.Adefaultof2isnicebecauseregardlessoftheaparameter,adistanceof1willgivehalfthefrequenciesfoundatdistance0.

Page 29: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 29 ~

Multipliers 

Themultipliersshouldbesetsothatthegeometricmeanoftherowmultipliersisequalto the geometric mean of the column multipliers. Where the geometric mean of nmultipliersisthenthrootoftheproductofthemultipliers.Thisisthedefault.

Usercanselectthisoption.Theoptionistosettheusean“Allmultiplier”applicableto‘all’cells,whichwouldallowtherowmultiplierstobestandardizedtothegeometricmean1,dittoforthecolumnmultipliers.

Coordinates 

WhenMis2youareintheEuclideancase.IfM=2thensettheunweightedmeanoftherowcoordinatesto0(ineachdimension).Sameforthecolumncoordinates.Notethat,atM=2, theeffectthattherowcoordinateshaveonthemodelis invariantunderanadditivetransformation. Dittoforcolumncoordinates. Hencethisstandardization. AtM=2forthe row,only the intervals among rowsmatter. Ditto for columns. (Donot alter theirscale,justtheirmeans.

IfMisnotequalto2,thenthecoordinatesinanyonedimension,bothrowcoordinatesandcolumncoordinateshavetobetreatedtogether.Inthiscase,subtracttheunweightedmeanof theircoordinates fromtheircoordinate– translating themtogetherasaset. Donotaltertheirscale,justtheirjointmean.

Minkowski power 

Thishelpstodefinefrequency

• Userselects

– Startingvalueandallowsthesystemtoimprove.Asnotedabovethedefaultistousethe4pre‐definedvalues.

– Whatfunctionalmodeltouse:

• Doanything

• Selectthefunctionfromalistofthoseavailable

Frequency Attenuation 

• This controls what power of the semi‐distance is use for attenuation of thefrequency

• Userselects

– Startingvalueandallowsthesystemtoimprove.Asnotedabovethedefaultistousethe4pre‐definedvalues.

– Whatfunctionalmodeltouse:

• Doanything

• Selectthefunctionfromalistofthoseavailable

Functional models 

TheseareusedastheoptionsfortheMinkowskipowerandthefrequencyattenuation.

Page 30: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 30 ~

Thefunctionalmodelstomakeavailableare:

• Etotheattenuateddistance

• Inversepowerlaw

• Correspondenceanalysis

– (Sidenote–usethecorrespondenceanalysisalreadywritteninORA)

• Cauchy

• Otherswillbeaddedinthefuture

Additional Option 

Allow the user to turn off the attempt to calculate Minkowski power or frequencyattenuation.Bydefaultthiscalculationisturnedon.

AllowtheusertolettheMinkowskigeneratea0distance.Bydefaulta0distanceisnotallowed. Note –wemayneed to set this differentlydependingon the functionalmodel.Thedistancecanbecontrolledbynot lettingtheoptimizersetcertainvaluesorbyusingstartingpointsinconjunctionwithanoptimizerthatcanneverreach0.

Noteahelpfileshouldcontaintheinformationinthesestepsbutasexplanations.

Calculate Error 

ThisistheChi‐squarederror.Ifthefrequencytableissquaresymmetricthencalculateerrorononlytheuppertriangle.Elsecalculateerrorontheentiretable.

Optimization Routine 

It is very likely for this system to regardless of theoptimizer chosennot settle into asinglefinalvalue.Intheendwemaywanttoallowformultipleoptimizationapproaches.However,inphase1‐ratherthanannealingtrythissimpleheuristic.Itislikeanannealerbutwithoutthecostfunctionandthere‐starts.Justsimplehillclimbing.

For each of 1, 2 and 3 dimensions ‐ Set three initial values for the row and columncoordinates,rowandcolumnmultipliersbasedontheconstraintsinthedocumentandforothervariables.Notethattherecanbelotsofparameters(includingeachcoordinate).SoifyouhavePparameters,“allcombinations”,isgoingtobe3^Pwhichcouldquicklybeverylarge.ScalabilityneedstobecheckedforV2.Ifyouaregoingtodoafullevaluationofthetableforeachcombination,theevaluationisthetimeburner,soweshouldcheckoptionsinparallel

Runallcombinations.

Rank order the results in terms of the fit of the chi‐square. Find the set of theseparametersthatgivesthebestfitforthatdimension.

Now going through the variables in this order ‐ multipliers then coordinates slightlyraise and lower the value while holding the other parameters constant. Within this"snowball" set ‐ take the new value if the fit is better than the original. Continue toquiescenceor10stepswhicheverisquicker.

Make it possible for the user to view theplots for how the fit changed as one or twoparametersofinterestchanged.

Page 31: SCM Tech Report CMU-ISR-16-108-final · SCM System Joel H. Levine*, Kathleen M. Carley June 3, 2016 CMU-ISR-16-108 Institute for Software Research School of Computer Science Carnegie

~ 31 ~

As a side note on optimization, if there is no parallelization you can do this oneparameteratatimeaseachparametermayimpactonlyasmallnumberofcellsintheSCM.In contrast, downhill simplex (with or without annealing) is always a full evaluationsimultaneouslyhangingallparameters.Ithasdifferentcosts.

Step 10: Visualization Oncechi‐squareisminimized,thenyouproduceavisualization.Thisisdoneusingthe

ORAvisualizationtoolforgrid‐basedvisualization.Theminimumchi‐squareisconsideredthebestfit.AnexampleofthevisualizationisshowninFigure7.

Addareportthatprovides:

a) ThevaluesforalltheMinkowskiparameters.

b) Theamountoferror

c) Attempttocalculatethedegreesoffreedomandprintthat

d) Calculatethestandardizedchi‐square–thisisaconvention.Ideallythechi‐squareisequaltothedegreeoffreedom.Printthestandardizedchi‐square

e) Do a t‐test to compare the square root of two of degrees of freedom and the chi‐square.Isitnear2?Ifsoprintoutthatthisisagoodfit.

EXTENSIONS 

How far does this strategy go toward introducing numerical variables and testablehypotheses?

Beyondheight‐weight,apedagogical“workhorse”ofthestatisticstrade,andbeyondthisterroristnetwork,how fardoes this strategy takeus intosolutions fornot‐yet‐measuredvariables?Wesuggestthatthisapproachwillhavevalueinawidenumberofareas,suchas:

toare‐conceptualismofsociological/politicalsurveysasnetworks

topharmacologywhereitdetectsadimensionwithintherelationbetweendrugs(functionalgroups)andbiologicaleffect

toanexaminationofsocialstructureandstabilitywithintheSyrianOpposition,

tonetworkanalysiswhereitprovidesanobjectivegoodnessoffitwithwhichtoevaluatecompeting‘visualizations’ofanetwork,

totextanalysis,andtheanalysisofrealworldbudgets.