pivotal greenplum textgptext.docs.pivotal.io/archives/gptext-docs-330.pdf · 2020. 8. 14. · in...
TRANSCRIPT
-
PivotalGreenplum®Text
Version3.3.0
UserGuide
Rev:01
©2019PivotalSoftware,Inc.
-
23410151722293234364660728391144168169171
TableofContents
TableofContentsPivotal®Greenplum®Text3.3.0DocumentationPivotal®GPText3.3.0ReleaseNotesInstallingGPTextUpgradingGPTextIntroductiontoPivotalGPTextAdministeringGPTextGPTextHighAvailabilityGPTextBestPracticesTroubleshootingHadoopConnectionProblemsWorkingWithGPTextIndexesQueryingGPTextIndexesCustomizingGPTextIndexesWorkingWithGPTextExternalIndexesNaturalLanguageProcessingwithGPTextIndexesGPTextFunctionReferenceGPTextManagementUtilitiesGPTextandSolrDataTypeMappingsGPTextSchemaTablesGPTextConfigurationParameters
©CopyrightPivotalSoftware,Inc,2013-2019 2 3.3.0
-
Pivotal®Greenplum®Text3.3.0Documentation
GPTextDocumentationPDF
PivotalGPText3.3.0ReleaseNotes
InstallingPivotalGPText
UpgradingPivotalGPText
UsingPivotalGPText
GPTextReferences
AdditionalResourcesPivotalGreenplumDatabase
ApacheSolrWebSite
ApacheMADlib
©CopyrightPivotalSoftware,Inc,2013-2019 3 3.3.0
http://docs-gptext-staging.cfapps.io/archives/GPText-docs-320.pdfhttp://docs-gptext-staging.cfapps.io/330/topics/http://docs-gptext-staging.cfapps.io/330/topics/FuncRef_preface.htmlhttp://gpdb.docs.pivotal.iohttp://lucene.apache.org/solr/http://madlib.apache.org/
-
Pivotal®GPText3.3.0ReleaseNotesThisdocumentcontainsreleaseinformationforPivotalGPText3.3.0
Released:June2019
AboutPivotalGPTextPivotalGPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearchandtheApacheMADlibAnalyticsLibrarytoprovidelarge-scaleanalyticsprocessingandbusinessdecisionsupport.GPTextincludesfreetextsearchaswellassupportfortextanalysis.
GPTextincludesthefollowingfeatures:
TheGPTextdatabaseschemaprovidesin-databaseaccesstoApacheSolrindexingandsearching
BuildindexeswithdatabasedataorexternaldocumentsandsearchwiththeGPTextAPI
Customtokenizersforinternationaltextandsocialmediatext
AUniversalQueryProcessorthatacceptsquerieswithmixedsyntaxfromsupportedSolrqueryprocessors
Facetedsearchresults
Termhighlightinginresults
Naturallanguageprocessing,includingpart-of-speechtaggingandnamedentityextraction
Greateremphasisonhighavailability
TheGPTextmanagementutilitysuiteincludescommand-lineutilitiestoperformthefollowingtasks:
Start,stop,andmonitorZooKeeperandGPTextnodes
ConfigureGPTextnodesandindexes
Addanddeletereplicasforindexshards
BackupandrestoreGPTextindexes
RecoveraGPTextnode
ExpandtheGPTextclusterbyaddingGPTextnodes
PrerequisitesInstallingGPTextalsoinstallsApacheSolrCloudand,optionally,ApacheZooKeeper.
FollowingareGPTextinstallationprerequisites.
GPTextrunsonRedHatEnterpriseLinux5.x,6.x,and7.x.
GPTextrunsonGreenplumDatabaseversion4.3.6orhigher,GreenplumDatabase5,orGreenplumDatabase6.GreenplumDatabase6requiresatleastGPText3.3.
GPTextrequiresJava8,OpenJDK8,Java11,orOpenJDK11tobeinstalledoneachhostintheGreenplumDatabasecluster.AddtheJRE bindirectorytothe PATH onallhostsinthecluster.
InstallandconfigureyourGreenplumDatabasesystembeforeyouinstallGPText.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .
Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( sudo yum install nc ).
Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).
GPTextcannotbeinstalledontoasharedNFSmount.
GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.
IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabase
©CopyrightPivotalSoftware,Inc,2013-2019 4 3.3.0
https://gpdb.docs.pivotal.io
-
gp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit intheGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.
ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes(fivenodesrecommended).Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusteronseparatehostswithnetworkconnectivitytotheGreenplumnetwork.
NewFeaturesandEnhancementsinGPText3.3.0
UsingGPText3.3withGreenplumDatabase6.0GPText3.3canbeinstalledonaGreenplumDatabase6systemwithJava8orJava11.
AGPTextbinarydistributionhasbeenaddedtoPivotalNetwork forRedHat7/CentOS7withGreenplumDatabase6.
FollowingaredifferencesusingGPTextwithGreenplumDatabase6thanwithearlierGreenplumDatabasereleases:
The custom_variable_classes serverconfigurationparameterhasbeenremovedinGreenplumDatabase6.WithearlierGreenplumDatabaseversions,itwasnecessarytoadd 'gptext' tothisparameterinordertosetGPTextconfigurationparameters.GreenplumDatabase6allowsyoutosetconfigurationparametersinadatabasesessionwithoutdeclaringavariableclass.
InGreenplumDatabase4and5,thedefaultoutputformatforthebinarydatatype bytea isthePostgreSQLescapeformat,asequenceofASCIIcharacterswithescapesequenceswherebytescannotberepresentedwithASCII.InGreenplumDatabase6,thedefaultoutputformatisthehexformat,whichrepresentseachbytewithhexadecimaldigits.InGreenplumDatabase5,thehexoutputformatcanbespecifiedbysettingthebytea_output configurationparameterto hex .ToproducethesameoutputinGreenplumDatabase4,5,and6,youcansetthe bytea_output
configurationparameterto escape .
CustomConfigurationDirectoryAnewoptionalinstallationparameter, GPTEXT_CUSTOM_CONFIG_DIR ,canbesetinthe gptext_install_config filetospecifyadirectorytostorecustomconfigurationfiles.
Bydefault,GPTextsavescustomconfigurationfilesunderthe $GPTEXTHOME/share/ directoryoneachSolrhost,forexample $GPTEXTHOME/share/external_ .
Tospecifyadifferentdirectorytostoreexternalconfigurationfiles,beforeyouruntheGPTextinstaller,uncommentthe GPTEXT_CUSTOM_CONFIG_DIRparameterinthe gptext_install_config fileandspecifythefullpathtothedirectory.Forexample:
GPTEXT_CUSTOM_CONFIG_DIR="/home/gpadmin/config_dir"
ThegpadminusermusthavetheOSpermissionsrequiredtocreatethedirectory.
Iftheparameterisset,theGPTextinstallerwillcreatethecustomconfigurationdirectoryoneverySolrhost.Configurationfilesyouuploadusingthegptext-externalupload
commandwillbestoredunderthisdirectoryoneverySolrhosttoallowSolrtoaccesstheexternaldocumentsourcefromeveryhost.
Forexampleifthe GPTEXT_CUSTOM_CONFIG_DIR parameterissetto /home/gpadmin/config_dir whenyouinstallGPText,ans3configurationwiththenames3_conf willbesavedinthedirectory /home/gpadmin/config_dir/external_source/s3/s3_conf oneachhost.
NewFeaturesandEnhancementsinGPText3.2.0TheGPText3.2.0releaseprovidesthefollowingfeaturesandenhancements.
LemmatizationGPText3.2.0enableslemmatizingtermsinGPTextindexes.YoucandefineSolranalysischainsthatincludetheApacheOpenNLPparts-of-speechfilterandthenewGPTextWordNetLemmatizerfilter,whichreplacestermswiththerootformoftheterm.TheWordNetLemmatizerfilterusesalexicaldatabasefromthePrincetonUniversityWordNet®projecttodeterminetherootform.
©CopyrightPivotalSoftware,Inc,2013-2019 5 3.3.0
http://greenplum.org/calc/https://network.pivotal.io/products/pivotal-gpdb
-
GPTextConfigurationFilesLocationGPTextnowsavesconfigurationfiles gptext.conf , gptxtenvs.conf ,and zookeeper.conf onlyintheGreenplumDatabasemasterandstandbymasterdirectories.The gptext.conf fileisnolongersavedineachsegmentdatadirectory.
FlexibleShardingBydefault,GPTextcreatesoneSolrindexshardforeachGreenplumDatabaseprimarysegment.Youcannowspecifyasmallernumberofshardsbysettingthe gptext.idx_num_shards parametertothenumberofshardsyouwantbeforeyoucreatetheindex.ThisworksforbothregularGPTextindexesandexternalindexes.
When gptext.idx_num_shards issettothedefault(0),GPTextconfigurestheindextousetheSolr implicit router,withoneshardperGreenplumDatabasesegment.Whenthe gptext.idx_num_shards parameterischangedtothenumberofshardsdesired,GPTextcreatestheindexusingtheSolr compositeId routertoroutedocumentstoshards.The compositeId routerdoesnotsupportduplicateIDs,soifyousetthe if_check_id_uniqueness argumenttofalsewhenyoucallthe gptext.create_index() functionthe implicit routerisused,andtheindexwillhaveoneshardperGreenplumDatabasesegment.
The content_id columnisremovedfromtheoutputofthe gptext.index_status() and gptext.index_summary() functions,sinceGreenplumDatabasesegmentsarenotalwaysassociatedwithasingleindexshard.
SeeSpecifyingtheNumberofShardsformoreinformationaboutthisfeature.
gptext-recoverUtilityWhenusingthe -f ( --force )option,the gptext-recover utilitynowverifiesthattherearenoindexesinaredstatebeforeproceeding.Ifanyindexisdown,theutilityexits.
ZooKeeperUpgradeApacheZooKeeperincludedwithGPText3.2.0hasbeenupgradedtoversion3.4.11.ThisZooKeeperreleaseincludesbugfixesthatresolveaninconsistentclusterissuewithGPText(MPP-29742).
NewFeaturesandEnhancementsinGPText3.1.0TheGPText3.1.0releaseprovidesthefollowingfeaturesandenhancements.
ImprovementstoaidindevelopingandtestinganalyzerchainsThenew gptext.list_field_types() functionliststhefieldtypesdefinedinthe managed-schema configurationfileforanindex.
Thenew gptext.get_field_type() functiondisplaystheindexandqueryanalyzerchainsforafieldtypeinJSONformat.
Thenew gptext.analyzer() functionshowstheindexorqueryanalyzerchainoutputforagivenfieldtypeandinputtext.Thisfunctionisusefulfortestinganddebugginganalyzerchainsinteractivelywithoutmodifyingtheindex.
Part-of-speechtaggingandnamedentityrecognitionGPTextincludesOpenNLPlibrariesandanalyzerclassestoclassifyindexedterms’parts-of-speech(POS),andtorecognizenamedentities,suchasthenamesofpersons,locations,andorganizations(NER).GPTextsavesNERtermsinthefield’stermsvector,prependedwithacodetoidentifythetypeofentityrecognized.Thisallowssearchingdocumentsbyentitytype.
Thenew gptext.ner_terms() functionlistsNER-taggedtermsfordocumentsthatmatchaquery.
GPTextincludestheOpenNLPmodelsfortheEnglishlanguage.YoucandownloadmodelsforotherlanguagesfromtheOpenNLPwebsiteandusethemwithGPText.
Otherenhancementsandfixes
©CopyrightPivotalSoftware,Inc,2013-2019 6 3.3.0
-
Thefirstargumentofthe gptext.terms() function,ananytabledatatype,hasbeenmadeoptional.
Fixedanerrorwherethe gptext.partition_status() functiondisplayedpartitioninformationforanindexafteritwasdropped.
ApacheSolrupdatedtoSolrversion7.3GPText3.1.0includesApacheSolr7.3.SeethefollowingreleasedocumentsforinformationabouttheSolr7.3release.
ApacheSolr7.3UpgradeNotes
ApacheSolr7.3ReleaseHighlights
FollowingareGPTextchangesandSolrusagenotesrelatedtotheSolr7.3upgrade.
GPTextserver-sidecomponentsarerebuiltandtestedwiththenewSolrJARfiles.
The managed-schema , solrconfig.xml andothercollectionconfigurationfilesareupdated.
Thetop-level elementin solrconfig.xml isnowofficiallydeprecatedinfavoroftheequivalent syntax.ThiselementhasbeenoutofuseindefaultSolrinstallationsforseveralreleasesalready.
The legacyCloud parameternowdefaultstofalse.Ifanentryforareplicadoesnotexistin state.json ,thatreplicawillnotberegistered.Thismayaffectuserswhobringupreplicasandtheyareautomaticallyregisteredasapartofashard.ItispossibletoreverttotheoldbehaviorbysettingthepropertylegacyCloud=true intheclusterpropertiesbyrunningthefollowingcommandintheGPTextinstallationdirectory:
$./server/scripts/cloud-scripts/zkcli.sh-zkhost127.0.0.1:2181-cmdclusterprop-namelegacyCloud-valtrue
WithearlierSolrreleases,ifyoudropanindexwhileaSolrnodewithareplicaoftheindexisdown,whenthedownnodecomesbackon-line,theindexcomesbackandcannotbedeleted.Solr7fixesthisbug.TheGPTextworkaroundforthisbugisremoved.
PointFieldsaredefaultnumerictypes.Solrhasimplemented*PointFieldtypesacrosstheboard,toreplaceTrie*basednumericfields.AllTrie*fieldsarenowconsidereddeprecated,andwillberemovedinSolr8.IfyouareusingTrie*fieldsinyourschema,youshouldconsidermovingtoPointFieldsassoonasfeasible.ChangingtothenewPointFieldtypeswillrequireyoutore-indexyourdata.
Thefollowingspatial-relatedfieldshavebeendeprecated:LatLonTypeGeoHashFieldFieldTypeSpatialTermQueryPrefixTreeFieldTypeUseoneofthesefieldtypesinstead:LatLonPointSpatialFieldSpatialRecursivePrefixTreeFieldRptWithGeometrySpatialField
ToimproveparameterconsistencyintheCollectionsAPI,theparameternames fromNode fortheMOVEREPLICAcommandandsource,and target fortheREPLACENODEcommandhavebeendeprecatedandreplacedwith sourceNode and targetNode instead.Theoldnameswillcontinuetoworkforbackwardscompatibility,buttheywillberemovedinSolr8.
Thereplicacorenamehaschangedfrom _shard#_replica# to _shard#_replica_# .Forexample,demo.wikipedia.articles_shard0_replica1 becomes demo.wikipedia.articles_shard0_replica_n1 .
NewFeaturesandEnhancementsinGPText3.0.0GPText3.0.0allowsaddingdocumentsstoredinAmazonWebServicesS3bucketstoaGPTextexternalindex.ThisenhancementincludeschangestoenableuploadingAWScredentialstoZooKeeperandsupportforthe s3 documentsourcetypeforthe gptext.external_login() , gptext.external_logout() ,gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.
The gptext-state utilitywiththe --index ( -i )optionnowincludesthedateandtimetheGPTextindexwaslastmodified.
NewFeaturesandEnhancementsinGPText2.4.0GPText2.4.0allowsaddingdocumentsstoredinanauthenticatedFTPservertoaGPTextexternalindex.Thisenhancementincludeschangestoaddsupportforthe ftp typetothe gptext.external upload command-lineutilityandthe gptext.external_login() , gptext.external_logout(), gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.
©CopyrightPivotalSoftware,Inc,2013-2019 7 3.3.0
https://lucene.apache.org/solr/guide/7_3/solr-upgrade-notes.htmlhttps://wiki.apache.org/solr/ReleaseNote73
-
NewFeaturesandEnhancementsinGPText2.3.1The gptext-backup command-lineutilitycannowbackupGPTextindexestolocalGPTextclusterstorageaswellasadirectoryonashareddrive.Forlocalbackups,backupmetadataandtheindexconfigurationfilesarebackeduptotheGreenplumDatabasemasterdatadirectoryandindexshardsarebackedupinthesegmentdatadirectoriesoneachhost.
The gptext-backup utilityhasanewoptiontobackupjusttheindexconfigurationfilesfromZooKeeper,withnoindexdata.
The gptext-restore uilityisupdatedtorestorebackupscreatedonlocalclusterstorage.
The gptext-restore utilityhasanewoptiontorestoreonlytheconfigurationfilesfromabackup.ThisoptionloadstheconfigurationfilesintoZooKeeperandcreatesanemptyGPTextindex.
NewFeaturesandEnhancementsinGPText2.3.0
Revisedgptext-configUtilitySyntaxThe gptext-config command-lineutilitywasrevisedtohaveamoreuser-friendlysyntax.
Anew list subcommandwasaddedto gptext-config youcanusetolistalloftheconfigurationfilesforaspecifiedGPTextindex.
$gptext-configlist-i
IndexDocumentsinaHadoopFileSystem(hdfs)DocumentSourceGPText2.3.0enablesyoutoadddocumentsstoredinahdfssystemtoaGPTextexternalindex.
Thenew gptext-external command-lineutilityuploadsHadoopconfigurationandauthenticationfilestoanamedconfigurationinZooKeeper.Theutilityhassubcommands upload , list ,and delete tomanagetheconfigurationsyouhaveuploaded.
Thenew gptext.external_login() functionlogsintothehdfssystemusingthenamedconfigurationyouhaveuploaded.Youcanlogintoonlyoneexternaldocumentsourceatatime.
UseURLsoftheform hdfs:// withthe gptext.index() and gptext.index_external() functionstoadddocumentstoaGPTextexternalindex.
Usethenew gptext.index_external_dir() functiontoaddalldocumentsinanhdfsdirectorytoaGPTextexternalindex.
Logoutofthehdfsexternaldocumentsourcewiththenew gptext.external_logout() function.
SeeAuthenticatingwithanExternalDocumentSourceforstepstoenableaccesstoanhdfsdocumentsource.
KnownIssuesSeetheApacheJira forknownissuesinApacheSolr.
FollowingareknownissuesinGPText.Workaroundsareprovidedwhenavailable.
WildcardsinGPTextSearchOptionsSolrdoesnotreturnallfieldswhenthe fl Solrsearchoptioncontainsawildcardthatmatchesfieldnames.Forexample,givenatablewithcolumnscontenta and contentb ,specifying fl=contenta,contentb,(sum,1,1) correctlyreturnsthreefields.Specifying fl=cont*,sum(1,1) correctlyreturns contenta andcontentb ,butomitsthepseudo-field sum(1,1) .
Specifyingawildcardtomatchallfields( fl=*,sum(1,1) )alsoomitsthepseudo-field.
IndexLoadFailureAfterConfigurationFileErrorIfSolrfailstoloadanindexbecauseofaconfigurationfileerror,andthentheindexisdroppedwithoutfirstcorrectingtheconfigurationfileerror,the
©CopyrightPivotalSoftware,Inc,2013-2019 8 3.3.0
https://issues.apache.org/jira/projects/SOLR/summary
-
indexcannotberecreateduntilGPTextisrestarted.Thiscanhappenifyouedit managed-schema or solrconfig.xml andintroduceanXMLsyntaxerrororatypoinconfigurationvalues.
Workaround:
1. Whenanindexfailstoload,checktheSolrlogtofindthecause.
2. Ifthecauseisaconfigurationfileerror,suchasinvalidXML,usethe gptext-config utilitytoeditthefileandfixtheerror.Droppingtheindexwithoutfirstcorrectingtheerrorisnotrecommended.
3. Ifyouhavedroppedanindexthatfailedtoloadwithoutfirstcorrectingthecauseofthefailure,youmustrestartGPTextbeforeyoucanrecreatetheindex.Run gptext-start -r torestartGPText.
StartupFailurewithLargeNumbersofIndexesWhenthereisalargenumberofSolrcores,SolrCloudcanfailtorestartsuccessfully,witherrormessagesindicatingfailuretoelectleadersforshards.ThisisaknownSolrissue;seehttps://issues.apache.org/jira/browse/SOLR-5990 intheApacheSolrJiraforanexample.Becauseofthisissue,itisrecommendedtoavoiddesigningGPTextapplicationsthatcreatelargenumbersofindexes,shards,andreplicas.Thenumberofcoresyoucancreatebeforeyouobservethisbehaviorishardwaredependent,soyoushouldtesttodetermineyoursystem’slimits.Youcancreateandsuccessfullyoperatealargernumbersofindexesthancanberestartedsuccessfullylater,sobesuretotestrestartingGPTexttodetermineapracticallimit.
SettingGPTextConfigurationParametersWithoutFirstSettingcustom_variable_classesInGreenplumDatabaseversionsbeforeGreenplumDatabase6,ifthe custom_variable_classes GreenplumDatabaseserverconfigurationparameterdoesnotincludethevalue“gptext”,attemptingtosetaGPTextconfigurationparameterreturnsanerrormessage,forexample:
mydb-#setgptext.replication_factor=4;WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)ERROR:unrecognizedconfigurationparameter"gptext.replication_factor"
InGPText2.0,inadditiontotheerrormessage,thevalueoftheconfigurationparameterpersistedinZooKeeperiszero,replacingthepreviousvalueoftheparameter.
mydb-#showgptext.replication_factor;gptext.replication_factor----------------------------0
BeginningwithGPText2.1,theerrormessageisstillgenerated,howeverthevaluesavedinZooKeeperisthevaluespecifiedinthe set command,4intheprecedingexample.
Topreventtheerrormessage,beforesettinganyGPTextconfigurationparameters,usethe gpconfig command-lineutilitytosetthe custom_variable_classesconfigurationparameter:
$gpconfig-ccustom_variable_classes-v'gptext'
InGreenplumDatabase6.0,the custom_variable_classes configurationparameterisremovedandcustomparameterscanbesetwithouterrors.
©CopyrightPivotalSoftware,Inc,2013-2019 9 3.3.0
https://issues.apache.org/jira/browse/SOLR-5990
-
InstallingGPText
PrerequisitesTheGPTextinstallationincludestheinstallationofApacheSolrCloudand,optionally,ApacheZooKeeper.
IfyouareinstallinganewGPTextreleaseintoanexistingGPTextsystem,followtheinstructionsinUpgradingGPTextinstead.
FollowingareGPTextinstallationprerequisites.
InstallandconfigureyourGreenplumDatabasesystem,version4.3.6orhigher.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .
GPTextrunsonRedHatEnterpriseLinuxorCentOS5.x,6.x,or7.x.
GPTextcannotbeinstalledontoasharedNFSmount.
InstallaJRE1.8or1.11onallhostsinthecluster.
Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( yum install nc ).
Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).
GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.
IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabasegp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit intheGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.
ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes.Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusterwithatleastthreenodes(fivenodesrecommended)onseparatehostswithnetworkconnectivitytotheGreenplumnetwork.
InstalltheGPTextBinaryDistribution1. OntheGreenplummasterhost,extracttheGPTextdistributionfile.Forexample:
$cd/home/gpadmin$tarxvfzgreenplum-text--.tar.gz
Thiscreatesthedirectory greenplum-text-- containingthefiles: gptext_install_config andtheGPTextinstallationbinary,whichhasanameintheformat greenplum-text--.bin .
2. Ifnecessary,grantexecutepermissiontotheGPTextbinary.Forexample:
$chmod+x/home/gpadmin/greenplum-text--.bin
3. IfyouareinstallingGPTextinadirectorythatisonlywritablebyroot,suchasthedefaultdirectory /usr/local ,performthesestepsasroot:
a. Sourcethe greenplum_path.sh fileintheGreenplumDatabaseinstallationdirectory.
#source/usr/local/greenplum-db-/greenplum_path.sh
b. LocateorcreateatextfilecontainingalistofthenamesofallhostswhereyouwillinstallGPText,oneperline,includingthemasterandstandbyhostnames.
c. Startgpssh,specifyingthetextfilewithhostnames.
#gpssh-fhostlist.txt
d. Createtheinstallationdirectoryandthe greenplum-solr directoryandsettheownershipandpermissions.Forexample,ifyouareinstallingGPTextinthedefaultdirectory, /usr/local :
©CopyrightPivotalSoftware,Inc,2013-2019 10 3.3.0
https://gpdb.docs.pivotal.iohttp://greenplum.org/calc/
-
=>mkdir/usr/local/greenplum-text-=>mkdir/usr/local/greenplum-solr=>chowngpadmin:gpadmin/usr/local/greenplum-text-=>chmod775/usr/local/greenplum-text-=>chowngpadmin:gpadmin/usr/local/greenplum-solr=>chmod775/usr/local/greenplum-solr=>exit
e. Completetheremainingstepsasthegpadminuser.
4. Editthe gptext_install_config filetosetparametersfortheinstallation.SeeSetInstallationParametersfordetails.
5. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:
$./greenplum-text--.bin-c
6. AcceptthePivotallicenseagreement.
OptionalTwo-PartGPTextInstallationTheGPTexttwo-partinstallationinstallsanddeploystheGPTextsoftwareinseparatesteps.Thisgivesyoutheoptiontoinstallthesoftwarefilestoaread-only,shareddirectorymountedonallGPTexthostsinthecluster,ratherthaninstallingthesoftwareoneveryGPTexthost.
IfyouinstalltheGPTextsoftwareontoashareddrive,youmustsetthe GPTEXT_CUSTOM_CONFIG_DIR parameterintheinstallationconfigurationfile.ThisparameterspecifiesawritabledirectorythatexistsoneveryGPTexthostwhereGPTextcanstoreconfigurationfilesforexternaldatasources.SeeGPTextinstallationparametersformoreinformationaboutthisparameter.
RuntheGPTextinstallationintwopartsbyfollowingthestepsinthissection.
1. PrepareGPTextinstallationdirectoriesasdescribedinsteps1through3inInstalltheGPTextBinaries.
2. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:
$./greenplum-text-.bin-b
Notethatthe -c optionisomitted.
3. SourcetheGPTextenvironmentscriptintheGPTextinstallationdirectory:
$source/greenplum-text_path.sh
4. Editthe gptext_install_config filetosetparametersfortheGPTextdeployment.SeeSetInstallationParametersfordetails.Besuretouncommentandsetthe GPTEXT_CUSTOM_CONFIG_DIR parameterifyouinstalledthesoftwareonaread-onlydrive.
5. DeploytheGPTextclusterwiththe gptext-deploy command.Thecommandrequiresthe -c optiontospecifytheinstallationconfigurationfile.Alsoincludethe -m optionbecauseyouinstalledtheGPTextsoftwaretoashareddrivemountedonallGPTexthosts.Ifyoudonotinclude -m , gptext-deploy copiestheGPTextsoftwaretoallGPTexthosts.
$gptext-deploy-m-c
SetInstallationParametersAGPTextconfigurationfilenamed gptext_install_config containsparameterstoconfiguretheGPTextinstallation.Editthefileandsettheparametersasdescribedinthefollowingtable.
The GPTEXT_HOSTS and DATA_DIRECTORY installationparametersdeterminethenumberofGPTextnodesthataredeployed.Thenumberofdirectoriesincludedinthe DATA_DIRECTORY arrayisthenumberofGPTextnodesthatarecreatedperhost.
The GPTEXT_HOSTS parameterdeterminesthenumberofhosts.Ifsettotheconstant "ALLSEGHOSTS" thenumberofGPTextnodehostsisthesameasthenumberofGreenplumsegmenthosts.If GPTEXT_HOSTS issettoanarrayofhostnames,thelengthofthearrayisthenumberofGPTextnodehosts.
©CopyrightPivotalSoftware,Inc,2013-2019 11 3.3.0
-
GPTextinstallationparameters
GPTEXT_HOSTS
AnarrayofhostnamesonwhichtoinstallGPText,orusetheconstant "ALLSEGHOSTS" toinstallGPTextonallGreenplumDatabasesegmenthosts.GPTexthostsmustbepasswordlessssh-accessiblebythegpadminuserfromallotherhostsintheGreenplumCluster.
declare -a GPTEXT_HOSTS=(gptext_h1 gptext_h2 gptext_h3)
GPTEXT_HOSTS="ALLSEGHOSTS"
DATA_DIRECTORY
AnarrayofdirectorypathswhereGPTextdatadirectoriesaretobecreated.ThenumberofdirectoriesinthearraydeterminesthenumberofGPTextnodesthatwillbecreatedoneachphysicalhost.If GPTEXT_HOSTS listsmultipleinterfacesperhost,theGPTextnodesarespreadevenlyacrosstheinterfaceaddresses.
declare -a DATA_DIRECTORY=(/data/primary /data/primary)
GPTEXT_CUSTOM_CONFIG_DIR
ThepathtoadirectorywhereGPTextstoresuploadedexternaldatasourceconfigurationfilesandcustomlibraries.Ifyoudonotsetthisparameter,thedefaultistostorethesefilesinthe share subdirectoryoftheGPTextinstallationdirectory.Ifyoudospecifyadirectorywiththisparameter,thedirectoryiscreatedoneverySolrhostinthecluster,andexternalconfigurationfilesandcustomlibrarieswillbestoredthere,leavingtheGPTextinstallationdirectoryfreefromapplicationdata.
JAVA_OPTS
SetstheminimumandmaximummemoryeachSolrCloudJVMcanuse.
JAVA_OPTS="-Xms1024M -Xmx2048M"
GPTEXT_PORT_BASE
GP_MAX_PORT_LIMIT
SetarangeofportnumbersavailabletoGPTextnodes.GPTextfindsunusedportsinthespecifiedrange.
GPTEXT_PORT_BASE=18983GP_MAX_PORT_LIMIT=28983
ZOO_CLUSTER
WhethertodeployaGPTextbindingZooKeeperclusteroruseanexistingZooKeepercluster.Ifsetto "BINDING" theinstallationdeploysaZooKeepercluster.TouseanexistingZooKeepercluster,setthisparametertoalistofZooKeepernodesintheformat"host1:port,host2:port,host3:port “.
ZOO_CLUSTER="BINDING"
ZOO_HOSTS
If ZOO_CLUSTER issetto "BINDING" ,thisparameterisanarrayofthehostswheretheZooKeepernodesaretobeinstalled.Thearraymustcontain3,5,or7hostnames,forexample ZOO_HOSTS=(sdw1 sdw2 swd3 sdw4 sdw5) .IfyouareusingasinglehostforZooKeeper,specifyitmultipletimes,forexample, ZOO_HOSTS=(sdw1 sdw1 sdw1) .
declare -a ZOO_HOSTS=(sdw1 sdw2 sdw3 sdw4 sdw5)
ZOO_DATA_DIR
TheZooKeeperdatadirectory,requiredwhen ZOO_CLUSTER issetto "BINDING" .
ZOO_DATA_DIR="/data/master/"
ThemaximumnumberofGPTextnodesisthenumberofGreenplumDatabaseprimarysegments.ThebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodesallowed.Forexample,ifthereareeightprimarysegmentsperhostintheGreenplumDatabasecluster,themaximumnumberofGPTextnodesperhostiseight,butyoushouldtestwithtwoorfourGPTextnodesperhost,adjustingthe JAVA_OPTS installationparametertodividethememoryreservedforGPTextamongthem.
©CopyrightPivotalSoftware,Inc,2013-2019 12 3.3.0
-
ZOO_GPTXTNODE
ThenodepathinZooKeeperforGPText.Thisparameterisrequiredwhether ZOO_CLUSTER issetto "BINDING" oralistofhosts.
ZOO_GPTXTNODE="gptext"
ZOO_PORT_BASE
ZOO_MAX_PORT_LIMIT
ArangeofportnumberstousefortheZooKeepercluster.Unusedportsareallocatedfromwithinthisrange.Therangemustcontainatleast4000portnumbers.
ZOO_PORT_BASE=2188ZOO_MAX_PORT_LIMIT=12188
GPTEXT_JAVA_HOME
ThehomedirectoryoftheJavainstallationtorunforZooKeeperandSolrprocesses.Ifnotset,theJREspecifiedinthe PATH and JAVA_HOMEenvironmentvariableswillbeused.
GPTEXT_JAVA_HOME=/usr/java/jdk1.8.0_131
StartingGPTextFirst,makesuretheGPTextcommand-lineutilitiesareinyourpathbysourcingtheGreenplumDatabaseandGPTextenvironmentscripts.ItisimportanttosourcetheGPTextenvironmentscripteachtimeyousourcetheGreenplumDatabasescript.Forexample:
$source/usr/local/greenplum-db-/greenplum_path.sh$source/usr/local/greenplum-text-/greenplum-text_path.sh
TouseGPTextinadatabase,youmustfirstusethe gptext-installsql managementutilitytoinstalltheGPTextuser-definedfunctionsandotherobjectsinthedatabase:
$gptext-installsqldatabase[database2...]
TheGPTextobjectsarecreatedinthe gptext schema.
TheZooKeeperclustermustberunningbeforeyoustartGPText.IfyouinstalledaboundZooKeepercluster,startitwiththe zkManager command-lineutility.
$zkManagerstart
StartGPTextwiththe gptext-start utility.
$gptext-start
ConfigureGreenplumDatabaseGPTextconfigurationparametersaresavedinZooKeeper.Youcan,however,viewandsetGPTextconfigurationparametersinaGreenplumDatabasesessionusingthe SHOW and SET commands.
IfyouareusingGreenplumDatabase4.3.xor5.x,youmustfirstdeclaretheGPTextcustomvariableclassbyaddingittotheGreenplumDatabasecustom_variable_classes configurationparameter.The custom_variable_classes parameterisremovedinGreenplumDatabase6,sothisstepisunnecessaryifyouhaveGreenplumDatabase6.
The custom_variable_classes configurationparameterisacomma-separatedlistofclassnames.Itisunsetbydefault.Toseeifanycustomvariableclasseshavealreadybeenconfigured,runthis gpconfig commandatthecommandline.
$gpconfig-scustom_variable_classes
Ifnocustomvariableclasseshavebeenset,settheparameterwiththefollowingcommand.
©CopyrightPivotalSoftware,Inc,2013-2019 13 3.3.0
-
$gpconfig-ccustom_variable_classes-v'gptext'[gpadmin@gpsne~]$gpconfig-ccustom_variable_classes-v'gptext'20171029:12:29:11:028199gpconfig:gpsne:gpadmin-[INFO]:-completedsuccessfully
Ifotherclasseshavebeenconfigured,add gptext totheexistinglist,separatedbyacomma.
Run gpstop-u
tohaveGreenplumDatabasereloadtheconfigurationfile.
VieworsetGPTextConfigurationParametersWhenyouwanttovieworsetGPTextconfigurationparametersina psql session,firstexecutethe gptext.version() functiontoloadtheGPTextconfigurationparametersintothesession.
=#SELECTgptext.version();version--------------------------------GreenplumTextAnalytics3.2.0(1row)
=#SHOWgptext.idx_delim;gptext.idx_delim------------------,(1row)
SeeSettingGPTextConfigurationParametersformoreaboutGPTextconfigurationparameters.
UninstallingGPTextTouninstallGPText,runthe gptext-uninstall utility.YoumusthavesuperuserpermissionsonalldatabaseswithGPTextschemastorun gptext-uninstall .
gptext-uninstall runsonlyifthereisatleastonedatabasewithaGPTextschema.
Execute:
$gptext-uninstall
©CopyrightPivotalSoftware,Inc,2013-2019 14 3.3.0
-
UpgradingGPTextUpgradingaGPTextsystemtoanewGPTextreleaseinstallsthenewGPTextsoftwarereleaseonallhostsintheGreenplumclusterandthenupgradestheGPTextsystem.
UpgradingGPTextandGreenplumDatabaseattheSameTimeIfyouareupgradingtonewreleasesofGreenplumDatabaseandGPTextatthesametime,followthesesteps:
1. CompletetheGreenplumDatabaseupgradefirstandensurethedatabaseisoperational.
2. RuntheGPText gptext-migrator utilitytomigrateyourcurrentGPTextsystemtothenewlyupgradedGreenplumDatabasesystem.
3. EnsurethatthecurrentversionofGPTextworkswiththenewGreenplumDatabaseversion.
4. ProceedwiththeGPTextupgrade.
UpgradingaGPTextReleaseUpgradingaGPTextreleaseisatwo-partprocess:installthenewsoftwarereleaseontheGreenplumclusterhostsandthenupgradetheexistingGPTextsystem.TheGPTextinstallerperformsthefirstpart,installingthenewsoftware.The gptext-upgrade utilityperformsthesecondpart,upgradingthecurrentGPTextsystemtothenewversion.
TheGPTextinstallerdetectsanexistingGPTextsystemand,afterinstallingthenewsoftwarerelease,offerstorunthe gptext-upgrade utilityforyou.IfyouchoosetoupgradetheGPTextsystemlater,youcanrunthe gptext-upgrade utilityyourself.
AllupgradetasksareexecutedontheGreenplummasterhostasthe gpadmin user.The gpadmin usermusthavewritepermissioninthedirectorywherethenewGPTextreleaseistobeinstalled, /usr/local/greenplum-text-- bydefault.
TheGreenplumDatabase,ZooKeeper,andGPTextclustersmustberunning.TheprocedurestopsandrestartsGPTextduringtheupgrade.
Followthesesteps:
1. DownloadthenewGPTextreleaseforyourplatformfromPivotalNetwork .
2. Extractthereleasepackage.
$tarxfzgreenplum-text--.tar.gz
3. MakesurethatZooKeeperandGPTextarerunning.
$gptext-state
4. RuntheGPTextinstaller.
$./greenplum-text--.bin
5. TheinstallerpromptsyoutoacceptthePivotallicenseagreementandtochooseandcreatetheinstallationdirectory.
6. Theinstallerverifiestheenvironmenttoensurethatprerequisitesarepresent,suchasPythonandJava.Ifanyproblemsarediscovered,theinstalleroutputsanerrormessageandstops.Correcttheproblemidentifiedbythemessageandruntheinstalleragain.
7. AfterthenewsoftwarehasbeeninstalledontheGreenplumcluster,theinstallerlooksforanexistingGPTextinstallation.IfanexistingGPTextsystemisfound,theinstallerasksifyouwishtoupgradeGPTextdirectly.
Ifyouansweryes,theinstallerrunsthe gptext-upgrade script.The gptext-upgrade utilityvalidatestheenvironmenttoensureitcancompletetheupgrade,thenexecutestheupgradeandrestartstheGPTextsystem.Ifanyproblemsarediscovered, gptext-upgrade outputsamessageandquits.Fixtheindicatedproblemsandrunthegptext-upgradeutility(at /bin/gptext-upgrade )tocomplete
WhenupgradingGPText,youdonotspecifyaninstallationconfigurationfileasyoudofortheinitialGPTextinstallation.
©CopyrightPivotalSoftware,Inc,2013-2019 15 3.3.0
http://network.pivotal.io
-
theGPTextsystemupgrade.Ifyouanswerno,youmustrunthe gptext-upgrade scriptaftertheinstallercompletes.Seethegptext-upgradeutilityreferenceforinstructions.
Important:Ifyouanswernoorifthe gptext-upgrade quitswithoutupgradingyoursoftware,followthesestepstore-run gptext-upgrade atalatertime:
a. Sourcethe greenplum-text_path.sh scriptintheoldGPTextinstallationdirectory.Forexample:
$ source /usr/local/greenplum-text-/greenplum-text_path.sh
b. Runthe gptext-upgrade commandfromthenewGPTextinstallationdirectory:
$ /usr/local/greenplum-text-/bin/gptext-upgrade
8. Aftertheupgradehascompleted,sourcethe greenplum-text_path.sh inthenewGPTextreleasedirectoryandrun gptext-statehealthcheck toverifytheGPTextsystem:
$source/usr/local/greenplum-text-/greenplum-text_path.sh$gptext-statehealthcheck
©CopyrightPivotalSoftware,Inc,2013-2019 16 3.3.0
-
IntroductiontoPivotalGPTextPivotalGPTextenablesprocessingmassquantitiesofrawtextdata(suchassocialmediafeedsore-maildatabases)intomission-criticalinformationthatguidesbusinessandprojectdecisions.GPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearch.GPTextincludespowerfultextsearchaswellassupportfortextanalysis.GPTextsupportsbusinessdecisionmakingbyoffering:
Multiplekindsofdata:GPTextsupportsbothsemi-structuredandunstructureddatasearches,whichexponentiallyincreasesthekindsofinformationyoucanfind.
Multipledocumentsources:GPTextcanindexdocumentsstoredinGreenplumDatabasetablesordocumentsretrievedfromexternalstores,suchasHTTPorFTPservers,AmazonS3,orHadoophdfs.Mostdocumentformatsarerecognizedautomatically.
Lessschemadependence:GPTextdoesnotrequirestaticschemastosuccessfullylocateinformation;schemascanchangeorbequitesimpleandstillreturntargetedresults.
Naturallanguagetextprocessing:GPTextprovidesNLPcapabilitieswiththeintegratedApacheOpenNLPtoolkit.
Textanalytics:YoucanuseApacheMADlibinGreenplumDatabaseforadvancedmachinelearning,graph,statisticsandanalyticsinGreenplumDatabase.
Thischaptercontainsthefollowingtopics:
GPTextSystemArchitecture
GPTextSampleUseCase
GPTextWorkflow
TextAnalysis
GPTextSystemArchitectureGPTextcombinesaGreenplumDatabaseclusterwithanApacheSolrCloudcluster.GreenplumDatabasesegmentsandGPTextnodescanbedeployedonthesamehostsorondifferenthostswithnetworkconnectivity.
ThefollowingfigureshowstheprocessarchitectureofthecombinedGreenplumDatabaseandApacheSolrclusters.ThefigureshowsfourclusternodeswithfourGreenplumsegmentsandfourSolrinstancesdeployedoneach.AnApacheZooKeeperservicemanagestheSolrCloudcluster.ZooKeepernodesaredeployedonthreeofthefourhosts.GreenplumDatabaseusersaccessSolrCloudservicesviaGPTextuser-definedfunctionsinstalledinGreenplumdatabasesandcommand-lineutilities.
©CopyrightPivotalSoftware,Inc,2013-2019 17 3.3.0
-
ThefigureomitstheGreenplummasterhost,secondarymaster,andmirrorsegmentsfortheGreenplumprimarysegments.
TheGreenplumsegments,Solrinstances,andZooKeepernodesmayallbedeployedonseparatehostsonthesamenetwork,dependingonapplicationandperformancerequirements.
ThefollowingsectionsdescribehowGPTextintegratesSolrCloudwithGreenplumDatabaseandhowthetwoclustersworktogethertoprovideparalleltextsearchcapabilitiesinGreenplumDatabaseandmaintainhighavailability.
GreenplumDatabaseClusterAGreenplumDatabaseclusteriscomprisedofthefollowingcomponents:
Amasterdatabaseinstance,executingonadedicatedhost,conventionallynamed mdw .(Notillustrated)
Asecondarymasterinstance,onahostconventionallynamed smdw ,actingasawarmstandbyforthemasterinstance.(Notillustrated)
Anarrayofdatabaseprimarysegmentinstancesandmirrorsdeployedonsegmenthosts,byconvention sdw1 through sdwn .AsegmentinstanceisanindependentPostgresdatabaseservermanagingaportionofthedistributeddata.Eachsegmenthasamirror(notillustrated)onanotherhostintheclustertoprovideuninterruptedserviceincaseofasegmentorsegmenthostfailure.Thenumberofprimarysegmentsperhostisdeterminedbythehardwareconfiguration—thenumberandtypeofprocessorcores,theamountofphysicalRAM,localstoragecapacity,andnetworkcapacity—aswellasavailabilityandperformancerequirements.
TheGreenplumDatabasemasterinstance,whichstoresnouserdata,coordinatestheworkofthesegmentinstances.DatabaseuserslogintothemasterinstanceandsubmitSQLqueries.Themasterinstancecreatesaplanforexecutingthequery,distributestheworktothesegments,andgathersandreturnstheresultstotheuser.
ApacheSolrCloudApacheSolrisaserverprovidingaccesstoApacheLucenefull-textindexes.ApacheSolrCloudisahighlyavailable,faulttolerantclusterofApacheSolrservers.ThetermGPTextclusterisanotherwaytorefertoaSolrCloudclusterdeployedbyGPTextforusewithaGreenplumDatabasesystem.
ASolrCloudclusteriscomprisedofthefollowingcomponents:
AnApacheZooKeeperclustertomanagetheSolrCloudcluster.SolrCloudusesZooKeepertomanageserverandindexconfigurationsandtocoordinatethecluster’sactivities.GPTextcaninstallaZooKeeperclusterthatisboundtotheGPTextcluster,oritcanshareanexistingZooKeepercluster.If
©CopyrightPivotalSoftware,Inc,2013-2019 18 3.3.0
-
GPTextinstallstheZooKeepercluster,itcanbemanagedusingGPTextfunctionsandutilities.TheZooKeeperclustercanbedeployedonGreenplumDatabaseclusterhostsor,forbestperformance,onseparatehostsaccessibletotheGreenplumDatabasecluster.
MultipleSolrCloudserverinstancesdeployedontheGreenplumsegmenthostsoronotherhostsonthesamenetwork.EachinstanceisaJVMprocessrunningSolrserver.SolrCloudinstancesuselocalstorage,whichmaybethesamelocalstoragevolumesthatstoreGreenplumDatabasedata.ThenumberofSolrCloudinstancesperhostcanbethesameasthenumberofGreenplumprimarysegmentsperhost,butthisisnotarequirement.ThenumberofinstancestoexecuteperhostisspecifiedduringGPTextinstallation.
GPTextprovidesdocumentindexingandsearchcapabilitiesforGreenplumDatabasewithuser-definedfunctions(UDFs)thataccessSolrAPIsfromwithindatabasequeries.
GPTextUDFsperformthefollowingtasks:
createandmanageGPTextindexes
providestatusinformationaboutindexes
insertdocumentsintoindexesfromdatabasetablesor,forGPTextexternalindexes,fromdocumentsstoredoutsideofGreenplumDatabase
searchindexes
TherearealsoGPTextUDFsandcommand-lineutilitiestoconfigure,monitor,andmanagetheSolrCloudcluster,andtomanagereplicas,SolrCloud’shigh-availabilitymechanism.(Moreonreplicasinthenextsection.)
ParallelisminGPTextIndexingandSearchingSolrClouddistributesdocumentindexesinslicescalledshards.EachshardismanagedbyaSolrCloudinstanceandZooKeeperensuresthattheshardsaredistributedevenlyamongtheSolrCloudinstances.TheSolrCloudinstancesandGreenplumsegmentsarenotrequiredtobeonthesamehosts.
WithGPText,thedefaultnumberofshardsforanindexisthenumberofGreenplumDatabasesegments,sothateachsegmentoperatesonanequalportionoftheindex.Optionally,alessernumberofshardscanbespecifiedwhenyoucreateaGPTextindex,allowingindexingworkloadstobescaledforperformancerequirementsandresourceusage.
HighAvailabilityforGPTextIndexesSolrCloudprovideshighavailabilitybymaintainingreplicasofshardsandprovidingautomaticfailoverifashardfailsorbecomesunavailable.Onereplicaofeachshardistheleadreplicaandanychangestoitareappliedtotheotherreplicas.Thereplicationfactor,whichdeterminesthenumberofreplicastomaintainforeachshard,issetwhentheindexiscreated.ReplicasmayalsobeaddedordroppedlaterusingGPTextUDFsorcommand-lineutilities.
ZooKeeperdeterminesthelocationsofshardreplicasamongtheSolrnodesandhosts.WhenaddingareplicausingaGPTextUDForcommand-lineutility,anewshardcanbeexplicitlyplacedonaSolrCloudinstance.
GPTextSampleUseCaseForensicfinancialanalystsneedtolocatecommunicationsamongcorporateexecutivesthatpointtofinancialmalfeasanceintheirfirm.Theanalystsusethefollowingworkflow:
1. LoadtheemailrecordsintoaGreenplumdatabase.
2. CreateaSolrindexoftheemailrecords.
3. Runqueriesthatlookfortextstringsandtheirauthors.
4. Refinethequeriesuntiltheypairadummycompanynamewithtopthreeorfourexecutivescorrespondingaboutsuspectoffshorefinancialtransactions.Withthisdata,theanalystscanfocustheinvestigationonspecificindividualsratherthanthethousandsofauthorsintheinitialdatasample.
GPTextWorkflowGPTextworkswithGreenplumDatabaseandApacheSolrCloudtostoreandindexbigdataforinformationretrieval(query)purposes.High-levelworkflowsincludedataloadingandindexing,anddataquerying.
Thistopicdescribesthefollowinginformation:
©CopyrightPivotalSoftware,Inc,2013-2019 19 3.3.0
-
DataLoadingandIndexingWorkflow
QueryingDataWorkflow
DataLoadingandIndexingWorkflowThefollowingdiagramshowstheGPTextworkflowforloadingandindexingdata.
AllclientinteractionwiththesystemisthroughtheGreenplummasterinstance.
1. LoaddataintoyourGreenplumDatabasesystem.Createadatabasetabletoholddataandthenaddthedatatothetable.Greenplumprovidesparalleldataloadingutilitiesandprotocolsthathelptotransformandloadexternaldatainvariousformatsandfromvarioussources.Fordetails,seetheGreenplumDatabaseAdministratorGuide,athttp://gpdb.docs.pivotal.io .Youcanalsocreateanexternalindexfordocumentsyouretrievefromawebserver,ftpserver,AmazonS3,orhdfs.Youcan
2. CreateandconfigureanemptyGPTextindex.Usethe gptext.create_index() user-definedfunction(UDF)tocreateanemptyGPTextindexforadatabasetable.GPTextstoresconfigurationfilesfortheindexinZooKeeper.
3. Customizetheindex,ifdesired,byeditingtheindexconfigurationfileswiththe gptext-config command-lineutility.Youcancustomizethewaydocumenttextistokenized,filtered,andtransformedbeforestoringintheindexandhowquerytextispreparedtosearchtheindex.
4. Populatetheindexwithdatafromthedatabasetableorexternaldatasource.Usethe gptext.index() or gptext.index_external() UDFtoadddatatotheindex.TheseUDFsworkbydispatchingSQLqueriestoexecuteoneachGreenplumsegment.ThesegmentsexecutethequeriesandaddtheresultstotheindexusingSolrAPIs.
5. Commitchangestotheindex.CommitchangestotheGPTextindexbycallingthe gptext.commit_index() UDF.Untilthechangesarecommitted,queriesexecutedontheindexcannotaccessanydataaddedtotheindexwith gptext.index() .Ifneeded,uncommittedchangescanberolledback.SolrCloudreplicateschangescommittedtotheleadreplicatotheshards’non-leadreplicas.
QueryingDataWorkflowThefollowingdiagramshowsthehigh-levelGPTextqueryprocessworkflow:
©CopyrightPivotalSoftware,Inc,2013-2019 20 3.3.0
http://gpdb.docs.pivotal.io
-
1. AusersubmitsaSQLquerydesignedtosearchtheindexeddata.AGPTextsearchqueryisaSQL SELECT statementonaGPTextsearchUDFthatcontainsfull-textsearchexpressions.
2. TheGreenplummasterdispatchesthequerytotheGreenplumDatabasesegments.
3. Eachsegmentexecutesthequery,usingtheSolrAPItosearchitsindexshard.Solranalyzesandexecutesthesearchqueryontheleadreplicafortheshard.
4. TheGreenplumDatabasesegmentsreturntheresultsofthesearchquerytotheGreenplumDatabasemaster.
5. TheGreenplumDatabasemasteraggregatestheresultsfromallsegmentsandreturnsthemtotheclient.
TextAnalysisGPTextenablesanalysisofSolrindexeswithApacheMADlib,anopensourcelibraryforscalablein-databaseanalytics.MADlibprovidesdata-parallelimplementationsofmathematical,statistical,andmachinelearningmethodsforstructuredandunstructureddata.YoucanuseGPTexttoperformavarietyofMADlibanalyses.
LearnmoreaboutApacheMADlibathttp://madlib.apache.org .A gppkg packageforMADlibisavailableonthePivotalnetworkathttp://network.pivotal.io .
TheApacheOpenNLPtoolkitprovidesadvancedmachinelearningtoolsfortokenizing,recognizing,andtaggingnaturallanguagetextthatyoucanenableforGPTextindexinandsearching.SeeNaturalLanguageProcessingwithGPTextIndexesformoreinformation.
©CopyrightPivotalSoftware,Inc,2013-2019 21 3.3.0
http://madlib.apache.orghttp://network.pivotal.io
-
AdministeringGPTextGPTextadministrationincludessecurityconsiderations,monitoringSolrindexstatistics,managingandmonitoringZooKeeper,andtroubleshooting.
ViewingtheClusterConfigurationGPTextdeploysApacheZooKeeperandApacheSolrnodesonhostsinyourGreenplumDatabasenetwork.EachnodeisaJVMserverprocesslisteningforrequestsfromothernodes.Usethe gptext-stateconfig commandtolistthehostandportforeachZooKeeperandSolrnodeandthememoryconfigurationforSolrnodes.
$gptext-stateconfigs20181112:12:38:26:018080gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-ClusterConfigurations.20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-JVMMin|MaxXms1024M|Xmx2048M20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Nodeinformation20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-HostNodeNamePortSolrDir20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw1sdw1_solr:1898318983/data/gptext/solr020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw1sdw1_solr:1898418984/data/gptext/solr120181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw2sdw2_solr:1898318983/data/gptext/solr020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw2sdw2_solr:1898418984/data/gptext/solr120181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Zookeeperinformation20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:----------------------------------20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-HostPortZookeeperDir20181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-mdw2189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw22189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-sdw12189/data/zoo/zoo020181112:12:38:27:018080gptext-state:mdw:gpadmin-[INFO]:-Done.
Youdon’tneedthesedetailstousetheGPTextfunctionsandutilities,buttheinformationcanbeusefulformonitoringandtroubleshootingthecluster.Forexample,youcanaccesstheSolrAdminUIbybrowsingtotheURL http://: onanySolrnode.SeeUsingtheSolrAdministrationInterface forinformationabouttheSolrAdminUI.
ChangingGPTextServerConfigurationParametersConfigurationparametersusedwithGPTextarebuilt-intoGPTextwithdefaultvalues.YousetnewvaluesfortheparametersinaGreenplumDatabasesessionusingthe SET command,thesamewayyousetGreenplumDatabasesessionparameters.Whenyouenterthe SET commandGPTextupdatesthevalueinZooKeepersothatthechangepersistsbetweendatabasesessions.
WithGreenplumDatabase4.xand5.x,aone-timeGreenplumDatabaseconfigurationchangeisneededsothatGreenplumDatabaseallowsyoutosetanddisplayGPTextconfigurationparameters.Untilyouhaveperformedthisstep,anyattempttosetaGPTextparameterresultsinan“Unrecognizedconfigurationparameter”error.YoumustdeclareacustomvariableclassforGPText.
Asthe gpadmin user,enterthefollowingcommandsinashell:
$gpconfig-ccustom_variable_classes-v'gptext'$gpstop-u
Oncethisstepiscompleted,youcanviewandsetGPTextconfigurationparametersin psql.
ToviewGPTextconfigurationparameters,youfirstneedtofetchthemfromZooKeeperintoyourGreenplumDatabasesessionbyexecutingthegptext.version() UDF.
=#SELECTgptext.version();version------------------------------------------------------GreenplumTextAnalytics3.2.0(1row)
The custom_variable_classes configurationparameterisremovedinGreenplumDatabase6.Youcansetcustomvariablesinadatabasesessionwithouterror,sothisstepisnotneededforGreenplumDatabase6.
©CopyrightPivotalSoftware,Inc,2013-2019 22 3.3.0
https://lucene.apache.org/solr/guide/7_3/using-the-solr-administration-user-interface.html#using-the-solr-administration-user-interface
-
Thenyoucanusethe SHOW commandtodisplayvaluesoftheparameters,forexample:
=#SHOWgptext.idx_num_shards;gptext.idx_num_shards-----------------------0(1row)
SeeGPTextConfigurationParametersforacompletelistofconfigurationparameters.
GPTextusesthecurrentvaluesoftheconfigurationparameterswhenyoucreateanewindex,sochangingaconfigurationparameteraffectsnewindexes,butdoesnotaffectexistingindexes.
ChangethevaluesofGPTextconfigurationvariablesusingthe SET commandinasessionwithadatabasethatcontainstheGPTextschema.Thefollowingexamplesetsvaluesforthreeconfigurationparametersina psql session:
=#setgptext.idx_buffer_size=10485760;SET=#setgptext.idx_delim='|';SET=#setgptext.extension_factor=5;SET
Youcanviewthenewvalueofaconfigurationparameterthatyouhavesetusingthe SHOW command:
=#showgptext.idx_delim;gptext.idx_delim------------------|(1row)
SecurityandGPTextIndexesGPTextsecurityisbasedonGreenplumDatabasesecurity.YourprivilegestoexecuteGPTextfunctionsdependonyourprivilegesforthedatabasetablethatisthesourcefortheindex.Forexample,ifyouhaveSELECTprivilegesforatableintheGreenplumDatabasedatabase,thenyouhaveSELECTprivilegesforanindexgeneratedfromthattable.
ExecutingGPTextfunctionsrequiresoneofOWNER,SELECT,INSERT,UPDATE,orDELETEprivileges,dependingonthefunction.TheOWNERisthepersonwhocreatedthetableandhasallprivileges.SeetheGreenplumDatabaseAdministratorGuideforinformationaboutsettingprivileges.
ZooKeeperAdministrationApacheZooKeeperenablescoordinationbetweentheApacheSolrandPivotalGPTextdistributedprocessesthroughasharednamespacethatresemblesafilesystem.InZooKeeper,anode(calledaznode)cancontaindata,likeafile,andcanhavechildznodes,likeadirectory.ZooKeeperreplicatesdatabetweenmultipleinstancesdeployedasaclustertoprovideahighlyavailable,fault-tolerantservice.BothSolrandGPTextstoreconfigurationfilesandsharestatusbywritingdatatoZooKeeperznodes.GPTextstoresinformationinthe /gptext znode.TheconfigurationfilesforaGPTextindexareinthe/gptext/configs/ znode.
ThenumberofZooKeeperinstancesintheclusterdetermineshowmanyZooKeepernodefailurestheclustercantolerateandstillremainactive.Theserviceremainsavailableaslongasaclearmajorityofthenon-failednodesareabletocommunicatewitheachother.Totolerateafailureofnnodestheclustermusthave2 +1nodes.Aclusteroffivenodes,forexample,cantoleratetwofailednodes.
ZooKeeperisveryfastforreadrequestsbecauseitstoresdatainmemory.IfZooKeeperbeginstoswapmemorytodisk,SolrandGPTextperformancewillsufferandcouldexperiencefailures,soitiscriticaltoallocatesufficientmemorytotheZooKeeperJavaprocesses.ToavoidZooKeeperinstancescompetingwithGreenplumDatabasesegmentsformemory,youshoulddeploytheZooKeeperinstancesandGreenplumDatabasesegmentsondifferenthosts.TheZooKeeperandGreenplumDatabasehostsmustbeonthesamenetworkandaccessiblewithpasswordlessSSHbythegpadminuser.YoucanusetheGreenplumDatabase gpssh-exkeys utilitytoshareSSHkeysbetweenZooKeeperandGreenplumDatabasehosts.
YoumuststarttheZooKeeperclusterbeforeyoustartGPText.WhenyoustartGPText,theSolrnodeseachloadthereplicasforindexestheymanage.Withlargenumbersofindexes,shards,andreplicas,startinguptheclustercangenerateaveryhigh,atypicalloadonZooKeeper.ItcantakealongtimetogetallindexesloadedandsomeZooKeeperrequestsmaytimeoutwaitingforresponses.Usingthe gptext-start--
slow_startoptionstartsSolrnodesoneata
time,providingamoreorderedstart-upandlimitingthenumberofconcurrentZooKeeperrequests.
n
©CopyrightPivotalSoftware,Inc,2013-2019 23 3.3.0
-
TheGPTextcommand-lineutility zkManager canbeusedtomonitortheZooKeepercluster.IftheZooKeeperclusterisboundtoGPText,youcanalsostartandstoptheclusterusing zkManager .
CheckingZooKeeperStatusUsethe zkManager utilityfromthecommandlinetochecktheZooKeeperclusterstatus.Theutilityliststhehosts,ports,latency,andfollower/leadermodeforeachZooKeeperinstance.Ifanodeisdown,itsmodeislistedasDown.
TochecktheZooKeeperclusterstatus,runthe zkManagerstate command.
$zkManagerstate20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstateprocess.20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-HostportLatencymin/avg/maxMode20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21890/0/22follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21900/0/29leader20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21880/0/27follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Done.
Inadatabasesession,youcanusethe gptext.zookeeper_hosts() functiontolisttheZooKeeperhosts.
=#SELECT*FROMgptext.zookeeper_hosts();host|port--------+------gpdb51|2188gpdb51|2189gpdb51|2190(3rows)
StartingandStoppingtheZooKeeperClusterIftheZooKeeperclusterwasinstalledbytheGPTextinstaller,the zkManager utilitycanstartorstoptheZooKeepercluster.Tostartthecluster,runthezkManagerstart
command.
$zkManagerstart20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstartprocess20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-StartingZookeeper:20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:48:017845zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:16:14:53:017845zkManager:gpdb:gpadmin-[INFO]:-Done.
TostopZooKeeper,runthe zkManagerstop command.
$zkManagerstop20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstopprocess.20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-StopZookeeper:20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:09:016499zkManager:gpdb:gpadmin-[INFO]:-Done.
SeethezkManagerreferenceformoreinformation.
CheckingSolrCloudStatusYoucancheckthestatusoftheSolrCloudclusterandindexesbyrunningthe gptext-state utilityfromthecommandline.
©CopyrightPivotalSoftware,Inc,2013-2019 24 3.3.0
-
TocheckthestateoftheGPTextnodesandeachindex,runthe gptext-state utilitywiththe -D ( --details )option.Example:
$gptext-state-D20180615:16:09:24:031986gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CheckGPTextclusterstatus...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CurrentGPTextVersion:3.0.020180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Allnodesareupandrunning.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Indexstatedetails.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-databaseindexnamestate20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.twitter.messageGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.wikipedia.articlesGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Done.
ThiscommandreportsthestatusoftheGPTextnodesandstatusofeachGPTextindex.
Run gptext-statelist toviewjusttheindexes.
The gptext-statehealthcheck commandcheckstheGPTextconfigurationfiles,theindexstatus,requireddiskspace,userprivileges,andindexanddatabaseconsistency.Bydefault,therequireddiskspacecheckpassesifthereisatleast20%diskfree.Youcansetadifferentdiskfreethresholdusingthe--disk_free option.Forexample:
[gpadmin@gpdb-sandbox~]$gptext-statehealthcheck--disk_free=2520160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-ExecutehealthcheckonGPTextcluster!20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextconfigfiles...20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextindexstatus...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireddiskspace...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireduserprivileges...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforindexesanddatabaseconsistency...20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Done.
Seethe gptext-state utilityreferenceforadditionaloptions.
RecoveringGPTextNodesUsethe gptext-recover utilitytorecoverdownGPTextnodes,forexampleafterafailedGreenplumDatabasesegmenthostisrecovered.
Withnoarguments,the gptext-recover utilitydiscoversdownGPTextnodesandrestartsthem.
Withthe -f (or --force )option,ifaGPTextnodecannotberestartedandnoshardsaredown,thenodeisdeletedandcreatedagainonthesamehost.Missingreplicasareaddedandthefailednodeandfailedreplicasareremoved.Iftheindexisinaredstate gptext-recover-
fwillprintamessageandexit.
The -H ( --new_hosts )optionallowsrecreatingdownGPTextnodesonnewhoststhatreplacefailedhosts.ThedownGPTextnodesaredeletedandrecreatedonthenewhosts.Theargumenttothe -H optionisacomma-separatedlistofthenewhoststhataretoreplacethefailedhosts.Thenumberofnewhostsmustmatchthenumberoffailedhosts.Ifshardsaredown,itadvisesreindexing.Ifonlysomereplicasaredown,itrecreatesthereplicasonthenewhostsandupdates gptext.conf .
The -r optionrecoversreplicas,butdoesnotattempttorecoveranydownnodes.
Note:BeforerecoveringGPTextnodesonnewlyaddedhosts,ensurethatthefollowingGPTextprerequisiteshavebeeninstalledonthehost:
Java1.8
Python2.6
TheLinux lsof utility
ViewingSolrIndexStatisticsYoucanviewSolrindexstatisticsbyrunningthe gptext-state utilityfromthecommandline.
©CopyrightPivotalSoftware,Inc,2013-2019 25 3.3.0
-
TolistallGPTextindexes,enterthefollowingcommandatthecommandline:
gptext-statelist
Acommandlinethatretrievesallstatisticsforanindex:
gptext-state--indexdemo.wikipedia.articles
Acommandlinethatretrievesthenumberofdocumentsinanindex:
gptext-state--indexdemo.wikipedia.articles--stats_columns=num_docs
Acommandlinethatretrieves num_docs ,index size ,andthedateandtime last_modified :
gptext-state--indexdemo.wikipedia.articles--stats_columnsnum_docs,size,last_modified
BackingUpandRestoringGPTextIndexesWiththe gptext-backup managementutility,youcanbackupaGPTextindexsothat,ifneeded,youcanquicklyrecoverfromafailure.ThebackupcanberestoredtothesameGPTextsystemortoanothersystemwiththesamenumberofGreenplumDatabasesegments.
The gptext-backup managementutilitybacksupanindexanditsconfigurationfilestoeitherasharedfilesystem,whichmustbemountedonandwritablebyeachhostintheGreenplumDatabasecluster,ortolocalstorageontheGreenplumDatabasemasterandsegmenthosts.
BackingUptoaSharedFileSystemTobackuponasharedfilesystem,usethe -p ( --path )command-lineoptiontospecifythelocationofadirectoryonthemountedfilesystemandthe-n ( --name )optiontoprovideanameforthebackup.Specifytheindextobackupwiththe -i (--index )option.
$gptext-backup-i-p--n
The gptext-backup utilitythenchecksthat:
theGPTextclusterisup
thesharedfilesystemisvalid
thebackupnamespecifiedwiththe -n optiondoesnotalreadyexistinthedirectoryspecifiedwiththe -p option
Theutilitycreatesthenewdirectoryandthensavesonecopyofeachindexshardtothatdirectory,alongwiththeindex’sconfigurationfilesfromZooKeeper.
Tosavetheconfigurationfilesonly,withnodata,addthe -c ( --backup_conf )command-lineoption.
Torestoreanindexfromasharedfilesystem,usethe gptext-restore managementutility.TheGPTextsystemyourestoretomustbeonaGreenplumDatabaseclusterwiththesamenumberofsegments.Thedatabaseandschemafortheindexmustbepresent.
The -i ( --index )optionspecifiesthenameoftheGPTextindexthatwillberestored.Iftheindexexists,youmustfirstdropitwiththe gptext.drop_index()user-definedfunction.
The -p ( --path )optionspecifiesthelocationofthedirectorycontainingthebackupfiles—thedirectorythat gptext-backup createdonthesharedfilesystem.
$gptext-restore-i-p
Youcanaddthe -c optiontorestoreonlytheconfigurationfilestoZooKeeperandcreateanemptyGPTextindex,withoutrestoringanysavedindexdata.
BackingUptoLocalStorage
©CopyrightPivotalSoftware,Inc,2013-2019 26 3.3.0
-
TobackuptolocalstorageontheGreenplumDatabasecluster,addthe local keywordtothe gptext-backup command-line.
AlocalGPTextbackuphasauniquenameconstructedbyappendingatimestamptotheindexname.Youdonotusethe -n optionwithlocalbackups.
$gptext-backuplocal-i
Onthemasterhost,inthemasterdatadirectorybydefault,thebackuputilitysavesaJSONfilewithbackupmetadataandadirectorycontainingtheindex’sconfigurationfilesfromZooKeeper.
TheutilitybacksupeachindexshardontheGreenplumDatabasesegmenthostwiththeGPTextnodethatmanagestheshard’sleadreplica.Bydefault,theshardbackupfilesaresavedinasegmentdatadirectory.
The gptext-backup commandoutputreportsthelocationsofallbackupfiles.
Youcanaddthe -p ( --path )optiontothe gptext-backup commandtospecifyalocaldirectorywherethebackupwillbesaved.ThedirectorymustbepresentoneveryGreenplumDatabasehostandmustbewriteablebythegpadminuser.
$gptext-backuplocal-i-p
ThebackupfileswillbesavedinthespecifieddirectoryoneachhostinsteadofintheGreenplumDatabasemasterandsegmentdatadirectories.
Torestoreabackupsavedtolocalstorage,addthe local keywordtothe gptext-restore command-lineandspecifythepathtothebackupdirectoryonthemasterhost.
$gptext-restorelocal-p
The isthefullpathtothedirectorythe gptext-backup commandcreatedonthemasterhost,includingthetimestamp,forexample$MASTER_DATA_DIRECTORY/demo.twitter.message_2018-05-08T15:32:21.397779 .
Seegptext-backupforsyntaxandexamplesforrunning gptext-backup .Seegptext-restoreforsyntaxandexamplesforrunning gptext-restore .
ExpandingtheGPTextClusterThe gptext-expand managementutilityaddsGPTextnodestothecluster.Therearetwowaystoaddnodes:
AddGPTextnodestoexistinghostsinthecluster.ThisoptionincreasesthenumberofGPTextnodesoneachhost.
AddGPTextnodestonewhostsaddedbyusingtheGreenplumDatabase gpexpand managementutilitytoexpandtheGreenplumDatabasesystem.
AddingGPTextNodestoExistingSegmentHostsToaddnodestoexistingsegmenthosts,runthe gptext-expand utilitywithacommandlikethefollowing:
gptext-expand-e-p/data1/nodes,/data2/nodes
ThisexampleaddstwoGPTextnodestoeachhost.
The -e ( --existing )optionspecifiesthatnodesaretobeaddedtoexistinghosts.
The -p ( --expand_paths )optionprovidesalistofdirectorieswherethenewnodes’datadirectoriesaretobecreated.TheseshouldbethesamedirectoriesthatcontaintheGreenplumDatabasesegmentdatadirectoriesandexistingGPTextdatadirectories.Thenumberofdirectoriesinthelististhenumberofnewnodesthatareadded.
AdirectorycanberepeatedinthedirectorylistmultipletimestoincreasethenumberofnewGPTextnodestocreate.Forexample,ifthereiscurrentlyoneGPTextnodeperhostinthe /data1/nodes directory,youcouldaddthreenodeswithacommandlikethefollowing:
gptext-expand-e-p/data1/nodes,/data2/nodes,/data2/nodes
Thisaddsonenodetothe /data1/nodes directoryandtwonodestothe /data2/nodes directorysotherearetwoGPTextnodesineachdirectory.
AddingGPTextnodesaffectsnewindexes,butnotexistingindexes.Replicasfornewindexeswillbedistributedacrossallofthenodes,includingbothold
©CopyrightPivotalSoftware,Inc,2013-2019 27 3.3.0
-
nodesandthenewlycreatednodes.Replicasforindexesthatexistedbeforerunning gptext-expand arenotautomaticallymoved.Rebalancingexistingreplicasrequiresreindexing.
AddingGPTextNodestoNewHostsCheckthatthefollowingGPTextprerequisitesareinstalledoneachnewhostaddedtotheGreenplumDatabasecluster:
Java1.8
Python2.6orgreater
Linux lsof utility
NewhostsmustbereachablebyallhostsintheGPTextcluster,includingexistinghostsandthenewhostsyouareadding.
AfterexpandingtheGreenplumDatabaseclusterwiththe gpexpand managementutility,call gptext-expand withthe -H ( --new_hosts )optionandalistofthenewhostsonwhichtoinstallGPText:
gptext-expand-Hnewhost1,newhost2
The gptext-expand utilityinstallsGPTextbinariesonthenewhostsandthencreatesnewGPTextnodesonthenewhosts.
ExpandingaGreenplumDatabaseclusterincreasesthenumberofsegments,sothenumberofGPTextindexshardsforexistingindexesmustbeincreasedtoequalthenewnumberofsegments.Thisrequiresreindexingallexistingdocuments.Newlycreatedindexeswillautomaticallybedistributedamongthenewshards.
TroubleshootingGPTexterrorsareofthefollowingtypes:
Solrerrors
gptext errors
MostoftheSolrerrorsareself-explanatory.
gptext errorsarecausedbymisuseofafunctionorutility.Theyprovideamessagethattellsyouwhenyouhaveusedanincorrectfunctionorargument.
MonitoringLogsYoucanexaminetheGreenplumDatabaseandSolrlogsformoreinformationiferrorsoccur.GreenplumDatabaselogsresidein:
segment-directory/pg-log
Solrlogsresidein:
/solr/logs
DeterminingSegmentStatuswithgptext-stateUsethe gptext-state utilitytodetermineifanyprimaryormirrorsegmentsaredown.See gptext-state intheGPTextManagementUtilitiesReference.
©CopyrightPivotalSoftware,Inc,2013-2019 28 3.3.0
-
GPTextHighAvailabilityTheGPTexthighavailabilityfeatureensuresthatyoucancontinueworkingwithGPTextindexesaslongaseachshardintheindexhasatleastoneworkingreplica.
AGPTextindexhasoneshardforeachGreenplumsegment,sothereisaone-to-onecorrespondencebetweenGreenplumsegmentsandGPTextindexshards.TheshardmanagedbyaGreenplumsegmentisanindexofthedocumentsthataremanagedbythatsegment.
TheGPTexthighavailabilitymechanismistomaintainmultiplecopies,orreplicas,oftheshard.TheZooKeeperservicethatmanagesSolrCloudchoosesaGPTextinstance(SolrCloudnode)foreachreplicatoensureevendistributionandhighavailability.Foreachshard,onereplicaiselectedleaderandtheGreenplumsegmentassociatedwiththeshardoperatesonthisleaderreplica.TheGPTextinstancemanagingtheleadreplicamayormaynotbeonanotherGreenplumhost,soindexingandsearchingoperationsarepassedovertheGreenplumcluster’sinterconnectnetwork.SolrCloudreplicateschangesmadetotheleaderreplicatotheremainingreplicas.
ThefollowingfigureillustratestherelationshipsbetweenGreenplumsegmentsandGPTextindexshardsandreplicas.Theleaderreplicaforeachshardisshowningreenandthefollowersaregray.
Thenumberofreplicastocreateforeachshard,thereplicationfactor,isaSolrCloudproperty.Bydefault,GPTextstartsSolrCloudwithareplicationfactorofthree.ThereplicationfactorforeachindividualindexisthevalueoftheSolrCloudreplicationfactorwhentheindexiscreated.Changingthereplicationfactordoesnotalterthereplicationfactorforexistingindexes.
GreenplumSegmentorHostFailureIfaGreenplumprimarysegmentfailsanditsmirrorisactivated,GPTextfunctionsandutilitiescontinuetoaccesstheleaderreplica.Nointerventionisneeded.
Ifahostintheclusterfails,bothGreenplumandGPTextareaffected.MirrorsfortheGreenplumprimarysegmentslocatedonthefailedhostareactivatedonotherhosts.SolrCloudelectsanewleaderreplicaforaffectedshards.BecauseGreenplumsegmentmirrorsandGPTextshardreplicasaredistributedthroughoutthecluster,asinglehostfailureshouldnotpreventtheclusterfromcontinuingtooperate.Theperformanceofdatabasequeriesandindexingoperationswillbeaffecteduntilthefailedhostisrecoveredandtheclusterisbroughtbackintobalance.
ZooKeeperClusterAvailabilitySolrCloudisdependentonaworking,availableZooKeepercluster.ForZooKeepertobeactive,amajorityoftheZooKeeperclusternodesmustbeupandabletocommunicatewitheachother.AZooKeeperclusterwiththreenodescancontinuetooperateifoneofthenodesfails,sincetwoisamajorityofthree.Totoleratetwofailednodes,theclustermusthaveatleastfivenodessothatthenumberofworkingnodesremainingafterthefailureareamajority.Totoleratennodefailures,then,aZooKeeperclustermusthave2*n*+1nodes.ThisiswhyZooKeeperclustersusuallyhaveanoddnumberofnodes.
Thebestpracticeforahigh-availabilityGPTextclusterisaZooKeeperclusterwithfiveorsevennodessothattheclustercantoleratetwoorthreefailednodes.
©CopyrightPivotalSoftware,Inc,2013-2019 29 3.3.0
-
ManagingGPTextClusterHealthGPTextdocumentindexingandsearchingservicesremainavailableaslongaseachshardofanindexhasatleastoneworkingreplica.Toensureavailabilityintheeventofafailure,itisimportanttomonitorthestatusoftheclusterandensurethatalloftheindexshardreplicasarehealthy.YoucanmonitortheSolrCloudclusterandindexesusingtheSolrCloudDashboardorusingGPTextfunctionsandmanagementutilities.AccesstheSolrCloudDashboardwithawebbrowseronanyGPTextinstancewithaURLsuchas http://sdw3:18983/solr .(TheportnumbersforGPTextinstancesaresetwiththeGPTEXT_PORT_BASE parameterintheinstallationparametersfileatinstallationtime.)
RefertotheApacheSolrClouddocumentationforhelpusingtheSolrCloudDashboard.
MonitoringtheClusterwithGPTextTheGPText gptext-state managementutilityallowsyoutoquerythestateoftheGPTextclusterandindexes.Youcanalsouse gptext.index_status() toviewthestatusofallindexesoraspecifiedindex.
ToseetheGPTextclusterstaterunthe gptext-state command-lineutilitywiththe -d optiontospecifyadatabasethathastheGPTextschemainstalled.
gptext-state-dmydb
TheutilityreportsanyGPTextnodesthataredownandliststhestatusofeveryGPTextindex.Foreachindex,thedatabasename,indexname,andstatusarereported.Thestatuscolumncontains“Green”,“Yellow”,or“Red”:-Green–allreplicasforallshardsarehealthy-Yellow–allshardshaveatleastonehealthyreplicabutatleastonereplicaisdown-Red–noreplicasareavailableforatleastoneindexshard
ToseethedistributionofindexshardsandreplicasintheGPTextcluster,executethisSQLstatement.
SELECTindex_name,shard_name,replica_name,node_nameFROMgptext.index_summary()ORDERBYnode_name;
TolistallGPTextindexes,runthe gptext-statelist command.
gptext-statelist-dmydb
The gptext-statehealthcheck commandchecksthehealthofthecluster.The -f flagspecifiesthepercentageofavailablediskspacerequiredtoreportahealthycluster.Thedefaultis10.
gptext-statehealthcheck-f20-dmydb
See gptext-state intheManagementUtilitiesreferenceforhelpwithadditional gptext-state options.
Thegptext.index_status()user-definedfunctionreportsthestatusofallGPTextindexesoraspecifiedindex.
SELECT*FROMgptext.index_status();
Specifyanindexnametoreportonlythestatusofthatindex.
SELECT*FROMgptext.index_status('demo.twitter.message');
AddingandDroppingReplicasThe gptext-replica utilityaddsordropsareplicaofasingleindexshard.Usethe gptext.add_replica() and gptext.delete_replica() user-definedfunctionstoperformthesametasksfromwithinthedatabase.
Ifareplicaofashardfails,use gptext-replica toaddanewreplicaandthendropthefailedreplicatobringtheindexbackto“Green”status.
gptext-replicaadd-imydb.public.messages-sshard3
Hereistheequivalent,usingthe gptext.add_replica() function:
©CopyrightPivotalSoftware,Inc,2013-2019 30 3.3.0
-
SELECT*FROMgptext.add_replica('mydb.public.messages',shard3);
ZooKeeperdetermineswherethereplicawillbelocated,butyoucanalsospecifythenodewherethereplicaiscreated:
gptext-replicaadd-imydb.public.messages-sshard3-nsdw3
Inthe gptext.add_replica() function,addthenodenameasathirdargument.
Todropareplica,call gptext.delete_replica() withthenameoftheindex,thenameoftheshard,andthenameofthereplica.Youcanfindthenameofthereplicabycalling gptext.index_status(index_name) .Thenameisintheformat core_noden .Anoptional -o flagspecifiesthatthereplicaistobedeletedonlyifitisdown.
gptext-replicadrop-imydb.public.messages-sshard3-rcore_node4-o
Hereistheequivalentoftheabovecommandusingthe gptext.delete_replica() user-definedfunction.
SELECT*FROMgptext.delete_replica('mydb.public.messages','shard3','core_node4',true);
©CopyrightPivotalSoftware,Inc,2013-2019 31 3.3.0
-
GPTextBestPracticesEachGPText/ApacheSolrnodeisaJavaVirtualMachine(JVM)processandisallocatedmemoryatstartup.ThemaximumamountofmemorytheJVMwilluseissetwiththe -Xmx parameterontheJavacommandline.Performanceproblemsandoutofmemoryfailurescanoccurwhenthenodeshaveinsufficientmemory.
OtherperformanceproblemscanresultfromresourcecontentionbetweentheGreenplumDatabase,Solr,andZooKeeperclusters.
ThistopicdiscussesGPTextusecasesthatstressSolrJVMmemoryindifferentwaysandthebestpracticesforpreventingoralleviatingperformanceproblemsfrominsufficientJVMmemoryandothercauses.
IndexingLargeNumbersofDocumentsIndexingdocumentsconsumesdatainSolrJVMmemory.Whentheindexiscommitted,partsofthememoryarereleased,butsomedataremainsinmemorytosupportfastsearch.Bydefault,Solrperformsanautomaticsoftcommitwhen1,000,000documentsareindexedor20minutes(1,200,000milliseconds)havepassed.Asoftcommitpushesdocumentsfrommemorytotheindex,freeingJVMmemory.Asoftcommitalsomakesthedocumentsvisibleinsearches.Asoftcommitdoesnot,however,maketheindexupdatesdurable;itisstillnecessarytocommittheindexwiththe gptext.commit()user-definedfunction.
Youcanconfigureanindextoperformamorefrequentautomaticsoftcommitbyeditingthe solrconfig.xml filefortheindex:
$gptext-configedit-fsolrconfig.xml-i..
The elementisachildofthe element.Editthe and valuestoreducethetimebetweenautomaticcommits.Forexample,thefollowingsettingsperformanautocommitevery100,000documentsor10minutes.
100000600000
IndexingVeryLargeDocumentsIndexingverylargedocumentscanusealargeamountofJVMmemory.Tomanagethis,youcansetthe gptext.idx_buffer_size configurationparametertoreducethesizeoftheindexingbuffer.
SeeChangingGPTextServerConfigurationParametersforinstructionstochangeconfigurationparametervalues.
DeterminingtheNumberofGPTextNodestoDeployAGPTextnodeisaSolrinstancemanagedbyGPText.ThenodescanbedeployedontheGreenplumDatabaseclusterhostsoronseparatehostsaccessibletotheGreenplumDatabasecluster.ThenumberofnodesisconfiguredduringGPTextinstallation.
ThemaximumrecommendednumberofGPTextnodesyoucandeployisthenumberofGreenplumDatabaseprimarysegments.However,thebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodes.Usethe JAVA_OPTS installationparametertosetmemorysizeforGPTextnodes.
AsingleGPTextnodeperhostcaneasilyhandleseveralindexes.EachadditionalnodeconsumesadditionalCPUandmemoryresources,soitisdesirabletolimitthenumberofnodesperhost.FormostGPTextinstallations,asingleGPTextnodeperhostissufficient.
IftheJVMhasaverylargeamountofmemory,however,garbagecollectioncancauselongpauseswhiletheJVMreorganizesmemory.Also,theJVMemploysamemoryaddressoptimizationthatcannotbeusedwhenJVMmemoryexceeds32GB,soatmorethan32GB,aGPTextnodelosescapacityandperformance.Therefore,noGPTextnodeshouldhavemorethan32GBofmemory.
Forexample,ifyouhave48GBmemoryavailableforGPTextperhost,youshoulddeploytwoGPTextnodeswith24GBmemory.Ifyouhave128GBavailable,youshoulddeployatleastfourJVMs,andmoreifgarbagecollectionbecomesaproblem.
©CopyrightPivotalSoftware,Inc,2013-2019 32 3.3.0
-
ConfigureMaximumJVMHeapSizeEachSolrcorefileconsumesJVMheapmemory.AddingmoreindexesincreasesJVMswappingandgarbagecollectionfrequencysothatittakeslongertocreateindexesandtoloadthecorefileswhenGPTextisstarted.IfyoucontinuetocreateindexeswithoutincreasingtheJVMheap,anoutofmemoryerrorwilleventuallyoccur.
MonitorperformanceatstartupandduringindexcreationandincreasetheJVMsizewhenyoubegintoseedegradedperformance.Youcanalsousetoolssuchasjconsole,includedwiththeJavaDeveloperKit,tomonitorJavaheapusage.Ifgarbagecollectionsareoccurringtoofrequentlyandfreeingtoolittlememory,JVMheapshouldbeincreased.
TheJVMsizeisinitiallyconfiguredduringGPTextinstallationbysettingthe JAVA_OPTIONS parameterintheinstallationconfigurationfile.Afterinstallation,usethe gptext-configjvm commandtoincreasetheJVMheapsize.Forexample,this gptext-configjvm commandsetstheJVMmaximumheapoptionto4GB:
$gptext-configjvm-o"-Xmx=4096M"
ManageIndexingandSearchLoadsWithhighindexingorsearchload,JVMgarbagecollectionpausescancausetheSolroverseerqueuetobackup.ForaheavilyloadedGPTextsystem,youcanpreventsomeperformanceproblemsbyschedulingdocumentindexingfortimeswhensearchactivityislow.
TermsQueriesandOutofMemoryErrorsThe gptext.terms() functionretrievestermsvectorsfromdocumentsthatmatchaquery.Anoutofmemoryerrormayoccurifthedocumentsarelarge,orifthequerymatchesalargenumberofdocumentsoneachnode.Otherfactorscancontributetooutofmemoryerrorswhenrunninga gptext.terms() query,includingthemaximummemoryavailabletotheSolrnodes(-Xmxvaluein JAVA_OPTS )andconcurrentqueries.
Ifyouexperienceoutofmemoryerrorswith gptext.terms() youcansetalowervalueforthe term_batch_size GPTextconfigurationvariable.Thedefaultvalueis1000.Forexample,youcouldtryrunningthefailingquerywith term_batch_size setto500.Loweringthevaluemaypreventoutofmemoryerrors,butperformanceoftermsqueriescanbeaffected.
SeeGPTextConfigurationParametersforhelpsettingGPTextconfigurationparameters.
ConfigureFileSystemCachingforZooKeeperGoodSolrperformanceisdependentonfastresponseforZooKeeperrequests.ZooKeeperperformsbestwhenitsdatabaseiscachedsoitdoesnothavetogotodiskforlookups.IfyoufindthatZooKeeperJVMshavefrequentdiskaccesses,lookforwaystoimprovefilecachingormoveZooKeeperdiskstofasterstorage.
TheZooKeeper zkClientTimeout parameteristhetimeaclientisallowedtonottalktoZooKeeperbeforehavingitssessionexpired.
©CopyrightPivotalSoftware,Inc,2013-2019 33 3.3.0
-
TroubleshootingHadoopConnectionProblemsThissectiondescribesHadoop-relatedproblemsandpotentialsolutionstotheseissues.
DataNodeAccessErrorsYoumayexperienceHadoopaccesserrorswithGPTextifanyDataNodesintheHadoopclusterresideinamulti-homednetwork.GPTextusesanexternalIPaddresstoaccesstheHDFSNameNode.GPTextencountersanerrorwhentheNameNodeprovidesaninternalIPaddressforaDataNode.Inthissituation,additionalconfigurationisrequiredtoconfigureGPTexttoperformitsownDNSresolutionofDataNodehostnames.
PerformthefollowingproceduretoexplicitlyconfigureDNSresolutionofDataNodehostnames:
1. LocatealocalcopyoftheHadoopauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_conf :
$cd/home/gpadmin/auths/hdfs_conf$lscore-site.xmlhdfs-site.xmluser.txt
2. Open hdfs-site.xml intheeditorofyourchoice.Forexample:
$vihdfs-site.xml
3. Addthefollowingpropertyblocktothefile,andthensavethefileandexit:
dfs.client.use.datanode.hostnametrue
ThispropertyallowsGPTexthoststoperformtheirownDNSresolutionofHDFSDataNodehostnames.
4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthe hdfs_conf directoryincludestheauthenticationconfigurationfilesforaHadoopclusterwith hdfs_bill_auth :
$cd..$gptext-externalupload-thdfs-chdfs_bill_auth-phdfs_conf
5. Determinethehostname-to-IPaddressmappingforallDataNodes,andaddtheassociatedentriesintothe /etc/hosts fileonallGPTextclienthosts.
Kerberos-RelatedErrorsThefollowingproblemsarespecifictoHadoopclusterssecuredwithKerberos.
ClockSkewAloginattempttoaHadoopclustersecuredwithKerberoswillfailifclockskewbetweenGPTextclienthostsandtheKerberosKDChostistoogreat.Inthissituation,youmayseethefollowingerrorintheSolrlog:
java.io.IOException causedbya KrbException noting“Clockskewtoogreat”
Toresolvethissituation,ensurethattheclocksontheKerberosKDChostandGPTextclienthostsaresynchronized.
TimeoutErrorsAloginattempttoaHadoopclustersecuredwithKerberosmayfailwithtimeouterrorswhenthe kdc and admin_server settingsinthe krb5.conf filearespecifiedwithahostname,andtheGPTextclienthostscannotresolvethehostname.Inthissituation,youmayseeoneofthefollowingerrorsintheSolrlog:
©CopyrightPivotalSoftware,Inc,2013-2019 34 3.3.0
-
org.apache.solr.common.SolrException: Failed to login HDFS messagecausedbya java.io.IOException specifyingjavax.security.auth.login.LoginException: Receive timed out
java.nio.channels.UnresolvedAddressException with SocketIOWithTimeout referencedinthestacktrace
Inthissituation,youmaychooseeitherofthefollowing:
UpdatetheKerberos krb5.conf filetospecifythe kdc and admin_server settingsusingIPaddresses.Or
UpdateallGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver.
Ifyouchoosetoupdatethe krb5.conf file:
1. LocatealocalcopyoftheHadoopKerberosauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_kerb_conf :
$cd/home/gpadmin/auths/hdfs_kerb_conf$lscore-site.xmlhdfs-site.xmlkeytabkrb5.confuser.txt
2. Open krb5.conf intheeditorofyourchoice.Forexample:
$vikrb5.conf
3. Replacethe KERBEROS blockattributeswiththeirequivalentIPaddressesandthensavethefileandexit.Forexample:
[realms]KERBEROS={kdc=admin_server=}
4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthedirectorynamed hdfs_kerb_conf includestheauthenticationconfigurationfilesforaHadoopclusterdefinedwiththe hdfs_kerb_auth :
$cd..$gptext-externalupload-thdfs-chdfs_kerb_auth-phdfs_kerb_conf
Alternately,ifyouchoosetoconfiguretheGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver,addanentryfortheKDChostname-to-IPaddressmappingtothe /etc/hosts fileonallGPTextclienthosts.
©CopyrightPivotalSoftware,Inc,2013-2019 35 3.3.0
-
WorkingWithGPTextIndexesIndexingpreparesdocumentsfortextanalysisandfastqueryprocessing.ThistopicshowsyouhowtocreateGPTextindexesandadddocumentsfromGreenplumDatabasetablestothem,andhowtomaintainandcustomizeindexesforyourownapplications.
ForhelpindexingandsearchingdocumentsstoredoutsideofGreenplumDatabaseseeWorkingWithGPTextExternalIndexes.
SettingUptheSampleDatabaseTheexamplesinthisdocumentationworkwitha demo databasecontainingthreedatabasetables,called wikipedia.articles , twitter.message ,andstore.products .Ifyouwanttoruntheexamplesyourself,followtheinstructionsinthissectiontosetupthe demo database.
1. LogintotheGreenplumDatabasemasterasthegpadminuserandcreatethe demo database.
$createdbdemo
2. Openaninteractiveshellforexecutingqueriesinthe demo database.
$psqldemo
3. Createthe articles tableinthe wikipedia schemawiththefollowingstatements.
CREATESCHEMAwikipedia;CREATETABLEwikipedia.articles(idint8primarykey,date_timetimestamptz,titletext,contenttext,refstext)DISTRIBUTEDBY(id);
4. Createthe message tableinthe twitter schemawiththefollowingstatements.
CREATESCHEMAtwitter;CREATETABLEtwitter.message(idbigint,message_idbigint,spamboolean,created_attimestampwithouttimezone,sourcetext,retweetedboolean,favoritedboolean,truncatedboolean,in_reply_to_screen_nametext,in_reply_to_user_idbigint,author_idbigint,author_nametext,author_screen_nametext,author_langtext,author_urltext,author_descriptiontext,author_listed_countinteger,author_statuses_countinteger,author_followers_countinteger,author_friends_countinteger,author_created_attimestampwithouttimezone,author_locationtext,author_verifiedboolean,message_urltext,message_texttext)DISTRIBUTEDBY(id)PARTITIONBYRANGE(created_at)(START(DATE'2011-08-01')INCLUSIVEEND(DATE'2011-12-01')EXCLUSIVEEVERY(INTERVAL'1month'));CREATEINDEXid_idxONtwitter.messageUSINGbtree(id);
5. CREATEthe store.products tablewiththesestatements.
©CopyrightPivotalSoftware,Inc,2013-2019 36 3.3.0
-
CREATESCHEMAstore;CREATETABLEstore.products(idbigint,titletext,categoryvarchar(32),brandvarchar(32),pricefloat)DISTRIBUTEDBY(id);
6. Downloadtestdataforthethreetableshere .Right-clickthelink,savethefile,andthencopyittothegpadminuser’shomedirectory.
7. Extractthedatafileswiththistarcommand.
$tarxvfzgptext-demo-data.tgz
8. Loadthewikipediadataintothe wikipedia.articles tableusingthe psql\COPY metacommand.
\COPYwikipedia.articlesFROM'/home/gpadmin/demo/articles.csv'HEADERCSV;
The articles tablenowcontainstextfrom23Wikipediaarticles.
9. Loadthetwitterdataintothe twitter.message tableusingthefollowing psql\COPY metacommand.
\COPYtwitter.messageFROM'/home/gpadmin/demo/twitter.csv'CSV;
The message tablenowcontains1730tweetsfromAugusttoOctober,2011.
10. Loadtheproductstableintothe store.products tablewiththefollowing psql\COPY metacommand.
\COPYstore.productsFROM'/home/gpadmin/demo/products.csv'HEADERCSV;
The products tablenowcontains50rows.Thistableisusedtodemonstratefacetedsearchqueries.SeeCreatingFacetedSearchQueries.
SettinguptheGPTextCommand-lineEnvironmentToworkwithGPTextindexes,youmustfirstsetupyourenvironmentandaddtheGPTextschematothedatabasecontainingthedocuments(GreenplumDatabasedata)youwanttoindex.
Tosettheenvironment,loginasthe gpadmin userandsourcetheGreenplumDatabaseandGPTextenvironmentscripts.TheGreenplumDatabaseenvironme