pivotal greenplum textgptext.docs.pivotal.io/archives/gptext-docs-310.pdf · gptext nodes can be...
Post on 24-Oct-2020
8 Views
Preview:
TRANSCRIPT
-
PivotalGreenplum®Text
Version3.1.0
UserGuide
Rev:01
©2018PivotalSoftware,Inc.
-
2349141620273032344458677884135157158160
TableofContents
TableofContentsPivotal®Greenplum®Text3.1.0DocumentationPivotal®GPText3.1.0ReleaseNotesInstallingGPTextUpgradingGPTextIntroductiontoPivotalGPTextAdministeringGPTextGPTextHighAvailabilityGPTextBestPracticesTroubleshootingHadoopConnectionProblemsWorkingWithGPTextIndexesQueryingGPTextIndexesCustomizingGPTextIndexesWorkingWithGPTextExternalIndexesUsingNamedEntityRecognitionwithGPTextGPTextFunctionReferenceGPTextManagementUtilitiesGPTextandSolrDataTypeMappingsGPTextSchemaTablesGPTextConfigurationParameters
©CopyrightPivotalSoftware,Inc,2013-2018 2 3.1.0
-
Pivotal®Greenplum®Text3.1.0Documentation
GPTextDocumentationPDF
PivotalGPText3.1.0ReleaseNotes
InstallingPivotalGPText
UpgradingPivotalGPText
UsingPivotalGPText
GPTextReferences
AdditionalResourcesPivotalGreenplumDatabase
ApacheSolrWebSite
ApacheMADlib
©CopyrightPivotalSoftware,Inc,2013-2018 3 3.1.0
http://docs-gptext-staging.cfapps.io/archives/GPText-docs-310.pdfhttp://docs-gptext-staging.cfapps.io/310/topics/http://docs-gptext-staging.cfapps.io/310/topics/FuncRef_preface.htmlhttp://gpdb.docs.pivotal.iohttp://lucene.apache.org/solr/http://madlib.apache.org/
-
Pivotal®GPText3.1.0ReleaseNotesThisdocumentcontainsreleaseinformationforPivotalGPText3.1.0
Released:September2018
AboutPivotalGPTextPivotalGPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearchandtheApacheMADlibAnalyticsLibrarytoprovidelarge-scaleanalyticsprocessingandbusinessdecisionsupport.GPTextincludesfreetextsearchaswellassupportfortextanalysis.
GPTextincludesthefollowingfeatures:
TheGPTextdatabaseschemaprovidesin-databaseaccesstoApacheSolrindexingandsearching
BuildindexeswithdatabasedataorexternaldocumentsandsearchwiththeGPTextAPI
Customtokenizersforinternationaltextandsocialmediatext
AUniversalQueryProcessorthatacceptsquerieswithmixedsyntaxfromsupportedSolrqueryprocessors
Facetedsearchresults
Termhighlightinginresults
Naturallanguageprocessing,includingpart-of-speechtaggingandnamedentityextraction
Greateremphasisonhighavailability
TheGPTextmanagementutilitysuiteincludescommand-lineutilitiestoperformthefollowingtasks:
Start,stop,andmonitorZooKeeperandGPTextnodes
ConfigureGPTextnodesandindexes
Addanddeletereplicasforindexshards
BackupandrestoreGPTextindexes
RecoveraGPTextnode
ExpandtheGPTextclusterbyaddingGPTextnodes
PrerequisitesInstallingGPTextalsoinstallsApacheSolrCloudand,optionally,ApacheZooKeeper.
FollowingareGPTextinstallationprerequisites.
GPTextrunsonRedHatEnterpriseLinux5.x,6.x,and7.x.
InstallandconfigureyourGreenplumDatabasesystem,version4.3.6orhigher.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .
InstallJavaJRE1.8.xandaddthe bin directorytothe PATH onallhostsinthecluster.GPTextistestedwithOracleJava1.8andOpenJDK1.8.
Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( sudo yum install nc ).
Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).
GPTextcannotbeinstalledontoasharedNFSmount.
GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.
IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabasegp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit in
theGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.
ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes(fivenodesrecommended).Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,
©CopyrightPivotalSoftware,Inc,2013-2018 4 3.1.0
https://gpdb.docs.pivotal.iohttp://greenplum.org/calc/
-
ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusteronseparatehostswithnetworkconnectivitytotheGreenplumnetwork.
NewFeaturesandEnhancementsinGPText3.1.0TheGPText3.1.0releaseprovidesthefollowingfeaturesandenhancements.
ImprovementstoaidindevelopingandtestinganalyzerchainsThenew gptext.list_field_types() functionliststhefieldtypesdefinedinthe managed-schema configurationfileforanindex.
Thenew gptext.get_field_type() functiondisplaystheindexandqueryanalyzerchainsforafieldtypeinJSONformat.
Thenew gptext.analyzer() functionshowstheindexorqueryanalyzerchainoutputforagivenfieldtypeandinputtext.Thisfunctionisusefulfortestinganddebugginganalyzerchainsinteractivelywithoutmodifyingtheindex.
Part-of-speechtaggingandnamedentityrecognitionGPTextincludesOpenNLPlibrariesandanalyzerclassestoclassifyindexedterms’parts-of-speech(POS),andtorecognizenamedentities,suchasthenamesofpersons,locations,andorganizations(NER).GPTextsavesNERtermsinthefield’stermsvector,prependedwithacodetoidentifythetypeofentityrecognized.Thisallowssearchingdocumentsbyentitytype.
Thenew gptext.ner_terms() functionlistsNER-taggedtermsfordocumentsthatmatchaquery.
GPTextincludestheOpenNLPmodelsfortheEnglishlanguage.YoucandownloadmodelsforotherlanguagesfromtheOpenNLPwebsiteandusethemwithGPText.
OtherenhancementsandfixesThefirstargumentofthe gptext.terms() function,ananytabledatatype,hasbeenmadeoptional.
Fixedanerrorwherethe gptext.partition_status() functiondisplayedpartitioninformationforanindexafteritwasdropped.
ApacheSolrupdatedtoSolrversion7.3GPText3.1.0includesApacheSolr7.3.SeethefollowingreleasedocumentsforinformationabouttheSolr7.3release.
ApacheSolr7.3UpgradeNotes
ApacheSolr7.3ReleaseHighlights
FollowingareGPTextchangesandSolrusagenotesrelatedtotheSolr7.3upgrade.
GPTextserver-sidecomponentsarerebuiltandtestedwiththenewSolrJARfiles.
The managed-schema , solrconfig.xml andothercollectionconfigurationfilesareupdated.
Thetop-level elementin solrconfig.xml isnowofficiallydeprecatedinfavoroftheequivalent syntax.ThiselementhasbeenoutofuseindefaultSolrinstallationsforseveralreleasesalready.
The legacyCloud parameternowdefaultstofalse.Ifanentryforareplicadoesnotexistin state.json ,thatreplicawillnotberegistered.Thismayaffectuserswhobringupreplicasandtheyareautomaticallyregisteredasapartofashard.ItispossibletoreverttotheoldbehaviorbysettingthepropertylegacyCloud=true intheclusterpropertiesbyrunningthefollowingcommandintheGPTextinstallationdirectory:
$./server/scripts/cloud-scripts/zkcli.sh-zkhost127.0.0.1:2181-cmdclusterprop-namelegacyCloud-valtrue
WithearlierSolrreleases,ifyoudropanindexwhileaSolrnodewithareplicaoftheindexisdown,whenthedownnodecomesbackon-line,theindexcomesbackandcannotbedeleted.Solr7fixesthisbug.TheGPTextworkaroundforthisbugisremoved.
PointFieldsaredefaultnumerictypes.Solrhasimplemented*PointFieldtypesacrosstheboard,toreplaceTrie*basednumericfields.AllTrie*fieldsarenowconsidereddeprecated,andwillberemovedinSolr8.IfyouareusingTrie*fieldsinyourschema,youshouldconsidermovingtoPointFieldsassoonasfeasible.ChangingtothenewPointFieldtypeswillrequireyoutore-indexyourdata.
Thefollowingspatial-relatedfieldshavebeendeprecated:LatLonType
©CopyrightPivotalSoftware,Inc,2013-2018 5 3.1.0
https://lucene.apache.org/solr/guide/7_3/solr-upgrade-notes.htmlhttps://wiki.apache.org/solr/ReleaseNote73
-
GeoHashFieldFieldTypeSpatialTermQueryPrefixTreeFieldTypeUseoneofthesefieldtypesinstead:LatLonPointSpatialFieldSpatialRecursivePrefixTreeFieldRptWithGeometrySpatialField
ToimproveparameterconsistencyintheCollectionsAPI,theparameternames fromNode fortheMOVEREPLICAcommandandsource,and target fortheREPLACENODEcommandhavebeendeprecatedandreplacedwith sourceNode and targetNode instead.Theoldnameswillcontinuetoworkforbackwardscompatibility,buttheywillberemovedinSolr8.
Thereplicacorenamehaschangedfrom _shard#_replcai# to _shard#_replicai_# .Forexample, test_shard0_replica1
becomes test_shard0_replica_n3 .
NewFeaturesandEnhancementsinGPText3.0.0GPText3.0.0allowsaddingdocumentsstoredinAmazonWebServicesS3bucketstoaGPTextexternalindex.ThisenhancementincludeschangestoenableuploadingAWScredentialstoZooKeeperandsupportforthe s3 documentsourcetypeforthe gptext.external_login() , gptext.external_logout() ,gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.
The gptext-state utilitywiththe --index ( -i )optionnowincludesthedateandtimetheGPTextindexwaslastmodified.
NewFeaturesandEnhancementsinGPText2.4.0GPText2.4.0allowsaddingdocumentsstoredinanauthenticatedFTPservertoaGPTextexternalindex.Thisenhancementincludeschangestoaddsupportforthe ftp typetothe gptext.external upload command-lineutilityandthe gptext.external_login() , gptext.external_logout(), gptext.index_external() ,and gptext.index_external_dir() GPTextfunctions.
NewFeaturesandEnhancementsinGPText2.3.1The gptext-backup command-lineutilitycannowbackupGPTextindexestolocalGPTextclusterstorageaswellasadirectoryonashareddrive.Forlocalbackups,backupmetadataandtheindexconfigurationfilesarebackeduptotheGreenplumDatabasemasterdatadirectoryandindexshardsarebackedupinthesegmentdatadirectoriesoneachhost.
The gptext-backup utilityhasanewoptiontobackupjusttheindexconfigurationfilesfromZooKeeper,withnoindexdata.
The gptext-restore uilityisupdatedtorestorebackupscreatedonlocalclusterstorage.
The gptext-restore utilityhasanewoptiontorestoreonlytheconfigurationfilesfromabackup.ThisoptionloadstheconfigurationfilesintoZooKeeperandcreatesanemptyGPTextindex.
NewFeaturesandEnhancementsinGPText2.3.0
Revisedgptext-configUtilitySyntaxThe gptext-config command-lineutilitywasrevisedtohaveamoreuser-friendlysyntax.
Anew list subcommandwasaddedto gptext-config youcanusetolistalloftheconfigurationfilesforaspecifiedGPTextindex.
$gptext-configlist-i
IndexDocumentsinaHadoopFileSystem(hdfs)DocumentSourceGPText2.3.0enablesyoutoadddocumentsstoredinahdfssystemtoaGPTextexternalindex.
Thenew gptext-external command-lineutilityuploadsHadoopconfigurationandauthenticationfilestoanamedconfigurationinZooKeeper.The
©CopyrightPivotalSoftware,Inc,2013-2018 6 3.1.0
-
utilityhassubcommands upload , list ,and delete tomanagetheconfigurationsyouhaveuploaded.
Thenew gptext.external_login() functionlogsintothehdfssystemusingthenamedconfigurationyouhaveuploaded.Youcanlogintoonlyoneexternaldocumentsourceatatime.
UseURLsoftheform hdfs:// withthe gptext.index() and gptext.index_external() functionstoadddocumentstoaGPTextexternalindex.
Usethenew gptext.index_external_dir() functiontoaddalldocumentsinanhdfsdirectorytoaGPTextexternalindex.
Logoutofthehdfsexternaldocumentsourcewiththenew gptext.external_logout() function.
SeeAuthenticatingwithanExternalDocumentSourceforstepstoenableaccesstoanhdfsdocumentsource.
KnownIssuesFollowingareknownissuesinGPText.Workaroundsareprovidedwhenavailable.
WildcardsinGPTextSearchOptionsSolrdoesnotreturnallfieldswhenthe fl Solrsearchoptioncontainsawildcardthatmatchesfieldnames.Forexample,givenatablewithcolumnscontenta and contentb ,specifying fl=contenta,contentb,(sum,1,1) correctlyreturnsthreefields.Specifying fl=cont*,sum(1,1) correctlyreturns contenta andcontentb ,butomitsthepseudo-field sum(1,1) .
Specifyingawildcardtomatchallfields( fl=*,sum(1,1) )alsoomitsthepseudo-field.
IndexLoadFailureAfterConfigurationFileErrorIfSolrfailstoloadanindexbecauseofaconfigurationfileerror,andthentheindexisdroppedwithoutfirstcorrectingtheconfigurationfileerror,theindexcannotberecreateduntilGPTextisrestarted.Thiscanhappenifyouedit managed-schema or solrconfig.xml andintroduceanXMLsyntaxerrororatypoinconfigurationvalues.
Workaround:
1. Whenanindexfailstoload,checktheSolrlogtofindthecause.
2. Ifthecauseisaconfigurationfileerror,suchasinvalidXML,usethe gptext-config utilitytoeditthefileandfixtheerror.Droppingtheindexwithoutfirstcorrectingtheerrorisnotrecommended.
3. Ifyouhavedroppedanindexthatfailedtoloadwithoutfirstcorrectingthecauseofthefailure,youmustrestartGPTextbeforeyoucanrecreatetheindex.Run gptext-start -r torestartGPText.
StartupFailurewithLargeNumbersofIndexesWhenthereisalargenumberofSolrcores,SolrCloudcanfailtorestartsuccessfully,witherrormessagesindicatingfailuretoelectleadersforshards.ThisisaknownSolrissue;seehttps://issues.apache.org/jira/browse/SOLR-5990 intheApacheSolrJiraforanexample.Becauseofthisissue,itisrecommendedtoavoiddesigningGPTextapplicationsthatcreatelargenumbersofindexes,shards,andreplicas.Thenumberofcoresyoucancreatebeforeyouobservethisbehaviorishardwaredependent,soyoushouldtesttodetermineyoursystem’slimits.Youcancreateandsuccessfullyoperatealargernumbersofindexesthancanberestartedsuccessfullylater,sobesuretotestrestartingGPTexttodetermineapracticallimit.
SettingGPTextConfigurationParametersWithoutFirstSettingcustom_variable_classesIfthe custom_variable_classes GreenplumDatabaseserverconfigurationparameterdoesnotincludethevalue“gptext”,attemptingtosetaGPTextconfigurationparameterreturnsanerrormessage,forexample:
mydb-#setgptext.replication_factor=4;WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)WARNING:PleaselogonagaintomakeGUCsettingtakeeffect.(GucValue.h:301)ERROR:unrecognizedconfigurationparameter"gptext.replication_factor"
©CopyrightPivotalSoftware,Inc,2013-2018 7 3.1.0
https://issues.apache.org/jira/browse/SOLR-5990
-
InGPText2.0,inadditiontotheerrormessage,thevalueoftheconfigurationparameterpersistedinZooKeeperiszero,replacingthepreviousvalueoftheparameter.
mydb-#showgptext.replication_factor;gptext.replication_factor----------------------------0
BeginningwithGPText2.1,theerrormessageisstillgenerated,howeverthevaluesavedinZooKeeperisthevaluespecifiedinthe set command,4intheprecedingexample.
Topreventtheerrormessage,beforesettinganyGPTextconfigurationparameters,usethe gpconfig command-lineutilitytosetthe custom_variable_classesconfigurationparameter:
$gpconfig-ccustom_variable_classes-v'gptext'
©CopyrightPivotalSoftware,Inc,2013-2018 8 3.1.0
-
InstallingGPText
PrerequisitesTheGPTextinstallationincludestheinstallationofApacheSolrCloudand,optionally,ApacheZooKeeper.
IfyouareinstallinganewGPTextreleaseintoanexistingGPTextsystem,followtheinstructionsinUpgradingGPTextinstead.
FollowingareGPTextinstallationprerequisites.
InstallandconfigureyourGreenplumDatabasesystem,version4.3.6orhigher.SeethePivotalGreenplumDatabaseInstallationGuideathttps://gpdb.docs.pivotal.io .
GPTextrunsonRedHatEnterpriseLinuxorCentOS5.x,6.x,or7.x.
GPTextcannotbeinstalledontoasharedNFSmount.
InstallaJRE1.8.xonallhostsinthecluster.
Ensurethat nc (netcat)isinstalledonallGreenplumclusterhosts( yum install nc ).
Installing lsof onallclusterhostsisrecommended( sudo yum install lsof ).
GPTextnodescanbeinstalledontheGreenplumDatabaseclusterhostsalongsidetheGreenplumsegmentsoronadditional,non-databasehostsaccessibleontheGreenplumclusternetwork.AllhostsparticipatingintheGPTextsystemmusthavethesameoperatingsystemandconfigurationandhavepasswordless-sshaccessforthegpadminuser.SeethePivotalGreenplumDatabaseInstallationGuideforinstructionstoconfigurehosts.
IfyouplantoplaceGPTextnodesontheGreenplumDatabasesegmenthosts,ensurethatyoureservememoryforGPTextusewhenyouconfigureGreenplumDatabase.TodeterminethememorytosetasideforGPText,multiplythenumberofGPTextnodestocreateoneachGreenplumsegmenthostbytheJVMmaximumsize.SubtractthismemoryfromthephysicalRAMwhencalculatingthevaluefortheGreenplumDatabasegp_vmem_protect_limit serverconfigurationparameter.SeetheGreenplumDatabaseserverconfigurationparameter gp_vmem_protect_limit in
theGreenplumDatabaseReferenceGuideforrecommendedmemorycalculationformulasorvisittheGPDBVirtualMemoryCalculator website.
ApacheSolrrequiresaZooKeeperclusterwithatminimumthreenodes.Youcaninstalla“binding”ZooKeeperclusterwithGPTextontheGreenplumclusterhosts,oryoucanuseanexistingZooKeepercluster.WhendeployedalongsideGreenplumDatabasesegments,ZooKeeperperformancecanbeaffectedunderheavydatabaseload.Forbestperformance,installaZooKeeperclusterwithatleastthreenodes(fivenodesrecommended)onseparatehostswithnetworkconnectivitytotheGreenplumnetwork.
InstalltheGPTextBinaryDistribution1. OntheGreenplummasterhost,extracttheGPTextdistributionfile.Forexample:
$cd/home/gpadmin$tarxvfzgreenplum-text--.tar.gz
Thisextractstwofilesinthecurrentdirectory: gptext_install_config andtheGPTextinstallationbinary,whichhasanameintheformat greenplum-text--.bin .
2. Ifnecessary,grantexecutepermissiontotheGPTextbinary.Forexample:
$chmod+x/home/gpadmin/greenplum-text--.bin
3. IfyouareinstallingGPTextinadirectorythatisonlywritablebyroot,suchasthedefaultdirectory /usr/local ,performthesestepsasroot:
a. Sourcethe greenplum_path.sh fileintheGreenplumDatabaseinstallationdirectory.
#source/usr/local/greenplum-db-/greenplum_path.sh
b. LocateorcreateatextfilecontainingalistofthenamesofallhostswhereyouwillinstallGPText,oneperline,includingthemasterandstandbyhostnames.
c. Startgpssh,specifyingthetextfilewithhostnames.
#gpssh-fhostlist.txt
d. Createtheinstallationdirectoryandthe greenplum-solr directoryandsettheownershipandpermissions.Forexample,ifyouareinstallingGPTextinthedefaultdirectory, /usr/local :
©CopyrightPivotalSoftware,Inc,2013-2018 9 3.1.0
https://gpdb.docs.pivotal.iohttp://greenplum.org/calc/
-
=>mkdir/usr/local/greenplum-text-=>mkdir/usr/local/greenplum-solr=>chowngpadmin:gpadmin/usr/local/greenplum-text-=>chmod775/usr/local/greenplum-text-=>chowngpadmin:gpadmin/usr/local/greenplum-solr=>chown775/usr/local/greenplum-solr=>exit
e. Completethetheremainingstepsasthegpadminuser.
4. Editthe gptext_install_config filetosetparametersfortheinstallation.SeeSetInstallationParametersfordetails.
5. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:
$./gptext-.bin-c
6. AcceptthePivotallicenseagreement.
OptionalTwo-PartGPTextInstallationYoucanruntheGPTextinstallationintwopartsbyfollowingthesesteps.
1. PrepareGPTextinstallationdirectoriesasdescribedinsteps1through3inInstalltheGPTextBinaries.
2. RuntheGPTextinstallationbinaryas gpadmin onthemasterserver:
$./greenplum-text-.bin-b
Notethatthe -c optionisomitted.
3. SourcetheGPTextenvironmentscriptintheGPTextinstallationdirectory:
$source/greenplum-text_path.sh
4. Editthe gptext_install_config filetosetparametersfortheGPTextinstallation.SeeSetInstallationParametersfordetails.
5. DeploytheGPTextclusterwiththefollowingcommand:
$gptext-deploy-c
SetInstallationParametersAGPTextconfigurationfilenamed gptext_install_config containsparameterstoconfiguretheGPTextinstallation.Editthefileandsettheparametersasdescribedinthefollowingtable.
GPTextinstallationparameters
The GPTEXT_HOSTS and DATA_DIRECTORY installationparametersdeterminethenumberofGPTextnodesthataredeployed.Thenumberofdirectoriesincludedinthe DATA_DIRECTORY arrayisthenumberofGPTextnodesthatarecreatedperhost.
The GPTEXT_HOSTS parameterdeterminesthenumberofhosts.Ifsettotheconstant "ALLSEGHOSTS" thenumberofGPTextnodehostsisthesameasthenumberofGreenplumsegmenthosts.If GPTEXT_HOSTS issettoanarrayofhostnames,thelengthofthearrayisthenumberofGPTextnodehosts.
ThemaximumnumberofGPTextnodesisthenumberofGreenplumDatabaseprimarysegments.ThebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodesallowed.Forexample,ifthereareeightprimarysegmentsperhostintheGreenplumDatabasecluster,themaximumnumberofGPTextnodesperhostiseight,butyoushouldtestwithtwoorfourGPTextnodesperhost,adjustingthe JAVA_OPTS installationparametertodividethememoryreservedforGPTextamongthem.
©CopyrightPivotalSoftware,Inc,2013-2018 10 3.1.0
-
GPTEXT_HOSTS
AnarrayofhostnamesonwhichtoinstallGPText,orusetheconstant "ALLSEGHOSTS" toinstallGPTextonallGreenplumDatabasesegmenthosts.GPTexthostsmustbepasswordlessssh-accessiblebythegpadminuserfromallotherhostsintheGreenplumCluster.
declare -a GPTEXT_HOSTS=(gptext_h1 gptext_h2 gptext_h3)
GPTEXT_HOSTS="ALLSEGHOSTS"
DATA_DIRECTORY
AnarrayofdirectorypathswhereGPTextdatadirectoriesaretobecreated.ThenumberofdirectoriesinthearraydeterminesthenumberofGPTextnodesthatwillbecreatedoneachphysicalhost.If GPTEXT_HOSTS listsmultipleinterfacesperhost,theGPTextnodesarespreadevenlyacrosstheinterfaceaddresses.
declare -a DATA_DIRECTORY=(/data/primary /data/primary)
JAVA_OPTS
SetstheminimumandmaximummemoryeachSolrCloudJVMcanuse.
JAVA_OPTS="-Xms1024M -Xmx2048M"
GPTEXT_PORT_BASE
GP_MAX_PORT_LIMIT
SetarangeofportnumbersavailabletoGPTextnodes.GPTextfindsunusedportsinthespecifiedrange.
GPTEXT_PORT_BASE=18983GP_MAX_PORT_LIMIT=28983
ZOO_CLUSTER
WhethertodeployaGPTextbindingZooKeeperclusteroruseanexistingZooKeepercluster.Ifsetto "BINDING" theinstallationdeploysaZooKeepercluster.TouseanexistingZooKeepercluster,setthisparametertoalistofZooKeepernodesintheformat"host1:port,host2:port,host3:port “.
ZOO_CLUSTER="BINDING"
ZOO_HOSTS
If ZOO_CLUSTER issetto "BINDING" ,thisparameterisanarrayofthehostswheretheZooKeepernodesaretobeinstalled.Thearraymustcontain3,5,or7hostnames,forexample ZOO_HOSTS=(sdw1 sdw2 swd3 sdw4 sdw5) .IfyouareusingasinglehostforZooKeeper,specifyitmultipletimes,forexample, ZOO_HOSTS=(sdw1 sdw1 sdw1) .
declare -a ZOO_HOSTS=(sdw1 sdw2 sdw3 sdw4 sdw5)
ZOO_DATA_DIR
TheZooKeeperdatadirectory,requiredwhen ZOO_CLUSTER issetto "BINDING" .
ZOO_DATA_DIR="/data/master/"
ZOO_GPTXTNODE
ThenodepathinZooKeeperforGPText.Thisparameterisrequiredwhether ZOO_CLUSTER issetto "BINDING" oralistofhosts.
ZOO_GPTXTNODE="gptext"
ZOO_PORT_BASE
ZOO_MAX_PORT_LIMIT
ArangeofportnumberstousefortheZooKeepercluster.Unusedportsareallocatedfromwithinthisrange.Therangemustcontainatleast4000portnumbers.
ZOO_PORT_BASE=2188ZOO_MAX_PORT_LIMIT=12188
GPTEXT_JAVA_HOME
ThehomedirectoryoftheJavainstallationtorunforZooKeeperandSolrprocesses.Ifnotset,theJREspecifiedinthe PATH and JAVA_HOMEenvironmentvariableswillbeused.
©CopyrightPivotalSoftware,Inc,2013-2018 11 3.1.0
-
GPTEXT_JAVA_HOME=/usr/java/jdk1.8.0_131
StartingGPTextFirst,makesuretheGPTextcommand-lineutilitiesareinyourpathbysourcingtheGreenplumDatabaseandGPTextenvironmentscripts.ItisimportanttosourcetheGPTextenvironmentscripteachtimeyousourcetheGreenplumDatabasescript.Forexample:
$source/usr/local/greenplum-db-/greenplum_path.sh$source/usr/local/greenplum-text-/greenplum-text_path.sh
TouseGPTextinadatabase,youmustfirstusethe gptext-installsql managementutilitytoinstalltheGPTextuser-definedfunctionsandotherobjectsinthedatabase:
$gptext-installsqldatabase[database2...]
TheGPTextobjectsarecreatedinthe gptext schema.
TheZooKeeperclustermustberunningbeforeyoustartGPText.IfyouinstalledaboundZooKeepercluster,startitwiththe zkManager command-lineutility.
$zkManagerstart
StartGPTextwiththe gptext-start utility.
$gptext-start
ConfigureGreenplumDatabaseGPTextconfigurationparametersaresavedinZooKeeper.Youcan,however,viewandsetGPTextconfigurationparametersinaGreenplumDatabasesessionusingthe SHOW and SET commands.ThisrequiresaddingtheGPTextcustomvariableclasstotheGreenplumDatabase custom_variable_classesconfigurationparameter.
The custom_variable_classes configurationparameterisacomma-separatedlistofclassnames.Itisunsetbydefault.Toseeifanycustomvariableclasseshavealreadybeenconfigured,runthis gpconfig commandatthecommandline.
$gpconfig-scustom_variable_classes
Ifnocustomvariableclasseshavebeenset,settheparameterwiththefollowingcommand.
$gpconfig-ccustom_variable_classes-v'gptext'[gpadmin@gpsne~]$gpconfig-ccustom_variable_classes-v'gptext'20171029:12:29:11:028199gpconfig:gpsne:gpadmin-[INFO]:-completedsuccessfully
Ifotherclasseshavebeenconfigured,add gptext totheexistinglist,separatedbyacomma.
Run gpstop-u
tohaveGreenplumDatabasereloadtheconfigurationfile.
WhenyouwanttovieworsetGPTextconfigurationparameters,firstexecutethe gptext.version() functiontoloadtheGPTextconfigurationparametersintothesession.
©CopyrightPivotalSoftware,Inc,2013-2018 12 3.1.0
-
=#SELECTgptext.version();version--------------------------------GreenplumTextAnalytics2.1.2(1row)
=#SHOWgptext.idx_delim;gptext.idx_delim------------------,(1row)
SeeSettingGPTextConfigurationParametersformoreaboutGPTextconfigurationparameters.
UninstallingGPTextTouninstallGPText,runthe gptext-uninstall utility.YoumusthavesuperuserpermissionsonalldatabaseswithGPTextschemastorun gptext-uninstall .
gptext-uninstall runsonlyifthereisatleastonedatabasewithaGPTextschema.
Execute:
$gptext-uninstall
©CopyrightPivotalSoftware,Inc,2013-2018 13 3.1.0
-
UpgradingGPTextUpgradingaGPTextsystemtoanewGPTextreleaseinstallsthenewGPTextsoftwarereleaseonallhostsintheGreenplumclusterandthenupgradestheGPTextsystem.
UpgradingGPTextandGreenplumDatabaseattheSameTimeIfyouareupgradingtonewreleasesofGreenplumDatabaseandGPTextatthesametime,followthesesteps:
1. CompletetheGreenplumDatabaseupgradefirstandensurethedatabaseisoperational.
2. RuntheGPText gptext-migrator utilitytomigrateyourcurrentGPTextsystemtothenewlyupgradedGreenplumDatabasesystem.
3. EnsurethatthecurrentversionofGPTextworkswiththenewGreenplumDatabaseversion.
4. ProceedwiththeGPTextupgrade.
UpgradingaGPTextReleaseUpgradingaGPTextreleaseisatwo-partprocess:installthenewsoftwarereleaseontheGreenplumclusterhostsandthenupgradetheexistingGPTextsystem.TheGPTextinstallerperformsthefirstpart,installingthenewsoftware.The gptext-upgrade utilityperformsthesecondpart,upgradingthecurrentGPTextsystemtothenewversion.
TheGPTextinstallerdetectsanexistingGPTextsystemand,afterinstallingthenewsoftwarerelease,offerstorunthe gptext-upgrade utilityforyou.IfyouchoosetoupgradetheGPTextsystemlater,youcanrunthe gptext-upgrade utilityyourself.
AllupgradetasksareexecutedontheGreenplummasterhostasthe gpadmin user.The gpadmin usermusthavewritepermissioninthedirectorywherethenewGPTextreleaseistobeinstalled, /usr/local/greenplum-text-- bydefault.
TheGreenplumDatabase,ZooKeeper,andGPTextclustersmustberunning.TheprocedurestopsandrestartsGPTextduringtheupgrade.
Followthesesteps:
1. DownloadthenewGPTextreleaseforyourplatformfromPivotalNetwork .
2. Extractthereleasepackage.
$tarxfzgreenplum-text--.tar.gz
3. MakesurethatZooKeeperandGPTextarerunning.
$gptext-state
4. RuntheGPTextinstaller.
$./greenplum-text--.bin
5. TheinstallerpromptsyoutoacceptthePivotallicenseagreementandtochooseandcreatetheinstallationdirectory.
6. Theinstallerverifiestheenvironmenttoensurethatprerequisitesarepresent,suchasPythonandJava.Ifanyproblemsarediscovered,theinstalleroutputsanerrormessageandstops.Correcttheproblemidentifiedbythemessageandruntheinstalleragain.
7. AfterthenewsoftwarehasbeeninstalledontheGreenplumcluster,theinstallerlooksforanexistingGPTextinstallation.IfanexistingGPTextsystemisfound,theinstallerasksifyouwishtoupgradeGPTextdirectly.
Ifyouansweryes,theinstallerrunsthe gptext-upgrade script.The gptext-upgrade utilityvalidatestheenvironmenttoensureitcancompletetheupgrade,thenexecutestheupgradeandrestartstheGPTextsystem.Ifanyproblemsarediscovered, gptext-upgrade outputsamessageandquits.Fixtheindicatedproblemsandrunthegptext-upgradeutility(at /bin/gptext-upgrade )tocomplete
WhenupgradingGPText,youdonotspecifyaninstallationconfigurationfileasyoudofortheinitialGPTextinstallation.
©CopyrightPivotalSoftware,Inc,2013-2018 14 3.1.0
http://network.pivotal.io
-
theGPTextsystemupgrade.Ifyouanswerno,youmustrunthe gptext-upgrade scriptaftertheinstallercompletes.Seethegptext-upgradeutilityreferenceforinstructions.
Important:Ifyouanswernoorifthe gptext-upgrade quitswithoutupgradingyoursoftware,followthesestepstore-run gptext-upgrade atalatertime:
a. Sourcethe greenplum-text_path.sh scriptintheoldGPTextinstallationdirectory.Forexample:
$ source /usr/local/greenplum-text-/greenplum-text_path.sh
b. Runthe gptext-upgrade commandfromthenewGPTextinstallationdirectory:
$ /usr/local/greenplum-text-/bin/gptext-upgrade
8. Aftertheupgradehascompleted,sourcethe greenplum-text_path.sh inthenewGPTextreleasedirectoryandrun gptext-statehealthcheck toverifytheGPTextsystem:
$source/usr/local/greenplum-text-/greenplum-text_path.sh$gptext-statehealthcheck
©CopyrightPivotalSoftware,Inc,2013-2018 15 3.1.0
-
IntroductiontoPivotalGPTextPivotalGPTextenablesprocessingmassquantitiesofrawtextdata(suchassocialmediafeedsore-maildatabases)intomission-criticalinformationthatguidesbusinessandprojectdecisions.GPTextjoinstheGreenplumDatabasemassivelyparallel-processingdatabaseserverwithApacheSolrCloudenterprisesearchandtheMADlibAnalyticsLibrarytoprovidelarge-scaleanalyticsprocessingandbusinessdecisionsupport.GPTextincludesfreetextsearchaswellassupportfortextanalysis.GPTextsupportsbusinessdecisionmakingbyoffering:
Multiplekindsofdata:GPTextsupportsbothsemi-structuredandunstructureddatasearches,whichexponentiallyincreasesthekindsofinformationyoucanfind.
Lessschemadependence:GPTextdoesnotrequirestaticschemastosuccessfullylocateinformation;schemascanchangeorbequitesimpleandstillreturntargetedresults.
Textanalytics:GPTextsupportsanalysisoftextdatawithmachinelearningalgorithms.TheMADlibanalyticslibraryisintegratedwithGreenplumDatabaseandisavailableforusewithGPText.
Thischaptercontainsthefollowingtopics:
GPTextSystemArchitecture
GPTextSampleUseCase
GPTextWorkflow
TextAnalysis
GPTextSystemArchitectureGPTextcombinesaGreenplumDatabaseclusterwithanApacheSolrCloudcluster.GreenplumDatabasesegmentsandGPTextnodescanbedeployedonthesamehostsorondifferenthostswithnetworkconnectivity.
ThefollowingfigureshowstheprocessarchitectureofthecombinedGreenplumDatabaseandApacheSolrclusters.ThefigureshowsfourclusternodeswithfourGreenplumsegmentsandfourSolrinstancesdeployedoneach.AnApacheZooKeeperservicemanagestheSolrCloudcluster.BecauseZooKeeperismostefficientwithanoddnumberofservers,ZooKeepernodesaredeployedonthreeofthefourhosts.GreenplumDatabaseusersaccessSolrCloudservicesviaGPTextuser-definedfunctionsinstalledinGreenplumdatabasesandcommand-lineutilities.
ThefigureomitstheGreenplummasterhost,secondarymaster,andmirrorsegmentsfortheGreenplumprimarysegments.
©CopyrightPivotalSoftware,Inc,2013-2018 16 3.1.0
-
TheGreenplumsegments,Solrinstances,andZooKeepernodesmayallbedeployedonseparatehostsonthesamenetwork,dependingonapplicationandperformancerequirements.
ThefollowingsectionsdescribehowGPTextintegratesSolrCloudwithGreenplumDatabaseandhowthetwoclustersworktogethertoprovideparalleltextsearchcapabilitiesinGreenplumDatabaseandmaintainhighavailability.
GreenplumDatabaseClusterAGreenplumDatabaseclusteriscomprisedofthefollowingcomponents:
Amasterdatabaseinstance,executingonadedicatedhost,conventionallynamed mdw .(Notillustrated)
Asecondarymasterinstance,onahostconventionallynamed smdw ,actingasawarmstandbyforthemasterinstance.(Notillustrated)
Anarrayofdatabaseprimarysegmentinstancesandmirrorsdeployedonsegmenthosts,byconvention sdw1 through sdwn .AsegmentinstanceisanindependentPostgresdatabaseprocessmanagingaportionofthedistributeddata.Eachsegmenthasamirror(notillustrated)onanotherhostintheclustertoprovideuninterruptedserviceincaseofasegmentorsegmenthostfailure.Thenumberofprimarysegmentsperhostisdeterminedbythehardwareconfiguration—thenumberandtypeofprocessorcores,theamountofphysicalRAM,localstoragecapacity,andnetworkcapacity—aswellasavailabilityandperformancerequirements.
TheGreenplummasterinstancecoordinatestheworkofthesegmentinstances.OptimalperformanceofaGreenplumDatabaseclusterrequiresthatallsegmenthostsbeconfiguredidenticallywiththesamenumberofprimaryandmirrorsegmentsoneach,andwiththedatabasedatadistributedevenlyamongthesegmentinstances.Thefullcapacityofthedatabaseclusterisutilizedwheneverysegmenthostperformsanequalamountofwork.
ApacheSolrCloudApacheSolrisaserverprovidingaccesstoApacheLucenefull-textindexes.ApacheSolrCloudisahighlyavailable,faulttolerantclusterofApacheSolrservers.ThetermGPTextclusterisanotherwaytorefertoaSolrCloudclusterdeployedbyGPTextforusewithaGreenplumDatabasesystem.
ASolrCloudclusteriscomprisedofthefollowingcomponents:
AnApacheZooKeeperclustertomanagetheSolrCloudcluster.SolrCloudusesZooKeepertomanageserverconfigurationandtocoordinatethecluster’sactivities.GPTextcaninstallZooKeeperclusterthatisboundtotheGPTextcluster,oritcanshareanexistingZooKeepercluster.IfGPTextinstallstheZooKeepercluster,itcanbemanagedusingGPTextfunctionsandutilities.TheZooKeeperclustercanbedeployedonGreenplumDatabaseclusterhostsor,forbestperformance,onseperatehostsaccessibletotheGreenplumDatabasecluster.
MultipleSolrCloudserverinstancesdeployedontheGreenplumsegmenthostsoronotherhostsonthesamenetwork.EachinstanceisaJVMprocessrunningSolrserver.SolrCloudinstancesuselocalstorage,whichmaybethesamelocalstoragevolumesthatstoreGreenplumDatabasedata.ThenumberofSolrCloudinstancesperhostcanbethesameasthenumberofGreenplumprimarysegmentsperhost,butthisisnotarequirement.ThenumberofinstancestoexecuteperhostisspecifiedduringGPTextinstallation.
GPTextprovidesdocumentindexingandsearchcapabilitiesforGreenplumDatabasebyaddinguser-definedfunctions(UDFs)thataccessSolrAPIsfromwithindatabasequeries.
GPTextUDFsperformthefollowingtasks:
createandmanageGPTextindexes
insertdocumentsintoindexesfromdatabasetablesor,forGPTextexternalindexes,fromdocumentsstoredoutsideofGreenplumDatabase
searchindexes
TherearealsoGPTextUDFsandcommand-lineutilitiestoconfigure,monitor,andmanagetheSolrCloudclusterandtomanagereplicas,SolrCloud’shigh-availabilitymechanism.(Moreonreplicasinthenextsection.)
ParallelisminGPTextIndexingandSearchingSolrClouddistributesdocumentindexesinslicescalledshards.WithGPText,thenumberofshardsforanindexisthesameasthenumberofGreenplumsegments,soeachGreenplumsegmentoperatesonanequalportionoftheindex.EachshardismanagedbyaSolrCloudinstanceandtheshardsaredistributedevenlyamongtheSolrCloudinstances.TheSolrCloudinstanceandGreenplumsegmentarenotrequiredtobeonthesamehost.
HighAvailabilityforGPTextIndexesSolrCloudprovideshighavailabilitybymaintainingreplicasofshardsandprovidingautomaticfailoverifashardfailsorbecomesunavailable.Onereplica
©CopyrightPivotalSoftware,Inc,2013-2018 17 3.1.0
-
ofeachshardistheleadreplicaandanychangestoitareappliedtotheotherreplicas.Thereplicationfactor,whichdeterminesthenumberofreplicastomaintainforeachshard,issetwhentheindexiscreated.ReplicasmayalsobeaddedordroppedlaterusingGPTextUDFsorcommand-lineutilities.
ZooKeeperdeterminesthelocationsofshardreplicasamongtheSolrnodesandhosts.WhenaddingareplicausingaGPTextUDForcommand-lineutility,anewshardcanbeexplicitlyplacedonaSolrCloudinstance.
GPTextSampleUseCaseForensicfinancialanalystsneedtolocatecommunicationsamongcorporateexecutivesthatpointtofinancialmalfeasanceintheirfirm.Theanalystsusethefollowingworkflow:
1. LoadtheemailrecordsintoaGreenplumdatabase.
2. CreateaSolrindexoftheemailrecords.
3. Runqueriesthatlookfortextstringsandtheirauthors.
4. Refinethequeriesuntiltheypairadummycompanynamewithtopthreeorfourexecutivescorrespondingaboutsuspectoffshorefinancialtransactions.Withthisdata,theanalystscanfocustheinvestigationonspecificindividualsratherthanthethousandsofauthorsintheinitialdatasample.
GPTextWorkflowGPTextworkswithGreenplumDatabaseandApacheSolrCloudtostoreandindexbigdataforinformationretrieval(query)purposes.High-levelworkflowsincludedataloadingandindexing,anddataquerying.
Thistopicdescribesthefollowinginformation:
DataLoadingandIndexingWorkflow
QueryingDataWorkflow
DataLoadingandIndexingWorkflowThefollowingdiagramshowstheGPTextworkflowforloadingandindexingdata.
AllclientinteractionwiththesystemisthroughtheGreenplummasterinstance.
1. LoaddataintoyourGreenplumDatabasesystem.Createadatabasetabletoholddataandthenaddthedatatothetable.Greenplumprovidesparalleldataloadingutilitiesandprotocolsthathelptotransformandloadexternaldatainvariousformatsandfromvarioussources.Fordetails,seetheGreenplumDatabaseAdministratorGuide,athttp://gpdb.docs.pivotal.io .
©CopyrightPivotalSoftware,Inc,2013-2018 18 3.1.0
http://gpdb.docs.pivotal.io
-
2. CreateanemptyGPTextindex.Usethe gptext.create_index() user-definedfunction(UDF)tocreateanemptyGPTextindexforthetable.EachGreenplumsegmentwillmanageasliceoftheindex,calledashard.SolrCloudcreatesmultiplereplicasforeachshard,distributedamongtheSolrinstances,andchoosesaleadreplicafortheGreenplumsegmenttooperateupon.Solrmanagesreplicationbetweenthereplicas.
3. Populatetheindexwithdatafromthedatabasetable.Usethe gptext.index() UDFtoadddatatotheindex.ThisUDFworksbydispatchingaSQLquerytoexecuteoneachGreenplumsegment.ThesegmentsexecutethequeryandaddtheresultstotheirshardsusingSolrAPIs.
4. Commitchangestotheindex.CommitchangestotheGPTextindexbycallingthe gptext.commit_index() UDF.Untilthechangesarecommitted,queriesexecutedontheindexcannotaccessanydataaddedtotheindexwith gptext.index() .Ifneeded,uncommittedchangescanberolledback.SolrCloudreplicateschangescommittedtotheleadreplicatotheshards’non-leadreplicas.
QueryingDataWorkflowThefollowingdiagramshowsthehigh-levelGPTextqueryprocessworkflow:
1. AusersubmitsaSQLquerydesignedtosearchtheindexeddata.AGPTextsearchqueryisaSQL SELECT statementonaGPTextsearchUDFthatcontainsfull-textsearchexpressions.
2. TheGreenplummasterdispatchesthequerytotheGreenplumsegments.
3. Eachsegmentexecutesthequery,usingtheSolrAPItosearchitsindexshard.SolrCloudexecutesthesearchqueryontheleadreplicafortheshard.
4. TheGreenplumsegmentsreturntheresultsofthesearchquerytotheGreenplummaster.
5. TheGreenplummasteraggregatestheresultsfromallsegmentsandreturnsthemtotheclient.
TextAnalysisGPTextenablesanalysisofSolrindexeswithApacheMADlib,anopensourcelibraryforscalablein-databaseanalytics.MADlibprovidesdata-parallelimplementationsofmathematical,statistical,andmachinelearningmethodsforstructuredandunstructureddata.YoucanuseGPTexttoperformavarietyofMADlibanalyses.
LearnmoreaboutApacheMADlibathttp://madlib.apache.org .A gppkg packageforMADlibisavailableonthePivotalnetworkathttp://network.pivotal.io .
©CopyrightPivotalSoftware,Inc,2013-2018 19 3.1.0
http://madlib.apache.orghttp://network.pivotal.io
-
AdministeringGPTextGPTextadministrationincludessecurityconsiderations,monitoringSolrindexstatistics,managingandmonitoringZooKeeper,andtroubleshooting.
ChangingGPTextServerConfigurationParametersConfigurationparametersusedwithGPTextarebuilt-intoGPTextwithdefaultvalues.YoucanchangethevaluesfortheseparametersbysettingthenewvaluesinaGreenplumDatabasesession.ThenewvaluesarestoredinZooKeeper.GPTextindexesusethevaluesofconfigurationparameterswhentheyarecreated.Changingconfigurationparametersaffectsnewindexes,butdoesnotaffectexistingindexes.
SeeGPTextConfigurationParametersforacompletelistofconfigurationparameters.
Aone-timeGreenplumDatabaseconfigurationchangeisneededforGreenplumDatabasetoallowsettinganddisplayingGPTextconfigurationvariables.Asthe gpadmin user,enterthefollowingcommandsinashell:
$gpconfig-ccustom_variable_classes-v'gptext'$gpstop-u
ThenconnecttoadatabasethatcontainstheGPTextschemaandexecutethe gptext.version() functiontoexposetheGPTextconfigurationvariables:
=#select*fromgptext.version();
ChangethevaluesofGPTextconfigurationvariablesusingthe SET commandinasessionwithadatabasethatcontainstheGPTextschema.Thefollowingexamplesetsvaluesforthreeconfigurationparametersina psql session:
=#setgptext.idx_buffer_size=10485760;SET=#setgptext.idx_delim='|';SET=#setgptext.extension_factor=5;SET
Youcanviewthecurrentvalueofaconfigurationparameterthatyouhavesetusingthe SHOW command:
=#showgptext.idx_delim;gptext.idx_delim------------------|(1row)
SecurityandGPTextIndexesGPTextsecurityisbasedonGreenplumDatabasesecurity.YourprivilegestoexecuteGPTextfunctionsdependonyourprivilegesforthedatabasetablethatisthesourcefortheindex.Forexample,ifyouhaveSELECTprivilegesforatableintheGreenplumDatabasedatabase,thenyouhaveSELECTprivilegesforanindexgeneratedfromthattable.
ExecutingGPTextfunctionsrequiresoneofOWNER,SELECT,INSERT,UPDATE,orDELETEprivileges,dependingonthefunction.TheOWNERisthepersonwhocreatedthetableandhasallprivileges.SeetheGreenplumDatabaseAdministratorGuideforinformationaboutsettingprivileges.
ZooKeeperAdministrationApacheZooKeeperenablescoordinationbetweentheApacheSolrandPivotalGPTextdistributedprocessesthroughasharednamespacethatresemblesafilesystem.InZooKeeper,anode(calledaznode)cancontaindata,likeafile,andcanhavechildznodes,likeadirectory.ZooKeeperreplicatesdatabetweenmultipleinstancesdeployedasaclustertoprovideahighlyavailable,fault-tolerantservice.BothSolrandGPTextstoreconfigurationfilesandsharestatusbywritingdatatoZooKeeperznodes.GPTextstoresinformationinthe /gptext znode.TheconfigurationfilesforaGPTextindexareinthe/gptext/configs/ znode.
ThenumberofZooKeeperinstancesintheclusterdetermineshowmanyZooKeepernodefailurestheclustercantolerateandstillremainactive.Theserviceremainsavailableaslongasaclearmajorityofthenon-failednodesareabletocommunicatewitheachother.Totolerateafailureofnnodesthe
n
©CopyrightPivotalSoftware,Inc,2013-2018 20 3.1.0
-
clustermusthave2 +1nodes.Aclusteroffivenodes,forexample,cantoleratetwofailednodes.
ZooKeeperisveryfastforreadrequestsbecauseitstoresdatainmemory.IfZooKeeperbeginstoswapmemorytodisk,SolrandGPTextperformancewillsufferandcouldexperiencefailures,soitiscriticaltoallocatesufficientmemorytotheZooKeeperJavaprocesses.ToavoidZooKeeperinstancescompetingwithGreenplumDatabasesegmentsformemory,youshoulddeploytheZooKeeperinstancesandGreenplumDatabasesegmentsondifferenthosts.TheZooKeeperandGreenplumDatabasehostsmustbeonthesamenetworkandaccessiblewithpasswordlessSSHbythegpadminuser.YoucanusetheGreenplumDatabase gpssh-exkeys utilitytoshareSSHkeysbetweenZooKeeperandGreenplumDatabasehosts.
YoumuststarttheZooKeeperclusterbeforeyoustartGPText.WhenyoustartGPText,theSolrnodeseachloadthereplicasforindexestheymanage.Withlargenumbersofindexes,shards,andreplicas,startinguptheclustercangenerateaveryhigh,atypicalloadonZooKeeper.ItcantakealongtimetogetallindexesloadedandsomeZooKeeperrequestsmaytimeoutwaitingforresponses.Usingthe gptext-start--
slow_startoptionstartsSolrnodesoneata
time,providingamoreorderedstart-upandlimitingthenumberofconcurrentZooKeeperrequests.
TheGPTextcommand-lineutility zkManager canbeusedtomonitortheZooKeepercluster.IftheZooKeeperclusterisboundtoGPText,youcanalsostartandstoptheclusterusing zkManager .
CheckingZooKeeperStatusUsethe zkManager utilityfromthecommandlinetochecktheZooKeeperclusterstatus.Theutilityliststhehosts,ports,latency,andfollower/leadermodeforeachZooKeeperinstance.Ifanodeisdown,itsmodeislistedasDown.
TochecktheZooKeeperclusterstatus,runthe zkManagerstate command.
$zkManagerstate20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstateprocess.20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-HostportLatencymin/avg/maxMode20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21890/0/22follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21900/0/29leader20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-gpdb21880/0/27follower20171016:12:59:47:026338zkManager:gpdb:gpadmin-[INFO]:-Done.
Inadatabasesession,youcanusethe gptext.zookeeper_hosts() functiontolisttheZooKeeperhosts.
=#SELECT*FROMgptext.zookeeper_hosts();host|port--------+------gpdb51|2188gpdb51|2189gpdb51|2190(3rows)
StartingandStoppingtheZooKeeperClusterIftheZooKeeperclusterwasinstalledbytheGPTextinstaller,the zkManager utilitycanstartorstoptheZooKeepercluster.Tostartthecluster,runthezkManagerstart
command.
$zkManagerstart20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstartprocess20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-StartingZookeeper:20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:46:017845zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:48:017845zkManager:gpdb:gpadmin-[INFO]:-Checkzookeeperclusterstate...20171016:16:14:53:017845zkManager:gpdb:gpadmin-[INFO]:-Done.
TostopZooKeeper,runthe zkManagerstop command.
n
©CopyrightPivotalSoftware,Inc,2013-2018 21 3.1.0
-
$zkManagerstop20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-Executezookeeperstopprocess.20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-StopZookeeper:20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:------------------------------------------------20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-HostZookeeperDir20171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo020171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo120171016:16:14:08:016499zkManager:gpdb:gpadmin-[INFO]:-gpdb/data/master/zoo220171016:16:14:09:016499zkManager:gpdb:gpadmin-[INFO]:-Done.
SeethezkManagerreferenceformoreinformation.
CheckingSolrCloudStatusYoucancheckthestatusoftheSolrCloudclusterandindexesbyrunningthe gptext-state utilityfromthecommandline.
TocheckthestateoftheGPTextnodesandeachindex,runthe gptext-state utilitywiththe -D ( --details )option.Example:
$gptext-state-D20180615:16:09:24:031986gptext-state:mdw:gpadmin-[INFO]:-ExecuteGPTextstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Checkzookeeperclusterstate...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CheckGPTextclusterstatus...20180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-CurrentGPTextVersion:3.0.020180615:16:09:25:031986gptext-state:mdw:gpadmin-[INFO]:-Allnodesareupandrunning.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Indexstatedetails.20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-databaseindexnamestate20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.twitter.messageGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-demodemo.wikipedia.articlesGreen20180615:16:09:26:031986gptext-state:mdw:gpadmin-[INFO]:-Done.
ThiscommandreportsthestatusoftheGPTextnodesandstatusofeachGPTextindex.
Run gptext-statelist toviewjusttheindexes.
The gptext-statehealthcheck commandcheckstheGPTextconfigurationfiles,theindexstatus,requireddiskspace,userprivileges,andindexanddatabaseconsistency.Bydefault,therequireddiskspacecheckpassesifthereisatleast20%diskfree.Youcansetadifferentdiskfreethresholdusingthe--disk_free option.Forexample:
[gpadmin@gpdb-sandbox~]$gptext-statehealthcheck--disk_free=2520160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-ExecutehealthcheckonGPTextcluster!20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextconfigfiles...20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:24:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-CheckGPTextindexstatus...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireddiskspace...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforrequireduserprivileges...20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:25:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Checkingforindexesanddatabaseconsistency...20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-GOOD20160629:15:45:27:669652gptext-state:gpdb-sandbox:gpadmin-[INFO]:-Done.
Seethe gptext-state utilityreferenceforadditionaloptions.
RecoveringGPTextNodesUsethe gptext-recover utilitytorecoverdownGPTextnodes,forexampleafterafailedGreenplumDatabasesegmenthostisrecovered.
Withnoarguments,the gptext-recover utilitydiscoversdownGPTextnodesandrestartsthem.
Withthe -f (or --force )option,ifaGPTextnodecannotberestartedandnoshardsaredown,thenodeisdeletedandcreatedagainonthesamehost.Missingreplicasareaddedandthefailednodeandfailedreplicasareremoved.
The -H ( --new_hosts )optionallowsrecreatingdownGPTextnodesonnewhoststhatreplacefailedhosts.ThedownGPTextnodesaredeletedandrecreatedonthenewhosts.Theargumenttothe -H optionisacomma-separatedlistofthenewhoststhataretoreplacethefailedhosts.Thenumberof
©CopyrightPivotalSoftware,Inc,2013-2018 22 3.1.0
-
newhostsmustmatchthenumberoffailedhosts.Ifshardsaredown,itadvisesreindexing.Ifonlysomereplicasaredown,itrecreatesthereplicasonthenewhostsandupdates gptext.conf .
The -r optionrecoversreplicas,butdoesnotattempttorecoveranydownnodes.
Note:BeforerecoveringGPTextnodesonnewlyaddedhosts,ensurethatthefollowingGPTextprerequisiteshavebeeninstalledonthehost:
Java1.8
Python2.6
TheLinux lsof utility
ViewingSolrIndexStatisticsYoucanviewSolrindexstatisticsbyrunningthe gptext-state utilityfromthecommandline.
TolistallGPTextindexes,enterthefollowingcommandatthecommandline:
gptext-statelist
Acommandlinethatretrievesallstatisticsforanindex:
gptext-state--indexdemo.wikipedia.articles
Acommandlinethatretrievesthenumberofdocumentsinanindex:
gptext-state--indexdemo.wikipedia.articles--stats_columns=num_docs
Acommandlinethatretrieves num_docs ,index size ,andthedateandtime last_modified :
gptext-state--indexdemo.wikipedia.articles--stats_columnsnum_docs,size,last_modified
BackingUpandRestoringGPTextIndexesWiththe gptext-backup managementutility,youcanbackupaGPTextindexsothat,ifneeded,youcanquicklyrecoverfromafailure.ThebackupcanberestoredtothesameGPTextsystemortoanothersystemwiththesamenumberofGreenplumDatabasesegments.
The gptext-backup managementutilitybacksupanindexanditsconfigurationfilestoeitherasharedfilesystem,whichmustbemountedonandwritablebyeachhostintheGreenplumDatabasecluster,ortolocalstorageontheGreenplumDatabasemasterandsegmenthosts.
BackingUptoaSharedFileSystemTobackuponasharedfilesystem,usethe -p ( --path )command-lineoptiontospecifythelocationofadirectoryonthemountedfilesystemandthe-n ( --name )optiontoprovideanameforthebackup.Specifytheindextobackupwiththe -i (--index )option.
$gptext-backup-i-p--n
The gptext-backup utilitythenchecksthat:
theGPTextclusterisup
thesharedfilesystemisvalid
thebackupnamespecifiedwiththe -n optiondoesnotalreadyexistinthedirectoryspecifiedwiththe -p option
Theutilitycreatesthenewdirectoryandthensavesonecopyofeachindexshardtothatdirectory,alongwiththeindex’sconfigurationfilesfromZooKeeper.
Tosavetheconfigurationfilesonly,withnodata,addthe -c ( --backup_conf )command-lineoption.
©CopyrightPivotalSoftware,Inc,2013-2018 23 3.1.0
-
Torestoreanindexfromasharedfilesystem,usethe gptext-restore managementutility.TheGPTextsystemyourestoretomustbeonaGreenplumDatabaseclusterwiththesamenumberofsegments.Thedatabaseandschemafortheindexmustbepresent.
The -i ( --index )optionspecifiesthenameoftheGPTextindexthatwillberestored.Iftheindexexists,youmustfirstdropitwiththe gptext.drop_index()user-definedfunction.
The -p ( --path )optionspecifiesthelocationofthedirectorycontainingthebackupfiles—thedirectorythat gptext-backup createdonthesharedfilesystem.
$gptext-restore-i-p
Youcanaddthe -c optiontorestoreonlytheconfigurationfilestoZooKeeperandcreateanemptyGPTextindex,withoutrestoringanysavedindexdata.
BackingUptoLocalStorageTobackuptolocalstorageontheGreenplumDatabasecluster,addthe local keywordtothe gptext-backup command-line.
AlocalGPTextbackuphasauniquenameconstructedbyappendingatimestamptotheindexname.Youdonotusethe -n optionwithlocalbackups.
$gptext-backuplocal-i
Onthemasterhost,inthemasterdatadirectorybydefault,thebackuputilitysavesaJSONfilewithbackupmetadataandadirectorycontainingtheindex’sconfigurationfilesfromZooKeeper.
TheutilitybacksupeachindexshardontheGreenplumDatabasesegmenthostwiththeGPTextnodethatmanagestheshard’sleadreplica.Bydefault,theshardbackupfilesaresavedinasegmentdatadirectory.
The gptext-backup commandoutputreportsthelocationsofallbackupfiles.
Youcanaddthe -p ( --path )optiontothe gptext-backup commandtospecifyalocaldirectorywherethebackupwillbesaved.ThedirectorymustbepresentoneveryGreenplumDatabasehostandmustbewriteablebythegpadminuser.
$gptext-backuplocal-i-p
ThebackupfileswillbesavedinthespecifieddirectoryoneachhostinsteadofintheGreenplumDatabasemasterandsegmentdatadirectories.
Torestoreabackupsavedtolocalstorage,addthe local keywordtothe gptext-restore command-lineandspecifythepathtothebackupdirectoryonthemasterhost.
$gptext-restorelocal-p
The isthefullpathtothedirectorythe gptext-backup commandcreatedonthemasterhost,includingthetimestamp,forexample$MASTER_DATA_DIRECTORY/demo.twitter.message_2018-05-08T15:32:21.397779 .
Seegptext-backupforsyntaxandexamplesforrunning gptext-backup .Seegptext-restoreforsyntaxandexamplesforrunning gptext-restore .
ExpandingtheGPTextClusterThe gptext-expand managementutilityaddsGPTextnodestothecluster.Therearetwowaystoaddnodes:
AddGPTextnodestoexistinghostsinthecluster.ThisoptionincreasesthenumberofGPTextnodesoneachhost.
AddGPTextnodestonewhostsaddedbyusingtheGreenplumDatabase gpexpand managementutilitytoexpandtheGreenplumDatabasesystem.
AddingGPTextNodestoExistingSegmentHostsToaddnodestoexistingsegmenthosts,runthe gptext-expand utilitywithacommandlikethefollowing:
gptext-expand-e-p/data1/nodes,/data2/nodes
©CopyrightPivotalSoftware,Inc,2013-2018 24 3.1.0
-
ThisexampleaddstwoGPTextnodestoeachhost.
The -e ( --existing )optionspecifiesthatnodesaretobeaddedtoexistinghosts.
The -p ( --expand_paths )optionprovidesalistofdirectorieswherethenewnodes’datadirectoriesaretobecreated.TheseshouldbethesamedirectoriesthatcontaintheGreenplumDatabasesegmentdatadirectoriesandexistingGPTextdatadirectories.Thenumberofdirectoriesinthelististhenumberofnewnodesthatareadded.
AdirectorycanberepeatedinthedirectorylistmultipletimestoincreasethenumberofnewGPTextnodestocreate.Forexample,ifthereiscurrentlyoneGPTextnodeperhostinthe /data1/nodes directory,youcouldaddthreenodeswithacommandlikethefollowing:
gptext-expand-e-p/data1/nodes,/data2/nodes,/data2/nodes
Thisaddsonenodetothe /data1/nodes directoryandtwonodestothe /data2/nodes directorysotherearetwoGPTextnodesineachdirectory.
AddingGPTextnodesaffectsnewindexes,butnotexistingindexes.Replicasfornewindexeswillbedistributedacrossallofthenodes,includingbotholdnodesandthenewlycreatednodes.Replicasforindexesthatexistedbeforerunning gptext-expand arenotautomaticallymoved.Rebalancingexistingreplicasrequiresreindexing.
AddingGPTextNodestoNewHostsCheckthatthefollowingGPTextprerequisitesareinstalledoneachnewhostaddedtotheGreenplumDatabasecluster:
Java1.8
Python2.6orgreater
Linux lsof utility
NewhostsmustbereachablebyallhostsintheGPTextcluster,includingexistinghostsandthenewhostsyouareadding.
AfterexpandingtheGreenplumDatabaseclusterwiththe gpexpand managementutility,call gptext-expand withthe -H ( --new_hosts )optionandalistofthenewhostsonwhichtoinstallGPText:
gptext-expand-Hnewhost1,newhost2
The gptext-expand utilityinstallsGPTextbinariesonthenewhostsandthencreatesnewGPTextnodesonthenewhosts.
ExpandingaGreenplumDatabaseclusterincreasesthenumberofsegments,sothenumberofGPTextindexshardsforexistingindexesmustbeincreasedtoequalthenewnumberofsegments.Thisrequiresreindexingallexistingdocuments.Newlycreatedindexeswillautomaticallybedistributedamongthenewshards.
TroubleshootingGPTexterrorsareofthefollowingtypes:
Solrerrors
gptext errors
MostoftheSolrerrorsareself-explanatory.
gptext errorsarecausedbymisuseofafunctionorutility.Theyprovideamessagethattellsyouwhenyouhaveusedanincorrectfunctionorargument.
MonitoringLogsYoucanexaminetheGreenplumDatabaseandSolrlogsformoreinformationiferrorsoccur.GreenplumDatabaselogsresidein:
segment-directory/pg-log
Solrlogsresidein:
©CopyrightPivotalSoftware,Inc,2013-2018 25 3.1.0
-
/solr/logs
DeterminingSegmentStatuswithgptext-stateUsethe gptext-state utilitytodetermineifanyprimaryormirrorsegmentsaredown.See gptext-state intheGPTextManagementUtilitiesReference.
©CopyrightPivotalSoftware,Inc,2013-2018 26 3.1.0
-
GPTextHighAvailabilityTheGPTexthighavailabilityfeatureensuresthatyoucancontinueworkingwithGPTextindexesaslongaseachshardintheindexhasatleastoneworkingreplica.
AGPTextindexhasoneshardforeachGreenplumsegment,sothereisaone-to-onecoorespondencebetweenGreenplumsegmentsandGPTextindexshards.TheshardmanagedbyaGreenplumsegmentisanindexofthedocumentsthataremanagedbythatsegment.
TheGPTexthighavailabilitymechanismistomaintainmultiplecopies,orreplicas,oftheshard.TheZooKeeperservicethatmanagesSolrCloudchoosesaGPTextinstance(SolrCloudnode)foreachreplicatoensureevendistributionandhighavailability.Foreachshard,onereplicaiselectedleaderandtheGreenplumsegmentassociatedwiththeshardoperatesonthisleaderreplica.TheGPTextinstancemanagingtheleadreplicamayormaynotbeonanotherGreenplumhost,soindexingandsearchingoperationsarepassedovertheGreenplumcluster’sinterconnectnetwork.SolrCloudreplicateschangesmadetotheleaderreplicatotheremainingreplicas.
ThefollowingfigureillustratestherelationshipsbetweenGreenplumsegmentsandGPTextindexshardsandreplicas.Theleaderreplicaforeachshardisshowningreenandthefollowersaregray.
Thenumberofreplicastocreateforeachshard,thereplicationfactor,isaSolrCloudproperty.Bydefault,GPTextstartsSolrCloudwithareplicationfactorofthree.ThereplicationfactorforeachindividualindexisthevalueoftheSolrCloudreplicationfactorwhentheindexiscreated.Changingthereplicationfactordoesnotalterthereplicationfactorforexistingindexes.
GreenplumSegmentorHostFailureIfaGreenplumprimarysegmentfailsanditsmirrorisactivated,GPTextfunctionsandutilitiescontinuetoaccesstheleaderreplica.Nointerventionisneeded.
Ifahostintheclusterfails,bothGreenplumandGPTextareaffected.MirrorsfortheGreenplumprimarysegmentslocatedonthefailedhostareactivatedonotherhosts.SolrCloudelectsanewleaderreplicaforaffectedshards.BecauseGreenplumsegmentmirrorsandGPTextshardreplicasaredistributedthroughoutthecluster,asinglehostfailureshouldnotpreventtheclusterfromcontinuingtooperate.Theperformanceofdatabasequeriesandindexingoperationswillbeaffecteduntilthefailedhostisrecoveredandtheclusterisbroughtbackintobalance.
ZooKeeperClusterAvailabilitySolrCloudisdependentonaworking,availableZooKeepercluster.ForZooKeepertobeactive,amajorityoftheZooKeeperclusternodesmustbeupandabletocommunicatewitheachother.AZooKeeperclusterwiththreenodescancontinuetooperateifoneofthenodesfails,sincetwoisamajorityofthree.Totoleratetwofailednodes,theclustermusthaveatleastfivenodessothatthenumberofworkingnodesremainingafterthefailureareamajority.Totoleratennodefailures,then,aZooKeeperclustermusthave2*n*+1nodes.ThisiswhyZooKeeperclustersusuallyhaveanoddnumberofnodes.
Thebestpracticeforahigh-availabilityGPTextclusterisaZooKeeperclusterwithfiveorsevennodessothattheclustercantoleratetwoorthreefailednodes.
©CopyrightPivotalSoftware,Inc,2013-2018 27 3.1.0
-
ManagingGPTextClusterHealthGPTextdocumentindexingandsearchingservicesremainavailableaslongaseachshardofanindexhasatleastoneworkingreplica.Toensureavailabilityintheeventofafailure,itisimportanttomonitorthestatusoftheclusterandensurethatalloftheindexshardreplicasarehealthy.YoucanmonitortheSolrCloudclusterandindexesusingtheSolrCloudDashboardorusingGPTextfunctionsandmanagementutilities.AccesstheSolrCloudDashboardwithawebbrowseronanyGPTextinstancewithaURLsuchas http://sdw3:18983/solr .(TheportnumbersforGPTextinstancesaresetwiththeGPTEXT_PORT_BASE parameterintheinstallationparametersfileatinstallationtime.)
RefertotheApacheSolrClouddocumentationforhelpusingtheSolrCloudDashboard.
MonitoringtheClusterwithGPTextTheGPText gptext-state managementutilityallowsyoutoquerythestateoftheGPTextclusterandindexes.Youcanalsouse gptext.index_status() toviewthestatusofallindexesoraspecifiedindex.
ToseetheGPTextclusterstaterunthe gptext-state command-lineutilitywiththe -d optiontospecifyadatabasethathastheGPTextschemainstalled.
gptext-state-dmydb
TheutilityreportsanyGPTextnodesthataredownandliststhestatusofeveryGPTextindex.Foreachindex,thedatabasename,indexname,andstatusarereported.Thestatuscolumncontains“Green”,“Yellow”,or“Red”:-Green–allreplicasforallshardsarehealthy-Yellow–allshardshaveatleastonehealthyreplicabutatleastonereplicaisdown-Red–noreplicasareavailableforatleastoneindexshard
ToseethedistributionofindexshardsandreplicasintheGPTextcluster,executethisSQLstatement.
SELECTindex_name,shard_name,replica_name,node_nameFROMgptext.index_summary()ORDERBYnode_name;
TolistallGPTextindexes,runthe gptext-statelist command.
gptext-statelist-dmydb
The gptext-statehealthcheck commandchecksthehealthofthecluster.The -f flagspecifiesthepercentageofavailablediskspacerequiredtoreportahealthycluster.Thedefaultis10.
gptext-statehealthcheck-f20-dmydb
See gptext-state intheManagementUtilitiesreferenceforhelpwithadditional gptext-state options.
Thegptext.index_status()user-definedfunctionreportsthestatusofallGPTextindexesoraspecifiedindex.
SELECT*FROMgptext.index_status();
Specifyanindexnametoreportonlythestatusofthatindex.
SELECT*FROMgptext.index_status('demo.twitter.message');
AddingandDroppingReplicasThe gptext-replica utilityaddsordropsareplicaofasingleindexshard.Usethe gptext.add_replica() and gptext.delete_replica() user-definedfunctionstoperformthesametasksfromwithinthedatabase.
Ifareplicaofashardfails,use gptext-replica toaddanewreplicaandthendropthefailedreplicatobringtheindexbackto“Green”status.
gptext-replicaadd-imydb.public.messages-sshard3
Hereistheequivalent,usingthe gptext.add_replica() function:
©CopyrightPivotalSoftware,Inc,2013-2018 28 3.1.0
-
SELECT*FROMgptext.add_replica('mydb.public.messages',shard3);
ZooKeeperdetermineswherethereplicawillbelocated,butyoucanalsospecifythenodewherethereplicaiscreated:
gptext-replicaadd-imydb.public.messages-sshard3-nsdw3
Inthe gptext.add_replica() function,addthenodenameasathirdargument.
Todropareplica,call gptext.delete_replica() withthenameoftheindex,thenameoftheshard,andthenameofthereplica.Youcanfindthenameofthereplicabycalling gptext.index_status(index_name) .Thenameisintheformat core_noden .Anoptional -o flagspecifiesthatthereplicaistobedeletedonlyifitisdown.
gptext-replicadrop-imydb.public.messages-sshard3-rcore_node4-o
Hereistheequivalentoftheabovecommandusingthe gptext.delete_replica() user-definedfunction.
SELECT*FROMgptext.delete_replica('mydb.public.messages','shard3','core_node4',true);
©CopyrightPivotalSoftware,Inc,2013-2018 29 3.1.0
-
GPTextBestPracticesEachGPText/ApacheSolrnodeisaJavaVirtualMachine(JVM)processandisallocatedmemoryatstartup.ThemaximumamountofmemorytheJVMwilluseissetwiththe -Xmx parameterontheJavacommandline.Performanceproblemsandoutofmemoryfailurescanoccurwhenthenodeshaveinsufficientmemory.
OtherperformanceproblemscanresultfromresourcecontentionbetweentheGreenplumDatabase,Solr,andZooKeeperclusters.
ThistopicdiscussesGPTextusecasesthatstressSolrJVMmemoryindifferentwaysandthebestpracticesforpreventingoralleviatingperformanceproblemsfrominsufficientJVMmemoryandothercauses.
IndexingLargeNumbersofDocumentsIndexingdocumentsconsumesdatainSolrJVMmemory.Whentheindexiscommitted,partsofthememoryarereleased,butsomedataremainsinmemorytosupportfastsearch.Bydefault,Solrperformsanautomaticsoftcommitwhen1,000,000documentsareindexedor20minutes(1,200,000milliseconds)havepassed.Asoftcommitpushesdocumentsfrommemorytotheindex,freeingJVMmemory.Asoftcommitalsomakesthedocumentsvisibleinsearches.Asoftcommitdoesnot,however,maketheindexupdatesdurable;itisstillnecessarytocommittheindexwiththe gptext.commit()user-definedfunction.
Youcanconfigureanindextoperformamorefrequentautomaticsoftcommitbyeditingthe solrconfig.xml filefortheindex:
$gptext-configedit-fsolrconfig.xml-i..
The elementisachildofthe element.Editthe and valuestoreducethetimebetweenautomaticcommits.Forexample,thefollowingsettingsperformanautocommitevery100,000documentsor10minutes.
100000600000
IndexingVeryLargeDocumentsIndexingverylargedocumentscanusealargeamountofJVMmemory.Tomanagethis,youcansetthe gptext.idx_buffer_size configurationparametertoreducethesizeoftheindexingbuffer.
SeeChangingGPTextServerConfigurationParametersforinstructionstochangeconfigurationparametervalues.
DeterminingtheNumberofGPTextNodestoDeployAGPTextnodeisaSolrinstancemanagedbyGPText.ThenodescanbedeployedontheGreenplumDatabaseclusterhostsoronseparatehostsaccessibletotheGreenplumDatabasecluster.ThenumberofnodesisconfiguredduringGPTextinstallation.
ThemaximumrecommendednumberofGPTextnodesyoucandeployisthenumberofGreenplumDatabaseprimarysegments.However,thebestpracticerecommendationistodeployfewerGPTextnodeswithmorememoryratherthantodividethememoryavailabletoGPTextamongthemaximumnumberofGPTextnodes.Usethe JAVA_OPTS installationparametertosetmemorysizeforGPTextnodes.
AsingleGPTextnodeperhostcaneasilyhandleseveralindexes.EachadditionalnodeconsumesadditionalCPUandmemoryresources,soitisdesirabletolimitthenumberofnodesperhost.FormostGPTextinstallations,asingleGPTextnodeperhostissufficient.
IftheJVMhasaverylargeamountofmemory,however,garbagecollectioncancauselongpauseswhiletheJVMreorganizesmemory.Also,theJVMemploysamemoryaddressoptimizationthatcannotbeusedwhenJVMmemoryexceeds32GB,soatmorethan32GB,aGPTextnodelosescapacityandperformance.Therefore,noGPTextnodeshouldhavemorethan32GBofmemory.
Forexample,ifyouhave48GBmemoryavailableforGPTextperhost,youshoulddeploytwoGPTextnodeswith24GBmemory.Ifyouhave128GBavailable,youshoulddeployatleastfourJVMs,andmoreifgarbagecollectionbecomesaproblem.
©CopyrightPivotalSoftware,Inc,2013-2018 30 3.1.0
-
ConfigureMaximumJVMHeapSizeEachSolrcorefileconsumesJVMheapmemory.AddingmoreindexesincreasesJVMswappingandgarbagecollectionfrequencysothatittakeslongertocreateindexesandtoloadthecorefileswhenGPTextisstarted.IfyoucontinuetocreateindexeswithoutincreasingtheJVMheap,anoutofmemoryerrorwilleventuallyoccur.
MonitorperformanceatstartupandduringindexcreationandincreasetheJVMsizewhenyoubegintoseedegradedperformance.Youcanalsousetoolssuchasjconsole,includedwiththeJavaDeveloperKit,tomonitorJavaheapusage.Ifgarbagecollectionsareoccurringtoofrequentlyandfreeingtoolittlememory,JVMheapshouldbeincreased.
TheJVMsizeisinitiallyconfiguredduringGPTextinstallationbysettingthe JAVA_OPTIONS parameterintheinstallationconfigurationfile.Afterinstallation,usethe gptext-configjvm commandtoincreasetheJVMheapsize.Forexample,this gptext-configjvm commandsetstheJVMmaximumheapoptionto4GB:
$gptext-configjvm-o"-Xmx=4096M"
ManageIndexingandSearchLoadsWithhighindexingorsearchload,JVMgarbagecollectionpausescancausetheSolroverseerqueuetobackup.ForaheavilyloadedGPTextsystem,youcanpreventsomeperformanceproblemsbyschedulingdocumentindexingfortimeswhensearchactivityislow.
TermsQueriesandOutofMemoryErrorsThe gptext.terms() functionretrievestermsvectorsfromdocumentsthatmatchaquery.Anoutofmemoryerrormayoccurifthedocumentsarelarge,orifthequerymatchesalargenumberofdocumentsoneachnode.Otherfactorscancontributetooutofmemoryerrorswhenrunninga gptext.terms() query,includingthemaximummemoryavailabletotheSolrnodes(-Xmxvaluein JAVA_OPTS )andconcurrentqueries.
Ifyouexperienceoutofmemoryerrorswith gptext.terms() youcansetalowervalueforthe term_batch_size GPTextconfigurationvariable.Thedefaultvalueis1000.Forexample,youcouldtryrunningthefailingquerywith term_batch_size setto500.Loweringthevaluemaypreventoutofmemoryerrors,butperformanceoftermsqueriescanbeaffected.
SeeGPTextConfigurationParametersforhelpsettingGPTextconfigurationparameters.
ConfigureFileSystemCachingforZooKeeperGoodSolrperformanceisdependentonfastresponseforZooKeeperrequests.ZooKeeperperformsbestwhenitsdatabaseiscachedsoitdoesnothavetogotodiskforlookups.IfyoufindthatZooKeeperJVMshavefrequentdiskaccesses,lookforwaystoimprovefilecachingormoveZooKeeperdiskstofasterstorage.
TheZooKeeper zkClientTimeout parameteristhetimeaclientisallowedtonottalktoZooKeeperbeforehavingitssessionexpired.
©CopyrightPivotalSoftware,Inc,2013-2018 31 3.1.0
-
TroubleshootingHadoopConnectionProblemsThissectiondescribesHadoop-relatedproblemsandpotentialsolutionstotheseissues.
DataNodeAccessErrorsYoumayexperienceHadoopaccesserrorswithGPTextifanyDataNodesintheHadoopclusterresideinamulti-homednetwork.GPTextusesanexternalIPaddresstoaccesstheHDFSNameNode.GPTextencountersanerrorwhentheNameNodeprovidesaninternalIPaddressforaDataNode.Inthissituation,additionalconfigurationisrequiredtoconfigureGPTexttoperformitsownDNSresolutionofDataNodehostnames.
PerformthefollowingproceduretoexplicitlyconfigureDNSresolutionofDataNodehostnames:
1. LocatealocalcopyoftheHadoopauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_conf :
$cd/home/gpadmin/auths/hdfs_conf$lscore-site.xmlhdfs-site.xmluser.txt
2. Open hdfs-site.xml intheeditorofyourchoice.Forexample:
$vihdfs-site.xml
3. Addthefollowingpropertyblocktothefile,andthensavethefileandexit:
dfs.client.use.datanode.hostnametrue
ThispropertyallowsGPTexthoststoperformtheirownDNSresolutionofHDFSDataNodehostnames.
4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthe hdfs_conf directoryincludestheauthenticationconfigurationfilesforaHadoopclusterwith hdfs_bill_auth :
$cd..$gptext-externalupload-thdfs-chdfs_bill_auth-phdfs_conf
5. Determinethehostname-to-IPaddressmappingforallDataNodes,andaddtheassociatedentriesintothe /etc/hosts fileonallGPTextclienthosts.
Kerberos-RelatedErrorsThefollowingproblemsarespecifictoHadoopclusterssecuredwithKerberos.
ClockSkewAloginattempttoaHadoopclustersecuredwithKerberoswillfailifclockskewbetweenGPTextclienthostsandtheKerberosKDChostistoogreat.Inthissituation,youmayseethefollowingerrorintheSolrlog:
java.io.IOException causedbya KrbException noting“Clockskewtoogreat”
Toresolvethissituation,ensurethattheclocksontheKerberosKDChostandGPTextclienthostsaresynchronized.
TimeoutErrorsAloginattempttoaHadoopclustersecuredwithKerberosmayfailwithtimeouterrorswhenthe kdc and admin_server settingsinthe krb5.conf filearespecifiedwithahostname,andtheGPTextclienthostscannotresolvethehostname.Inthissituation,youmayseeoneofthefollowingerrorsintheSolrlog:
©CopyrightPivotalSoftware,Inc,2013-2018 32 3.1.0
-
org.apache.solr.common.SolrException: Failed to login HDFS messagecausedbya java.io.IOException specifyingjavax.security.auth.login.LoginException: Receive timed out
java.nio.channels.UnresolvedAddressException with SocketIOWithTimeout referencedinthestacktrace
Inthissituation,youmaychooseeitherofthefollowing:
UpdatetheKerberos krb5.conf filetospecifythe kdc and admin_server settingsusingIPaddresses.Or
UpdateallGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver.
Ifyouchoosetoupdatethe krb5.conf file:
1. LocatealocalcopyoftheHadoopKerberosauthenticationconfigurationdirectorythatyoupreviouslyuploadedtoZooKeeper.Forexample,ifthedirectoryislocatedat /home/gpadmin/auths/hdfs_kerb_conf :
$cd/home/gpadmin/auths/hdfs_kerb_conf$lscore-site.xmlhdfs-site.xmlkeytabkrb5.confuser.txt
2. Open krb5.conf intheeditorofyourchoice.Forexample:
$vikrb5.conf
3. Replacethe KERBEROS blockattributeswiththeirequivalentIPaddressesandthensavethefileandexit.Forexample:
[realms]KERBEROS={kdc=admin_server=}
4. Re-uploadthemodifiedconfigurationtoZooKeeper.Forexample,ifthedirectorynamed hdfs_kerb_conf includestheauthenticationconfigurationfilesforaHadoopclusterdefinedwiththe hdfs_kerb_auth :
$cd..$gptext-externalupload-thdfs-chdfs_kerb_auth-phdfs_kerb_conf
Alternately,ifyouchoosetoconfiguretheGPTexthoststoperformtheirownDNSresolutionoftheKerberosKDCserver,addanentryfortheKDChostname-to-IPaddressmappingtothe /etc/hosts fileonallGPTextclienthosts.
©CopyrightPivotalSoftware,Inc,2013-2018 33 3.1.0
-
WorkingWithGPTextIndexesIndexingpreparesdocumentsfortextanalysisandfastqueryprocessing.ThistopicshowsyouhowtocreateGPTextindexesandadddocumentsfromGreenplumDatabasetablestothem,andhowtomaintainandcustomizeindexesforyourownapplications.
ForhelpindexingandsearchingdocumentsstoredoutsideofGreenplumDatabaseseeWorkingWithGPTextExternalIndexes.
SettingUptheSampleDatabaseTheexamplesinthisdocumentationworkwitha demo databasecontainingthreedatabasetables,called wikipedia.articles , twitter.message ,andstore.products .Ifyouwanttoruntheexamplesyourself,followtheinstructionsinthissectiontosetupthe demo database.
1. LogintotheGreenplumDatabasemasterasthegpadminuserandcreatethe demo database.
$createdbdemo
2. Openaninteractiveshellforexecutingqueriesinthe demo database.
$psqldemo
3. Createthe articles tableinthe wikipedia schemawiththefollowingstatements.
CREATESCHEMAwikipedia;CREATETABLEwikipedia.articles(idint8primarykey,date_timetimestamptz,titletext,contenttext,refstext)DISTRIBUTEDBY(id);
4. Createthe message tableinthe twitter schemawiththefollowingstatements.
CREATESCHEMAtwitter;CREATETABLEtwitter.message(idbigint,message_idbigint,spamboolean,created_attimestampwithouttimezone,sourcetext,retweetedboolean,favoritedboolean,truncatedboolean,in_reply_to_screen_nametext,in_reply_to_user_idbigint,author_idbigint,author_nametext,author_screen_nametext,author_langtext,author_urltext,author_descriptiontext,author_listed_countinteger,author_statuses_countinteger,author_followers_countinteger,author_friends_countinteger,author_created_attimestampwithouttimezone,author_locationtext,author_verifiedboolean,message_urltext,message_texttext)DISTRIBUTEDBY(id)PARTITIONBYRANGE(created_at)(START(DATE'2011-08-01')INCLUSIVEEND(DATE'2011-12-01')EXCLUSIVEEVERY(INTERVAL'1month'));CREATEINDEXid_idxONtwitter.messageUSINGbtree(id);
5. CREATEthe store.products tablewiththesestatements.
©CopyrightPivotalSoftware,Inc,2013-2018 34 3.1.0
-
CREATESCHEMAstore;CREATETABLEstore.products(idbigint,titletext,categoryvarchar(32),brandvarchar(32),pricefloat)DISTRIBUTEDBY(id);
6. Downloadtestdataforthethreetableshere .Right-clickthelink,savethefile,andthencopyittothegpadminuser’shomedirectory.
7. Extractthedatafileswiththistarcommand.
$tarxvfzgptext-demo-data.tgz
8. Loadthewikipediadataintothe wikipedia.articles tableusingthe psql\COPY metacommand.
\COPYwikipedia.articlesFROM'/home/gpadmin/demo/articles.csv'HEADERCSV;
The articles tablenowcontainstextfrom23Wikipediaarticles.
9. Loadthetwitterdataintothe twitter.message tableusingthefollowing psql\COPY metacommand.
\COPYtwitter.messageFROM'/home/gpadmin/demo/twitter.csv'CSV;
The message tablenowcontains1730tweetsfromAugusttoOctober,2011.
10. Loadtheproductstableintothe store.products tablewiththefollowing psql\COPY metacommand.
\COPYstore.productsFROM'/home/gpadmin/demo/products.csv'HEADERCSV;
The products tablenowcontains50rows.Thistableisusedtodemonstratefacetedsearchqueries.SeeCreatingFacetedSearchQueries.
SettinguptheGPTextCommand-lineEnvironmentToworkwithGPTextindexes,youmustfirstsetupyourenvironmentandaddtheGPTextschematothedatabasecontainingthedocuments(GreenplumDatabasedata)youwanttoindex.
Tosettheenvironment,loginasthe gpadmin userandsourcetheGreenplumDatabaseandGPTextenvironmentscripts.TheGreenplumDatabaseenvironmentmustbesetbeforeyousourcetheGPTextenvironmentscript.Forexample,ifbothGreenplumDatabaseandGPTextareinstalledinthe/usr/local/ directory,enterthesecommands:
$source/usr/local/greenplum-db-/greenplum_path.sh$source/usr/local/greenplum-text-/greenplum-text_path.sh
Withtheenvironmentnowset,youcanaccesstheGPTextcommand-lineutilities.
AddingtheGPTextSchematoaDatabaseUsethe gptext-installsql utilitytoaddtheGPTextschematodatabasescontainingdatayouwanttoindexwithGPText.Youperformthistaskonetimeforeachdatabase.Inthisexample,the gptext schemaisinstalledintothe demo database.
$gptext-installsqldemo
The gptext schemaprovidesuser-definedtypes,tables,views,andfunctionsforGPText.ThisschemaisreservedforGPText.Ifyoucreateanynewobjectsinthe gptext schema,theywillbelostwhenyoureinstalltheschemaorupgradeGPText.
CreatingGPTextIndexesandIndexingData
©CopyrightPivotalSoftware,Inc,2013-2018 35 3.1.0
http://docs-gptext-staging.cfapps.io/demo/gptext-demo-data.tgz
-
ThegeneralstepsforcreatingaGPTextindexandindexingdocumentsare:
1. CreateanemptySolrindex
2. Customizetheindex(optional)
3. Populatetheindex
4. Committheindex
Afteryoucompletethesesteps,youcancreateandexecuteasearchqueryorimplementmachinelearningalgorithms.SearchingGPTextindexesisdescribedintheQueryingGPTextIndexestopic.
ThefollowingstepsarecompletedbyexecutingSQLcommandsandGPTextfunctionsinthedatabase.RefertotheGPTextFunctionReferencefordetailsabouttheGPTextfunctionsdescribedinthefollowingexamples.
CreateanemptyGPTextindexAGPTextindexisanApacheSolrcollectioncontainingdocumentsaddedfromaGreenplumDatabasetable.TherecanbeoneGPTextindexperGreenplumDatabasetable.EachrowinthedatabasetableisadocumentthatcanbeaddedtotheGPTextindex.
Ifthedatabasetableispartitioned,thereisoneGPTextindexforallpartitions.Youmustspecifytheroottablenamewhencreatingtheindexandaddingdocumentstoit.GPTextprovidessearchsemanticsthatenablesearchingpartitionsefficiently.
AGPTextexternalindexisaSolrindexfordocumentsthatarelocatedoutsideofGreenplumDatabase.GPTextprovidesuser-definedfunctionstocreateexternalindexesandinsertdocumentsintothem.SeeWorkingwithGPTextExternalIndexes.
The gptext.create_index() functioncreatesanewGPTextindex.Thisfunctionhastwosignatures:
gptext.create_index(,,,[,])
or
gptext.create_index(,,,,,[,])
The and argumentsspecifythedatabasetablethatcontainsthesourcedocuments.
The argumentisthenameofthetablecolumnthatcontainsauniqueidentifierforeachrow.The columncanbeoftypeint4 , int8 , varchar , text ,or uuid .
The argumentisthenameofthetablecolumnthatcontainsthecontentyouwanttosearchbydefault.Forexample,ifyouwanttoindexandsearchjustthe column,youcanusethefirstsignatureandspecifythe content columnnameinthe argument.
Thefinal,optionalargument, ,isaBooleanargument.Whentrue,thedefault,attemptingtoaddadocumentwithanidthatalreadyexistsintheindexgeneratesanerror.Ifyousettheargumenttofalse,youcanadddocumentswiththesameid,butwhenyousearchtheindexalldocumentswiththesameIDarereturned.
Thefollowingcommandcreatesanindexforthe twitter.message table,withthe id columnastheuniqueIDfieldandthe message_text columnforthedefaultsearchcolumn:
=#SELECT*FROMgptext.create_index('twitter','message','id','message_text');
Toverifythatthe demo.twitter.message indexwascreated,call gptext.index_status() :
©CopyrightPivotalSoftware,Inc,2013-2018 36 3.1.0
-
=#SELECT*FROMgptext.index_status('demo.twitter.message');content_id|index_name|shard_name|shard_state|replica_name|replica_state|core|node_name|base_url|is_leader|partitioned|external_index------------+----------------------+------------+-------------+--------------+---------------+-----------------------------------------+-----------------+------------------------+-----------+-------------+----------------0|demo.twitter.message|shard0|active|core_node3|active|demo.twitter.message_shard0_replica_n1|sdw2:18983_solr|http://sdw2:18983/solr|t|t|f0|demo.twitter.message|shard0|active|core_node5|active|demo.twitter.message_shard0_replica_n2|sdw1:18983_solr|http://sdw1:18983/solr|f|t|f1|demo.twitter.message|shard1|active|core_node7|active|demo.twitter.message_shard1_replica_n4|sdw2:18983_solr|http://sdw2:18983/solr|t|t|f1|demo.twitter.message|shard1|active|core_node9|active|demo.twitter.message_shard1_replica_n6|sdw1:18983_solr|http://sdw1:18983/solr|f|t|f2|demo.twitter.message|shard2|active|core_node11|active|demo.twitter.message_shard2_replica_n8|sdw2:18983_solr|http://sdw2:18983/solr|t|t|f2|demo.twitter.message|shard2|active|core_node13|active|demo.twitter.message_shard2_replica_n10|sdw1:18983_solr|http://sdw1:18983/solr|f|t|f3|demo.twitter.message|shard3|active|core_node15|active|demo.twitter.message_shard3_replica_n12|sdw2:18983_solr|http://sdw2:18983/solr|t|t|f3|demo.twitter.message|shard3|active|core_node16|active|demo.twitter.message_shard3_replica_n14|sdw1:18983_solr|http://sdw1:18983/solr|f|t|f(8rows)
ThisexampleexecutedonaGreenplumDatabaseclusterwithfourprimarysegments.Fourshardswerecreated,oneforeachsegment,andeachshardhastworeplicas.
Youcanalsorunthe gptext-state-D
command-lineutilitytoverifytheindexwascreated.Seethegptext-statereferencefordetails.
TheGPTextindexforthe demo.twitter.message tableisconfigured,bydefault,toindexallcolumnsinthe twitter.message databasetable.Youcanwritesearchqueriesthatcontaincriteriausinganycolumninthetable.
Ifyouwanttoindexandsearchasubsetofthetablecolumns,youcanusethesecond gptext.create_index() signature,specifyingthecolumnstoindexinthe argumentandthedatatypesofthosecolumnsinthe argument.The and argumentsaretextarrays.The
idcolumnnameanddefaultsearchcolumnnamemustbeincludedinthearrays.
Usethesecond gptext.create_index() signaturetocreateanindexforthe wikipedia.articles table.Thisindexwillallowyoutosearchonthe title , content ,andrefs columns.Notethattheidcolumnanddefaultsearchcolumnarestillspecifiedinseparateargumentsfollowingthe and
arrays.
=#SELECT*FROMgptext.create_index('wikipedia','articles','{id,title,content,refs}','{long,text_intl,text_intl,text_intl}','id','content',true);INFO:Createdindexdemo.wikipedia.articlescreate_index--------------t(1row)
Becausethe date_time columnwasomittedfromthe and arrays,itwillnotbepossibletosearchthe wikipedia.articles indexondatewiththeGPTextsearchfunctions.
Customizetheindex(optional)CreatingaGPTextindexgeneratesasetofconfigurationfilesfortheindex.Beforeyouadddocumentstotheindex,youcancustomizetheconfigurationfilestochangethewaydataisindexedandstored.Youcancustomizeanindexlater,afteryouhaveaddeddocumentstoit,butyoumustthenreindexthedatatotakeadvantageofyourcustomizations.
Onecommoncustomizationistoremapdatatypesforsomedatabasecolumns.Inthe managed-schema configurationfileforanindex,GPTextmapsthedatatypesforeachfieldfromtheGreenplumDatabasetypetoanequivalentSolrdatatype.GPTextappliesdefaultmappings(seeGPTextandSolrDataTypeMappings),butyourindexmaybemoreeffectiveifyouuseadifferentmappingforsomefields.
The demo.twitter.message table,forexample,hasa message_text textcolumnthatcontainstweets.Bydefault,GPTextmapstextcolumnstotheSolr text_intl(internationaltext)type.TheGPText text_sm (socialmediatext)typeisabettermappingforatextcolumnthatcontainssocialmediaidiomssuchasemoticons.
Followthesestepstoremapthe message_text fieldtothe gtext_sm type.
1. Usethe gptext-config utilitytoeditthe managed-schema fileforthe demo.twitter.message index.
$gptext-configedit-idemo.twitter.message-fmanaged-schema
The managed-schema fileloadsinatexteditor(normallyvi).
2. Findthe elementforthe message_text field.
©CopyrightPivotalSoftware,Inc,2013-2018 37 3.1.0
-
3. Changethe type attributefrom text_intl to text_sm .
4. Savethefileandexittheeditor.
TherearemanyotherwaystocustomizeaGPTextindex.Forexample,youcanomitfieldsfromtheindexbychangingthe indexed attributeofthe elementto false ,storethecontentsofthefieldintheindexbychangingthe stored attributeto true ,oruse gptext-config toeditthe stopwords.txt filetospecifyadditionalwordstoignorewhenindexing.
SeeCustomizingGPTextIndexestolearnhowdatatypemappingdetermineshowSolranalyzesandindexesfieldcontentsandformorewaystocustomizeGPTextindexes.
PopulatetheindexTopopulatetheindex,usethetablefunction gptext.index() ,whichhasthefollowingsyntax:
SELECT*FROMgptext.index(TABLE(SELECT*FROM),);
Toindexallrowsinthe twitter.message table,executethiscommand:
=#SELECT*FROMgptext.index(TABLE(SELECT*FROMtwitter.message),'demo.twitter.message');dbid|num_docs------+----------2|8923|838(2rows)
Thiscommandindexestherowsinthe wikipedia.articles table.
=#SELECT*FROMgptext.index(TABLE(SELECT*FROMwikipedia.articles),'demo.wikipedia.articles');dbid|num_docs------+----------3|112|12(2rows)
Theresultsofthiscommandshowthat23documentsfromtwosegmentswereaddedtotheindex.
Thefirstargumentofthe gptext.index() functionisatableexpression. TABLE(SELECT*FROMwikipedia.articles)
createsatableexpressionfromthearticles
table,usingthetablefunction TABLE .
Youcanchoosethedatatoindexorupdatebychangingtheinnerselectlistinthequerytoselecttherowsyouwanttoindex.Whenaddingnewdocumentstoanexistingindex,forexample,specifya WHERE clauseinthe gptext.index() calltochooseonlythenewrowstoindex.
Theinner SELECT statementcouldalsobeaqueryonadifferenttablewiththesamestructure,oraresultsetconstructedwithanarbitrarilycomplexjoin,providedthecolumnsspecifiedinthe gptext.create_index() functionarepresentintheresults.Ifyouindexdatafromasourceotherthanthetableusedtocreatetheindex,besurethedistributionkeyfortheresultsetmatchesthedistributionkeyofthebasetable.TheGreenplumDatabase SELECTstatementhasa SCATTERBY clausethatyoucanusetospecifythedistributionkeyfortheresultsfromaquery.SeeSpecifyingadistributionkeywithSCATTERBYformoreaboutthedistributionpolicyandGPTextindexes.
CommittheindexAfteryoucreateandpopulateanindex,youcommittheindexusing gptext.commit_index() .
Thisexamplecommitsthedocumentsaddedtotheindexesinthepreviousexample.
©CopyrightPivotalSoftware,Inc,2013-2018 38 3.1.0
-
=#SELECT*FROMgptext.commit_index('demo.twitter.message');commit_index--------------t(1row)
=#SELECT*FROMgptext.commit_index('demo.wikipedia.articles');commit_index--------------t(1row)
The gptext.commit_index() functioncommitsanynewdataaddedtoordeletedfromtheindexsincethelastcommit.
ManagingGPTextIndexesGPTextprovidescommand-lineutilitiesandfunctionsyoucanusetoperformtheseGPTextmanagementtasks:
Configuringanindex
Optimizinganindex
SpecifyingadistributionpolicywithSCATTERBY
Deletingfromanindex
Droppinganindex
Addingafieldtoanindex
Droppingafieldfromanindex
Listingallindexes
Configuringaninde
top related