schemaless solr and the solr schema rest api

SCHEMALESS SOLR AND THE SOLR SCHEMA REST API Steve Rowe Senior Software Engineer, LucidWorks Twitter: @steven_a_rowe

•  LucidWorks employee •  Lucene/Solr committer since 2010 •  JFlex committer since 2008 •  Previously at the Center for Natural Language Processing

at Syracuse University’s iSchool (School of Information) •  Twitter: @steven_a_rowe

Who am I?

•  As of version 4.4, Solr can operate in schemaless mode:

–  No need to pre-configure fields in the schema

–  As documents are indexed, previously unknown fields are automatically added to the schema

–  Field types are auto-detected from a limited set of basic types:

•  Long, Double, Boolean, Date, Text (default)

•  All are multi-valued –  Works in standalone Solr and SolrCloud

Schemaless Solr

•  Solr features used to implement schemaless mode:

–  Managed schema •  Required for runtime

schema modification –  Field value class guessing

•  Parsers attempt to detect the Java class of String-valued field content

–  Automatic schema field addition

•  Java class(es) mapped to schema field type

•  “Schemaless” does not mean that there is no schema •  Search applications need schemas to support non-trivial document models

–  No schema needed when there is only one field, or only one field type, i.e. all fields share:

•  Document & query processing, including analysis •  Index features & format •  Similarity implementation •  (etc.)

–  Otherwise, search apps need to manage per-field processing configuration (i.e. a schema) to consistently index documents and effectively serve queries

•  So what does “schemaless” mean for Solr? –  No up-front schema configuration required –  Schema discovery: document structure is either not fixed or not fully known

The slide about the nature and utility of schemalessness

•  Convention over configuration •  Glob-like patterns match field names with field types

!

<dynamicField name="*_i" type="int" indexed="true” stored="true"/>!<fieldType name="int" class="solr.TrieIntField"! precisionStep="0" positionIncrementGap="0"/>!!

•  Dynamic fields solve the problem of assigning field types to unknown fields by inferring a field’s type from its name

•  By contrast, Solr’s schemaless mode infers an unknown field’s type from its value or values

•  These two approaches are complementary •  The Solr schemaless example defines a number of dynamic fields, including the

*_i ! int mapping above

Dynamic fields

From example/example-schemaless/solr/collection1/conf/schema.xml: !

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />! <field name="_version_" type="long" indexed="true" stored="true"/>! From example/exampledocs/books.csv:

id,cat,name,price,inStock,author,series_t,sequence_i,genre_s! 0441385532,book,Jhereg,7.95,false,Steven Brust,Vlad Taltos,1,fantasy! ...!!$ cd example && java -Dsolr.solr.home=example-schemaless/solr -jar start.jar!!

$ cd exampledocs && java -Dtype=text/csv -jar post.jar books.csv!!

SimplePostTool version 1.5!Posting files to base url http://localhost:8983/solr/update using content-type text/csv..!POSTing file books.csv!1 files indexed.!COMMITting Solr index changes to http://localhost:8983/solr/update..!Time spent: 0:00:00.147!

Schemaless mode example

$ curl http://localhost:8983/solr/schema/fields!!

{ "fields":[{ "name":"_version_", "type":"long", "indexed":true, "stored":true },! { "name":"author", "type":"text_general" },! { "name":"cat", "type":"text_general" },! { "name":"id", "type":"string", "multiValued":false, "indexed":true,! "required":true, "stored":true,! "uniqueKey":true },! { "name":"inStock", "type":"booleans" },! { "name":"name", "type":"text_general" },! { "name":"price", "type":"tdoubles" }]}!!!!!!

From example/example-schemaless/solr/collection1/conf/schema.xml: !

<fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>! <fieldType name="tdoubles" class="solr.TrieDoubleField" precisionStep="8" ! positionIncrementGap="0" multiValued="true"/>!!!

Schemaless mode example

id! cat! name! price! inStock! author! series_t! sequence_i! genre_s!

0441385532! book! Jhereg! 7.95! false! Steven Brust!

Vlad Taltos!

1! fantasy!

•  The schema resource is managed by Solr, rather than hand edited

•  On first startup, Solr auto-converts schema.xml to managed-schema

•  Managed schema format is currently XML, but may change in the future

•  XML comments don’t survive the conversion.

•  mutable=true enables runtime schema modification

–  Automatic schema field addition –  Schema REST API

Managed schema From example/example-schemaless/solr/collection1/conf/solrconfig.xml: ! <schemaFactory class="ManagedIndexSchemaFactory">! <bool name="mutable">true</bool>! <str name="managedSchemaResourceName">managed-schema</str>! </schemaFactory>!

conf/ before startup

currency.xml!elevate.xml!lang/!protwords.txt!schema.xml!solrconfig.xml!stopwords.txt!synonyms.txt!

conf/ after startup

currency.xml!elevate.xml!lang/!managed-schema!protwords.txt!schema.xml.bak!solrconfig.xml!stopwords.txt!synonyms.txt!

•  Unknown fields’ String-typed values are speculatively parsed

–  Cascading parsers attempt to recognize field values

–  On failure, the next one is tried –  First successful parse wins

•  Reconfigurable –  Integer parser could be swapped

in for the Long parser, etc. –  Numeric parsers can take a locale

for java.text.NumberFormat!–  Date parser, implemented using

Joda-Time, can be configured with other patterns, a locale, and/or a default time zone

Field value class guessing <updateRequestProcessorChain name="add-unknown-fields-to-the-schema">! <processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>! <processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>! <processor class="solr.ParseLongFieldUpdateProcessorFactory"/>! <processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>! <processor class="solr.ParseDateFieldUpdateProcessorFactory">! <arr name="format">! <str>yyyy-MM-dd'T'HH:mm:ss.SSSZ</str>! <str>yyyy-MM-dd'T'HH:mm:ss,SSSZ</str>! <str>yyyy-MM-dd'T'HH:mm:ss.SSS</str>! <str>yyyy-MM-dd'T'HH:mm:ss,SSS</str>! <str>yyyy-MM-dd'T'HH:mm:ssZ</str>! <str>yyyy-MM-dd'T'HH:mm:ss</str>! <str>yyyy-MM-dd'T'HH:mmZ</str>! <str>yyyy-MM-dd'T'HH:mm</str>! <str>yyyy-MM-dd HH:mm:ss.SSSZ</str>! <str>yyyy-MM-dd HH:mm:ss,SSSZ</str>! <str>yyyy-MM-dd HH:mm:ss.SSS</str>! <str>yyyy-MM-dd HH:mm:ss,SSS</str>! <str>yyyy-MM-dd HH:mm:ssZ</str>! <str>yyyy-MM-dd HH:mm:ss</str>! <str>yyyy-MM-dd HH:mmZ</str>! <str>yyyy-MM-dd HH:mm</str>! <str>yyyy-MM-dd</str>! </arr>! </processor>! !

•  Field value classes are mapped to field types

•  First match wins •  If none of the typeMapping-s

match, the default field type is assigned

•  If a multi-valued field contains a mix of value classes, the first mapping that matches all values’ classes wins

•  The new field is added to the schema with the mapped field type

•  Reconfigurable

Automatic schema field addition

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">! <str name="defaultFieldType">text_general</str>! <lst name="typeMapping">! <str name="valueClass">java.lang.Boolean</str>! <str name="fieldType">booleans</str>! </lst>! <lst name="typeMapping">! <str name="valueClass">java.util.Date</str>! <str name="fieldType">tdates</str>! </lst>! <lst name="typeMapping">! <str name="valueClass">java.lang.Long</str>! <str name="valueClass">java.lang.Integer</str>! <str name="fieldType">tlongs</str>! </lst>! <lst name="typeMapping">! <str name="valueClass">java.lang.Number</str>! <str name="fieldType">tdoubles</str>! </lst>! </processor>!

•  Automatically adding new schema fields in production may not be a good idea –  Unwanted fields, e.g. field name typos, won’t trigger an error

•  First instance wins: field type detection can’t know about the full range of a field’s values

•  Wasted space: e.g. Longs are always used, when Integers might suffice •  Limited gamut of detectable field types •  Single analysis specification for text fields •  Single processing model for all fields

Schemaless mode limitations

Schema REST API

•  Each element of the schema is individually readable via the Schema REST API •  Output format can be JSON or XML (wt request param) •  Read-only elements:

–  The entire schema •  In addition to JSON and XML output formats, output can also be in

schema.xml format (?wt=schema.xml) –  All fields, or a specified set of them –  All dynamic fields, or a specified set of them –  All field types, or a specific one –  All copy field directives –  The schema name, version, uniqueKey, and default query operator –  The global similarity

•  Managed schema is not required to use the read-only schema REST API.

Schema REST API: read-only

$ SOLR=http://localhost:8983/solr/collection1!!$ curl $SOLR/schema/dynamicfields/*_i!!

{! "responseHeader":{! "status":0,! "QTime":1},! "dynamicField":{! "name":"*_i",! "type":"int",! "indexed":true,! "stored":true}}!

Schema REST API: read-only examples !!$ curl $SOLR/schema/uniquekey?wt=xml!!

<?xml version="1.0" encoding="UTF-8"?>!<response>!<lst name="responseHeader">! <int name="status">0</int>! <int name="QTime">1</int>!</lst>!<str name="uniqueKey">id</str>!</response>!

•  Schema REST API URLs employ the downcased form of all schema elements, but the responses use the same casing as schema.xml.

•  For full details on the Solr Schema REST API, see the Schema API section of the Solr Reference Guide: https://cwiki.apache.org/confluence/display/solr/Schema+API

•  To enable schema modification via the schema REST API, the schema must be managed, and must be configured as mutable.

•  Schema modifications possible as of Solr 4.4: –  Fields may be added

•  Copy field directives may optionally be added at the same time –  Copy field directives may be added

•  Works under both standalone Solr and SolrCloud –  Under SolrCloud, conflicting simultaneous requests are detected using a form of

optimistic concurrency and automatically retried •  Core/collection reload not required for schema modifications that are compatible with

previously indexed documents –  Generally additions are not sources of schema incompatibility

•  Schema incompatibility-inducing operations will require core/collection reload: –  Modifying or removing (dynamic) fields or copy field directives –  Modifying all other schema elements

Schema REST API: runtime schema modification

Schema REST API: add field example $ SOLR=http://localhost:8983/solr/collection1!!$ curl $SOLR/schema/fields/claimid -X PUT -H 'Content-type: application/json' --data-binary '!{ ! "type":"string",! "stored":true,! "copyFields": [ ! "claims", ! "all"! ]!}’!!

•  The copyField destinations “claims” and “all” must already exist in the schema. •  For full details on the Solr Schema REST API, see the Schema API section of the Solr

Reference Guide: https://cwiki.apache.org/confluence/display/solr/Schema+API

•  https://issues.apache.org/jira/browse/SOLR-4898 is the umbrella JIRA issue under which further schema REST API work will be done, including:

–  adding dynamic fields –  adding field types –  enabling wholesale replacement by PUTing a new schema. –  modifying and removing fields, dynamic fields, field types, and copy field

directives –  modifying all remaining aspects of the schema: Name, Version, Unique Key,

Global Similarity, and Default Query Operator

Schema REST API TODOs

•  Add arbitrary metadata at the top level of the schema and at each leaf node •  Allow read/write access to that metadata via the REST API. •  Uses cases:

–  Round-trippable documentation •  Conversion to managed schema format drops all comments

–  Documentable tags –  When modifying the schema via REST API, a "last-modified" annotation could

be automatically added. –  User-level arbitrary key/value metadata

•  W3C XML Schema has a similar facility: http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-annotation

Proposal: Schema Annotations

<schema name="example" version="1.5">! <annotation>! <description element="tag" ! content="plain-numeric-field-types">! Plain numeric field types store and index the! text value verbatim.! </description>! <documentation element="copyField">! copyField commands copy one field to another at! the time a document is added to the index. It's! used either to index the same field differently,! or to add multiple fields to the same field for! easier/faster searching.! </documentation>! <last-modified>2014-03-08T12:14:02Z</last-modified>! …! </annotation>!…!

Schema Annotation example <fieldType name="pint" class="solr.IntField">! <annotation>! <tag>plain-numeric-field-types</tag>! </annotation>! </fieldType>! <fieldType name="plong" class="solr.LongField">! <annotation>! <tag>plain-numeric-field-types</tag>! </annotation>! </fieldType>! …! <copyField source="cat" dest="text">! <annotation>! <todo>Copy to the catchall field?</todo>! </annotation>! </copyField>! …! <field name="text" type="text_general">! <annotation>! <description>catchall field</description>! <visibility>public</visibility>! </annotation>! </field>!

•  Schemaless Solr mode enables quick prototyping with minimal setup

•  Schema REST API provides programmatic read/write access to Solr’s schema •  More elements writeable soon

•  Schema annotations would enable round-trippable documentation, tagging, and arbitrary user-provided metadata

Summary

schemaless solr and the solr schema rest api

Technology