Elasticsearch is a search engine that’s designed to deal with natural language and calculating relevancy of matched results to provide more then just boolean yes or no matching. It is particularly well suited to parsing large amounts of text and finding relevant documents for given queries.

Parts of WordPress.com uses Elasticsearch to augment and speed up searches under the hood. In addition some JSON API endpoints allow limited querying of our Elasticsearch index of posts. This doc aims to detail how the internal Elasticsearch index is setup and how to query for results through the WordPress.com API.

Document Schema

Posts and pages are indexed into Elasticsearch as a document of the “post” type. Individual fields within an Elasticsearch document can be referenced using just the field name (e.g. “tag.name“) or with the document type prepended (e.g. “post.tag.name“).

Elasticsearch only uses UTF-8 character encoding, so all documents are converted to UTF-8 before being indexed. In addition all URLs fields have their protocol part (e.g. http/https) excluded to aid in prefix matching.

Please see the Elasticsearch Core Data Type documentation for details on each native data type. For textual data we treat them using 3 different ways depending on the field.

  • analyzed: the text has been broken up into individual terms based on the language analyzer being applied to this document (generally based on whitespace, but Chinese and Japanese for example are broken up into words using word segmentation algorithms)
  • not analyzed: the entire string is treated as a single term
  • lowercased: entire string is treated as a single term, but it is lowercased, and character folding/normalization is performed so “My ResumĂ©” will be one term “my resume”

Schema Details:

Allowed Queries & Filters

Please see the Elasticsearch Query DSL Guide for how to build queries but note, some types of Elasticsearch queries are too resource intensive for us to run on the WordPress.com infrastructure. The following is a list of queries and filters that are allowed:

Common Filters:

Exotic Filters:

Common Queries:

Exotic Queries:

Why-Not-Use-A-Filter-Instead Queries:

Faceting

Running faceted queries requires a custom index and so is currently only available for WordPress.com VIP clients with the VIP Search Add On. The custom indices for VIP clients also contain some additional fields in the post documents containing the post meta.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s