twitter-text Parser

twitter-text parsing updates

The twitter-text library provides implementations in four languages: Java, JavaScript, Ruby, and Objective-C. As part of changes to Tweet anatomy to make Tweets more flexible, we are introducing some changes to the way we count characters, as well as some new APIs to simplify usage of the library.

Overview

To make the counting logic more flexible, a twitter-text Configuration has been introduced that is used by all four language implementations, and a new weighted length calculation has been added.

twitter-text Configuration:

The Configuration defines Unicode code point ranges, with a weight associated with each of these ranges. This enables language density to be taken into consideration when counting characters.

Other fields in this configuration are:

  • Maximum weighted length for validation
  • Default weight to cover code points not defined in the ranges
  • Transformed URL length, which is the default length assigned to all URLs.

Weighted Length Calculation:

A "max length" is no longer defined, and instead twitter-text uses a weighted scale specified by the Unicode code point ranges. The algorithm for length calculation is as follows:

  for codepoint in text:
	Weight ← defaultWeight
	For range in ranges:
		if codepoint in range:
			weight ← range.weight
			break
	weightedLength ← weightedLength + weight

It is important to remember that this length is not the absolute length of the Tweet, and should not be used to estimate the number of remaining characters. Please see the recommendations section for suggestions relating to this.

New API method

Previous versions of twitter-text provided different helper methods for Tweet validation, Tweet length, and the remaining characters calculation. To simplify the API and obtain this information with just one call, twitter-text now exposes a new “parseTweet” method that will return the following fields:

  • weightedLength: Integer that indicates the weighted length calculated by the algorithm above.
  • permillage: Integer value corresponding to the ratio of consumed weighted length to the maximum weighted length.
  • isValid: Boolean indicating whether it is a valid Tweet.
  • displayTextRange: A display range with start and end indices on the Tweet string.
  • validDisplayTextRange: A display range indicating the start and end indices for valid text. This end index can be lesser than the display text range.

Marked for Deprecation

Methods related to “tweetLength” and “remaining character count” will be marked as deprecated, to be removed in a future version. The remaining count does not make sense to support in this library under the new weighted counting scheme, as this can differ based on the end user’s language of choice.

Previous versions also provided separate configurable options for “http” and “https” URL lengths. This is now a single constant defined in the twitter-text Configuration.

Details about these changes will be specified in READMEs of the specific modules when released on Github.

Recommendations

Adopt a progress-based UI, instead of an absolute count, to indicate remaining characters. The “permillage“ field that is part of the parseTweet response was added for exactly this reason.

twitter-text Configuration

  {
  "version": 2,
  "maxWeightedTweetLength": 280,
  "scale": 100,
  "defaultWeight": 200,
  "transformedURLLength": 23,
  "ranges": [
    {
      "start": 0,
      "end": 4351,
      "weight": 100
    },
    {
      "start": 8192,
      "end": 8205,
      "weight": 100
    },
    {
      "start": 8208,
      "end": 8223,
      "weight": 100
    },
    {
      "start": 8242,
      "end": 8247,
      "weight": 100
    }
  ]
}

Sample

      - description: "Regular Tweet with url"
      text: "Hi http://test.co"
      expected:
        weightedLength: 26
        valid: true
        permillage: 92
        displayRangeStart: 0
        displayRangeEnd: 16
        validRangeStart: 0
        validRangeEnd: 16