OCFL and “source of truth” — two options

Some great things about conferences is how different sessions can play off each other, and how lots of people interested in the same thing are in the same place (virtual or real) at the same time, to bounce ideas off each other.

I found both of those things coming into play to help elucidate what I think is an important issue in how software might use the Oxford Common File Layout (OCFL). Prompted by the Code4Lib 2023 session The Oxford Common File Layout – Understanding the specification, institutional use cases and implementations, with presentations by Tom Wrobel, Stefano Cossu, and Arran Griffith. (recorded video here).

OCFL is a specification for laying files out in a disk-like storage system, in a way that is suitable for long-time preservation. With a standard simple layout that is both human- and machine-readable, and would allow someone (some software) at a future point to reconstruct digital objects and metadata from the record left on disk.

The role of OCFL in a software system: Two choices

After the conference presentation, Matt Lincoln from JStor Labs asked a question in Slack chat that had been rising up in my mind too, but which Matt said more clearly than was in my mind at the time! This prompted a discussion on Slack, largely but not entirely between me and Stefano Cossu, which I found to be very productive, and which I’m going to detail here with my own additional glosses, but first let’s start with Matt’s question.

(I will insert slack links to quotes in this piece; you probably can’t see the sources unless you are a member of the Code4Lib workspace).

For the OCFL talk, I’m still unclear what the relationship is/can/will be in these systems between the database supporting the application layer, and the filesystem with all the OCFL-laid-out objects. Does DB act as a source of truth and OCFL as a copy? OCFL as source of truth and DB as cache? No db at all, and just r/w directly to OCFL?  If I’m a content manager and edit an item’s metadata in the app’s web interface, does that request get passed to a DB and THEN to OCFL? Is the web app reading/writing directly to the OCFL filesystem without mediating DB representation? Something else?

Matt Lincoln

I think Matt, utilizing the helpful term “source of truth”, accurately identifies two categories of use of OCFL in a software system — and in fact, that different people in the OCFL community — even different presenters in this single OCFL conference presentation — had been taking different paths, and maybe assuming that everyone else was on the same page as them, or at least not frequently drawing out the difference and consequences of these two paths.

Stefano Cossu, one of the presenters from the OCFL talk at Code4Lib, described it this way in a Slack response:

IMHO OCFL can either act as a source from which you derive metadata, or a final destination for preservation derived from a management or access system, that you don’t want to touch until disaster hits. It all depends on how your ideal information flow is. I believe Fedora is tied to OCFL which is its source of truth, upon which you can build indices and access services, but it doesn’t necessarily need to be that way.

Stefano Cossu

It turns out that both paths are challenging in different ways; there is no magic bullet. I think this is a foundational question for the software engineering of systems that use OCFL for preservation, with significant implications on the practice of digital preservation as a whole.

First, let’s say a little bit more about what the paths are.

“OCFL as a source of truth”

If you are treating OCFL as a “source of truth”, the files stored in OCFL are the main primary location of your data.

When the software wants to add, remove, or change data, it will probably happen to the OCFL first, or at any rate won’t be considered a successful change until it is reflected in OCFL.

There might be other layers on top providing alternate access to the OCFL, some kind of “index” to OCFL for faster and/or easier access to the data, but these are considered “derivative”, and can always be re-created from just the OCFL. The OCFL is “the data”, everything else is “derivative” and can be re-created by an automated process from the OCFL on disk.

This may be what some of the OCFL designers were assuming everyone would do; as we’ll see, it makes certain things possible, and provides the highest level of confidence in our preservation activities.

“OCFL off to the side”

Alternately, you might write an application more or less using standard architectures for writing (eg) web applications. The data is probably in a relational database system (rdbms) like postgres or MySQL, or some other data store meant for supporting application development.

When the application makes a change to the data, it’s made to the primary data store.

Then the data is “mirrored” to OCFL. Possibly after every change, or possibly periodically. The OCFL can be thought of as a kind of “backup” — a backup in a specific standard format meant to support long-term preservation and interoperability. I’m calling this “off to the side”, Stefano aboves calls it “final destination”, in either case contrasted with “source of truth”.

It’s possible you haven’t stored all the data the application uses to OCFL, only the data you want to backup “for long-term preservation purposes”. (Stefano later suggests this is their practice, in fact). Maybe there is some data you think is necessary only for the particular present application’s functionalities (say, to support back-end accounts and workflows), which you think of as accidental, ephemeral, contextual, or system-specific and non-standard– and which you don’t see any use to storing for long-term preservation.

In this path, if ALL you have is the OCFL, you aren’t intending that you can necessarily stand your actual present application back up — maybe you didn’t store all the data you’d need for that; maybe you don’t have existing software capable of translating the OCFL back to the form the application actually needs it in to function. Of if you are intending that, the challange is greater to accomplish it, as we’ll see.

So why would you do this? Well, let’s start with that.

Why not OCFL as a source of truth?

There’s really only one reason — because it makes application development a lot harder. What do I mean by “a lot harder”? I mean, it’s going to take more development time, and more development care and decisions, you’re going to have more trouble achieving reasonable performance in a large-scale system — and you’re going to make more mistakes, have more bugs and problems, more initial deliveries that have problems. It’s not all “up-front” cost or known cost, but as you continue to develop the system, you’re going to keep struggling with these things. You honestly have increased chance of failure.

Why?

In the Slack thread, Stefano Cossu spoke up for OCFL to be a “final destination”, not the “source of truth” for the daily operating software:

I personally prefer OCFL to be the final destination, since if it’s meant to be for preservation, you don’t want to “stir” the medium by running indexing and access traffic, increasing the chances of corruption.

Stefano Cossu

If you’re using it as the actual data store for a running application, instead of leaving it off to the side as a backup, it perhaps increases the chances of bugs effecting data reliability.

The problem with that setup [OCFL as source of truth] is that a preservation system has different technical requirements from an access system. E.g. you may not want store (and index) versioning information in your daily-churn system. Or you may want to use a low-cost, low-performance medium for preservation

Stefano Cossu

OCFL is designed to rebuild knowledge (not only data, but also the semantic relationships between resources) without any supporting software. That’s what I intend for long-term preservation. In order to do that, you need to serialize everything in a way that is very inefficient for daily use.

Stefano Cossu

The form that OCFL prescribes is cumbersome to use for ordinary daily functionality. It makes it harder to achieve the goals you want for your actually running software.

I think Stefano is absolutely right about all of this, by the way, and also thank him for skillfully and clearly delineating a perspective that may, explicitly or not, actually be somewhat against the stream of some widespread OCFL assumptions.

One aspect of the cumbersomeness is that writes to OCFL need to be “synchronized” with regard to concurrency — the contents of a new version written to OCFL are as deltas on the previous version, so if another version is added while you are working on preparing your additional version — your version will be wrong. You need to use some form of locking, whether optimistic or naive pessimistic locks.

Whereas a relational database system is built on decades of work to ensure ACID (atomicity, consistency, isolation, durability) with regard to writes, while also trying to optimize performance within these constraints (which can be a real tension) — with OCFL we don’t have the built-up solutions (tools and patterns) for this to the same extent.

Application development gets a lot harder

In general, building a (say) web app on a relational database system is a known problem with a huge corpus of techniques, patterns, shared knowledge, and toolsets available. A given developer may be more or less experienced or skilled; different developers may disagree on optimal choices in some cases. But those choices are being made from a very established field, with deep shared knowledge on how to build applications rapidly (cheaply), with good performance and reliability.

When we switch to OCFL as the primary “source of truth” for an app, we in some ways are charting new territory and have to figure out and invent the best ways to do certain things, with much less support from tooling, the “literature” (even including blogs you find on google etc), and a much smaller community of practice.

The Fedora repository platform is in some sense meant to be a kind of “middleware” to make this lift easier. In its version 6 incarnation, it’s own internal data store is OCFL. It doesn’t give you a user-facing app. It gives you a “middleware” you can access over a more familiar HTTP API with clear semantics, and you don’t have to deal with the underlying OCFL (or in previous incarnations other internal formats) yourself. (Seth Erickson’s ocfl_index could be thought of as similar peer “middleware” in some ways, although it’s read-only, it doesn’t provide for writing).

But it’s still not the well-trodden path of rapid web application development on top of an rdbms.

I think that the samvera (née hydra) community really learned this to some extent the hard way, the way trying to build on top of this novel architecture really raised the complexity, cost, and difficulty of implementing the user-facing application (with implications on succession, hiring, and retention too). I’m not saying this happened becuase Fedora team did something wrong, I’m saying a novel architecture like this inherently and neccessarily raises the difficulty over a well-trodden architectural path. (although it’s possible to recognize the challenge and attempt to ameliorate with features that make things easier on developers, it’s not possible to eliminate).

Some samvera peer instititions have left the Fedora-based architecture, I think as a result of this experience. Where I work at Science History Institute, we left sufia/hydra/samvera to write a closer to “just plain Rails app”, and I believe it successfully and seriously increased our capacity to meet organizational and business needs within our available software engineering capacity. I personally would be really relutant to go back to attempting to use Fedora and/or OCFL as a “source of truth”, instead of more conventional web app data storage patterns.

So… that’s why you might not… but what do you lose?

What do you lose without OCFL as source of truth?

The trade-off is real though — I think some of the assumptions about what OCFL provides how are actually based on assumptions of OCFL as source of truth in your application.

Mike Kastellec’s Code4Lib presentation just before the OCFL one, on How to Survive a Disaster [Recovery] really got me thinking about backups and reliability.

Many of us have heard (or worse, found out ourselves the hard way) the adage: You don’t really know if you have a good backup unless you regularly go through the practice of recovery using it, to test it. Many have found that what they thought was their backup — was missing, was corrupt, or was not in a format suitable for supporting recovery. Because they hadn’t been verifying it would work for recovery, they were just writing to it but not using it for anything.

(Where I work, we try to regularly use our actual backups as the source of sync’ing from a production system to a staging system, in part as a method of incorporating backup recovery verification into our routine).

How is a preservation copy analogous? If your OCFL is not your source of truth, but just “off to the side” as a “preservation copy” — it can easily be a similar “write-only” copy. How do you know what you have there is sufficient to serve as a preservation copy?

Just as with backups, there are (at least) two categories of potential problem: It could be there are bugs in your synchronization routines, such that what you thought was being copied to OCFL was not, or not on the schedule you thought, or was getting corrupted or lost in transit. But the other category, even worse it could be that your design had problems, and what you chose to sync to OCFL left out some crucial things that these future consumers of your preservation copy would have needed to fully restore and access the data. Stefano also wrote:

We don’t put everything in OCFL. Some resources are not slated for long-term preservation. (or at least, we may not in the future, but we do now)

If you are using the OCFL as your daily “source of truth”, you at least know the data you have stored in OCFL is sufficient to run your current system. Or at least you haven’t noticed any bugs with it yet, and if anyone notices any you’ll fix them ASAP.

The goal of preservation is that some future system will be able to use these files to reconstruct the objects and metadata in a useful way… It’s good to at least know it’s sufficient for some system, your current system. If you are writing to OCFL and not using it for anything… it reminds us of writing to a backup that you never restore from. How do you know it’s not missing things, by bug or by misdesign?

Do you even intend the OCFL to be sufficient to bring up your current system (I think some do, some don’t, some haven’t thought about it), and if you do, how do you know it meets your intents?

OCFL and Completeness and Migrations

The OCFL web page lists as one of its benefits (which I think can also be understood as design goals for OCFL):

Completeness, so that a repository can be rebuilt from the files it stores

If OCFL is your applications “source of truth”, you have this necessarily, in the sense of that almost being the definition of OCFL being the “source of truth”. (maybe suggesting at least some OCFL designers were assuming it as source of truth).

But if your OCFL is “off to the side”… do you even have that? I guess it depends on if you intended the OCFL to be transformable back to your application’s own internal source of truth, and if that intention was successful. If we’re talking about data from your application being written “off to the side” to OCFL, and then later transformed back to your application — I think we’re talking about what is called “round-tripping” the data.

There was another Code4Lib presentation about repository migration at Stanford, in the Slack discussion happening about that presentation, Stanford’s Justin Coyne and Mike Giarlo wrote:

I don’t recommend “round trip mappings”. I was a developer on this project.  It’s very challenging to not lose data when going from A -> B -> A

Justin Coyne

We spent sooooo much time on getting these round-trip mappings correct. Goodness gracious.

Mike Giarlo

So, if you want to make your OCFL “off to the side” provide this quality of completeness via round-trippability, you probably have to be focusing on it intentionally, and then it’s still going to be really hard, maybe one of the hardest (most time-consuming, most buggy) aspects of your application, or at least it’s persistence layer.

I found this presentation about repository migration really connecting my neurons to the OCFL discussion generally — when i thought about this I realized, well, that makes sense, woah, is one description of “preservation” activities actually: a practice of trying to plan and provide for unknown future migrations not yet fully spec’d?

So, while we were talking about repository migrations on Slack, and how challenging the data migrations were (several conf presentations dealt with data migrations in repositories) Seth Erickson made a point about OCFL:

One of the arguments for OCFL is that the repository software should upgradeable/changeable without having to migrate the data… (that’s the aspiration, anyways)

Seth Erickson

If the vision is that with nothing more than an OCFL storage system, we can point new software to it and be up and running without a data migration — I think we can see this is basically assuming OCFL as the “source of truth”, and also talking about the same thing the OCFL webpage calls “completeness” again.

And why is this vision aspirational? Well, to begin with, we don’t actually have very many repository systems that use OCFL as a source of truth. We may only have Fedora — that is, systems that use Fedora as middleware. Or maybe ocfl_index too, although it being only read-only and also middleware that doesn’t necessarily have user-facing software built on it yet, it’s probably currently a partial entry at most.

If we had multiple systems that could already do this, we’d be a lot more confident it would work out — but of course, the expense and difficulty of building a system using OCFL as the “source of truth” is probably a large part of why we don’t!

OK, do we at least have multiple systems based on fedora? Well… yes. Even before Fedora was based on OCFL, it would hypothetically be possible to upgrade/change repository software without a data migration if both source and target software were based on Fedora… except, in fact, it was not possible to do this between Samvera sufia/hydra and Islandora, despite both being based on fedora, because even though they both used fedora, their metadata stored in Fedora (or OCFL) was not consistent. A whole giant topic we’re not going to cover here, except to point out it’s a huge challenge for that vision of “completeness” providing for software changes without data migration, a huge challenge that we have seen in practice, without necessarily seeing a success in practice. (Even within hyrax alone, there are currently two different possible fedora data layouts, using traditional activefedora with “wings” adapter or instead valkyrie-fedora adapter, requiring data migration between them!)

And if we think of the practice of preservation as being trying to maximize chances of providing for migration to future unknown systems with unknown needs… then we see it’s all aspirational (that far-future digital preservation is an aspirational endeavor is of course probably not a controversial thing to say either).

But the little bit of paradox here is that while “completeness” makes it more likely you will be able to easily change systems without data loss, the added cost of developing systems that achieve “completeness” via OCFL as “source of truth” means — you will probably have much fewer, if any, choices of suitable systems to change to, or resources available to develop them!

So… what do we do? Can we split the difference?

I think the first step is acknowledging the issue, the tension here between completeness via “OCFL as source-of-truth” and, well, ease of software development. There is no magic answer that optimizes everything, there are trade-offs.

That quality of “completeness” of data (“source of truth”) is going to make your software much more challenging to develop. Take longer, take more skill, have more chance of problems and failures. And another way to say this is: Within a given amount of engineering resources, you will be delivering fewer features that matter to your users and organization, because you are spending more of your resources on implementing on a more challenging architecture.

What you get out of this is aspirationally increased chances of successful preservation. This doesn’t mean you shouldn’t do it, digital preservation is neccessarily aspirational. I’m not sure one balances this cost and benefit — it might likely be different for different institutions — but I think we should be careful not to be routinely under-estimating the cost or over-estimating the size or confidence of benefits from the “source of truth” approach. Undoubtedly many institutions will still choose to develop OCFL as a source of truth, especially using middleware intended to ease the burden, like Fedora.

I will probably not be one of them at my current institution — the cost is just too high for us, we can’t give up the capacity to relatively rapidly meet other organizational and user needs. But I’d like to look at incorporating OCFL as “off to the side” preservation copy anyway in the future.

(And Stefano and me are definitely not the only ones considering this or doing it. Many institutions are using an “off to the side” “final destination” approach to preservation copies, if not with OCFL, than with some of it’s progenitors or peers like BagIt or Stanford’s MOAB — the “off to the side” approach is not unusual, and for good reasons! We can acknowledge it and talk about it without shame!)

If you are developing instead with OCFL as a “off to the side” (or “final destination”), are there things you can do to try to get closer to the benefits of OCFL as “source of truth”?

The main thing I can think of involves “round-trippability”

  • Yes, commit to storing all of your objects and metadata necessary to restore a working current system in your OCFL
  • And commit to storing it round-trippably
  • One way to ensure/enforce this would be — every time you write a new version to OCFL, run a job that serializes those objects and metadata to OCFL, and back to your internal format, and verify that it is still equivalent. Verify the round-trip.

Round-trippability doens’t just happen on it’s own, and ensuring it will definitely significantly increase the cost of your development — as the Stanford folks said from experience, round-trippability is a headache and a major cost! But, it could conceivably get you a lot of the confidence in “completeness” that “source of truth” OCFL gets you. And as it still is “off to the side”, it still allows you to write your application using whatever standard (or innovative in different directions) architectures you want, you don’t have the novel data persistence architecture design involved in all of your feature development to meet user and business needs.

This will perhaps arrive at a better cost/benefit balance for some institutions.

There may be other approaches or thoughts, this is hopefully the beginning of a long conversation and practice.

Escaping/encoding URI components in ruby 3.2

Thanks to zverok_kha’s awesome writeup of Ruby changes, I noticed a new method released in ruby 3.2: CGI.escapeURIComponent

This is the right thing to use if you have an arbitrary string that might include characters not legal in a URI/URL, and you want to include it as a path component or part of the query string:

require 'cgi'

url = "https://example.com/some/#{ CGI.escapeURIComponent path_component }" + 
  "?#{CGI.escapeURIComponent my_key}=#{CGI.escapeURIComponent my_value}"
  • The docs helpfully refer us to RFC3986, a rare citation in the wild world of confusing and vaguely-described implementations of escaping (to various different standards and mistakes) for URLs and/or HTML
  • This will escape / as %2F, meaning you can use it to embed a string with / in it inside a path component, for better or worse
  • This will escape a space ( ) as %20, which is correct and legal in either a query string or a path component
  • There is also a reversing method available CGI.unescapeURIComponent

What if I am running on a ruby previous to 3.2?

Two things in standard library probably do the equivalent thing. First:

require 'cgi'
CGI.escape(input).gsub("+", "%20")

CGI escape but take the +s it encodes space characters into, and gsub them into the more correct %20. This will not be as performant because of the gsub, but it works.

This, I noticed once a while ago, is what ruby aws-sdk does… well, except it also unescapes %7E back to ~, which does not need to be escaped in a URI. But… generally… it is fine to percent-encode ~ as %7E. Or copy what aws-sdk does, hoping they actually got it right to be equivalent?

Or you can use:

require 'erb'
ERB::Util.url_encode(input)

But it’s kind of weird to have to require the ERB templating library just for URI escaping. (and would I be shocked if ruby team moves erb from “default gem” to “bundled gem”, or further? Causing you more headache down the road? I would not). (btw, ERB::Util.url_encode leaves ~ alone!)

Do both of these things do exactly the same thing as CGI.escapeURIComponent? I can’t say for sure, see discussion of CGI.escape and ~ above. Sure is confusing. (there would be a way to figure it out, take all the chars in various relevant classes in the RFC spec and test them against these different methods. I haven’t done it yet).

What about URI.escape?

In old code I encounter, I often see places using URI.escape to prepare URI query string values…

# don't do this, don't use URI.escape
url = "https://example.com?key=#{ URI.escape value }"

# not this either, don't use URI.escape
url = "https://example.com?" + 
   query_hash.collect { |k, v| "#{URI.escape k}=#{URI.escape v}"}.join("&")

This was never quite right, in that URI.escape was a huge mess… intending to let you pass in whole URLs that were not legal URLs in that they had some illegal characters that needed escaping, and it would somehow parse them and then escape the parts that needed escaping… this is a fool’s errand and not something it’s possible to do in a clear consistent and correct way.

But… it worked out okay because the output of URI.escape overlapped enough with (the new RFC 3986-based) CGI.escapeURIComponent that it mostly (or maybe even always?) worked out. URI.escape did not escape a /… but it turns out / is probably actually legal in a query string value anyway, it’s optional to escape it to %2F in a query string? I think?

And people used it in this scenario, I’d guess, because it’s name made it sound like the right thing? Hey, I want to escape something to put it in a URI, right? And then other people copied from code they say, etc.

But URI.escape was an unpredictable bad idea from the start, and was deprecated by ruby, then removed entirely in ruby 3.0!

When it went away, it was a bit confusing to figure out what to replace it with. Because if you asked, sometimes people would say “it was broken and wrong, there is nothing to replace it”, which is technically true… but the code escaping things for inclusion in, eg, query strings, still had to do that… and then the “correct” behavior for this actually only existed in the ruby stdlib in the erb module (?!?) (where few had noticed it before URI.escape went away)… and CGI.escapeURIComponent which is really what you wanted didn’t exist yet?

Why is this so confusing and weird?

Why was this functionality in ruby stdlib non-existent/tucked away? Why are there so many slightly different implementations of “uri escaping”?

Escaping is always a confusing topic in my experience — and a very very confusing thing to debug when it goes wrong.

The long history of escaping in URLs and HTML is even more confusing. Like, turning a space into a + was specified for application/x-www-form-urlencoded format (for encoding an HTML form as a string for use as a POST body)… and people then started using it in url query strings… but I think possibly that was never legal, or perhaps the specifications were incomplete/inconsistent on it.

But it was so commonly done that most things receiving URLs would treat a literal + as an encode space… and then some standards were retroactively changed to allow it for compatibility with common practice…. maybe. I’m not even sure I have this right.

And then, as with the history of the web in general, there have been a progression of standards slightly altering this behavior, leapfrogging with actual common practice, where technically illegal things became common and accepted, and then standards tried to cope… and real world developers had trouble underestanding there might be different rules for legal characters/escaping in HTML vs URIs vs application/x-www-form-urlencoded strings vs HTTP headers…. and then language stdlib implementers (including but not limited to ruby) implemented things with various understandings acccording to various RFCs (or none, or buggy), documented only with words like “Escapes the string, replacing all unsafe characters with codes.” (unsafe according to what standard? For what purpose?)

PHEW.

It being so confusing, lots of people haven’t gotten it right — I swear that AWS S3 uses different rules for how to refer to spaces in filenames than AWS MediaConvert does, such that I couldn’t figure out how to get AWS MediaConvert to actually input files stored on S3 with spaces in them, and had to just make sure to not use spaces in filenames on S3 destined for MediaConvert. But maybe I was confused! But honestly I’ve found it’s best to avoid spaces in filenames on S3 in general, because S3 docs and implementation can get so confusing and maybe inconsistent/buggy on how/when/where they are escaped. Because like we’re saying…

Escaping is always confusing, and URI escaping is really confusing.

Which is I guess why the ruby stdlib didn’t actually have a clearly labelled provided-with-this-intention way to escape things for use as a URI component until ruby 3.2?

Just use CGI.escapeURIComponent in ruby 3.2+, please.

What about using the Addressable gem?

When the horrible URI.escape disappeared and people that had been wrongly using it to escape strings for use as URI components needed some replacement and the ruby stdlib was confusing (maybe they hadn’t noticed ERB::Util.url_encode or weren’t confident it did the right thing and gee I wonder why not), some people turned to the addressable gem.

This gem for dealing with URLs does provide ways to escape strings for use in URLs… it actually provides two different algorithms depending on whether you want to use something in a path component or a query component.

require 'addressable'

Addressable::URI.encode_component(query_param_value, Addressable::URI::CharacterClasses::QUERY)

Addressable::URI.encode_component(path_component, Addressable::URI::CharacterClasses::PATH)

Note Addressable::URI::CharacterClasses::QUERY vs Addressable::URI::CharacterClasses::PATH? Two different routines? (Both by the way escape a space to %20 not +).

I think that while some things need to be escaped in (eg) a path component and don’t need to be in a query component, the specs also allow some things that don’t need to be escaped to be escaped in both places, such that you can write an algorithm that produces legally escaped strings for both places, which I think is what CGI.escapeURIComponentis. Hopefully we’re in good hands.

On Addressable, neither the QUERY nor PATH variant escapes /, but CGI.escapeURIComponent does escape it to %2F. PHEW.

You can also call Addressable::URI.encode_component with no second arg, in which case it seems to escape CharacterClasses::RESERVED + CharacterClasses::UNRESERVED from this list. Whereas PATH is, it looks like there, equivalent to UNRESERVED with SOME of RESERVED (SUB_DELIMS but only some of GENERAL_DELIMS), and QUERY is just path plus ? as needing escaping…. (CGI.escapeURIComponent btw WILL escape ? to %3F).

PHEW, right?

Anyhow

Anyhow, just use CGI.escapeURIComponent to… escape your URI components, just like it says on the lid.

Thanks to /u/f9ae8221b for writing it and answering some of my probably annoying questions on reddit and github.

attr_json 2.0 release: ActiveRecord attributes backed by JSON column

attr_json is a gem to provide attributes in ActiveRecord that are serialized to a JSON column, usually postgres jsonb, multiple attributes in a json hash. In a way that can be treated as much as possible like any other “ordinary” (database column) ActiveRecord.

It supports arrays and nested models as hashes, and the embedded nested models can also be treated much as an ordinary “associated” record — for instance CI build tests with cocoon , and I’ve had a report that it works well with stimulus nested forms, but I don’t currently know how to use those. (PR welcome for a test in build?)

An example:

# An embedded model, if desired
class LangAndValue
  include AttrJson::Model

  attr_json :lang, :string, default: "en"
  attr_json :value, :string
end

class MyModel < ActiveRecord::Base
   include AttrJson::Record

   # use any ActiveModel::Type types: string, integer, decimal (BigDecimal),
   # float, datetime, boolean.
   attr_json :my_int_array, :integer, array: true
   attr_json :my_datetime, :datetime

   attr_json :embedded_lang_and_val, LangAndValue.to_type
end

model = MyModel.create!(
  my_int_array: ["101", 2], # it'll cast like ActiveRecord
  my_datetime: DateTime.new(2001,2,3,4,5,6),
  embedded_lang_and_val: LangAndValue.new(value: "a sentence in default language english")
)

By default it will serialize attr_json attributes to a json_attributes column (this can also be specified differently), and the above would be serialized like so:

{
  "my_int_array": [101, 2],
  "my_datetime": "2001-02-03T04:05:06Z",
  "embedded_lang_and_val": {
    "lang": "en",
    "value": "a sentence in default language english"
  }
}

Oh, attr_json also supports some built-in construction of postgres jsonb contains (“@>“) queries, with proper rails type-casting, through embedded models with keypaths:

MyModel.jsonb_contains(
  my_datetime: Date.today,
  "embedded_lang_and_val.lang" => "de"
) # an ActiveRelation, you can chain on whatever as usual

And it supports in-place mutations of the nested models, which I believe is important for them to work “naturally” as ruby objects.

my_model.embedded_lang_and_val.lang = "de"
my_model.embedded_lang_and_val_change 
# => will correctly return changes in terms of models themselves
my_model.save!

There are some other gems in this “space” of ActiveRecord attribute json serialization, with different fits for different use cases, created either before or after I created attr_json — but none provide quite this combination of features — or, I think, have architectures that make this combination feasible (I could be wrong!). Some to compare are jsonb_accessor, store_attribute, and store_model.

One use case where I think attr_json really excels is when using Rails Single-Table Inheritance, where different sub-classes may have different attributes.

And especially for a “content management system” type of use case, where on top of that single-table inheritance polymorphism, you can have complex hierarchical data structures, in an inheritance hierarchichy, where you don’t actually want or need the complexity of an actual normalized rdbms schema for the data that has both some polymorphism and some hetereogeneity. We get some aspects of a schema-less json-document-store, but embedded in postgres, without giving up rdbms features or ordinary ActiveRecord affordances.

Slow cadence, stability and maintainability

While the 2.0 release includes a few backwards incompats, it really should be an easy upgrade for most if not everyone. And it comes three and a half years after the 1.0 release. That’s a pretty good run.

Generally, I try to really prioritize backwards compatibility and maintainability, doing my best to avoid anything that could provide backwards incompat between major releases, and trying to keep major releases infrequent. I think that’s done well here.

I know that management of rails “plugin” dependencies can end up a nightmare, and I feel good about avoiding this with attr_json.

attr_json was actually originally developed for Rails 4.2 (!!), and has kept working all the way to Rails 7. The last attr_json 1.x release actually supported (in same codebase) Rails 5.0 through Rails 7.0 (!), and attr_json 2.0 supports 6.0 through 7.0. (also grateful to the quality and stability of the rails attributes API originally created by sgrif).

I think this succesfully makes maintenance easier for downstream users of attr_json, while also demonstrating success at prioritizing maintainability of attr_json itself — it hasn’t needed a whole lot of work on my end to keep working across Rails releases. Occasionally changes to the test harness are needed when a new Rails version comes out, but I actually can’t think of any changes needed to implementation itself for new Rails versions, although there may have been a few.

Because, yeah, it is true that this is still basically a one-maintainer project. But I’m pleased it has successfully gotten some traction from other users — 390 github “stars” is respectable if not huge, with occasional Issues and PR’s from third parties. I think this is a testament to it’s stability and reliability, rather than to any (almost non-existent) marketing I’ve done.

“Slow code”?

In working on this and other projects, I’ve come to think of a way of working on software that might be called “slow code”. To really get stability and backwards compatibility over time, one needs to be very careful about what one introduces into the codebase in the first place. And very careful about getting the fundamental architectural design of the code solid in the first place — coming up with something that is parsimonious (few architectural “concepts”) and consistent and coherent, but can handle what you will want to throw at it.

This sometimes leads me to holding back on satisfying feature requests, even if they come with pull requests, even if it seems like “not that much code” — if I’m not confident it can fit into the architecture in a consistent way. It’s a trade-off.

I realize that in many contemporary software development environments, it’s not always possible to work this way. I think it’s a kind of software craftsmanship for shared “library” code (mostly open source) that… I’m not sure how much our field/industry accomnodates development with (and the development of) this kind of craftsmanship these days. I appreciate working for a non-profit academic institute that lets me develop open source code in a context where I am given the space to attend to it with this kind of care.

The 2.0 Release

There aren’t actually any huge changes in the 2.0 release, mostly it just keeps on keeping on.

Mostly, 2.0 tries to make things adhere even closer and more consistently to what is expected of Rails attributes.

The “Attributes” API was still brand new in Rails 4.2 when this project started, but now that it has shown itself solid and mature, we can always create a “cover” Rails attribute in the ActiveRecord model, instead of making it “optional” as attr_json originally did. Which provides for some code simplification.

Some rough edges were sanded involved making Time/Date attributes timezone-aware in the way Rails usually does transparently. And with some underlying Rails bugs/inconsistencies having been long-fixed in Rails, they can now store miliseconds in JSON serialization rather than just whole seconds too.

I try to keep a good CHANGELOG, which you can consult for more.

The 2.0 release is expected to be a very easy migration for anyone on 1.x. If anyone on 1.x finds it challenging, please get in touch in a github issue or discussion, I’d like to make it easier for you if I can.

For my Library-Archives-Museums Rails people….

The original motivation from this came from trying to move off samvera (nee hydra) sufia/hyrax to an architecutre that was more “Rails-like”. But realizing that the way we wanted to model our data in a digital collections app along the lines of sufia/hyrax, would be rather too complicated to do with a reasonably normalized rdbms schema.

So… can we model things in the database in JSON — similar to how valkyrie-postgres would actually model things in postgres — but while maintaining an otherwise “Rails-like” development architecture? The answer: attr_json.

So, you could say the main original use case for attr_json was to persist a “PCDM“-ish data model ala sufia/hyrax, those kinds of use cases, in an rdbms, in a way that supported performant SQL queries (minimal queries per page, avoiding n+1 queries), in a Rails app using standard Rails tools and conventions, without an enormously complex expansive normalized rdbms schema.

While the effort to base hyrax on valkyrie is still ongoing, in order to allow postgres vs fedora (vs other possible future stores) to be a swappable choice in the same architecture — I know at least some institutions (like those of the original valkyrie authors) are using valkyrie in homegrown app directly, as the main persistence API (instead of ActiveRecord).

In some sense, valkyrie-postgres (in a custom app) vs attr-json (in a custom app) are two paths to “step off” the hyrax-fedora architecture. They both result in similar things actually stored in your rdbms (and we both chose postgres, for similar reasons, including I think good support for json(b)). They have both have advantages and disadvantages. Valkyrie-postgres kind of intentionally chooses not to use ActiveRecord (at least not in controllers/views etc, not in your business logic), one advantage of such is to get around some of the known widely-commented upon deficiencies and complaints with Rails standard ActiveRecord architecture.

Whereas I followed a different path with attr_json — how can we store things in postgres similarly, but while still using ActiveRecord in a very standard Rails way — how can we make it as standard a Rails way as possible? This maintains the disadvantages people sometimes complain about Rails architecture, but with the benefit of sticking to the standard Rails ecosystem, having less “custom community” stuff to maintain or figure out (including fewer lines of code in attr-json), being more familiar or accessible to Rails-experienced or trained developers.

At least that’s the idea, and several years later, I think it’s still working out pretty well.

In addition to attr_json, I wrote a layer on top to provide some parts on top of attr_json, that I thought would be both common and somewhat tricky in writing a pcdm/hyrax-ish digital collections app as “standard Rails as much as it makes sense”. This is kithe and it hasn’t had very much uptake. The only other user I’m aware of (who is using only a portion of what kithe provides; but kithe means to provide for that as a use case) is Eric Larson at https://github.com/geobtaa/geomg.

However, meanwhile, attr_json itself has gotten quite a bit more uptake — from wider Rails developer community, not our library-museum-archives community. attr_json’s 390 github stars isn’t that big in the wider world of things, but it’s pretty big for our corner of the world. (Compare to 160 for hyrax or 721 for blacklight). That the people using attr_json, and submitting Issues or Pull Requests largely aren’t library-museum-archives developers, I consider positive and encouraging, that it’s escaped the cultural-heritage-rails bubble, and is meeting a more domain-independent or domain-neutral need, at a lower level of architecture, with a broader potential community.

A tiny donation to rubyland.news would mean a lot

I started rubyland.news in 2016 because it was a thing I wanted to see for the ruby community. I had been feeling a shrinking of the ruby open source collaborative community, it felt like the room was emptying out.

If you find value in Rubyland News, just a few dollars contribution on my Github Sponsors page would be so appreciated.

I wanted to make people writing about ruby and what they were doing with it visible to each other and to the community, in order to try to (re)build/preserve/strengthen a self-conception as a community, connect people to each other, provide entry to newcomers, and just make it easier to find ruby news.

I’ve been solely responsible for its development, and editorial and technical operations. I think it’s been a success. I don’t have analytics, but it seems to be somewhat known and used.

Rubyland.news has never been a commercial project. I have never tried to “monetize” it. I don’t even really highlight my personal involvement much. I have in the past occasionally had modest paid sponsorship barely enough to cover expenses, but decided it wasn’t worth the effort.

I have and would never provide any kind of paid content placement, because I think that would be counter to my aims and values — I have had offers, specifically asking for paid placement not labelled as such, because apparently this is how the world works now, but I would consider that an unethical violation of trust.

It’s purely a labor or love, in attempted service to the ruby community, building what I want to see in the world as an offering of mutual aid.

So why am I asking for money?

The operations of Rubyland News don’t cost much, but they do cost something. A bit more since Heroku eliminated free dynos.

I currently pay for it out of my pocket, and mostly always have modulo occasional periods of tiny sponsorship. My pockets are doing just fine, but I do work for an academic non-profit, so despite being a software engineer the modest expenses are noticeable.

Sure, I could run it somewhere cheaper than heroku (and eventually might have to) — but I’m doing all this in my spare time, I don’t want to spend an iota more time or psychic energy on (to me) boring operational concerns than I need to. (But if you want to volunteer to take care of setting up, managing, and paying for deployment and operations on another platform, get in touch! Or if you are another platform that wants to host rubyland news for free!)

It would be nice to not have to pay for Rubyland News out of my pocket. But also, some donations would, as much as be monetarily helpful, also help motivate me to keep putting energy into this, showing me that the project really does have value to the community.

I’m not looking to make serious cash here. If I were able to get just $20-$40/month in donations, that would about pay my expenses (after taxes, cause I’d declare if i were getting that much), I’d be overjoyed. Even 5 monthly sustainers at just $1 would really mean a lot to me, as a demonstration of support.

You can donate one-time or monthly on my Github Sponsors page. The suggested levels are $1 and $5.

(If you don’t want to donate or can’t spare the cash, but do want to send me an email telling me about your use of rubyland news, I would love that too! I really don’t get much feedback! jonathan at rubyland.news)

Thanks

  • Thanks to anyone who donates anything at all
  • also to anyone who sends me a note to tell me that they value Rubyland News (seriously, I get virtually no feedback — telling me things you’d like to be better/different is seriously appreciated too! Or things you like about how it is now. I do this to serve the community, and appreciate feedback and suggestions!)
  • To anyone who reads Rubyland News at all
  • To anyone who blogs about ruby, especially if you have an RSS feed, especially if you are doing it as a hobbyist/community-member for purposes other than business leads!
  • To my current single monthly github sponsor, for $1, who shall remain unnamed because they listed their sponsorship as private
  • To anyone contributing in their own way to any part of open source communities for reasons other than profit, sometimes without much recognition, to help create free culture that isn’t just about exploiting each other!

vite-ruby for JS/CSS asset management in Rails

I recently switched to vite and vite-ruby for managing my JS and CSS assets in Rails. I was switching from a combination of Webpacker and sprockets — I moved all of my Webpacker and most of my sprockets to vite.

  • Note that vite-ruby has smooth ready-made integrations for Padrino, Hanami, and jekyll too, and possibly hook points for integrations with arbitrary ruby, plus could always just use vite without vite-ruby — but I’m using vite-ruby with Rails.

I am finding it generally pretty agreeble, so I thought I’d write up some of the things I like about it for others. And a few other notes.

I am definitely definitely not an expert in Javascript build systems (or JS generally), which both defines me as an audience for build tools, but also means I don’t always know how these things might compare with other options. The main other option I was considering was jsbundling-rails with esbuild and cssbundling-rails with SASS, but I didn’t get very far into the weeds of checking those out.

I moved almost all my JS and (S)CSS into being managed/built by vite.

My context

I work on a monolith “full stack” Rails application, with a small two-developer team.

I do not do any very fancy Javascript — this is not React or Vue or anything like that. It’s honestly pretty much “JQuery-style” (although increasingly I try to do it without jquery itself using just native browser API, it’s still pretty much that style).

Nonetheless, I have accumulated non-trivial Javascript/NPM dependencies, including things like video.js , @shoppify/draggable, fontawesome (v4), openseadragon. I need package management and I need building.

I also need something dirt simple. I don’t really know what I’m doing with JS, my stack may seem really old-fashioned, but here it is. Webpacker had always been a pain, I started using it to have something to manage and build NPM packages, but was still mid-stream in trying to switch all my sprockets JS over to webpacker when it was announced webpacker was no longer recommended/maintained by Rails. My CSS was still in sprockets all along.

Vite

One thing to know about vite is that it’s based on the idea of using different methods in dev vs production to build/serve your JS (and other managed assets). In “dev”, you ordinarily run a “vite server” which serves individual JS files, whereas for production you “build” more combined files.

Vite is basically an integration that puts together tools like esbuild and (in production) rollup, as well as integrating optional components like sass — making them all just work. It intends to be simple and provide a really good developer experience where doing simple best practice things is simple and needs little configuration.

vite-ruby tries to make that “just works” developer experience as good as Rubyists expect when used with ruby too — it intends to integrate with Rails as well as webpacker did, just doing the right thing for Rails.

Things I am enjoying with vite-ruby and Rails

  • You don’t need to run a dev server (like you do with jsbundling-rails and css-bundling rails)
    • If you don’t run the vite dev server, you’ll wind up with auto-built vite on-demand as needed, same as webpacker basically did.
    • This can be slow, but it works and is awesome for things like CI without having to configure or set up anything. If there have been no changes to your source, it is not slow, as it doesn’t need to re-build.
    • If you do want to run the dev server for much faster build times, hot module reload, better error messages, etc, vite-ruby makes it easy, just run ./bin/vite dev in a terminal.
  • If you DO run the dev server — you have only ONE dev-server to run, that will handle both JS and CSS
    • I’m honestly really trying to avoid the foreman approach taken by jsbundling-rails/cssbundling-rails, because of how it makes accessing the interactive debugger at a breakpoint much more complicated. Maybe with only one dev server (that is optional), I can handle running it manually without a procfile.
  • Handling SASS and other CSS with the same tool as JS is pretty great generally — you can even @import CSS from a javascript file, and also @import plain CSS too to aggregate into a single file server-side (without sass). With no non-default configuration, it just works, and will spit out stylesheet <link> tags, and it means your css/sass is going through the same processing whether you import it from .js or .css.
    • I handle fontawesome 4 this way. Include "font-awesome": "^4.7.0" in my package.json, then @import "font-awesome/css/font-awesome.css"; just works, and from either a .js or a .css file. It actually spits out not only the fontawesome CSS file, but also all the font files referenced from it and included in the npm package, in a way that just works. Amazing!!
    • Note how you can reference things from NPM packages with just package name. On google for some tools you find people doing contortions involving specifically referencing node-modules, I’m not sure if you really have to do this with latest versions of other tools but you def don’t with vite, it just works.
  • in general, I really appreciate vite’s clear opinionated guidance and focus on developer experience. Understanding all the options from the docs is not as hard because there are fewer options, but it does everything I need it to. vite-ruby succesfully carries this into ruby/Rails, it’s documentation is really good, without being enormous. In Rails, it just does what you want, automatically.
  • Vite supports source maps for SASS!
    • Not currently on by default, you have to add a simple config.
    • Unfortunately sass sourcemaps are NOT supported in production build mode, only in dev server mode. (I think I found a ticket for this, but can’t find it now)
    • But that’s still better than the official Rails options? I don’t understand how anyone develops SCSS without sourcemaps!
      • But even though sprockets 4.x finally supported JS sourcemaps, it does not work for SCSS! Even though there is an 18-month-old PR to fix it, it goes unreviewed by Rails core and unmerged.
      • Possibly even more suprisingly, SASS sourcemaps doesn’t seem to work for the newer cssbundling-rails=>sass solution either. https://github.com/rails/cssbundling-rails/issues/68
      • Previous to this switch, I was still using sprockets old-style “comments injected into CSS built files with original source file/line number” — that worked. But to give that up, and not get working scss sourcemaps in return? I think that would have been a blocker for me against cssbundling-rails/sass anyway… I feel like there’s something I’m missing, because I don’t understand how anyone is developing sass that way.

  • If you want to split up your js into several built files (“chunks), I love how easy it is. It just works. Vite/rollup will do it for you automatically for any dynamic runtime imports, which it also supports, just write import with parens, inside a callback or whatever, just works.

Things to be aware of

  • vite and vite-ruby by default will not create .gz variants of built JS and CSS
    • Depending on your deploy environment, this may not matter, maybe you have a CDN or nginx that will automatically create a gzip and cache it.
    • But in eg default heroku Rails deploy, it really really does. Default Heroku deploy uses the Rails app itself to deliver your assets. The Rails app will deliver content-encoding gzip if it’s there. If it’s not… when you switch to vite from webpacker/sprockets, you may now delivering uncommpressed JS and CSS with no other changes to your environment, with non-trivial performance implications but ones you may not notice.
    • Yeah, you could probably configure your CDN you hopefully have in front of your heroku app static assets to gzip for you, but you may not have noticed.
    • Fortunately it’s pretty easy to configure
  • There are some vite NPM packages involved (vite itself as well as some vite-ruby plugins), as well as the vite-ruby gem, and you have to keep them up to date in sync. You don’t want to be using a new version of vite NPM packages with too-old gem, or vice versa. (This is kind of a challenge in general with ruby gems with accompanying npm packages)
    • But vite_ruby actually includes a utility to check this on boot and complain if they’ve gotten out of sync! As well as tools for syncing them! Sweet!
    • But that can be a bit confusing sometimes if you’re running CI after an accidentally-out-of-sync upgrade, and all your tests are now failing with the failed sync check. But no big deal.

Things I like less

  • vite-ruby itself doesn’t seem to have a CHANGELOG or release notes, which I don’t love.
  • Vite is a newer tool written for modern JS, it mostly does not support CommonJS/node require, preferring modern import. In some cases that I can’t totally explain require in dependencies seems to work anyway… but something related to this stuff made it apparently impossible for me to import an old not-very-maintained dependency I had been importing fine in Webpacker. (I don’t know how it would have done with jsbundling-rails/esbuild). So all is not roses.

Am I worried that this is a third-party integration not blessed by Rails?

The vite-ruby maintainer ElMassimo is doing an amazing job. It is currently very well-maintained software, with frequent releases, quick turnaround from bug report to release, and ElMassimo is very repsonsive in github discussions.

But it looks like it is just one person maintaining. We know how open source goes. Am I worried that in the future some release of Rails might break vite-ruby in some way, and there won’t be a maintainer to fix it?

I mean… a bit? But let’s face it… Rails officially blessed solutions haven’t seemed very well-maintained for years now either! The three year gap of abandonware between the first sprockets 4.x beta and final release, followed by more radio silence? The fact that for a couple years before webpacker was officially retired it seemed to be getting no maintainance, including requiring dependency versions with CVE’s that just stayed that way? Not much documentation (ie Rails Guide) support for webpacker ever, or jsbundling-rails still?

One would think it might be a new leaf with css/jsbundling-rails… but I am still baffled by there being no support for sass sourcemaps in cssbundling-rails and sass! Official rails support doesn’t necessarily get you much “just works” DX when it comes to asset handling for years now.

Let’s face it, this has been an area where being in the Rails github org and/or being blessed by Rails docs has been no particular reason to expect maintenance or expect you won’t have problems down the line anyway. it’s open source, nobody owes you anything, maintainers spend time on what they have interest to spend time on (including time to review/merge/maintain other’s PR’s — which is def non-trivial time!) — it just is what it is.

While the vite-ruby code provides a pretty great integrated into Rails DX, its also actually mostly pretty simple code, especially when it comes to the Rails touch points most at risk of Rails breaking — it’s not doing anything too convoluted.

So, you know, you take your chances, I feel good about my chances compared to a css/jsbundling-rails solution. And if someday I have to switch things over again, oh well — Rails just pulled webpacker out from under us quicker than expected too, so you take your chances regardless!


(thanks to colleague Anna Headley for first suggesting we take a look at vite in Rails!)

Using engine_cart with Rails 6.1 and Ruby 3.1

Rails does not seem to generally advertise ruby version compatibility, but it seems to be the case taht Rails 6.1, I believe, works with Ruby 3.1 — as long as you manually add three dependencies to your Gemfile.

gem "net-imap"
gem "net-pop"
gem "net-smtp"

(Here’s a somewhat cryptic gist from one (I think) Rails committer with some background. Although it doens’t specifically and clearly tell you to add these dependencies for Rails 6.1 and ruby 3.1… it won’t work unless you do. You can find other discussion of this on the net.)

Or you can instead add one line to your Gemfile, opting in to using the pre-release mail gem 2.8.0.rc1, which includes these dependencies for ruby 3.1 compatibility. Mail is already a Rails dependency; but pre-release gems (whose version numbers end in something including letters after a third period) won’t be included by bundler unless you mention a pre-release version (whose version number ends in…) explicitly in Gemfile.

gem "mail", ">= 2.8.0.rc1"

Once mail 2.8.0 final is released, if I understand what’s going on right, you won’t need to do any of this, since it won’t be a pre-release version bundler will just use it when bundle updateing a Rails app, and it expresses the dependencies you need for ruby 3.1, and Rails 6.1 will Just Work with ruby 3.1. Phew! I hope it gets released soon (been about 7 weeks since 2.8.0.rc1).

Engine cart

Engine_cart is a gem for dynamically creating Rails apps at runtime for use in CI build systems, mainly to test Rails engine gems. It’s in use in some collaborative open source communities I participate in. While it has plusses (actually integration testing real app generation) and minuses (kind of a maintenance nightmare it turns out), I don’t generally recommend it, if you haven’t heard of it before and am wondering “Does jrochkind think I should use this for testing engine gems in general?” — this is not an endorsement. In general it can add a lot of pain.

But it’s in use in some projects I sometimes help maintain.

How do you get a build using engine_cart to succesfully test under Rails 6.1 and ruby 3.1? Since if it were “manual” you’d have to add a line to a Gemfile…

It turns out you can create a ./spec/test_app_templates/Gemfile.extra file, with the necessary extra gem calls:

gem "net-imap"
gem "net-pop"
gem "net-smtp"

# OR, above OR below, don't need both

gem "mail", ">= 2.8.0.rc1"
  • I think ./spec/test_app_templates/Gemfile.extra is a “magic path” used by engine_cart… or if the app I’m working on is setting it, I can’t figure out why/how! But I also can’t quite figure out why/if engine_cart is defaulting to it…
  • Adding this to your main project Gemfile is not sufficient, it needs to be in Gemfile.extra
  • Some projects I’ve seen have a line in their Gemfile using eval_gemfile and referencing the Gemfile.extra… which I don’t really understand… and does not seem to be necessary to me… I think maybe it’s leftover from past versions of engine_cart best practices?
  • To be honest, I don’t really understand how/where the Gemfile.extra is coming in, and I haven’t found any documentation for it in engine_cart . So if this doens’t work for you… you probably just haven’t properly configured engine_cart to use the Gemfile.extra in that location, which the project I’m working on has done in some way?

Note that you may still get an error produced in build output at some point of generating the test app:

run  bundle binstubs bundler
rails  webpacker:install
You don't have net-smtp installed in your application. Please add it to your Gemfile and run bundle install
rails aborted!
LoadError: cannot load such file -- net/smtp

But it seems to continue and work anyway!

None of this should be necessary when mail 2.8.0 final is released, it should just work!

The above is of course always including those extra dependencies, for all builds in your matrix, when they are only necessary for Rails 6.1 (not 7!) and ruby 3.1. If you’d instead like to guard it to only apply for that build, and your app is using the RAILS_VERSION env variable convention, this seems to work:

# ./specs/test_app_templates/Gemfile.extra
#
# Only necessary until mail 2.8.0 is released, allow us to build with engine_cart
# under Rails 6.1 and ruby 3.1, by opting into using pre-release version of mail
# 2.8.0.rc1
#
# https://github.com/mikel/mail/pull/1472

if ENV['RAILS_VERSION'] && ENV['RAILS_VERSION'] =~ /^6\.1\./ && RUBY_VERSION =~ /^3\.1\./
  gem "mail", ">= 2.8.0.rc1"
end

Rails7 connection.select_all is stricter about it’s arguments in backwards incompat way: TypeError: Can’t Cast Array

I have code that wanted to execute some raw SQL against an ActiveRecord database. It is complicated and weird multi-table SQL (involving a postgres recursive CTE), so none of the specific-model-based API for specifying SQL seemed appropriate. It also needed to take some parameters, that needed to be properly escaped/sanitized.

At some point I decided that the right way to do this was with Model.connection.select_all , which would create a parameterized prepared statement.

Was I right? Is there a better way to do this? The method is briefly mentioned in the Rails Guide (demonstrating it is public API!), but without many details about the arguments. It has very limited API docs, just doc’d as: select_all(arel, name = nil, binds = [], preparable: nil, async: false), “Returns an ActiveRecord::Result instance.” No explanation of the type or semantics of the arguments.

In my code working on Rails previous to 7, the call looked like:

MyModel.connection.select_all(
  "select complicated_stuff WHERE something = $1",
  "my_complicated_stuff_name",
  [[nil, value_for_dollar_one_sub]],
  preparable: true
)
  • yeah that value for the binds is weird, a duple-array within an array, where the first value of the duple-array is just nil? This isn’t documented anywhere, I probably got that from somewhere… maybe one of the several StackOverflow answers.
  • I honestly don’t know what preparable: true does, or what difference it makes.

In Rails 7.0, this started failing with the error: TypeError: can’t cast Array.

I couldn’t find any documentation of that select_all all method at all, or other discussion of this; I couldn’t find any select_all change mentioned in the Rails Changelog. I tried looking at actual code history but got lost. I’m guessing “can’t cast Array” referes to that weird binds value… but what is it supposed to be?

Eventually I thought to look for Rails tests of this method that used the binds argument, and managed to eventually find one!

So… okay, rewrote that with new binds argument like so:

bind = ActiveRecord::Relation::QueryAttribute.new(
  "something", 
  value_for_dollar_one_sub, 
  ActiveRecord::Type::Value.new
)

MyModel.connection.select_all(
  "select complicated_stuff WHERE something = $1",
  "my_complicated_stuff_name",
  [bind],
  preparable: true
)
  • Confirmed this worked not only in Rails 7, but all the way back to Rails 5.2 no problem.
  • I guess that way I was doing it previously was some legacy way of passing args that was finally removed in Rails 7?
  • I still don’t really understand what I’m doing. The first arg to ActiveRecord::Relation::QueryAttribute.new I made match the SQL column it was going to be compared against, but I don’t know if it matters or if it’s used for anything. The third argument appears to be an ActiveRecord Type… I just left it the generic ActiveRecord::Type::Value.new, which seemed to work fine for both integer or string values, not sure in what cases you’d want to use a specific type value here, or what it would do.
  • In general, I wonder if there’s a better way for me to be doing what I’m doing here? It’s odd to me that nobody else findable on the internet has run into this… even though there are stackoverflow answers suggesting this approach… maybe i’m doing it wrong?

But anyways, since this was pretty hard to debug, hard to find in docs or explanations on google, and I found no mention at all of this changing/breaking in Rails 7… I figured I’d write it up so someone else had the chance of hitting on this answer.

Exploring hosting video in a non-profit archival digital collections web app

We are a small independent academic institution, that hosts a “digital collections” of digitized historical materials, on the web, using a custom in-house Rails app.

We’re getting into including some historical video content for the first time, and I didn’t have much experience with video, it required me to figure outa few things, that I’m going to document here. (Thanks to many folks from Code4Lib and Samvera communities who answered my questions, including mbklein, cjcolvar, and david.schober).

Some more things about us and our content:

  • Our content at least initially will be mostly digitized VHS (480p, fairly low-visual-quality content), although we could also eventually have some digitized 16mm film and other content that could be HD.
  • Our app is entirely cloud-hosted, mainly on heroku and AWS S3. We don’t manage any of our own servers at the OS level (say an EC2), and don’t want to except as a last resort (we don’t really have the in-house capacity to).

Our usage patterns are not necessarily typical for a commercial application! We have a lot (at least eventually we will) of fairly low-res/low-bitrate stuff (old VHSs!), it’s unclear how much will get viewed how often (a library/archives probably stores a lot more content as a proportion of content viewed than a commercial operation), and our revenue doesn’t increase as our content accesses do! So, warning to the general reader, our lense on things may or may not match a commercial enterprise’s. But if you are one of our peers in cultural heritage, you’re thinking, “I know, right?”

All of these notes are as of February 2022, if you are reading far in the future, some things may have changed. Also, I may have some things wrong, this is just what I have figured out, corrections welcome.

Standard Video Format: MP4 (h.264) with html5 video

I originally had in my head that maybe you needed to provide multiple video formats to hit all browsers — but it seems you don’t really anymore.

An MP4 using H.264 codec and AAC audio can be displayed in pretty much any browser of note with an html5 video tag.

There are newer/better/alternate video formats. Some terms (I’m not totally sure which of these are containers are which are codecs and which containers can go with which codecs!) include: WebM, Ogg, Theora, H.265, V8, and V9. Some of these are “better” in various ways (whether better licenses or better quality video at same filesize) — but none are yet supported across all common browsers.

You could choose to provide alternate formats so browsers that can use one of the newer formats will — but I think the days of needing to provide multiple formats to satisfy 99%+ of possible consumers appear gone. (Definitely forget about flash).

The vast majority of our cultural heritage peers I could find are using MP4. (Except for the minority using an adaptive bitrate streaming format…).

Another thing to know about H.264-MP4 is that even at the same screen size/resolution, the bitrate, and therefore the filesize (per second of footage) can vary quite a bit between files. This is because of the (lossy) compression in H.264. Some original source material may just be more compressible than others — not sure how much this comes into play. What definitely comes into play is that when encoding, you can choose to balance higher compression level vs higher quality (similar to in JPG). Most encoding software can let you choose set a maximum bitrate (sometimes using either a variable-bit-rate (VBR) or constant-bit-rate (CBR) algorithm), OR choose a “quality level” on some scale and the encoder will compress differently from second to second of footage, at whatever level will give you that level of “quality”.

An Aside: Tools: ffmpeg and ffprobe

ffmpeg (and its accompanying ffprobe analyzer) are amazing open source tools. They do almost anything. They are at the heart of most other video processing software.

ffprobe is invaluable for figuring out “what do I have here” if say given an mp4 file.

One thing I didn’t notice at first that is really neat is that both ffmpeg and ffprobe can take a URL as an input argument just about anywhere they can take a local file path. This can be very convenient if your video source material is, say, on S3. You don’t need to download it before feeding to ffmpeg or ffprobe, you can just give them a URL, leading to faster more efficient operations as ffmpeg/ffprobe take care of downloading only what they need as they need it. (ffprobe characterization can often happen from only the first portion of the file, for instance).

IF you need an adaptive bitrate streaming format: HLS will do

So then there are “streaming” formats, the two most popular being HLS or MPEG-DASH. While required if you are doing live-streaming, you also can use them for pre-recorded video (which is often called “video-on-demand” (VOD) in internet discussions/documentation, coming out of terminology used in commercial applications).

The main reason to do this is for the “adaptive bitrate” features. You can provide variants at various bitrates, and someone on a slow network (maybe a 3G cellphone) can still watch your video, just at lower resolution/visual quality. An adaptive bitrate streaming format can even change the resolution/quality in mid-view, if network conditions change. Perhaps you’ve noticed this watching commercial video over the internet.

Just like with static video, there seems to be one format which is supported in (nearly) all current browsers (not IE though): HLS. (Which sometimes/usually may actually be h.264, just reformatted for streaming use? Not sure). While the caniuse chart makes it look like only a few browsers support HLS, in fact the other browsers do via their support of the Media Source API (formerly called Media Source Extensions). Javascript code can be used to add HLS support to a browser via the Media Source API, and if you use a Javascript viewer like video.js or MediaElement.js, this just happens for you transparently. This is one reason people may be using such players instead of raw html5 <video> tags alone. (I plan to use video.js).

Analogously to MP4, there are other adaptive bitrate streaming formats too, that may be superior in various ways to HLS, but don’t have as wide support. Like MPEG-DASH. At the moment, you can reach pretty much any browser in use with HLS, and of community/industry peers I found using an adaptive bitrate streaming format, HLS was the main one in use, usually presented without alternative sources.

But do we even need it?

At first I thought that one point of HLS was allowing better/more efficient seeking: When the viewer wants to jump to the middle of the video you don’t want to force the browser to download the whole video. But in fact, after more research I believe current browsers can seek in a static mp4 just fine using HTTP byte-range requests (important they are on a server that supports such!), so long as the MP4 was prepared with what ffmpeg calls faststart. (This is important! You can verify if a video is properly prepared by running mediainfo on it and looking for the IsStreaming line in output; or by jumping through some hoops with ffmpeg).

It’s a bit hacky to do seeking with just http-hosted mp4s, but user-agents seem to have gotten pretty good at it. You may want to be sure any HTTP caching layer you are using can appropriately pass through and cache HTTP byte-range requests.

So, as far as I know, that leaves actual dynamic adaptive bitrate feature as the reason to use HLS. But you need that only if the MP4’s you’d be using otherwise are a higher bitrate than the lowest bitrate you’d be preparing for your inclusion in your HLS bundle. Like, if your original MP4 is only 500 kbps, and that’s the lowest bitrate you’d be including in your HLS package anyway… your original MP4 is already viewable on as slow a connection as your HLS preparation would be.

What is the lowest bitrate typically included as an option in HLS? i’ve found it hard to find advice on what variants to prepare for HLS distribution! In Avalon’s default configuration for creating HLS with ffmpeg, the lowest bitrate variant is 480p at 500 kbps. For more comparisons, if you turn on simulation of a slow connection in Chrome Dev Tools, Chrome says “Slow 3G” is about 400 kbps, and “Fast 3G” is about 1500 kbps. The lowest bitrate offered by AWS MediaConvert’s HLS presets is 400 kbps (for 270p or 360p) or 600kbps at 480p.

(In some cases these bitrates may be video-only; audio can add another 100-200kbps, depending on what quality you encode it at.)

I think if your original MP4 is around 500 kbps, there’s really no reason at all to use HLS. As it approaches 1500 kbps (1.5 Mbps)… you could consider creating HLS with a 500kbps variant, but also probably get away with serving most users adequately at the original bitrate (sorry to the slower end of cell phone network). As you start approaching 3Mbps, I’d start considering HLS, and if you have HD 720p or 1080p content (let alone 4K!) which can get up to 6 Mbps bitrates and even much higher — I think you’d probably be leaving users on not-the-fastest connections very frustrated without HLS.

This is me doing a lot of guessing, cause I haven’t found a lot of clear guidance on this!

As our originals are digitizations of VHS (and old sometimes degraded VHS at that), and started out pretty low-quality, our initial bitrates are fairly low. In one of our sample collections, the bitrates were around 1300bps — I’d say we probably don’t need HLS? Some of our digitized film was digitized in SD at around 2300 kbps — meh? But we had a couple films digitized at 1440p and 10 Mbps — okay, probably want to either downsample the access MP4, or use HLS.

MOST of our cultural heritage peers do not yet seem to be using HLS. In a sampling of videos found on DPLA, almost all were being delivered as MP4 (and usually fairly low-quality videos at under 2 Mbps, so that’s fine!). However, most of our samvera peers using avalon are probably using HLS.

So how do you create and serve HLS?

Once you create it, it’s actually just static files on disk, you can serve with a plain old static HTTP web server. You don’t need any kind of fancy media server to serve HLS! (Another misconception I started out not being sure of, that maybe used to be different, in the days when “RTP” and such were your only adaptive streaming options). An HLS copy on disk is one (or usually several) .m3u8 manifest files, and a lot of .ts files with chunked data referenced by the manifests.

You can (of course) use ffmpeg to create HLS. A lot of people do that happily, but it doesn’t work well for our cloud deployment — creating HLS takes too long for us to want to do it in a Rails background job on a heroku worker dyno, and we don’t want to be in the business of running our own servers otherwise.

Another thing some of our peers use is the wowza media server. We didn’t really look at that seriously either, I think our peers using it are at universities with enormous campus-wide site licenses for running it on-premises (which we don’t want to do), there might be a cloud-hosted SaaS version… but I just mention this product for completeness in case you are interested, it looked too “enterprisey” for our needs, and we didn’t significantly investigate.

The solutions we found and looked at that fit into the needs we had for our cloud-hosted application were:

  • Use AWS Elemental MediaConvert to create the HLS variants, host on S3 and serve from there (probably with CloudFront, paying AWS data egrees fees for video watched). This is sort of like a cloud alternative to ffmpeg, you tell it exactly what HLS variants you want.
  • AWS Elemental MediaPackage ends up working more like a “video server” — you just give it your originals, and get back an HLS URL, you leave the characteristics of the HLS variants up to the black box, and it creates them apparently on-the-fly as needed. You don’t pay for storage of any HLS variants, but do pay a non-trivial fee for every minute of video processed (potentially multiple times if it expires from cache and is watched again) on top of the AWS egrees fees.
  • CloudFlare Stream is basically CloudFlare’s version of MediaPackage. They charge by second of footage instead of byte (for both storage and the equivalent of egress bandwidth), and it’s not cheap… whether it’s cheaper or more expensive than MediaPackage can depend on the bitrate of your material, and the usage patterns of viewing/storing. For our low-bitrate unpredictable-usage patterns, it looks to me likely to be end up more expensive? But not sure.
  • Just upload them all to youtube and/or vimeo, and serve from there? Kind of crazy but it just might work? Could still be playing on our web pages, but they’re actually pointing at youtube or vimeo hosting the video…. While this has the attraction of being the only SaaS/PaaS solutions I know of that won’t have metered bandwidth (you don’t pay per minute viewed)… there are some limitations too. I couldn’t really find any peers doing this with historical cultural heritage materials.

I’m going to talk about each of these in somewhat more detail below, especially with regard to costs.

First a word on cost estimating of video streaming

One of our biggest concerns with beginning to include video in digital collections is cost. Especially because, serving out of S3/Cloudfront, or most (all?) other cloud options, we pay data egress costs for every minute of video viewed. As a non-profit educational institution, the more the video gets viewed, the more we’re achieving our mission — but our revenue doesn’t go up with minutes viewed, and it can really add up.

So that’s one the things we’re most interested in comparing between different HLS options.

But comparing it requires guessing at a bunch of metrics. Some that are easier to guess are: How many hours of video we’ll be storing; and How many hours we’ll be ingesting per month. Some services charge in bytes, some in hours; to convert from hours to bytes requires us to guess our average bitrate, which we can take a reasonable stab at. (Our digitized VHS is going to be fairly low quality, maybe 1500-2500 kbps).

Then there are the numbers that are a lot harder to guess — how many hours of video will be viewed a month? 100 hours? 1000? 10,000? I am not really sure what to expect, it’s the least under our control, it could grow almost unboundedly and cost us a lot of money. Similarly, If we offer full-file downloads, how many GB of video files will be downloaded a month?

Well, I made some guesses, and I made a spreadsheet that tried to estimate costs of various platforms under various scenarios. (There are also probably assumptions I don’t even realize I’m making not reflected in the spreadsheet that will effect costs!). Our initial estimates are pretty small for a typical enterprise, just hosting maybe 100 hours of video, maybe 200 hours viewed a month? Low-res VHS digitized material. (Our budget is also fairly sensitive to what would be very small amounts in a commercial enterprise!)

You can see/copy the spreadsheet here, and I’ll put a few words about each below.

Serve MP4 files from S3

The base case! Just serve plain MP4 files from S3 (probably with CloudFront in front). Sufficient if our MP4 bitrates are 500 kbps, maybe up to around 1.5 Mbps.

Our current app architecture actually keeps three copies of all data — production, a backup, and a sync’d staging environment. So that’s some S3 storage charges, initially estimated at just full-cost standard S3. There are essentially no “ingest” costs (some nominal cost to replicate production to our cross-region backup).

Then there are the standard AWS data egress costs — Cloudfront not actually that different from standard S3, until you get into trying to do bulk reserved purchases or something, but we’ll just estimate at the standard rate.

The storage costs will probably be included in any other solution too, since we’ll probably still keep our “canonical” cop(ies) on S3 regardless.

HLS via AWS Elemental MediaConvert

AWS Elemental MediaConvert is basically a transcoding service — think of it like ffmpeg transcoding but AWS-hosted SaaS. Your source needs to be on S3 (well, technically it can be a public URL elsewhere), you decide what variants you want to create, they are written to an S3 bucket.

Bandwidth costs are exactly the same as our base MP4 S3 case, since we’d still serving from Cloudfront — so essentially scales up with traffic exactly the same as our base case, which is nice. (hypothetically could be somewhat less bandwidth depending on how many users receive lower-bitrate variants via HLS, but we just estimated that everyone would get the high-quality one, as an upper bound).

We pay a bit more for storage (have to store the HLS derivatives, just standard S3 prices).

Then we pay an ingest cost to create the HLS, that is actually charged per minute (rather than per GB) — for SD, if we avoid “professional tier” features and stay no more than 30 fps, $0.0075 per minute of produced video (meaning the more HLS variants you create, the more you pay).

Since we probably will be digitizing a fairly small package per month, and there is no premium over standard S3 for bandwidth (our least predictable cost), this may be a safe option?

Also, because we know of at least one peer using AWS Elemental MediaConvert. (Northwestern University Libraries).

AWS Elemental MediaPackage

MediaPackage is basically an AWS media server offering. You can use it for live streaming, but we’re looking at the “Video on Demand” use case and pricing only. You just give it a source video, and it creates an HLS URL for it. (Among other possible streaming formats; HLS is the one we care about). You don’t (and I think can’t) tell it what bitrates/variants to create in the HLS stream, it just does what it thinks best. On the fly/on-demand, I think?

The pricing model includes a fairly expensive $0.05/GB packaging fee — that is, fee to create the stream. (why not per-minute like MediaConvert? I don’t know). This is charged on-demand: Not until someone tries to watch a video. If multiple people are watching the same video at the same time, you’ll only pay the packaging fee once as it’ll be cached in CloudFront. But I don’t know if it’s clear exactly how long it will remain cached in CloudFront — and I don’t know how to predict my viewers usage patterns anyway, how much they’ll end up watching the same videos taking advantage of cache — how many views will result in packaging fees vs cached views.

So taking a worst-case estimate of zero cache utilization, MediaPackage basically adds a 50% premium to our bandwidth costs. These being our least predictable and unbounded costs, this is a bit risky — if we have a lot of viewed minutes, that don’t cache well, this could end up being much more expensive than MediaConvert approach.

But, you don’t pay any storage fees for the HLS derivatives at all. If you had a large-stored-volume and relatively small viewed-minutes, MediaPackage could easily end up cheaper than doing it yourself with MediaConvert (as you can see in our spreadsheet). Plus, there’s just less to manage or code, you really just give it an S3 source URL, and get back an HLS URL, the end.

CloudFlare Stream

CloudFlare Stream is basically Cloudflare’s altenrative to MediaPackage. Similarly, it can be used for livestreaming, but we’re just looking at it for “video on demand”. Similarly, you basically just give it video and get back HLS URL (or possibly other streaming formats), without specifying the details.

The big difference is that CloudFlare meters per minute instead of per GB. For storage of a copy of “originals” in the Stream system, which is required– and we’d end up storing an extra copy in CloudFlare, since we’re still going to want our canonical copy in AWS. (Don’t know if we can do everything we’d need/want with a copy only in CloudFlare stream). And CloudFlare charges per minute for bandwidth from Stream too. (There is no ingest/packaging fee, and no cost for storage of derived HLS).

Since they charge per minute, how competitive it is really depends on your average bitrate, the higher your average bitrate the better a deal CloudFlare is compared to AWS! At an average bitrate of more than 1500kbps, the CloudFlare bandwidth cost starts beating AWS — at 10 MBps HD, it’s going to really beat it. But we’re looking at relatively low-quality SD under 1500kbps, so.

Whether CloudFlare Stream is more or less expensive than one of the AWS approaches is going to depend not only on bandwidth, but on your usage patterns (how much are you storing, how much are you ingesting a month) — from a bit more expensive to, in the right circumstances, a lot less expensive.

CloudFlare Stream has pretty much the convenience of AWS MediaPackage, except that we need to deal with getting a separate copy of originals into CloudFlare, with prepaid storage limits (you need to kind of sign up for what storage limit you want). Which is actually kind of inconvenient.

What if we used YouTube or Vimeo though?

What if we host our videos on YouTube or Vimeo, and deliver them from there? Basically use them as a cloud hosted video server? I haven’t actually found any peers doing this with historical/cultural heritage/archival materials. But the obvious attraction is that these services don’t meter bandwidth, we could get out of paying higher egress/bandwidth as viewing usage goes up — our least predictable and potentially otherwise largest budget component.

The idea is that this would be basically invisible to the end-user, they’d still be looking at our digital collections app and an embedded viewer; ideally the items would not be findable in youtube/vimeo platform search or on google to a youtube/vimeo page. It would also be mostly invisible to content management staff, they’d ingest into our Digital Collections system same as ever, and our software would add to vimeo and get referencing links via vimeo API.

We’d just be using youtube or vimeo as a video server platform, really not too different from how one uses AWS MediaPackage or Cloudflare Stream.

Youtube is completely free, but, well, it’s youtube. It’s probably missing features we might want (I don’t think you can get a direct HLS link), has unclear/undocumented limits or risk of having your account terminated, you get what you pay for as far as support, etc.

Vimeo is more appealing. Features included (some possibly only in “pro” account and above) seem (am still at beggining of investigation) to include:

  • HLS URLs we could use with whatever viewers we wanted, same viewers we’d be using with any other HLS URL.
    • Also note direct video download links, if we want, so we can avoid bandwidth/egress charges on downloads too!
  • Support for high-res videos, no problem, all the way up to 4K/HDR. (Although we don’t need that for this phase of VHS digitization)
  • “team” accounts where multiple staff accounts can have access to Vimeo management of our content. (Only 3 accounts on $20/month “pro”, 10 on “Business” and higher)
  • Unlisted/private settings that should keep our videos off of any vimeo searches or google. Even a “Hide from Vimeo” setting where the video cannot be viewed on a vimeo page at all, but only as embedded (say via HLS)!
    • One issue, though, is that the HLS and video download links we do have probably won’t be signed/expiring — once someone has it, they have it and can share it (until/unless you delete the content from vimeo). This is probably fine for our public content use cases, but worth keeping in mind.
  • An API that looks reasonable and full-featured.

Vimeo storage/ingest limits

The way Vimeo does price tiers/metering is a bit odd, especially at the $20/month “Pro” level. It’s listed as being limited to ingesting 20GB/week, and 1TB/year. But I guess it can hold as much content you want as long as you ingest it at no more than 20GB/week? Do I understand that right? For our relatively low-res SD content, let’s say at a 3Mbs bitrate — 20GB/week is about 15 hours/week — at our current planned capacity, we wouldn’t be ingesting more than that as a rule, although it’s a bit annoying that if we did, as an unusual spike, our software would have to handle the weekly rate limit.

At higher plans, the vimeo limits are total storage rather than weekly ingest. The “Business” plan at $50/month has 5TB total storage. At 3Mbs bitrate, that’s around 3700 hours of content. At our current optimistic planned ingest capacity, it would take us over 10 years to fill that up. If it were HD content at 10 Mbps, 5TB is around 1100 hours of content, which we might reach in 4 or 5 years at our current planned ingest rate.

The “Premium” plan lets you bump that up to 7TB for $75/month.

It’s certainly conceivable we could reach those limits — and even sooner if we increase our digitization/ingest capacity beyond what we’re starting with. I imagine at that point we’d have to talk to them about a custom “enterprise” plan, and hope they can work out something reasonable for a non-profit academic institution that just needs expanded storage limits.

I imagine we’d write our software so it could serve straight MP4 if the file wasn’t (yet?) in Vimeo, but would just use vimeo HLS (and download link?) URLs if it was.

It’s possible there will be additional unforeseen limitations or barriers once we get into an investigatory implementation, but this seems worth investigating.

So?

Our initial implementation may just go with static MP4 files, for our relatively low-bitrate SD content.

When we are ready to explore streaming (which could be soon after MVP), I think we’d probably explore either vimeo? If not Vimeo… AWS MediaConvert is more “charted territory” as we have cooperating peers who have used it… but the possibility of MediaConvert or CloudFront Stream to be cheaper under some usage patterns is interesting. (And they are possibly somewhat simpler to implement). However, their risk of being much more expensive under other usage patterns may be too risky. Predictability of budget is really high-value in the non-profit world, which is a large part of our budgeting challenge here, the unpredictability of costs when increased usage means increased costs due to metered bandwidth/egress.

Blacklight: Automatic retries of failed Solr requests

Sometimes my Blacklight app makes a request to Solr and it fails in a temporary/intermittent way.

  • Maybe there was a temporary network interupting, resulting in a failed connection or timeout
  • Maybe Solr was overloaded and being slow, and timed out
    • (Warning: Blacklight by default sets no timeouts, and is willing to wait forever for Solr, which you probably don’t want. How to set a timeout is under-documented, but set a read_timeout: key in your blacklight.yml to a number of seconds; or if you have RSolr 2.4.0+, set key timeout. Both will do the same thing, pass the value timeout to an underlying faraday client).
  • Maybe someone restarted the Solr being used live, which is not a good idea if you’re going for zero uptime, but maybe you aren’t that ambitious, or if you’re me maybe your SaaS solr provider did it without telling you to resolve the Log4Shell bug.
    • And btw, if this happens, can appear as a series of connection refused, 503 responses, and 404 responses, for maybe a second or three.
  • (By the way also note well: Your blacklight app may be encountering these without you knowing, even if you think you are monitoring errors. Blacklight default will take pretty much all Solr errors, including timeouts, and rescue them, responding with an HTTP 200 status page with a message “Sorry, I don’t understand your search.” And HoneyBadger or other error monitoring you may be using will probably never know. Which I think is broken and would like to fix it, but have been having trouble getting consensus and PR reviews to do so. You can fix it with some code locally, but that’s a separate topic, ANYWAY…)

So I said to myself, self, is there any way we could get Blacklight to automatically retry these sorts of temporary/intermittent failures, maybe once or twice, maybe after a delay? So there would be fewer errors presented to users (and fewer errors alerting me, after I fixed Blacklight to alert on em), in exhange for some users in those temporary error conditions waiting a bit longer for a page?

Blacklight talks to Solr via RSolr — can use 1.x or 2.x — and RSolr, if you’re using 2.x, uses faraday for it’s solr http connections. So one nice way might be to configure the Blacklight/RSolr faraday connection with the faraday retry middleware. (1.x rubydoc). (moved into its own gem in the recently released faraday 2.0).

Can you configure custom faraday middleware for the Blacklight faraday client? Yeesss…. but it requires making and configuring a custom Blacklight::Solr::Repository class, most conveniently by sub-classing the Blacklight class and overriding a private method. :( But it seems to work out quite well after you jump through some a bit kludgey hoops! Details below.

Questions for the Blacklight/Rsolr community:

  • Is this actually safe/forwards-compatible/supported, to be sub-classing Blacklight::Solr::Repository and over-riding build_connection with a call to super? Is this a bad idea?
  • Should Blacklight have it’s own supported and more targeted API for supplying custom faraday middleware generally (there are lots of ways this might be useful), or setting automatic retries specifically? i’d PR it, if there was some agreement about what it should look like and some chance of it getting reviewed/merged.
  • Is there anyone, anyone at all, who is interested in giving me emotional/political/sounding-board/political/code-review support for improving Blacklight’s error handling so it doesn’t swallow all connection/timeout/permanent configuration errors by returning an http 200 and telling the user “Sorry, I don’t understand your search”?

Oops, this may break in Faraday 2?

I haven’t actually tested this on the just-released Faraday 2.0, that was released right after I finished working on this. :( If faraday changes something that makes this approach infeasible, that might be added motivation to make Blacklight just have an API for customizing faraday middleware without having to hack into it like this.

The code for automatic retries in Blacklight 7

(and probably many other versions, but tested in Blacklight 7).

Here’s my whole local pull request if you find that more covenient, but I’ll also walk you through it a bit below and paste in frozen code.

There were some tricks to figuring out how to access and change the middleware on the existing faraday client returned by the super call; and how to remove the already-configured Blacklight middleware that would otherwise interfere with what we wanted to do (including an existing use of the retry middleware that I think is configured in a way that isn’t very useful or as intended). But overall it works out pretty well.

I’m having it retry timeouts, connection failures, 404 responses, and any 5xx response. Nothing else. (For instance it won’t retry on a 400 which generally indicates an actual request error of some kind that isn’t going to have any different result on retry).

I’m at least for now having it retry twice, waiting a fairly generous 200ms before first retry, then another 400ms before a second retry if needed. Hey, my app can be slow, so it goes.

Extensively annotated:

# ./lib/scihist/blacklight_solr_repository.rb
module Scihist
# Custom sub-class of stock blacklight, to override build_connection
# to provide custom faraday middleware for HTTP retries
#
# This may not be a totally safe forwards-compat Blacklight API
# thing to do, but the only/best way we could find to add-in
# Solr retries.
class BlacklightSolrRepository < Blacklight::Solr::Repository
# this is really only here for use in testing, skip the wait in tests
class_attribute :zero_interval_retry, default: false
# call super, but then mutate the faraday_connection on
# the returned RSolr 2.x+ client, to customize the middleware
# and add retry.
def build_connection(*_args, **_kwargs)
super.tap do |rsolr_client|
faraday_connection = rsolr_client.connection
# remove if already present, so we can add our own
faraday_connection.builder.delete(Faraday::Request::Retry)
# remove so we can make sure it's there AND added AFTER our
# retry, so our retry can succesfully catch it's exceptions
faraday_connection.builder.delete(Faraday::Response::RaiseError)
# add retry middleware with our own confiuration
# https://github.com/lostisland/faraday/blob/main/docs/middleware/request/retry.md
#
# Retry at most twice, once after 300ms, then if needed after
# another 600 ms (backoff_factor set to result in that)
# Slow, but the idea is slow is better than an error, and our
# app is already kinda slow.
#
# Retry not only the default Faraday exception classes (including timeouts),
# but also Solr returning a 404 or 502. Which gets converted to
# Faraday error because RSolr includes raise_error middleware already.
#
# Log retries. I wonder if there's a way to have us alerted if
# there are more than X in some time window Y…
faraday_connection.request :retry, {
interval: (zero_interval_retry ? 0 : 0.300),
# exponential backoff 2 means: 1) 0.300; 2) .600; 3) 1.2; 4) 2.4
backoff_factor: 2,
# But we only allow the first two before giving up.
max: 2,
exceptions: [
# default faraday retry exceptions
Errno::ETIMEDOUT,
Timeout::Error,
Faraday::TimeoutError,
Faraday::RetriableResponse, # important to include when overriding!
# we add some that could be Solr/jetty restarts, based
# on our observations:
Faraday::ConnectionFailed, # nothing listening there at all,
Faraday::ResourceNotFound, # HTTP 404
Faraday::ServerError # any HTTP 5xx
],
retry_block: -> (env, options, retries_remaining, exc) do
Rails.logger.warn("Retrying Solr request: HTTP #{env["status"]}: #{exc.class}: retry #{options.maxretries_remaining}")
# other things we could log include `env.url` and `env.response.body`
end
}
# important to add this AFTER retry, to make sure retry can
# rescue and retry it's errors
faraday_connection.response :raise_error
end
end
end
end

Then in my local CatalogController config block, nothing more than:

config.repository_class = Scihist::BlacklightSolrRepository

I had some challenges figuring out how to test this. I ended up testing against a live running Solr instance, which my app’s test suite does sometimes (via solr_wrapper, for better or worse).

One test that’s just a simple smoke test that this thing seems to still function properly as a Blacklight::Solr::Repository without raising. And one that of a sample error

require "rails_helper"
describe Scihist::BlacklightSolrRepository do
# a way to get a configured repository class…
let(:repository) do
Scihist::BlacklightSolrRepository.new(CatalogController.blacklight_config).tap do |repo|
# if we are testing retries, don't actually wait between them
repo.zero_interval_retry = true
end
end
# A simple smoke test against live solr hoping to be a basic test that the
# thing works like a Blacklight::Solr::Repository, our customization attempt
# hopefully didn't break it.
describe "ordinary behavior smoke test", solr: true do
before do
create(:public_work).update_index
end
it "can return results" do
response = repository.search
expect(response).to be_kind_of(Blacklight::Solr::Response)
expect(response.documents).to be_present
end
end
# We're actually going to use webmock to try to mock some error conditions
# to actually test retry behavior, not going to use live solr.
describe "retry behavior", solr:true do
let(:solr_select_url_regex) { /^#{Regexp.escape(ScihistDigicoll::Env.lookup!(:solr_url) + "/select")}/ }
describe "with solr 400 response" do
before do
stub_request(:any, solr_select_url_regex).to_return(status: 400, body: "error")
end
it "does not retry" do
expect {
response = repository.search
}.to raise_error(Blacklight::Exceptions::InvalidRequest)
expect(WebMock).to have_requested(:any, solr_select_url_regex).once
end
end
describe "with solr 404 response" do
before do
stub_request(:any, solr_select_url_regex).to_return(status: 404, body: "error")
end
it "retries twice" do
expect {
response = repository.search
}.to raise_error(Blacklight::Exceptions::InvalidRequest)
expect(WebMock).to have_requested(:any, solr_select_url_regex).times(3)
end
end
end
end