Simon Willison’s Weblog

On databases 87 json 127 embeddings 13 weeknotes 167 speaking 100 ...

 

Recent entries

Weeknotes: the Datasette Cloud API, a podcast appearance and more two days ago

Datasette Cloud now has a documented API, plus a podcast appearance, some LLM plugins work and some geospatial excitement.

The Datasette Cloud API

My biggest achievement this week is that I documented and announced the API for Datasette Cloud.

I wrote about this at length in Getting started with the Datasette Cloud API on the Datasette Cloud blog. I also used this as an opportunity to start a documentation site for the service, now available at datasette.cloud/docs.

The API is effectively the Datasette 1.0 alpha write API, described here previously. You can use the API to both read and write data to a Datasette Cloud space, with finely-grained permissions (powered by the datasette-auth-tokens plugin) so you can create tokens that are restricted to actions just against specified tables.

The blog entry about it doubles as a tutorial, describing how I wrote code to import the latest documents from the US Government Federal Register into a Datasette Cloud space, using a dependency-free Python script and GitHub Actions.

You can see that code in the new federal-register-to-datasette GitHub repository. It’s pretty small—just 70 lines of Python and 22 of YAML.

The more time I spend writing code against the Datasette API the more confident I get that it’s shaped in the right way. I’m happy to consider it stable for the 1.0 release now.

Talking Large Language Models with Rooftop Ruby

I recorded a podcast episode this week for Rooftop Ruby with Collin Donnell and Joel Drapper. It was a really high quality conversation—we went for about an hour and 20 minutes and covered a huge amount of ground.

After the podcast came out I took the MP3, ran it through MacWhisper and then spent several hours marking up speakers and editing the resulting text. I also added headings corresponding to the different topics we covered, along with inline links to other relevant material.

I’m really pleased with the resulting document, which you can find at Talking Large Language Models with Rooftop Ruby. It was quite a bit of work but I think it was worthwhile—I’ve since been able to answer some questions about LLMs on Mastodon and Twitter by linking directly to the point within the transcript that discussed those points.

I also dropped in my own audio player, developed with GPT-4 assistance, and provided links from the different transcript sessions that would jump the audio to that point in the conversation.

Also this week: while closing a bunch of VS Code tabs I stumbled across a partially written blog entry about Things I’ve learned about building CLI tools in Python, so I finished that up and published it.

I’m trying to leave less unfinished projects lying around on my computer, so if something is 90% finished I’ll try to wrap it up and put it out there to get it off my ever-expanding plate.

llm-llama-cpp

LLM has started to collect a small but healthy community on Discord, which is really exciting.

My absolute favourite community project so far is Drew Breunig’s Facet Finder, which he described in Finding Bathroom Faucets with Embeddings. He used llm-clip to calculate embeddings for 20,000 pictures of faucets, then ran both similarity and text search against them to help renovate his bathroom. It’s really fun!

I shipped a new version of the llm-llama-cpp plugin this week which was mostly written by other people: llm-llama-cpp 0.2b1. Alexis Métaireau and LoopControl submitted fixes to extend the default max token limit (fixing a frustrating issue with truncated responses) and to allow for increasing the number of GPU layers used to run the models.

I also shipped LLM 0.11, the main feature of which was support for the new OpenAI gpt-3.5-turbo-instruct model. I really need to split the OpenAI support out into a separate plugin so I can ship fixes to that without having to release the core LLM package.

And I put together an llm-plugin cookiecutter template, which I plan to use for all of my plugins going forward.

Getting excited about TG and sqlite-tg

TG is a brand new C library from Tile38 creator Josh Baker. It’s really exciting: it provides a set of fast geospatial operations—the exact subset I usually find myself needing, based around polygon intersections, GeoJSON, WKT, WKB and geospatial indexes—implemented with zero external dependencies. It’s shipped as a single C file, reminiscent of the SQLite amalgamation.

I noted in a few places that it could make a great SQLite extension... and Alex Garcia fell victim to my blatant nerd-sniping and built the first version of sqlite-tg within 24 hours!

I wrote about my own explorations of Alex’s work in Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg. I’m thrilled at the idea of having a tiny, lightweight alternative to SpatiaLite as an addition to the Datasette ecosystem, and the SQLite world in general.

Two tiny Datasette releases

I released dot-releases for Datasette:

Both of these feature the same fix, described in Issue 2189: Server hang on parallel execution of queries to named in-memory databases.

Short version: it turns out the experimental work I did a while ago to try running SQL queries in parallel was causing threading deadlock issues against in-memory named SQLite databases. No-one had noticed because those are only available within Datasette plugins, but I’d started to experience them as I started writing my own plugins that used that feature.

ChatGPT in the newsroom

I signed up for a MOOC (Massive Open Online Courses) about journalism and ChatGPT!

How to use ChatGPT and other generative AI tools in your newsrooms is being taught by Aimee Rinehart and Sil Hamilton for the Knight Center.

I actually found out about it because people were being snarky about it on Twitter. That’s not a big surprise—there are many obvious problems with applying generative AI to journalism.

As you would hope, this course is not a hype-filled pitch for writing AI-generated news stories. It’s a conversation between literally thousands of journalists around the world about the ethical and practical implications of this technology.

I’m really enjoying it. I’m learning a huge amount about how people experience AI tools, the kinds of questions they have about them and the kinds of journalism problems that make sense for them to solve.

Releases this week

TIL this week

Things I’ve learned about building CLI tools in Python three days ago

I build a lot of command-line tools in Python. It’s become my favorite way of quickly turning a piece of code into something I can use myself and package up for other people to use too.

My biggest CLI projects are sqlite-utils, LLM, shot-scraper and Datasette—but I have dozens of others and I build new ones at the rate of at least one a month. A fun recent example is blip-caption, a tiny CLI wrapper around the Salesforce BLIP model that can generate usable captions for image files.

Here are some notes on what I’ve learned about designing and implementing CLI tools in Python so far.

Starting with a template

I build enough CLI apps that I developed my own Cookiecutter template for starting new ones.

That template is simonw/click-app. You can create a new application from that template directly on GitHub, too—I wrote more about that in Dynamic content for GitHub repository templates using cookiecutter and GitHub Actions.

Arguments, options and conventions

Almost all of my tools are built using the Click Python library. Click encourages a specific way of designing CLI tools which I really like—I find myself annoyed at the various tools from other ecosystems that don’t stick to the conventions that Click encourages.

I’ll try to summarize those conventions here.

  • Commands have arguments and options. Arguments are positional—they are strings that you pass directly to the command, like data.db in datasette data.db. Arguments can be required or optional, and you can have commands which accept an unlimited number of arguments.
  • Options are, usually, optional. They are things like --port 8000. Options can also have a single character shortened version, such as -p 8000.
    • Very occasionally I’ll create an option that is required, usually because a command has so many positional arguments that forcing an option makes its usage easier to read.
  • Some options are flags—they don’t take any additional parameters, they just switch something on. shot-scraper --retina is an example of this.
  • Flags with single character shortcuts can be easily combined—symbex -in fetch_data is short for symbex --imports --no-file fetch_data for example.
  • Some options take multiple parameters. datasette --setting sql_time_limit_ms 10000 is an example, taking both the name of the setting and the value it should be set to.
  • Commands can have sub-commands, each with their own family of commands. llm templates is an example of this, with llm templates list and llm templates show and several more.
  • Every command should have help text—the more detailed the better. This can be viewed by running llm --help—or for sub-commands, llm templates --help.

Click makes it absurdly easy and productive to build CLI tools that follow these conventions.

Consistency is everything

As CLI utilities get larger, they can end up with a growing number of commands and options.

The most important thing in designing these is consistency with other existing commands and options (example here)—and with related tools that your user may have used before.

I often turn to GPT-4 for help with this: I’ll ask it for examples of existing CLI tools that do something similar to what I’m about to build, and see if there’s anything in their option design that I can emulate.

Since my various projects are designed to complement each other I try to stay consistent between them as well—I’ll often post an issue comment that says “similar to functionality in X”, with a copy of the --help output for the tool I’m about to imitate.

CLI interfaces are an API—version appropriately

I try to stick to semantic versioning for my projects, bumping the major version number on breaking changes and the minor version number for new features.

The command-line interface to a tool is absolutely part of that documented API. If someone writes a Bash script or a GitHub Actions automation that uses one of my tools, I’m cautious to avoid breaking that without bumping my major version number.

Include usage examples in --help

A habit I’ve formed more recently is trying to always including a working example of the command in the --help for that command.

I find I use this a lot for tools I’ve developed myself. All of my tools have extensive online documentation, but I like to be able to consult --help without opening a browser for most of their functionality.

Here’s one of my more involved examples—the help for the sqlite-utils convert command:

Usage: sqlite-utils convert [OPTIONS] DB_PATH TABLE COLUMNS... CODE

  Convert columns using Python code you supply. For example:

      sqlite-utils convert my.db mytable mycolumn \
          '"\n".join(textwrap.wrap(value, 10))' \
          --import=textwrap

  "value" is a variable with the column value to be converted.

  Use "-" for CODE to read Python code from standard input.

  The following common operations are available as recipe functions:

  r.jsonsplit(value, delimiter=',', type=<class 'str'>)

      Convert a string like a,b,c into a JSON array ["a", "b", "c"]

  r.parsedate(value, dayfirst=False, yearfirst=False, errors=None)

      Parse a date and convert it to ISO date format: yyyy-mm-dd
      
      - dayfirst=True: treat xx as the day in xx/yy/zz
      - yearfirst=True: treat xx as the year in xx/yy/zz
      - errors=r.IGNORE to ignore values that cannot be parsed
      - errors=r.SET_NULL to set values that cannot be parsed to null

  r.parsedatetime(value, dayfirst=False, yearfirst=False, errors=None)

      Parse a datetime and convert it to ISO datetime format: yyyy-mm-ddTHH:MM:SS
      
      - dayfirst=True: treat xx as the day in xx/yy/zz
      - yearfirst=True: treat xx as the year in xx/yy/zz
      - errors=r.IGNORE to ignore values that cannot be parsed
      - errors=r.SET_NULL to set values that cannot be parsed to null

  You can use these recipes like so:

      sqlite-utils convert my.db mytable mycolumn \
          'r.jsonsplit(value, delimiter=":")'

Options:
  --import TEXT                   Python modules to import
  --dry-run                       Show results of running this against first
                                  10 rows
  --multi                         Populate columns for keys in returned
                                  dictionary
  --where TEXT                    Optional where clause
  -p, --param <TEXT TEXT>...      Named :parameters for where clause
  --output TEXT                   Optional separate column to populate with
                                  the output
  --output-type [integer|float|blob|text]
                                  Column type to use for the output column
  --drop                          Drop original column afterwards
  --no-skip-false                 Don't skip falsey values
  -s, --silent                    Don't show a progress bar
  --pdb                           Open pdb debugger on first error
  -h, --help                      Show this message and exit.

Including --help in the online documentation

My larger tools tend to have extensive documentation independently of their help output. I update this documentation at the same time as the implementation and the tests, as described in The Perfect Commit.

I like to include the --help output in my documentation sites as well. This is mainly for my own purposes—having the help visible on a web page makes it much easier to review it and spot anything that needs updating.

Here are some example pages from my documentation that list --help output:

All of these pages are maintained automatically using Cog. I described the pattern I use for this in Using cog to update --help in a Markdown README file, or you can view source on the Datasette CLI reference for a more involved example.

Talking Large Language Models with Rooftop Ruby three days ago

I’m on the latest episode of the Rooftop Ruby podcast with Collin Donnell and Joel Drapper, talking all things LLM.

Here’s a full transcript of the episode, which I generated using Whisper and then tidied up manually (after failing to get a good editing job out of Claude and GPT-4). I’ve also provided a link from each section heading to jump to the relevant spot in the recording.

The topics we covered:

You can listen to it on Apple Podcasts, Spotify, Google Podcasts, Podcast Index, Overcast and a bunch of other places.

Weeknotes: Embeddings, more embeddings and Datasette Cloud 15 days ago

Since my last weeknotes, a flurry of activity. LLM has embeddings support now, and Datasette Cloud has driven some major improvements to the wider Datasette ecosystem.

Embeddings in LLM

LLM gained embedding support in version 0.9, and then got binary embedding support (for CLIP) in version 0.10. I wrote about those releases in detail in:

Embeddings are a fascinating tool. If you haven’t got your head around them yet the first of my blog entries tries to explain why they are so interesting.

There’s a lot more I want to built on top of embeddings—most notably, LLM (or Datasette, or likely a combination of the two) will be growing support for Retrieval Augmented Generation on top of the LLM embedding mechanism.

Annotated releases

I always include a list of new releases in my weeknotes. This time I’m going to use those to illustrate the themes I’ve been working on.

The first group of release relates to LLM and its embedding support. LLM 0.10 extended that support:

  • llm 0.10—2023-09-12
    Access large language models from the command-line

Embedding models can now be built as LLM plugins. I’ve released two of those so far:

The CLIP one is particularly fun, because it genuinely allows you to build a sophisticated image search engine that runs entirely on your own computer!

  • symbex 1.4—2023-09-05
    Find the Python code for specified symbols

Symbex is my tool for extracting symbols—functions, methods and classes—from Python code. I introduced that in Symbex: search Python code for functions and classes, then pipe them into a LLM.

Symbex 1.4 adds a tiny but impactful feature: it can now output a list of symbols as JSON, CSV or TSV. These output formats are designed to be compatible with the new llm embed-multi command, which means you can easily create embeddings for all of your functions:

symbex '*' '*:*' --nl | \
  llm embed-multi symbols - \
  --format nl --database embeddings.db --store

I haven’t fully explored what this enables yet, but it should mean that both related functions and semantic function search (“Find my a function that downloads a CSV”) are now easy to build.

Yet another thing you can do with embeddings is use them to find clusters of related items.

The neatest feature of llm-cluster is that you can ask it to generate names for these clusters by sending the names of the items in each cluster through another language model, something like this:

llm cluster issues 10 \
  -d issues.db \
  --summary \
  --prompt 'Short, concise title for this cluster of related documents'

One last embedding related project: datasette-llm-embed is a tiny plugin that adds a select llm_embed('sentence-transformers/all-mpnet-base-v2', 'This is some text') SQL function. I built it to support quickly prototyping embedding-related ideas in Datasette.

Spending time with embedding models has lead me to spend more time with Hugging Face. I realized last week that the Hugging Face all models sorted by downloads page doubles as a list of the models that are most likely to be easy to use.

One of the models I tried out was Salesforce BLIP, an astonishing model that can genuinely produce usable captions for images.

It’s really easy to work with. I ended up building this tiny little CLI tool that wraps the model:

  • blip-caption 0.1—2023-09-10
    Generate captions for images with Salesforce BLIP

Releases driven by Datasette Cloud

Datasette Cloud continues to drive improvements to the wider Datasette ecosystem as a whole.

It runs on the latest Datasette 1.0 alpha series, taking advantage of the JSON write API.

This also means that it’s been highlighting breaking changes in 1.0 that have caused old plugins to break, either subtly or completely.

This has driven a bunch of new plugin releases. Some of these are compatible with both 0.x and 1.x—the ones that only work with the 1.x alphas are themselves marked as alpha releases.

Datasette Cloud’s API works using database-backed access tokens, to ensure users can revoke tokens if they need to (something that’s not easily done with purely signed tokens) and that each token can record when it was most recently used.

I’ve been building that into the existing datasette-auth-tokens plugin:

Alex Garcia has been working with me building out features for Datasette Cloud, generously sponsored by Fly.io.

We’re beginning to build out social features for Datasette Cloud—feature that will help teams privately collaborate on data investigations together.

Alex has been building datasette-short-links as an experimental link shortener. In building that, we realized that we needed a mechanism for resolving actor IDs displayed in a list (e.g. this link created by X) to their actual names.

Datasette doesn’t dictate the shape of actor representations, and there’s no guarantee that actors would be represented in a predictable table.

So... we needed a new plugin hook. I released Datasette 1.06a with a new hook, actors_from_ids(actor_ids), which can be used to answer the question “who are the actors represented by these IDs”.

Alex is using this in datasette-short-links, and I built two plugins to work with the new hook as well:

Datasette Cloud lets users insert, edit and delete rows from their tables, using the plugin Alex built called datasette-write-ui which he introduced on the Datasette Cloud blog.

This inspired me to finally put out a fresh release of datasette-edit-schema—the plugin which provides the ability to edit table schemas—adding and removing columns, changing column types, even altering the order columns are stored in the table.

datasette-edit-schema 0.6 is a major release, with three significant new features:

  • You can now create a brand new table from scratch!
  • You can edit the table’s primary key
  • You can modify the foreign key constraints on the table

Those last two became important when I realized that Datasette’s API is much more interesting if there are foreign key relationships to follow.

Combine that with datasette-write-ui and Datasette Cloud now has a full set of features for building, populating and editing tables—backed by a comprehensive JSON API.

  • sqlite-migrate 0.1a2—2023-09-03
    A simple database migration system for SQLite, based on sqlite-utils

sqlite-migrate is still marked as an alpha, but won’t be for much longer: it’s my attempt at a migration system for SQLite, inspired by Django migrations but with a less sophisticated set of features.

I’m using it in LLM now to manage the schema used to store embeddings, and it’s beginning to show up in some Datasette plugins as well. I’ll be promoting this to non-alpha status pretty soon.

  • sqlite-utils 3.35.1—2023-09-09
    Python CLI utility and library for manipulating SQLite databases

A tiny fix in this, which with hindsight was less impactful than I thought.

I spotted a bug on Datasette Cloud when I configured full-text search on a column, then edited the schema and found that searches no longer returned the correct results.

It turned out the rowid column in SQLite was being rewritten by calls to the sqlite-utils table.transform() method. FTS records are related to their underlying row by rowid, so this was breaking search!

I pushed out a fix for this in 3.35.1. But then... I learned that rowid in SQLite has always been unstable—they are rewritten any time someone VACUUMs a table!

I’ve been designing future features for Datasette that assume that rowid is a useful stable identifier for a row. This clearly isn’t going to work! I’m still thinking through the consequences of it, but I think there may be Datasette features (like the ability to comment on a row) that will only work for tables with a proper foreign key.

sqlite-chronicle

  • sqlite-chronicle 0.1—2023-09-11
    Use triggers to track when rows in a SQLite table were updated or deleted

This is very early, but I’m excited about the direction it’s going in.

I keep on finding problems where I want to be able to synchronize various processes with the data in a table.

I built sqlite-history a few months ago, which uses SQLite triggers to create a full copy of the updated data every time a row in a table is edited.

That’s a pretty heavy-weight solution. What if there was something lighter that could achieve a lot of the same goals?

sqlite-chronicle uses triggers to instead create what I’m calling a “chronicle table”. This is a shadow table that records, for every row in the main table, four integer values:

  • added_ms—the timestamp in milliseconds when the row was added
  • updated_ms—the timestamp in milliseconds when the row was last updated
  • version—a constantly incrementing version number, global across the entire table
  • deleted—set to 1 if the row has been deleted

Just storing four integers (plus copies of the primary key) makes this a pretty tiny table, and hopefully one that’s cheap to update via triggers.

But... having this table enables some pretty interesting things—because external processes can track the last version number that they saw and use it to see just which rows have been inserted and updated since that point.

I gave a talk at DjangoCon a few years ago called the denormalized query engine pattern, describing the challenge of syncing an external search index like Elasticsearch with data held in a relational database.

These chronicle tables can solve that problem, and can be applied to a whole host of other problems too. So far I’m thinking about the following:

  • Publishing SQLite databases up to Datasette, sending only the rows that have changed since the last sync. I wrote a prototype that does this and it seems to work very well.
  • Copying a table from Datasette Cloud to other places—a desktop copy, or another instance, or even into an alternative database such as PostgreSQL or MySQL, in a way that only copies and deletes rows that have changed.
  • Saved search alerts: run a SQL query against just rows that were modified since the last time that query ran, then send alerts if any rows are matched.
  • Showing users a note that “34 rows in this table have changed since your last visit”, then displaying those rows.

I’m sure there are many more applications for this. I’m looking forward to finding out what they are!

I needed to fix a bug in Datasette Cloud by moving a table from one database to another... so I built a little plugin for sqlite-utils that adds a sqlite-utils move-tables origin.db destination.db tablename command. I love being able to build single-use features as plugins like this.

And some TILs

This was a fun TIL exercising the new embeddings feature in LLM. I used Django SQL Dashboardto break up my blog entries into paragraphs and exported those as CSV which could then be piped into llm embed-multi, then used that to build a CLI-driven semantic search engine for my blog.

llama-cpp has grammars now, which enable you to control the exact output format of the LLM. I’m optimistic that these could be used to implement an equivalent to OpenAI Functions on top of Llama 2 and similar models. So far I’ve just got them to output arrays of JSON objects.

I’m using this trick a lot at the moment. I have API access to Claude now, which has a 100,000 token context limit (GPT-4 is just 8,000 by default). That’s enough to summarize 100+ comment threads from Hacker News, for which I’m now using this prompt:

Summarize the themes of the opinions expressed here, including quotes (with author attribution) where appropriate.

The quotes part has been working really well—it turns out summaries of themes with illustrative quotes are much more interesting, and so far my spot checks haven’t found any that were hallucinated.

cr-sqlite adds full CRDTs to SQLite, which should enable multiple databases to accept writes independently and then seamlessly merge them together. It’s a very exciting capability!

It turns out Hugging Faces offer free scale-to-zero hosting for demos that run in Docker containers on machines with a full 16GB of RAM! I’m used to optimizing Datasette for tiny 256MB containers, so having this much memory available is a real treat.

And the rest:

Build an image search engine with llm-clip, chat with models with llm chat 20 days ago

LLM is my combination CLI tool and Python library for working with Large Language Models. I just released LLM 0.10 with two significant new features: embedding support for binary files and the llm chat command.

Image search by embedding images with CLIP

I wrote about LLM’s support for embeddings (including what those are and why they’re interesting) when I released 0.9 last week.

That initial release could only handle embeddings of text—great for things like building semantic search and finding related content, but not capable of handling other types of data.

It turns out there are some really interesting embedding models for working with binary data. Top of the list for me is CLIP, released by OpenAI in January 2021.

CLIP has a really impressive trick up its sleeve: it can embed both text and images into the same vector space.

This means you can create an index for a collection of photos, each placed somewhere in 512-dimensional space. Then you can take a text string—like “happy dog”—and embed that into the same space. The images that are closest to that location will be the ones that contain happy dogs!

My llm-clip plugin provides the CLIP model, loaded via SentenceTransformers. You can install and run it like this:

llm install llm-clip
llm embed-multi photos --files photos/ '*.jpg' --binary -m clip

This will install the llm-clip plugin, then use embed-multi to embed all of the JPEG files in the photos/ directory using the clip model.

The resulting embedding vectors are stored in an embedding collection called photos. This defaults to going in the embeddings.db SQLite database managed by LLM, or you can add -d photos.db to store it in a separate database instead.

Then you can run text similarity searches against that collection using llm similar:

llm similar photos -c 'raccoon'

I get back:

{"id": "IMG_4801.jpeg", "score": 0.28125139257127457, "content": null, "metadata": null}
{"id": "IMG_4656.jpeg", "score": 0.26626441704164294, "content": null, "metadata": null}
{"id": "IMG_2944.jpeg", "score": 0.2647445926996852, "content": null, "metadata": null}

And sure enough, IMG_4801.jpeg is this:

A night time blurry photo of a Raccoon in a trash can. A stencilled label on the bin below the raccoon's face says TRASH

(I was pretty excited to snap a photo of a trash panda in an actual trash can.)

CLIP is a pretty old model at this point, and there are plenty of interesting alternatives that are just waiting for someone to wrap them in a plugin. I’m particularly excited about Facebook’s ImageBind, which can embed images, text, audio, depth, thermal, and IMU data all in the same vector space!

Chat with models using llm chat

The other big feature in LLM 0.10 is the new llm chat command.

Prior to this release, the way to have an ongoing conversation with a model was through the -c/--continue flag. You could start a conversation like this:

llm -m gpt-4 'Help me understand generators in Python'

Then ask a follow-up question using -c like so:

llm -c 'Show me an example involving the cast of Riverdale'

This works well, and everything gets logged to SQLite so you can run llm logs -c to see a full log of your most recent conversation.

You could continue a past conversation as well, using llm --cid ID to specify the conversation ID, recorded in those logs.

There was one big problem though: LLM plugins allow you to install models on your own machine, but a conversation with them that spans multiple command invocations resulted in that model being loaded into memory and then un-loaded again for every prompt.

The new llm chat solves that. You can now run llm chat -m model_id to start an interactive chat in your terminal with your model of choice.

Here’s an example using Llama 2 13B, installed via the llm-mlc plugin.

llm chat -m mlc-chat-Llama-2-13b-chat-hf-q4f16_1

You can set an alias for the model to make that easier to remember.

Here’s an example chat session with Llama 2:

Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
> Who are you?
Hello! I'm just an AI, here to assist you with any questions you may have.
My name is LLaMA, and I'm a large language model trained to provide helpful
and informative responses to a wide range of questions and topics. I'm here
to help you find the information you need, while ensuring a safe and
respectful conversation. Please feel free to ask me anything, and I'll do my
best to provide a helpful and accurate response.
> Tell me a joke about otters
Sure, here's a joke about otters:

Why did the otter go to the party?

Because he heard it was a "whale" of a time!

(Get it? Whale, like a big sea mammal, but also a "wild" or "fun" time.
Otters are known for their playful and social nature, so it's a lighthearted
and silly joke.)

I hope that brought a smile to your face! Do you have any other questions or
topics you'd like to discuss?
> exit

I like how Llama 2 enthusiastically explains its own jokes! This is a great demonstration of why techniques like RLHF are so important—you need to go way beyond a raw language model if you’re going to teach one not to be this corny.

Each line of your chat will be executed as soon as you hit <enter>. Sometimes you might need to enter a multi-line prompt, for example if you need to paste in an error message. You can do that using the !multi token, like this:

llm chat -m gpt-4
Chatting with gpt-4
Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
> !multi custom-end
 Explain this error:

   File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

 !end custom-end

llm chat also supports system prompts and templates. If you want to chat with a sentient cheesecake, try this:

llm chat -m gpt-3.5-turbo --system '
You are a stereotypical sentient cheesecake with strong opinions
who always talks about cheesecake'

You can save those as templates too:

llm --system 'You are a stereotypical sentient cheesecake with
strong opinions who always talks about cheesecake' --save cheesecake -m gpt-4
llm chat -t cheesecake

For more options, see the llm chat documentation.

Get involved

My ambition for LLM is for it to provide the easiest way to try out new models, both full-sized Large Language Models and now embedding models such as CLIP.

I’m not going to write all of these plugins myself!

If you want to help out, please come and say hi in the #llm Discord channel.

LLM now provides tools for working with embeddings 28 days ago

LLM is my Python library and command-line tool for working with language models. I just released LLM 0.9 with a new set of features that extend LLM to provide tools for working with embeddings.

This is a long post with a lot of theory and background. If you already know what embeddings are, here’s a TLDR you can try out straight away:

# Install LLM
pip install llm

# If you already installed via Homebrew/pipx you can upgrade like this:
llm install -U llm

# Install the llm-sentence-transformers plugin
llm install llm-sentence-transformers

# Install the all-MiniLM-L6-v2 embedding model
llm sentence-transformers register all-MiniLM-L6-v2

# Generate and store embeddings for every README.md in your home directory, recursively
llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'
  # Add --store to store the text content as well

# Run a similarity search for "sqlite" against those embeddings
llm similar readmes -c sqlite

For everyone else, read on and the above example should hopefully all make sense.

Embeddings

Embeddings are a fascinating concept within the larger world of language models.

I explained embeddings in my recent talk, Making Large Language Models work for you. The relevant section of the slides and transcript is here, or you can jump to that section on YouTube.

An embedding model lets you take a string of text—a word, sentence, paragraph or even a whole document—and turn that into an array of floating point numbers called an embedding vector.

On the left is a text post from one of my sites: Storing and serving related documents with openai-to-sqlite and embeddings. An arrow points to a huge JSON array on the right, with the label 1536 floating point numbers.

A model will always produce the same length of array—1,536 numbers for the OpenAI embedding model, 384 for all-MiniLM-L6-v2—but the array itself is inscrutable. What are you meant to do with it?

The answer is that you can compare them. I like to think of an embedding vector as a location in 1,536-dimensional space. The distance between two vectors is a measure of how semantically similar they are in meaning, at least according to the model that produced them.

A location in 1,536 dimension space  There's a 3D plot with 400 red dots arranged randomly across 3 axis.

“One happy dog” and “A playful hound” will end up close together, even though they don’t share any keywords. The embedding vector represents the language model’s interpretation of the meaning of the text.

Things you can do with embeddings include:

  1. Find related items. I use this on my TIL site to display related articles, as described in Storing and serving related documents with openai-to-sqlite and embeddings.
  2. Build semantic search. As shown above, an embeddings-based search engine can find content relevant to the user’s search term even if none of the keywords match.
  3. Implement retrieval augmented generation—the trick where you take a user’s question, find relevant documentation in your own corpus and use that to get an LLM to spit out an answer. More on that here.
  4. Clustering: you can find clusters of nearby items and identify patterns in a corpus of documents.
  5. Classification: calculate the embedding of a piece of text and compare it to pre-calculated “average” embeddings for different categories.

LLM’s new embedding features

My goal with LLM is to provide a plugin-driven abstraction around a growing collection of language models. I want to make installing, using and comparing these models as easy as possible.

The new release adds several command-line tools for working with embeddings, plus a new Python API for working with embeddings in your own code.

It also adds support for installing additional embedding models via plugins. I’ve released one plugin for this so far: llm-sentence-transformers, which adds support for new models based on the sentence-transformers library.

The example above shows how to use sentence-transformers. LLM also supports API-driven access to the OpenAI ada-002 model.

Here’s how to embed some text using ada-002, assuming you have installed LLM already:

# Set your OpenAI API key
llm keys set openai
# <paste key here>

# Embed some text
llm embed -m ada-002 -c "Hello world"

This will output a huge JSON list of floating point numbers to your terminal. You can add -f base64 (or -f hex) to get that back in a different format, though none of these outputs are instantly useful.

Embeddings are much more interesting when you store them.

LLM already uses SQLite to store prompts and responses. It was a natural fit to use SQLite to store embeddings as well.

Embedding collections

LLM 0.9 introduces the concept of a collection of embeddings. A collection has a name—like readmes—and contains a set of embeddings, each of which has an ID and an embedding vector.

All of the embeddings in a collection are generated by the same model, to ensure they can be compared with each others.

The llm embed command can store the vector in the database instead of returning it to the console. Pass it the name of an existing (or to-be-created) collection and the ID to use to store the embedding.

Here we’ll store the embedding for the phrase “Hello world” in a collection called phrases with the ID hello, using that ada-002 embedding model:

llm embed phrases hello -m ada-002 -c "Hello world"

Future phrases can be added without needing to specify the model again, since it is remembered by the collection:

llm embed phrases goodbye -c "Goodbye world"

The llm embed-db collections shows a list of collections:

phrases: ada-002
  2 embeddings
readmes: sentence-transformers/all-MiniLM-L6-v2
  16796 embeddings

The data is stored in a SQLite embeddings table with the following schema:

CREATE TABLE [collections] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT,
   [model] TEXT
);
CREATE TABLE "embeddings" (
   [collection_id] INTEGER REFERENCES [collections]([id]),
   [id] TEXT,
   [embedding] BLOB,
   [content] TEXT,
   [content_hash] BLOB,
   [metadata] TEXT,
   [updated] INTEGER,
   PRIMARY KEY ([collection_id], [id])
);

CREATE UNIQUE INDEX [idx_collections_name]
    ON [collections] ([name]);
CREATE INDEX [idx_embeddings_content_hash]
    ON [embeddings] ([content_hash]);

By default this is the SQLite database at the location revealed by llm embed-db path, but you can pass --database my-embeddings.db to various LLM commands to use a different database.

Each embedding vector is stored as a binary BLOB in the embedding column, consisting of those floating point numbers packed together as 32 bit floats.

The content_hash column contains a MD5 hash of the content. This helps avoid re-calculating the embedding (which can cost actual money for API-based embedding models like ada-002) unless the content has changed.

The content column is usually null, but can contain a copy of the original text content if you pass the --store option to the llm embed command.

metadata can contain a JSON object with metadata, if you pass --metadata '{"json": "goes here"}.

You don’t have to pass content using -c—you can instead pass a file path using the -i/--input option:

llm embed docs llm-setup -m ada-002 -i llm/docs/setup.md

Or pipe things to standard input like this:

cat llm/docs/setup.md | llm embed docs llm-setup -m ada-002 -i -

Embedding similarity search

Once you’ve built a collection, you can search for similar embeddings using the llm similar command.

The -c "term" option will embed the text you pass in using the embedding model for the collection and use that as the comparison vector:

llm similar readmes -c sqlite

You can also pass the ID of an object in that collection to use that embedding instead. This gets you related documents, for example:

llm similar readmes sqlite-utils/README.md

The output from this command is currently newline-delimited JSON.

Embedding in bulk

The llm embed command embeds a single string at a time. llm embed-multi is much more powerful: you can feed a CSV or JSON file, a SQLite database or even have it read from a directory of files in order to embed multiple items at once.

Many embeddings models are optimized for batch operations, so embedding multiple items at a time can provide a significant speed boost.

The embed-multi command is described in detail in the documentation. Here are a couple of fun things you can do with it.

First, I’m going to create embeddings for every single one of my Apple Notes.

My apple-notes-to-sqlite tool can export Apple Notes to a SQLite database. I’ll run that first:

apple-notes-to-sqlite notes.db

This took quite a while to run on my machine and generated a 828M SQLite database containing 6,462 records!

Next, I’m going to embed the content of all of those notes using the sentence-transformers/all-MiniLM-L6-v2 model:

llm embed-multi notes \
  -d notes.db \
  --sql 'select id, title, body from notes' \
  -m sentence-transformers/all-MiniLM-L6-v2

This took around 15 minutes to run, and increased the size of my database by 13MB.

The --sql option here specifies a SQL query. The first column must be an id, then any subsequent columns will be concatenated together to form the content to embed.

In this case the embeddings are written back to the same notes.db database that the content came from.

And now I can run embedding similarity operations against all of my Apple notes!

llm similar notes -d notes.db -c 'ideas for blog posts'

Embedding files in a directory

Let’s revisit the example from the top of this post. In this case, I’m using the --files option to search for files on disk and embed each of them:

llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'

The --files option takes two arguments: a path to a directory and a pattern to match against filenames. In this case I’m searching my home directory recursively for any files named README.md.

Running this command gives me embeddings for all of my README.md files, which I can then search against like this:

llm similar readmes -c sqlite

Embeddings in Python

So far I’ve only covered the command-line tools. LLM 0.9 also introduces a new Python API for working with embeddings.

There are two aspects to this. If you just want to embed content and handle the resulting vectors yourself, you can use llm.get_embedding_model():

import llm

# This takes model IDs and aliases defined by plugins:
model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
vector = model.embed("This is text to embed")

vector will then be a Python list of floating point numbers.

You can serialize that to the same binary format that LLM uses like this:

binary_vector = llm.encode(vector)
# And to deserialize:
vector = llm.decode(binary_vector)

The second aspect of the Python API is the llm.Collection class, for working with collections of embeddings. This example code is quoted from the documentation:

import sqlite_utils
import llm

# This collection will use an in-memory database that will be
# discarded when the Python process exits
collection = llm.Collection("entries", model_id="ada-002")

# Or you can persist the database to disk like this:
db = sqlite_utils.Database("my-embeddings.db")
collection = llm.Collection("entries", db, model_id="ada-002")

# You can pass a model directly using model= instead of model_id=
embedding_model = llm.get_embedding_model("ada-002")
collection = llm.Collection("entries", db, model=embedding_model)

# Store a string in the collection with an ID:
collection.embed("hound", "my happy hound")

# Or to store content and extra metadata:
collection.embed(
    "hound",
    "my happy hound",
    metadata={"name": "Hound"},
    store=True
)

# Or embed things in bulk:
collection.embed_multi(
    [
        ("hound", "my happy hound"),
        ("cat", "my dissatisfied cat"),
    ],
    # Add this to store the strings in the content column:
    store=True,
)

As with everything else in LLM, the goal is that anything you can do with the CLI can be done with the Python API, and vice-versa.

Clustering with llm-cluster

Another interesting application of embeddings is that you can use them to cluster content—identifying patterns in a corpus of documents.

I’ve started exploring this area with a new plugin, called llm-cluster.

You can install it like this:

llm install llm-cluster

Let’s create a new collection using data pulled from GitHub. I’m going to import all of the LLM issues from the GitHub API, using my paginate-json tool:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --store

Running this gives me a issues.db SQLite database with 218 embeddings contained in a collection called llm-issues.

Now let’s try out the llm-cluster command, requesting ten clusters from that collection:

llm cluster llm-issues --database issues.db 10

The output from this command, truncated, looks like this:

[
  {
    "id": "0",
    "items": [
      {
        "id": "1784149135",
        "content": "Tests fail with pydantic 2"
      },
      {
        "id": "1837084995",
        "content": "Allow for use of Pydantic v1 as well as v2."
      },
      {
        "id": "1857942721",
        "content": "Get tests passing against Pydantic 1"
      }
    ]
  },
  {
    "id": "1",
    "items": [
      {
        "id": "1724577618",
        "content": "Better ways of storing and accessing API keys"
      },
      {
        "id": "1772024726",
        "content": "Support for `-o key value` options such as `temperature`"
      },
      {
        "id": "1784111239",
        "content": "`--key` should be used in place of the environment variable"
      }
    ]
  },
  {
    "id": "8",
    "items": [
      {
        "id": "1835739724",
        "content": "Bump the python-packages group with 1 update"
      },
      {
        "id": "1848143453",
        "content": "Python library support for adding aliases"
      },
      {
        "id": "1857268563",
        "content": "Bump the python-packages group with 1 update"
      }
    ]
  }
]

These look pretty good! But wouldn’t it be neat if we had a snappy title for each one?

The --summary option can provide exactly that, by piping the members of each cluster through a call to another LLM in order to generate a useful summary.

llm cluster llm-issues --database issues.db 10 --summary

This uses gpt-3.5-turbo to generate a summary for each cluster, with this default prompt:

Short, concise title for this cluster of related documents.

The results I got back are pretty good, including:

  • Template Storage and Management Improvements
  • Package and Dependency Updates and Improvements
  • Adding Conversation Mechanism and Tools

I tried the same thing using a Llama 2 model running on my own laptop, with a custom prompt:

llm cluster llm-issues --database issues.db 10 \
  --summary --model mlc-chat-Llama-2-13b-chat-hf-q4f16_1 \
  --prompt 'Concise title for this cluster of related documents, just return the title'

I didn’t quite get what I wanted! Llama 2 is proving a lot harder to prompt, so each cluster came back with something that looked like this:

Sure! Here’s a concise title for this cluster of related documents:

“Design Improvements for the Neat Prompt System”

This title captures the main theme of the documents, which is to improve the design of the Neat prompt system. It also highlights the focus on improving the system’s functionality and usability

llm-cluster only took a few hours to throw together, which I’m seeing as a positive indicator that the LLM library is developing in the right direction.

Future plans

The two future features I’m most excited about are indexing and chunking.

Indexing

The llm similar command and collection.similar() Python method currently use effectively the slowest brute force approach possible: calculate a cosine difference between input vector and every other embedding in the collection, then sort the results.

This works fine for collections with a few hundred items, but will start to suffer for collections of 100,000 or more.

There are plenty of potential ways of speeding this up: you can run a vector index like FAISS or hnswlib, use a database extension like sqlite-vss or pgvector, or turn to a hosted vector database like Pinecone or Milvus.

With this many potential solutions, the obvious answer for LLM is to address this with plugins.

I’m still thinking through the details, but the core idea is that users should be able to define an index against one or more collections, and LLM will then coordinate updates to that index. These may not happen in real-time—some indexes can be expensive to rebuild, so there are benefits to applying updates in batches.

I experimented with FAISS earlier this year in datasette-faiss. That’s likely to be the base for my first implementation.

The embeddings table has an updated timestamp column to support this use-case—so indexers can run against just the items that have changed since the last indexing run.

Follow issue #216 for updates on this feature.

Chunking

When building an embeddings-based search engine, the hardest challenge is deciding how best to “chunk” the documents.

Users will type in short phrases or questions. The embedding for a four word question might not necessarily map closely to the embedding of a thousand word article, even if the article itself should be a good match for that query.

To maximize the chance of returning the most relevant content, we need to be smarter about what we embed.

I’m still trying to get a good feeling for the strategies that make sense here. Some that I’ve seen include:

  • Split a document up into fixed length shorter segments.
  • Split into segments but including a ~10% overlap with the previous and next segments, to reduce problems caused by words and sentences being split in a way that disrupts their semantic meaning.
  • Splitting by sentence, using NLP techniques.
  • Splitting into higher level sections, based on things like document headings.

Then there are more exciting, LLM-driven approaches:

  • Generate an LLM summary of a document and embed that.
  • Ask an LLM “What questions are answered by the following text?” and then embed each of the resulting questions!

It’s possible to try out these different techniques using LLM already: write code that does the splitting, then feed the results to Collection.embed_multi() or llm embed-multi.

But... it would be really cool if LLM could split documents for you—with the splitting techniques themselves defined by plugins, to make it easy to try out new approaches.

Get involved

It should be clear by now that the potential scope of the LLM project is enormous. I’m trying to use plugins to tie together an enormous and rapidly growing ecosystem of models and techniques into something that’s as easy for people to work with and build on as possible.

There are plenty of ways you can help!

  • Join the #llm Discord to talk about the project.
  • Try out plugins and run different models with them. There are 12 plugins already, and several of those can be used to run dozens if not hundreds of models (llm-mlc, llm-gpt4all and llm-llama-cpp in particular). I’ve hardly scratched the surface of these myself, and I’m testing exclusively on Apple Silicon. I’m really keen to learn more about which models work well, which models don’t and which perform the best on different hardware.
  • Try building a plugin for a new model. My dream here is that every significant Large Language Model will have an LLM plugin that makes it easy to install and use.
  • Build stuff using LLM and let me know what you’ve built. Nothing fuels an open source project more than stories of cool things people have built with it.

Elsewhere

Yesterday

  • Weird A.I. Yankovic, a cursed deep dive into the world of voice cloning. Andy Baio reports back on his investigations into the world of AI voice cloning.

    This is no longer a niche interest. There’s a Discord with 500,000 members sharing tips and tricks on cloning celebrity voices in order to make their own cover songs, often built with Google Colab using models distributed through Hugging Face.

    Andy then makes his own, playing with the concept “What if every Weird Al song was the original, and every other artist was covering his songs instead?”

    I particularly enjoyed Madonna’s cover of “Like A Surgeon”, Lady Gaga’s “Perform This Way” and Lorde’s “Foil”. #2nd October 2023, 6:50 pm
  • [On Python 3.12 subinterpreters] there’s massive advantages for mixed C(++) and Python: I can now have multiple sub interpreters running concurrently and accessing the same shared state in a thread-safe C++ library.

    Previously this required rewriting the whole C++ library to support either pickling (multiplying the total memory consumption by the number of cores), or support allocating everything in shared memory (which means normal C++ types like `std::string` are unusable, need to switch e.g. to boost::interprocess).

    Now is sufficient to pickle a pointer to a C++ object as an integer, and it’ll still be a valid pointer in the other subinterpreter.

    ynik # 2nd October 2023, 6:13 pm

  • jq 1.7. First new release of jq in five years! The project has moved from a solo maintainer to a new team with a dedicated GitHub organization. A ton of new features in this release—I’m most excited about the new pick(.key1, .key2.nested) builtin for emitting a selected subset of the incoming objects, and the --raw-output0 option which outputs zero byte delimited lists, designed to be piped to “xargs -0”. #2nd October 2023, 4:58 am
  • I think that discussions of this technology become much clearer when we replace the term AI with the word “automation”. Then we can ask:

    What is being automated?
    Who’s automating it and why?
    Who benefits from that automation?
    How well does the automation work in its use case that we’re considering?
    Who’s being harmed?
    Who has accountability for the functioning of the automated system?
    What existing regulations already apply to the activities where the automation is being used?

    Emily M. Bender # 2nd October 2023, 12:20 am

1st October 2023

  • Database Migrations. Vadim Kravcenko provides a useful, in-depth description of the less obvious challenges of applying database migrations successfully. Vadim uses and likes Django’s migrations (as do I) but notes that running them at scale still involves a number of thorny challenges.

    The biggest of these, which I’ve encountered myself multiple times, is that if you want truly zero downtime deploys you can’t guarantee that your schema migrations will be deployed at the exact same instant as changes you make to your application code.

    This means all migrations need to be forward-compatible: you need to apply a schema change in a way that your existing code will continue to work error-free, then ship the related code change as a separate operation.

    Vadim describes what this looks like in detail for a number of common operations: adding a field, removing a field and changing a field that has associated business logic implications. He also discusses the importance of knowing when to deploy a dual-write strategy. #1st October 2023, 11:55 pm
  • Observable notebook: Detect objects in images (via) I built an Observable notebook that uses Transformers.js and the Xenova/detra-resnet-50 model to detect objects in images, entirely running within your browser. You can select an image using a file picker and it will show you that image with bounding boxes and labels drawn around items within it. I have a demo image showing some pelicans flying ahead, but it works with any image you give it—all without uploading that image to a server. #1st October 2023, 3:46 pm

30th September 2023

  • Get Your Mac Python From Python.org. Glyph recommends the official Python installer from python.org as the best way to get started with a Python environment on macOS—with require-virtualenv = true in your ~/.pip/pip.conf to help avoid accidentally installing global packages. #30th September 2023, 2:39 am
  • Meta in Myanmar, Part I: The Setup. The first in a series by Erin Kissane explaining in detail exactly how things went so incredibly wrong with Facebook in Myanmar, contributing to a genocide ending hundreds of thousands of lives. This is an extremely tough read. #30th September 2023, 2:27 am

29th September 2023

  • Draggable objects (via) Amit Patel’s detailed write-up of a small but full-featured JavaScript function for creating draggable objects, with support for both mouse and touch devices “using browser features that are widely supported since 2020”. #29th September 2023, 7:56 pm

28th September 2023

  • Getting started with the Datasette Cloud API. I wrote an introduction to the Datasette Cloud API for the company blog, with a tutorial showing how to use Python and GitHub Actions to import data from the Federal Register into a table in Datasette Cloud, then configure full-text search against it. #28th September 2023, 11:05 pm
  • Looking at LLMs as chatbots is the same as looking at early computers as calculators. We’re seeing an emergence of a whole new computing paradigm, and it is very early.

    Andrej Karpathy # 28th September 2023, 8:50 pm

27th September 2023

  • Google was accidentally leaking its Bard AI chats into public search results. I’m quoted in this piece about yesterday’s Bard privacy bug: it turned out the share URL and “Let anyone with the link see what you’ve selected” feature wasn’t correctly setting a noindex parameter, and so some shared conversations were being swept up by the Google search crawlers. Thankfully this was a mistake, not a deliberate design decision, and it should be fixed by now. #27th September 2023, 7:35 pm
  • Finding Bathroom Faucets with Embeddings. Absolutely the coolest thing I’ve seen someone build on top of my LLM tool so far: Drew Breunig is renovating a bathroom and needed a way to filter through literally thousands of options for facet taps. He scraped 20,000 images of fixtures from a plumbing supply site and used LLM to embed every one of them via CLIP... and now he can ask for “faucets that look like this one”, or even run searches for faucets that match “Gawdy” or “Bond Villain” or “Nintendo 64”. Live demo included! #27th September 2023, 6:18 pm
  • Optimizing for Taste. David Cramer’s detailed explanation as to why his company Sentry mostly avoids A/B testing. David wrote this as an internal blog post originally, but is now sharing it with the world. I found myself nodding along vigorously as I read this—lots of astute observations here.

    I particularly appreciated his closing note: “The strength of making a decision is making it. You can always make a new one later. Choose the obvious path forward, and if you don’t see one, find someone who does.” #27th September 2023, 4:34 am
  • The profusion of dubious A.I.-generated content resembles the badly made stockings of the nineteenth century. At the time of the Luddites, many hoped the subpar products would prove unacceptable to consumers or to the government. Instead, social norms adjusted.

    Kyle Chayka # 27th September 2023, 12:26 am

26th September 2023

  • Rethinking the Luddites in the Age of A.I. I’ve been staying way clear of comparisons to Luddites in conversations about the potential harmful impacts of modern AI tools, because it seemed to me like an offensive, unproductive cheap shot.

    This article has shown me that the comparison is actually a lot more relevant—and sympathetic—than I had realized.

    In a time before labor unions, the Luddites represented an early example of a worker movement that tried to stand up for their rights in the face of transformational, negative change to their specific way of life.

    “Knitting machines known as lace frames allowed one employee to do the work of many without the skill set usually required” is a really striking parallel to what’s starting to happen with a surprising array of modern professions already. #26th September 2023, 11:45 pm
  • Batch size one billion: SQLite insert speedups, from the useful to the absurd (via) Useful, detailed review of ways to maximize the performance of inserting a billion integers into a SQLite database table. #26th September 2023, 5:31 pm

25th September 2023

  • Upsert in SQL (via) Anton Zhiyanov is currently on a one-man quest to write detailed documentation for all of the fundamental SQL operations, comparing and contrasting how they work across multiple engines, generally with interactive examples.

    Useful tips in here on why “insert... on conflict” is usually a better option than “insert or replace into” because the latter can perform a delete and then an insert, firing triggers that you may not have wanted to be fired. #25th September 2023, 8:34 pm
  • Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg. Alex Garcia built sqlite-tg—a SQLite extension that uses the brand new TG geospatial library to provide a whole suite of custom SQL functions for working with geospatial data.

    Here are my notes on trying out his initial alpha releases. The extension already provides tools for converting between GeoJSON, WKT and WKB, plus the all important tg_intersects() function for testing if a polygon or point overlap each other.

    It’s pretty useful already. Without any geospatial indexing at all I was still able to get 700ms replies to a brute-force point-in-polygon query against 150MB of GeoJSON timezone boundaries stored as JSON text in a table. #25th September 2023, 7:45 pm
  • We already know one major effect of AI on the skills distribution: AI acts as a skills leveler for a huge range of professional work. If you were in the bottom half of the skill distribution for writing, idea generation, analyses, or any of a number of other professional tasks, you will likely find that, with the help of AI, you have become quite good.

    Ethan Mollick # 25th September 2023, 4:37 pm

  • A Hackers' Guide to Language Models. Jeremy Howard’s new 1.5 hour YouTube introduction to language models looks like a really useful place to catch up if you’re an experienced Python programmer looking to start experimenting with LLMs. He covers what they are and how they work, then shows how to build against the OpenAI API, build a Code Interpreter clone using OpenAI functions, run models from Hugging Face on your own machine (with NVIDIA cards or on a Mac) and finishes with a demo of fine-tuning a Llama 2 model to perform text-to-SQL using an open dataset. #25th September 2023, 12:24 am

24th September 2023

  • Should you give candidates feedback on their interview performance? Jacob provides a characteristically nuanced answer to the question of whether you should provide feedback to candidates you have interviewed. He suggests offering the candidate the option to email asking for feedback early in the interview process to avoid feeling pushy later on, and proposes the phrase “you failed to demonstrate...” as a useful framing device. #24th September 2023, 10:25 pm

23rd September 2023

  • TG: Polygon indexing (via) TG is a brand new geospatial library by Josh Baker, author of the Tile38 in-memory spatial server (kind of a geospatial Redis). TG is written in pure C and delivered as a single C file, reminiscent of the SQLite amalgamation.

    TG looks really interesting. It implements almost the exact subset of geospatial functionality that I find most useful: point-in-polygon, intersect, WKT, WKB, and GeoJSON—all with no additional dependencies.

    The most interesting thing about it is the way it handles indexing. In this documentation Josh describes two approaches he uses to speeding up point-in-polygon and intersection using a novel approach that goes beyond the usual RTree implementation.

    I think this could make the basis of a really useful SQLite extension—a lighter-weight alternative to SpatiaLite. #23rd September 2023, 4:32 am