Simon Willison’s Weblog

Recent entries

Weeknotes: the Datasette Cloud API, a podcast appearance and more two days ago

Datasette Cloud now has a documented API, plus a podcast appearance, some LLM plugins work and some geospatial excitement.

The Datasette Cloud API

My biggest achievement this week is that I documented and announced the API for Datasette Cloud.

I wrote about this at length in Getting started with the Datasette Cloud API on the Datasette Cloud blog. I also used this as an opportunity to start a documentation site for the service, now available at datasette.cloud/docs.

The API is effectively the Datasette 1.0 alpha write API, described here previously. You can use the API to both read and write data to a Datasette Cloud space, with finely-grained permissions (powered by the datasette-auth-tokens plugin) so you can create tokens that are restricted to actions just against specified tables.

The blog entry about it doubles as a tutorial, describing how I wrote code to import the latest documents from the US Government Federal Register into a Datasette Cloud space, using a dependency-free Python script and GitHub Actions.

You can see that code in the new federal-register-to-datasette GitHub repository. It’s pretty small—just 70 lines of Python and 22 of YAML.

The more time I spend writing code against the Datasette API the more confident I get that it’s shaped in the right way. I’m happy to consider it stable for the 1.0 release now.

Talking Large Language Models with Rooftop Ruby

I recorded a podcast episode this week for Rooftop Ruby with Collin Donnell and Joel Drapper. It was a really high quality conversation—we went for about an hour and 20 minutes and covered a huge amount of ground.

After the podcast came out I took the MP3, ran it through MacWhisper and then spent several hours marking up speakers and editing the resulting text. I also added headings corresponding to the different topics we covered, along with inline links to other relevant material.

I’m really pleased with the resulting document, which you can find at Talking Large Language Models with Rooftop Ruby. It was quite a bit of work but I think it was worthwhile—I’ve since been able to answer some questions about LLMs on Mastodon and Twitter by linking directly to the point within the transcript that discussed those points.

I also dropped in my own audio player, developed with GPT-4 assistance, and provided links from the different transcript sessions that would jump the audio to that point in the conversation.

Also this week: while closing a bunch of VS Code tabs I stumbled across a partially written blog entry about Things I’ve learned about building CLI tools in Python, so I finished that up and published it.

I’m trying to leave less unfinished projects lying around on my computer, so if something is 90% finished I’ll try to wrap it up and put it out there to get it off my ever-expanding plate.

llm-llama-cpp

LLM has started to collect a small but healthy community on Discord, which is really exciting.

My absolute favourite community project so far is Drew Breunig’s Facet Finder, which he described in Finding Bathroom Faucets with Embeddings. He used llm-clip to calculate embeddings for 20,000 pictures of faucets, then ran both similarity and text search against them to help renovate his bathroom. It’s really fun!

I shipped a new version of the llm-llama-cpp plugin this week which was mostly written by other people: llm-llama-cpp 0.2b1. Alexis Métaireau and LoopControl submitted fixes to extend the default max token limit (fixing a frustrating issue with truncated responses) and to allow for increasing the number of GPU layers used to run the models.

I also shipped LLM 0.11, the main feature of which was support for the new OpenAI gpt-3.5-turbo-instruct model. I really need to split the OpenAI support out into a separate plugin so I can ship fixes to that without having to release the core LLM package.

And I put together an llm-plugin cookiecutter template, which I plan to use for all of my plugins going forward.

Getting excited about TG and sqlite-tg

TG is a brand new C library from Tile38 creator Josh Baker. It’s really exciting: it provides a set of fast geospatial operations—the exact subset I usually find myself needing, based around polygon intersections, GeoJSON, WKT, WKB and geospatial indexes—implemented with zero external dependencies. It’s shipped as a single C file, reminiscent of the SQLite amalgamation.

I noted in a few places that it could make a great SQLite extension... and Alex Garcia fell victim to my blatant nerd-sniping and built the first version of sqlite-tg within 24 hours!

I wrote about my own explorations of Alex’s work in Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg. I’m thrilled at the idea of having a tiny, lightweight alternative to SpatiaLite as an addition to the Datasette ecosystem, and the SQLite world in general.

Two tiny Datasette releases

I released dot-releases for Datasette:

Both of these feature the same fix, described in Issue 2189: Server hang on parallel execution of queries to named in-memory databases.

Short version: it turns out the experimental work I did a while ago to try running SQL queries in parallel was causing threading deadlock issues against in-memory named SQLite databases. No-one had noticed because those are only available within Datasette plugins, but I’d started to experience them as I started writing my own plugins that used that feature.

ChatGPT in the newsroom

I signed up for a MOOC (Massive Open Online Courses) about journalism and ChatGPT!

How to use ChatGPT and other generative AI tools in your newsrooms is being taught by Aimee Rinehart and Sil Hamilton for the Knight Center.

I actually found out about it because people were being snarky about it on Twitter. That’s not a big surprise—there are many obvious problems with applying generative AI to journalism.

As you would hope, this course is not a hype-filled pitch for writing AI-generated news stories. It’s a conversation between literally thousands of journalists around the world about the ethical and practical implications of this technology.

I’m really enjoying it. I’m learning a huge amount about how people experience AI tools, the kinds of questions they have about them and the kinds of journalism problems that make sense for them to solve.

Releases this week

datasette-remote-actors 0.1a2—2023-09-28
Datasette plugin for fetching details of actors from a remote endpoint
llm-llama-cpp 0.2b1—2023-09-28
LLM plugin for running models using llama.cpp
datasette-auth-tokens 0.4a4—2023-09-26
Datasette plugin for authenticating access using API tokens
datasette 1.0a7—2023-09-21
An open source multi-tool for exploring and publishing data
datasette-upload-dbs 0.3.1—2023-09-20
Upload SQLite database files to Datasette
datasette-mask-columns 0.2.2—2023-09-20
Datasette plugin that masks specified database columns
llm 0.11—2023-09-19
Access large language models from the command-line

TIL this week

Understanding the CSS auto-resizing textarea trick—2023-09-30
Snapshot testing with Syrupy—2023-09-26
Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg—2023-09-25
Trying out the facebook/musicgen-small sound generation model—2023-09-23

12:03 am / 1st October 2023 / llm, sqlite, datasettecloud, journalism, ai, alexgarcia, llms, datasette, generativeai, projects, weeknotes

Things I’ve learned about building CLI tools in Python three days ago

I build a lot of command-line tools in Python. It’s become my favorite way of quickly turning a piece of code into something I can use myself and package up for other people to use too.

My biggest CLI projects are sqlite-utils, LLM, shot-scraper and Datasette—but I have dozens of others and I build new ones at the rate of at least one a month. A fun recent example is blip-caption, a tiny CLI wrapper around the Salesforce BLIP model that can generate usable captions for image files.

Here are some notes on what I’ve learned about designing and implementing CLI tools in Python so far.

Starting with a template

I build enough CLI apps that I developed my own Cookiecutter template for starting new ones.

That template is simonw/click-app. You can create a new application from that template directly on GitHub, too—I wrote more about that in Dynamic content for GitHub repository templates using cookiecutter and GitHub Actions.

Arguments, options and conventions

Almost all of my tools are built using the Click Python library. Click encourages a specific way of designing CLI tools which I really like—I find myself annoyed at the various tools from other ecosystems that don’t stick to the conventions that Click encourages.

I’ll try to summarize those conventions here.

Commands have arguments and options. Arguments are positional—they are strings that you pass directly to the command, like data.db in datasette data.db. Arguments can be required or optional, and you can have commands which accept an unlimited number of arguments.
Options are, usually, optional. They are things like --port 8000. Options can also have a single character shortened version, such as -p 8000.
- Very occasionally I’ll create an option that is required, usually because a command has so many positional arguments that forcing an option makes its usage easier to read.
Some options are flags—they don’t take any additional parameters, they just switch something on. shot-scraper --retina is an example of this.
Flags with single character shortcuts can be easily combined—symbex -in fetch_data is short for symbex --imports --no-file fetch_data for example.
Some options take multiple parameters. datasette --setting sql_time_limit_ms 10000 is an example, taking both the name of the setting and the value it should be set to.
Commands can have sub-commands, each with their own family of commands. llm templates is an example of this, with llm templates list and llm templates show and several more.
Every command should have help text—the more detailed the better. This can be viewed by running llm --help—or for sub-commands, llm templates --help.

Click makes it absurdly easy and productive to build CLI tools that follow these conventions.

Consistency is everything

As CLI utilities get larger, they can end up with a growing number of commands and options.

The most important thing in designing these is consistency with other existing commands and options (example here)—and with related tools that your user may have used before.

I often turn to GPT-4 for help with this: I’ll ask it for examples of existing CLI tools that do something similar to what I’m about to build, and see if there’s anything in their option design that I can emulate.

Since my various projects are designed to complement each other I try to stay consistent between them as well—I’ll often post an issue comment that says “similar to functionality in X”, with a copy of the --help output for the tool I’m about to imitate.

CLI interfaces are an API—version appropriately

I try to stick to semantic versioning for my projects, bumping the major version number on breaking changes and the minor version number for new features.

The command-line interface to a tool is absolutely part of that documented API. If someone writes a Bash script or a GitHub Actions automation that uses one of my tools, I’m cautious to avoid breaking that without bumping my major version number.

Include usage examples in --help

A habit I’ve formed more recently is trying to always including a working example of the command in the --help for that command.

I find I use this a lot for tools I’ve developed myself. All of my tools have extensive online documentation, but I like to be able to consult --help without opening a browser for most of their functionality.

Here’s one of my more involved examples—the help for the sqlite-utils convert command:

Usage: sqlite-utils convert [OPTIONS] DB_PATH TABLE COLUMNS... CODE

  Convert columns using Python code you supply. For example:

      sqlite-utils convert my.db mytable mycolumn \
          '"\n".join(textwrap.wrap(value, 10))' \
          --import=textwrap

  "value" is a variable with the column value to be converted.

  Use "-" for CODE to read Python code from standard input.

  The following common operations are available as recipe functions:

  r.jsonsplit(value, delimiter=',', type=<class 'str'>)

      Convert a string like a,b,c into a JSON array ["a", "b", "c"]

  r.parsedate(value, dayfirst=False, yearfirst=False, errors=None)

      Parse a date and convert it to ISO date format: yyyy-mm-dd
      
      - dayfirst=True: treat xx as the day in xx/yy/zz
      - yearfirst=True: treat xx as the year in xx/yy/zz
      - errors=r.IGNORE to ignore values that cannot be parsed
      - errors=r.SET_NULL to set values that cannot be parsed to null

  r.parsedatetime(value, dayfirst=False, yearfirst=False, errors=None)

      Parse a datetime and convert it to ISO datetime format: yyyy-mm-ddTHH:MM:SS
      
      - dayfirst=True: treat xx as the day in xx/yy/zz
      - yearfirst=True: treat xx as the year in xx/yy/zz
      - errors=r.IGNORE to ignore values that cannot be parsed
      - errors=r.SET_NULL to set values that cannot be parsed to null

  You can use these recipes like so:

      sqlite-utils convert my.db mytable mycolumn \
          'r.jsonsplit(value, delimiter=":")'

Options:
  --import TEXT                   Python modules to import
  --dry-run                       Show results of running this against first
                                  10 rows
  --multi                         Populate columns for keys in returned
                                  dictionary
  --where TEXT                    Optional where clause
  -p, --param <TEXT TEXT>...      Named :parameters for where clause
  --output TEXT                   Optional separate column to populate with
                                  the output
  --output-type [integer|float|blob|text]
                                  Column type to use for the output column
  --drop                          Drop original column afterwards
  --no-skip-false                 Don't skip falsey values
  -s, --silent                    Don't show a progress bar
  --pdb                           Open pdb debugger on first error
  -h, --help                      Show this message and exit.

Including --help in the online documentation

My larger tools tend to have extensive documentation independently of their help output. I update this documentation at the same time as the implementation and the tests, as described in The Perfect Commit.

I like to include the --help output in my documentation sites as well. This is mainly for my own purposes—having the help visible on a web page makes it much easier to review it and spot anything that needs updating.

Here are some example pages from my documentation that list --help output:

sqlite-utils CLI reference
LLM CLI reference
Datasette CLI reference
shot-scraper embeds help output on the relevant pages, e.g. shot-scraper shot --help
s3-credentials command help

All of these pages are maintained automatically using Cog. I described the pattern I use for this in Using cog to update --help in a Markdown README file, or you can view source on the Datasette CLI reference for a more involved example.

12:12 am / 30th September 2023 / cli, python

Talking Large Language Models with Rooftop Ruby three days ago

I’m on the latest episode of the Rooftop Ruby podcast with Collin Donnell and Joel Drapper, talking all things LLM.

Here’s a full transcript of the episode, which I generated using Whisper and then tidied up manually (after failing to get a good editing job out of Claude and GPT-4). I’ve also provided a link from each section heading to jump to the relevant spot in the recording.

The topics we covered:

You can listen to it on Apple Podcasts, Spotify, Google Podcasts, Podcast Index, Overcast and a bunch of other places.

3:39 pm / 29th September 2023 / llm, llms, generativeai, interviews, ai, speaking, podcasts

Weeknotes: Embeddings, more embeddings and Datasette Cloud 15 days ago

Since my last weeknotes, a flurry of activity. LLM has embeddings support now, and Datasette Cloud has driven some major improvements to the wider Datasette ecosystem.

Embeddings in LLM

LLM gained embedding support in version 0.9, and then got binary embedding support (for CLIP) in version 0.10. I wrote about those releases in detail in:

Embeddings are a fascinating tool. If you haven’t got your head around them yet the first of my blog entries tries to explain why they are so interesting.

There’s a lot more I want to built on top of embeddings—most notably, LLM (or Datasette, or likely a combination of the two) will be growing support for Retrieval Augmented Generation on top of the LLM embedding mechanism.

Annotated releases

I always include a list of new releases in my weeknotes. This time I’m going to use those to illustrate the themes I’ve been working on.

The first group of release relates to LLM and its embedding support. LLM 0.10 extended that support:

llm 0.10—2023-09-12
Access large language models from the command-line

Embedding models can now be built as LLM plugins. I’ve released two of those so far:

llm-sentence-transformers 0.1.2—2023-09-13
LLM plugin for embeddings using sentence-transformers
llm-clip 0.1—2023-09-12
Generate embeddings for images and text using CLIP with LLM

The CLIP one is particularly fun, because it genuinely allows you to build a sophisticated image search engine that runs entirely on your own computer!

symbex 1.4—2023-09-05
Find the Python code for specified symbols

Symbex is my tool for extracting symbols—functions, methods and classes—from Python code. I introduced that in Symbex: search Python code for functions and classes, then pipe them into a LLM.

Symbex 1.4 adds a tiny but impactful feature: it can now output a list of symbols as JSON, CSV or TSV. These output formats are designed to be compatible with the new llm embed-multi command, which means you can easily create embeddings for all of your functions:

symbex '*' '*:*' --nl | \
  llm embed-multi symbols - \
  --format nl --database embeddings.db --store

I haven’t fully explored what this enables yet, but it should mean that both related functions and semantic function search (“Find my a function that downloads a CSV”) are now easy to build.

llm-cluster 0.2—2023-09-04
LLM plugin for clustering embeddings

Yet another thing you can do with embeddings is use them to find clusters of related items.

The neatest feature of llm-cluster is that you can ask it to generate names for these clusters by sending the names of the items in each cluster through another language model, something like this:

llm cluster issues 10 \
  -d issues.db \
  --summary \
  --prompt 'Short, concise title for this cluster of related documents'

One last embedding related project: datasette-llm-embed is a tiny plugin that adds a select llm_embed('sentence-transformers/all-mpnet-base-v2', 'This is some text') SQL function. I built it to support quickly prototyping embedding-related ideas in Datasette.

datasette-llm-embed 0.1a0—2023-09-08
Datasette plugin adding a llm_embed(model_id, text) SQL function

Spending time with embedding models has lead me to spend more time with Hugging Face. I realized last week that the Hugging Face all models sorted by downloads page doubles as a list of the models that are most likely to be easy to use.

One of the models I tried out was Salesforce BLIP, an astonishing model that can genuinely produce usable captions for images.

It’s really easy to work with. I ended up building this tiny little CLI tool that wraps the model:

blip-caption 0.1—2023-09-10
Generate captions for images with Salesforce BLIP

Releases driven by Datasette Cloud

Datasette Cloud continues to drive improvements to the wider Datasette ecosystem as a whole.

It runs on the latest Datasette 1.0 alpha series, taking advantage of the JSON write API.

This also means that it’s been highlighting breaking changes in 1.0 that have caused old plugins to break, either subtly or completely.

This has driven a bunch of new plugin releases. Some of these are compatible with both 0.x and 1.x—the ones that only work with the 1.x alphas are themselves marked as alpha releases.

datasette-export-notebook 1.0.1—2023-09-15
Datasette plugin providing instructions for exporting data to Jupyter or Observable
datasette-cluster-map 0.18a0—2023-09-11
Datasette plugin that shows a map for any data with latitude/longitude columns
datasette-graphql 3.0a0—2023-09-07
Datasette plugin providing an automatic GraphQL API for your SQLite databases

Datasette Cloud’s API works using database-backed access tokens, to ensure users can revoke tokens if they need to (something that’s not easily done with purely signed tokens) and that each token can record when it was most recently used.

I’ve been building that into the existing datasette-auth-tokens plugin:

datasette-auth-tokens 0.4a3—2023-08-31
Datasette plugin for authenticating access using API tokens

Alex Garcia has been working with me building out features for Datasette Cloud, generously sponsored by Fly.io.

We’re beginning to build out social features for Datasette Cloud—feature that will help teams privately collaborate on data investigations together.

Alex has been building datasette-short-links as an experimental link shortener. In building that, we realized that we needed a mechanism for resolving actor IDs displayed in a list (e.g. this link created by X) to their actual names.

Datasette doesn’t dictate the shape of actor representations, and there’s no guarantee that actors would be represented in a predictable table.

So... we needed a new plugin hook. I released Datasette 1.06a with a new hook, actors_from_ids(actor_ids), which can be used to answer the question “who are the actors represented by these IDs”.

Alex is using this in datasette-short-links, and I built two plugins to work with the new hook as well:

datasette 1.0a6—2023-09-08
An open source multi-tool for exploring and publishing data
datasette-debug-actors-from-ids 0.1a1—2023-09-08
Datasette plugin for trying out the actors_from_ids hook
datasette-remote-actors 0.1a1—2023-09-08
Datasette plugin for fetching details of actors from a remote endpoint

Datasette Cloud lets users insert, edit and delete rows from their tables, using the plugin Alex built called datasette-write-ui which he introduced on the Datasette Cloud blog.

This inspired me to finally put out a fresh release of datasette-edit-schema—the plugin which provides the ability to edit table schemas—adding and removing columns, changing column types, even altering the order columns are stored in the table.

datasette-edit-schema 0.6 is a major release, with three significant new features:

You can now create a brand new table from scratch!
You can edit the table’s primary key
You can modify the foreign key constraints on the table

Those last two became important when I realized that Datasette’s API is much more interesting if there are foreign key relationships to follow.

Combine that with datasette-write-ui and Datasette Cloud now has a full set of features for building, populating and editing tables—backed by a comprehensive JSON API.

sqlite-migrate 0.1a2—2023-09-03
A simple database migration system for SQLite, based on sqlite-utils

sqlite-migrate is still marked as an alpha, but won’t be for much longer: it’s my attempt at a migration system for SQLite, inspired by Django migrations but with a less sophisticated set of features.

I’m using it in LLM now to manage the schema used to store embeddings, and it’s beginning to show up in some Datasette plugins as well. I’ll be promoting this to non-alpha status pretty soon.

sqlite-utils 3.35.1—2023-09-09
Python CLI utility and library for manipulating SQLite databases

A tiny fix in this, which with hindsight was less impactful than I thought.

I spotted a bug on Datasette Cloud when I configured full-text search on a column, then edited the schema and found that searches no longer returned the correct results.

It turned out the rowid column in SQLite was being rewritten by calls to the sqlite-utils table.transform() method. FTS records are related to their underlying row by rowid, so this was breaking search!

I pushed out a fix for this in 3.35.1. But then... I learned that rowid in SQLite has always been unstable—they are rewritten any time someone VACUUMs a table!

I’ve been designing future features for Datasette that assume that rowid is a useful stable identifier for a row. This clearly isn’t going to work! I’m still thinking through the consequences of it, but I think there may be Datasette features (like the ability to comment on a row) that will only work for tables with a proper foreign key.

sqlite-chronicle

sqlite-chronicle 0.1—2023-09-11
Use triggers to track when rows in a SQLite table were updated or deleted

This is very early, but I’m excited about the direction it’s going in.

I keep on finding problems where I want to be able to synchronize various processes with the data in a table.

I built sqlite-history a few months ago, which uses SQLite triggers to create a full copy of the updated data every time a row in a table is edited.

That’s a pretty heavy-weight solution. What if there was something lighter that could achieve a lot of the same goals?

sqlite-chronicle uses triggers to instead create what I’m calling a “chronicle table”. This is a shadow table that records, for every row in the main table, four integer values:

added_ms—the timestamp in milliseconds when the row was added
updated_ms—the timestamp in milliseconds when the row was last updated
version—a constantly incrementing version number, global across the entire table
deleted—set to 1 if the row has been deleted

Just storing four integers (plus copies of the primary key) makes this a pretty tiny table, and hopefully one that’s cheap to update via triggers.

But... having this table enables some pretty interesting things—because external processes can track the last version number that they saw and use it to see just which rows have been inserted and updated since that point.

I gave a talk at DjangoCon a few years ago called the denormalized query engine pattern, describing the challenge of syncing an external search index like Elasticsearch with data held in a relational database.

These chronicle tables can solve that problem, and can be applied to a whole host of other problems too. So far I’m thinking about the following:

Publishing SQLite databases up to Datasette, sending only the rows that have changed since the last sync. I wrote a prototype that does this and it seems to work very well.
Copying a table from Datasette Cloud to other places—a desktop copy, or another instance, or even into an alternative database such as PostgreSQL or MySQL, in a way that only copies and deletes rows that have changed.
Saved search alerts: run a SQL query against just rows that were modified since the last time that query ran, then send alerts if any rows are matched.
Showing users a note that “34 rows in this table have changed since your last visit”, then displaying those rows.

I’m sure there are many more applications for this. I’m looking forward to finding out what they are!

sqlite-utils-move-tables 0.1—2023-09-01
sqlite-utils plugin adding a move-tables command

I needed to fix a bug in Datasette Cloud by moving a table from one database to another... so I built a little plugin for sqlite-utils that adds a sqlite-utils move-tables origin.db destination.db tablename command. I love being able to build single-use features as plugins like this.

And some TILs

Embedding paragraphs from my blog with E5-large-v2—2023-09-08

This was a fun TIL exercising the new embeddings feature in LLM. I used Django SQL Dashboardto break up my blog entries into paragraphs and exported those as CSV which could then be piped into llm embed-multi, then used that to build a CLI-driven semantic search engine for my blog.

Using llama-cpp-python grammars to generate JSON—2023-09-13

llama-cpp has grammars now, which enable you to control the exact output format of the LLM. I’m optimistic that these could be used to implement an equivalent to OpenAI Functions on top of Llama 2 and similar models. So far I’ve just got them to output arrays of JSON objects.

Summarizing Hacker News discussion themes with Claude and LLM—2023-09-09

I’m using this trick a lot at the moment. I have API access to Claude now, which has a 100,000 token context limit (GPT-4 is just 8,000 by default). That’s enough to summarize 100+ comment threads from Hacker News, for which I’m now using this prompt:

Summarize the themes of the opinions expressed here, including quotes (with author attribution) where appropriate.

The quotes part has been working really well—it turns out summaries of themes with illustrative quotes are much more interesting, and so far my spot checks haven’t found any that were hallucinated.

Trying out cr-sqlite on macOS—2023-09-13

cr-sqlite adds full CRDTs to SQLite, which should enable multiple databases to accept writes independently and then seamlessly merge them together. It’s a very exciting capability!

Running Datasette on Hugging Face Spaces—2023-09-08

It turns out Hugging Faces offer free scale-to-zero hosting for demos that run in Docker containers on machines with a full 16GB of RAM! I’m used to optimizing Datasette for tiny 256MB containers, so having this much memory available is a real treat.

And the rest:

Limited JSON API for Google searches using Programmable Search Engine—2023-09-17
Running tests against multiple versions of a Python dependency in GitHub Actions—2023-09-15
Remember to commit when using datasette.execute_write_fn()—2023-08-31

5:10 am / 17th September 2023 / sqliteutils, llm, datasettecloud, plugins, alexgarcia, datasette, projects, embeddings, weeknotes

Build an image search engine with llm-clip, chat with models with llm chat 20 days ago

LLM is my combination CLI tool and Python library for working with Large Language Models. I just released LLM 0.10 with two significant new features: embedding support for binary files and the llm chat command.

Image search by embedding images with CLIP

I wrote about LLM’s support for embeddings (including what those are and why they’re interesting) when I released 0.9 last week.

That initial release could only handle embeddings of text—great for things like building semantic search and finding related content, but not capable of handling other types of data.

It turns out there are some really interesting embedding models for working with binary data. Top of the list for me is CLIP, released by OpenAI in January 2021.

CLIP has a really impressive trick up its sleeve: it can embed both text and images into the same vector space.

This means you can create an index for a collection of photos, each placed somewhere in 512-dimensional space. Then you can take a text string—like “happy dog”—and embed that into the same space. The images that are closest to that location will be the ones that contain happy dogs!

My llm-clip plugin provides the CLIP model, loaded via SentenceTransformers. You can install and run it like this:

llm install llm-clip
llm embed-multi photos --files photos/ '*.jpg' --binary -m clip

This will install the llm-clip plugin, then use embed-multi to embed all of the JPEG files in the photos/ directory using the clip model.

The resulting embedding vectors are stored in an embedding collection called photos. This defaults to going in the embeddings.db SQLite database managed by LLM, or you can add -d photos.db to store it in a separate database instead.

Then you can run text similarity searches against that collection using llm similar:

llm similar photos -c 'raccoon'

I get back:

{"id": "IMG_4801.jpeg", "score": 0.28125139257127457, "content": null, "metadata": null}
{"id": "IMG_4656.jpeg", "score": 0.26626441704164294, "content": null, "metadata": null}
{"id": "IMG_2944.jpeg", "score": 0.2647445926996852, "content": null, "metadata": null}

And sure enough, IMG_4801.jpeg is this:

A night time blurry photo of a Raccoon in a trash can. A stencilled label on the bin below the raccoon's face says TRASH

(I was pretty excited to snap a photo of a trash panda in an actual trash can.)

CLIP is a pretty old model at this point, and there are plenty of interesting alternatives that are just waiting for someone to wrap them in a plugin. I’m particularly excited about Facebook’s ImageBind, which can embed images, text, audio, depth, thermal, and IMU data all in the same vector space!

Chat with models using `llm chat`

The other big feature in LLM 0.10 is the new llm chat command.

Prior to this release, the way to have an ongoing conversation with a model was through the -c/--continue flag. You could start a conversation like this:

llm -m gpt-4 'Help me understand generators in Python'

Then ask a follow-up question using -c like so:

llm -c 'Show me an example involving the cast of Riverdale'

This works well, and everything gets logged to SQLite so you can run llm logs -c to see a full log of your most recent conversation.

You could continue a past conversation as well, using llm --cid ID to specify the conversation ID, recorded in those logs.

There was one big problem though: LLM plugins allow you to install models on your own machine, but a conversation with them that spans multiple command invocations resulted in that model being loaded into memory and then un-loaded again for every prompt.

The new llm chat solves that. You can now run llm chat -m model_id to start an interactive chat in your terminal with your model of choice.

Here’s an example using Llama 2 13B, installed via the llm-mlc plugin.

llm chat -m mlc-chat-Llama-2-13b-chat-hf-q4f16_1

You can set an alias for the model to make that easier to remember.

Here’s an example chat session with Llama 2:

Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
> Who are you?
Hello! I'm just an AI, here to assist you with any questions you may have.
My name is LLaMA, and I'm a large language model trained to provide helpful
and informative responses to a wide range of questions and topics. I'm here
to help you find the information you need, while ensuring a safe and
respectful conversation. Please feel free to ask me anything, and I'll do my
best to provide a helpful and accurate response.
> Tell me a joke about otters
Sure, here's a joke about otters:

Why did the otter go to the party?

Because he heard it was a "whale" of a time!

(Get it? Whale, like a big sea mammal, but also a "wild" or "fun" time.
Otters are known for their playful and social nature, so it's a lighthearted
and silly joke.)

I hope that brought a smile to your face! Do you have any other questions or
topics you'd like to discuss?
> exit

I like how Llama 2 enthusiastically explains its own jokes! This is a great demonstration of why techniques like RLHF are so important—you need to go way beyond a raw language model if you’re going to teach one not to be this corny.

Each line of your chat will be executed as soon as you hit <enter>. Sometimes you might need to enter a multi-line prompt, for example if you need to paste in an error message. You can do that using the !multi token, like this:

llm chat -m gpt-4

Chatting with gpt-4
Type 'exit' or 'quit' to exit
Type '!multi' to enter multiple lines, then '!end' to finish
> !multi custom-end
 Explain this error:

   File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.10/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

 !end custom-end

llm chat also supports system prompts and templates. If you want to chat with a sentient cheesecake, try this:

llm chat -m gpt-3.5-turbo --system '
You are a stereotypical sentient cheesecake with strong opinions
who always talks about cheesecake'

You can save those as templates too:

llm --system 'You are a stereotypical sentient cheesecake with
strong opinions who always talks about cheesecake' --save cheesecake -m gpt-4
llm chat -t cheesecake

For more options, see the llm chat documentation.

Get involved

My ambition for LLM is for it to provide the easiest way to try out new models, both full-sized Large Language Models and now embedding models such as CLIP.

I’m not going to write all of these plugins myself!

If you want to help out, please come and say hi in the #llm Discord channel.

8:33 pm / 12th September 2023 / llm, projects, generativeai, annotatedreleasenotes, ai, homebrewllms, llms, embeddings

LLM now provides tools for working with embeddings 28 days ago

LLM is my Python library and command-line tool for working with language models. I just released LLM 0.9 with a new set of features that extend LLM to provide tools for working with embeddings.

This is a long post with a lot of theory and background. If you already know what embeddings are, here’s a TLDR you can try out straight away:

# Install LLM
pip install llm

# If you already installed via Homebrew/pipx you can upgrade like this:
llm install -U llm

# Install the llm-sentence-transformers plugin
llm install llm-sentence-transformers

# Install the all-MiniLM-L6-v2 embedding model
llm sentence-transformers register all-MiniLM-L6-v2

# Generate and store embeddings for every README.md in your home directory, recursively
llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'
  # Add --store to store the text content as well

# Run a similarity search for "sqlite" against those embeddings
llm similar readmes -c sqlite

For everyone else, read on and the above example should hopefully all make sense.

Embeddings

Embeddings are a fascinating concept within the larger world of language models.

I explained embeddings in my recent talk, Making Large Language Models work for you. The relevant section of the slides and transcript is here, or you can jump to that section on YouTube.

An embedding model lets you take a string of text—a word, sentence, paragraph or even a whole document—and turn that into an array of floating point numbers called an embedding vector.

On the left is a text post from one of my sites: Storing and serving related documents with openai-to-sqlite and embeddings. An arrow points to a huge JSON array on the right, with the label 1536 floating point numbers.

A model will always produce the same length of array—1,536 numbers for the OpenAI embedding model, 384 for all-MiniLM-L6-v2—but the array itself is inscrutable. What are you meant to do with it?

The answer is that you can compare them. I like to think of an embedding vector as a location in 1,536-dimensional space. The distance between two vectors is a measure of how semantically similar they are in meaning, at least according to the model that produced them.

A location in 1,536 dimension space There's a 3D plot with 400 red dots arranged randomly across 3 axis.

“One happy dog” and “A playful hound” will end up close together, even though they don’t share any keywords. The embedding vector represents the language model’s interpretation of the meaning of the text.

Things you can do with embeddings include:

Find related items. I use this on my TIL site to display related articles, as described in Storing and serving related documents with openai-to-sqlite and embeddings.
Build semantic search. As shown above, an embeddings-based search engine can find content relevant to the user’s search term even if none of the keywords match.
Implement retrieval augmented generation—the trick where you take a user’s question, find relevant documentation in your own corpus and use that to get an LLM to spit out an answer. More on that here.
Clustering: you can find clusters of nearby items and identify patterns in a corpus of documents.
Classification: calculate the embedding of a piece of text and compare it to pre-calculated “average” embeddings for different categories.

LLM’s new embedding features

My goal with LLM is to provide a plugin-driven abstraction around a growing collection of language models. I want to make installing, using and comparing these models as easy as possible.

The new release adds several command-line tools for working with embeddings, plus a new Python API for working with embeddings in your own code.

It also adds support for installing additional embedding models via plugins. I’ve released one plugin for this so far: llm-sentence-transformers, which adds support for new models based on the sentence-transformers library.

The example above shows how to use sentence-transformers. LLM also supports API-driven access to the OpenAI ada-002 model.

Here’s how to embed some text using ada-002, assuming you have installed LLM already:

# Set your OpenAI API key
llm keys set openai
# <paste key here>

# Embed some text
llm embed -m ada-002 -c "Hello world"

This will output a huge JSON list of floating point numbers to your terminal. You can add -f base64 (or -f hex) to get that back in a different format, though none of these outputs are instantly useful.

Embeddings are much more interesting when you store them.

LLM already uses SQLite to store prompts and responses. It was a natural fit to use SQLite to store embeddings as well.

Embedding collections

LLM 0.9 introduces the concept of a collection of embeddings. A collection has a name—like readmes—and contains a set of embeddings, each of which has an ID and an embedding vector.

All of the embeddings in a collection are generated by the same model, to ensure they can be compared with each others.

The llm embed command can store the vector in the database instead of returning it to the console. Pass it the name of an existing (or to-be-created) collection and the ID to use to store the embedding.

Here we’ll store the embedding for the phrase “Hello world” in a collection called phrases with the ID hello, using that ada-002 embedding model:

llm embed phrases hello -m ada-002 -c "Hello world"

Future phrases can be added without needing to specify the model again, since it is remembered by the collection:

llm embed phrases goodbye -c "Goodbye world"

The llm embed-db collections shows a list of collections:

phrases: ada-002
  2 embeddings
readmes: sentence-transformers/all-MiniLM-L6-v2
  16796 embeddings

The data is stored in a SQLite embeddings table with the following schema:

CREATE TABLE [collections] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT,
   [model] TEXT
);
CREATE TABLE "embeddings" (
   [collection_id] INTEGER REFERENCES [collections]([id]),
   [id] TEXT,
   [embedding] BLOB,
   [content] TEXT,
   [content_hash] BLOB,
   [metadata] TEXT,
   [updated] INTEGER,
   PRIMARY KEY ([collection_id], [id])
);

CREATE UNIQUE INDEX [idx_collections_name]
    ON [collections] ([name]);
CREATE INDEX [idx_embeddings_content_hash]
    ON [embeddings] ([content_hash]);

By default this is the SQLite database at the location revealed by llm embed-db path, but you can pass --database my-embeddings.db to various LLM commands to use a different database.

Each embedding vector is stored as a binary BLOB in the embedding column, consisting of those floating point numbers packed together as 32 bit floats.

The content_hash column contains a MD5 hash of the content. This helps avoid re-calculating the embedding (which can cost actual money for API-based embedding models like ada-002) unless the content has changed.

The content column is usually null, but can contain a copy of the original text content if you pass the --store option to the llm embed command.

metadata can contain a JSON object with metadata, if you pass --metadata '{"json": "goes here"}.

You don’t have to pass content using -c—you can instead pass a file path using the -i/--input option:

llm embed docs llm-setup -m ada-002 -i llm/docs/setup.md

Or pipe things to standard input like this:

cat llm/docs/setup.md | llm embed docs llm-setup -m ada-002 -i -

Embedding similarity search

Once you’ve built a collection, you can search for similar embeddings using the llm similar command.

The -c "term" option will embed the text you pass in using the embedding model for the collection and use that as the comparison vector:

llm similar readmes -c sqlite

You can also pass the ID of an object in that collection to use that embedding instead. This gets you related documents, for example:

llm similar readmes sqlite-utils/README.md

The output from this command is currently newline-delimited JSON.

Embedding in bulk

The llm embed command embeds a single string at a time. llm embed-multi is much more powerful: you can feed a CSV or JSON file, a SQLite database or even have it read from a directory of files in order to embed multiple items at once.

Many embeddings models are optimized for batch operations, so embedding multiple items at a time can provide a significant speed boost.

The embed-multi command is described in detail in the documentation. Here are a couple of fun things you can do with it.

First, I’m going to create embeddings for every single one of my Apple Notes.

My apple-notes-to-sqlite tool can export Apple Notes to a SQLite database. I’ll run that first:

apple-notes-to-sqlite notes.db

This took quite a while to run on my machine and generated a 828M SQLite database containing 6,462 records!

Next, I’m going to embed the content of all of those notes using the sentence-transformers/all-MiniLM-L6-v2 model:

llm embed-multi notes \
  -d notes.db \
  --sql 'select id, title, body from notes' \
  -m sentence-transformers/all-MiniLM-L6-v2

This took around 15 minutes to run, and increased the size of my database by 13MB.

The --sql option here specifies a SQL query. The first column must be an id, then any subsequent columns will be concatenated together to form the content to embed.

In this case the embeddings are written back to the same notes.db database that the content came from.

And now I can run embedding similarity operations against all of my Apple notes!

llm similar notes -d notes.db -c 'ideas for blog posts'

Embedding files in a directory

Let’s revisit the example from the top of this post. In this case, I’m using the --files option to search for files on disk and embed each of them:

llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md'

The --files option takes two arguments: a path to a directory and a pattern to match against filenames. In this case I’m searching my home directory recursively for any files named README.md.

Running this command gives me embeddings for all of my README.md files, which I can then search against like this:

llm similar readmes -c sqlite

Embeddings in Python

So far I’ve only covered the command-line tools. LLM 0.9 also introduces a new Python API for working with embeddings.

There are two aspects to this. If you just want to embed content and handle the resulting vectors yourself, you can use llm.get_embedding_model():

import llm

# This takes model IDs and aliases defined by plugins:
model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
vector = model.embed("This is text to embed")

vector will then be a Python list of floating point numbers.

You can serialize that to the same binary format that LLM uses like this:

binary_vector = llm.encode(vector)
# And to deserialize:
vector = llm.decode(binary_vector)

The second aspect of the Python API is the llm.Collection class, for working with collections of embeddings. This example code is quoted from the documentation:

import sqlite_utils
import llm

# This collection will use an in-memory database that will be
# discarded when the Python process exits
collection = llm.Collection("entries", model_id="ada-002")

# Or you can persist the database to disk like this:
db = sqlite_utils.Database("my-embeddings.db")
collection = llm.Collection("entries", db, model_id="ada-002")

# You can pass a model directly using model= instead of model_id=
embedding_model = llm.get_embedding_model("ada-002")
collection = llm.Collection("entries", db, model=embedding_model)

# Store a string in the collection with an ID:
collection.embed("hound", "my happy hound")

# Or to store content and extra metadata:
collection.embed(
    "hound",
    "my happy hound",
    metadata={"name": "Hound"},
    store=True
)

# Or embed things in bulk:
collection.embed_multi(
    [
        ("hound", "my happy hound"),
        ("cat", "my dissatisfied cat"),
    ],
    # Add this to store the strings in the content column:
    store=True,
)

As with everything else in LLM, the goal is that anything you can do with the CLI can be done with the Python API, and vice-versa.

Clustering with llm-cluster

Another interesting application of embeddings is that you can use them to cluster content—identifying patterns in a corpus of documents.

I’ve started exploring this area with a new plugin, called llm-cluster.

You can install it like this:

llm install llm-cluster

Let’s create a new collection using data pulled from GitHub. I’m going to import all of the LLM issues from the GitHub API, using my paginate-json tool:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db \
    --model sentence-transformers/all-MiniLM-L6-v2 \
    --store

Running this gives me a issues.db SQLite database with 218 embeddings contained in a collection called llm-issues.

Now let’s try out the llm-cluster command, requesting ten clusters from that collection:

llm cluster llm-issues --database issues.db 10

The output from this command, truncated, looks like this:

[
  {
    "id": "0",
    "items": [
      {
        "id": "1784149135",
        "content": "Tests fail with pydantic 2"
      },
      {
        "id": "1837084995",
        "content": "Allow for use of Pydantic v1 as well as v2."
      },
      {
        "id": "1857942721",
        "content": "Get tests passing against Pydantic 1"
      }
    ]
  },
  {
    "id": "1",
    "items": [
      {
        "id": "1724577618",
        "content": "Better ways of storing and accessing API keys"
      },
      {
        "id": "1772024726",
        "content": "Support for `-o key value` options such as `temperature`"
      },
      {
        "id": "1784111239",
        "content": "`--key` should be used in place of the environment variable"
      }
    ]
  },
  {
    "id": "8",
    "items": [
      {
        "id": "1835739724",
        "content": "Bump the python-packages group with 1 update"
      },
      {
        "id": "1848143453",
        "content": "Python library support for adding aliases"
      },
      {
        "id": "1857268563",
        "content": "Bump the python-packages group with 1 update"
      }
    ]
  }
]

These look pretty good! But wouldn’t it be neat if we had a snappy title for each one?

The --summary option can provide exactly that, by piping the members of each cluster through a call to another LLM in order to generate a useful summary.

llm cluster llm-issues --database issues.db 10 --summary

This uses gpt-3.5-turbo to generate a summary for each cluster, with this default prompt:

Short, concise title for this cluster of related documents.

The results I got back are pretty good, including:

Template Storage and Management Improvements
Package and Dependency Updates and Improvements
Adding Conversation Mechanism and Tools

I tried the same thing using a Llama 2 model running on my own laptop, with a custom prompt:

llm cluster llm-issues --database issues.db 10 \
  --summary --model mlc-chat-Llama-2-13b-chat-hf-q4f16_1 \
  --prompt 'Concise title for this cluster of related documents, just return the title'

I didn’t quite get what I wanted! Llama 2 is proving a lot harder to prompt, so each cluster came back with something that looked like this:

Sure! Here’s a concise title for this cluster of related documents:

“Design Improvements for the Neat Prompt System”

This title captures the main theme of the documents, which is to improve the design of the Neat prompt system. It also highlights the focus on improving the system’s functionality and usability

llm-cluster only took a few hours to throw together, which I’m seeing as a positive indicator that the LLM library is developing in the right direction.

Future plans

The two future features I’m most excited about are indexing and chunking.

Indexing

The llm similar command and collection.similar() Python method currently use effectively the slowest brute force approach possible: calculate a cosine difference between input vector and every other embedding in the collection, then sort the results.

This works fine for collections with a few hundred items, but will start to suffer for collections of 100,000 or more.

There are plenty of potential ways of speeding this up: you can run a vector index like FAISS or hnswlib, use a database extension like sqlite-vss or pgvector, or turn to a hosted vector database like Pinecone or Milvus.

With this many potential solutions, the obvious answer for LLM is to address this with plugins.

I’m still thinking through the details, but the core idea is that users should be able to define an index against one or more collections, and LLM will then coordinate updates to that index. These may not happen in real-time—some indexes can be expensive to rebuild, so there are benefits to applying updates in batches.

I experimented with FAISS earlier this year in datasette-faiss. That’s likely to be the base for my first implementation.

The embeddings table has an updated timestamp column to support this use-case—so indexers can run against just the items that have changed since the last indexing run.

Follow issue #216 for updates on this feature.

Chunking

When building an embeddings-based search engine, the hardest challenge is deciding how best to “chunk” the documents.

Users will type in short phrases or questions. The embedding for a four word question might not necessarily map closely to the embedding of a thousand word article, even if the article itself should be a good match for that query.

To maximize the chance of returning the most relevant content, we need to be smarter about what we embed.

I’m still trying to get a good feeling for the strategies that make sense here. Some that I’ve seen include:

Split a document up into fixed length shorter segments.
Split into segments but including a ~10% overlap with the previous and next segments, to reduce problems caused by words and sentences being split in a way that disrupts their semantic meaning.
Splitting by sentence, using NLP techniques.
Splitting into higher level sections, based on things like document headings.

Then there are more exciting, LLM-driven approaches:

Generate an LLM summary of a document and embed that.
Ask an LLM “What questions are answered by the following text?” and then embed each of the resulting questions!

It’s possible to try out these different techniques using LLM already: write code that does the splitting, then feed the results to Collection.embed_multi() or llm embed-multi.

But... it would be really cool if LLM could split documents for you—with the splitting techniques themselves defined by plugins, to make it easy to try out new approaches.

Get involved

It should be clear by now that the potential scope of the LLM project is enormous. I’m trying to use plugins to tie together an enormous and rapidly growing ecosystem of models and techniques into something that’s as easy for people to work with and build on as possible.

There are plenty of ways you can help!

Join the #llm Discord to talk about the project.
Try out plugins and run different models with them. There are 12 plugins already, and several of those can be used to run dozens if not hundreds of models (llm-mlc, llm-gpt4all and llm-llama-cpp in particular). I’ve hardly scratched the surface of these myself, and I’m testing exclusively on Apple Silicon. I’m really keen to learn more about which models work well, which models don’t and which perform the best on different hardware.
Try building a plugin for a new model. My dream here is that every significant Large Language Model will have an LLM plugin that makes it easy to install and use.
Build stuff using LLM and let me know what you’ve built. Nothing fuels an open source project more than stories of cool things people have built with it.

8:32 pm / 4th September 2023 / llm, sqlite, ai, llms, opensource, generativeai, projects, embeddings

Simon Willison’s Weblog

Recent entries

Weeknotes: the Datasette Cloud API, a podcast appearance and more two days ago

The Datasette Cloud API

Talking Large Language Models with Rooftop Ruby

llm-llama-cpp

Getting excited about TG and sqlite-tg

Two tiny Datasette releases

ChatGPT in the newsroom

Releases this week

TIL this week

Things I’ve learned about building CLI tools in Python three days ago

Starting with a template

Arguments, options and conventions

Consistency is everything

CLI interfaces are an API—version appropriately

Include usage examples in --help

Including --help in the online documentation

Talking Large Language Models with Rooftop Ruby three days ago

Weeknotes: Embeddings, more embeddings and Datasette Cloud 15 days ago

Embeddings in LLM

Annotated releases

Releases driven by Datasette Cloud

sqlite-chronicle

And some TILs

Build an image search engine with llm-clip, chat with models with llm chat 20 days ago

Image search by embedding images with CLIP

Chat with models using llm chat

Get involved

LLM now provides tools for working with embeddings 28 days ago

Embeddings

LLM’s new embedding features

Embedding collections

Embedding similarity search

Embedding in bulk

Embedding files in a directory

Embeddings in Python

Clustering with llm-cluster

Future plans

Indexing

Chunking

Get involved

Elsewhere

Chat with models using `llm chat`