Language Log

Korean pot food in southern Taiwan

December 3, 2023 @ 1:28 pm · Filed by Victor Mair under Language and food, Multilingualism, Translation

2017 photo of a Kaohsiung storefront courtesy of Mark Eaglesfield:

Read the rest of this entry »

Extracting training data from LLMs

December 3, 2023 @ 9:40 am · Filed by Mark Liberman under Computational linguistics

Nasr et al., "Scalable Extraction of Training Data from (Production) Language Models", arXiv.org 11/28/2023:

This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

Read the rest of this entry »

Permalink Comments

Q* = Q + A* ?

December 3, 2023 @ 8:21 am · Filed by Mark Liberman under Computational linguistics

Recent buzz over "Q*" started with stories about 10 days ago. A recent Wired article explains:

Last week, after briefly deposed CEO Sam Altman was reinstalled at OpenAI, two reports claimed that a top-secret project at the company had rattled some researchers there with its potential to solve intractable problems in a powerful new way.

“Given vast computing resources, the new model was able to solve certain mathematical problems,” Reuters reported, citing a single unnamed source. “Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success.” The Information said that Q* was seen as a breakthrough that would lead to “far more powerful artificial intelligence models,” adding that “the pace of development alarmed some researchers focused on AI safety,” citing a single unnamed source.

Read the rest of this entry »

Permalink Comments (4)

"Are": Japanese word of the year

December 2, 2023 @ 8:37 am · Filed by Victor Mair under Acronyms, Borrowing, Grammar, Language and entertainment, Language and sports, Word of the year

Japanese words of the year are always exciting and surprising, but this year's takes the cake.

are あれ

pronunciation

- IPA: [a̠ɾe̞]

distal demonstrative, something far off removed from both speaker and listener: that, yon

1. (deictically) that one over there (far from the speaker and the addressee)
  
  あれはなんですか？
  
  Are wa nan desu ka?
  
  What is that?
2. (anaphorically) that one we both know (both the speaker and the addressee know)
  
  これはあれでしょ？○○。
  
  Kore wa are desho?○○.
  
  This is that one thing, isn't it? You know, X.

Usage note

- Indicates something far off, removed from both speaker and addressee. Contrast with それ (sore), indicating something removed from the speaker but closer to the addressee.

(Wiktionary)

Read the rest of this entry »

Permalink Comments (15)

Implementing Pāṇini's grammar

December 1, 2023 @ 6:16 pm · Filed by Victor Mair under Computational linguistics, Grammar, Language and computers, Morphology

[Here's the conclusion to the hoped for trifecta on things Indian — see the preface here. It comes in the form of a guest post by Arun Prasad]

The cornerstone of traditional Sanskrit grammar is Pāṇini's Aṣṭādhyāyī, which in around 4,000 short rules defines a comprehensive system for generating valid Sanskrit expressions. It continues to prompt vigorous discussion to this today, some of which has featured in Language Log before.

As a professional software engineer and amateur Sanskritist, my lens is more pragmatic: if we could implement the Aṣṭādhyāyī in code and generate an exhaustive list of Sanskrit words, we could create incredibly valuable tools for Sanskrit students and scholars.

To that end, I have implemented just over 2,000 of the Aṣṭādhyāyī's rules in code, with an online demo here. These rules span all major sections of the text that pertain to morphology, including: derivation of verbs, nominals, secondary roots, primary nominal bases, and secondary nominal bases; compounding; accent; and sandhi.

Read the rest of this entry »

Permalink Comments

Major Linguistic Faux Pas in Chinese Football Association PPT

December 1, 2023 @ 12:47 am · Filed by Victor Mair under Language and politics, Language and sports, Slogans

The Chinese Football Association used dǎngguó 党国 ("party state" — nettlesome term to be explained fully below) in a powerpoint on its plans for '24. Awkward political illiteracy!

Here's a screenshot.

Read the rest of this entry »

Permalink Comments (2)

Spelling and intuition

November 30, 2023 @ 8:02 am · Filed by Victor Mair under Spelling

Long have we pondered the overwhelming dominance by individuals of Indian heritage over the spelling bees. Do they have some sort of mysterious power or secret for memorizing hundreds of thousands of obscure words?

Now we have an answer from one of the masters himself, Dev Shah, a ninth-grader living in Largo, Florida, who won the Scripps National Spelling Bee in June of this year.

Opinion

I won the National Spelling Bee.

This is what it takes to master spelling.

By Dev Shah, WSJ

——————-

I never expected to win. I had lost more than two dozen spelling bees since I started competing in the fourth grade, and last year, I didn’t even qualify for the national competition. If that wasn’t enough pressure, this was my final year of eligibility. This spelling bee was my last shot.

The annual Scripps National Spelling Bee is an incredible event. Each year, some 11 million students from across the country take part in the spelling bee circuit, all vying for the championship title. After competing in rigorous local bees, about 200 spellers make it to the national stage, and a handful of them qualify for the grand finals. Of course, only one can be crowned the National Spelling Bee champion. This year that student was me.

Read the rest of this entry »

Permalink Comments (14)

Sanskrit is far from extinct

November 29, 2023 @ 10:38 am · Filed by Victor Mair under Diction, Grammar, Language teaching and learning, Pedagogy, Philology, Phonetics and phonology, Pronunciation

[This is the first of two consecutive posts on things Indian. After reading them, if someone is prompted to send me material for a third, I'll be happy to make it a trifecta.]

Our entry point to the linguistically compelling topic of today's post is this Nikkei Asia (11/29/23) article by Barkha Shah in its "Tea Leaves" section:

Why it's worth learning ancient Sanskrit in the modern world:

India’s classical language is making a comeback via Telegram and YouTube

The author begins with a brief introduction to the language:

The language had its heyday in ancient India. The Vedas, a collection of poems and hymns, were written in Sanskrit between 1500 and 1200 B.C., along with other literary texts now known as the Upanishads, Granths and Vedangas. But while Sanskrit became the foundation for many (though not all) modern Indian languages, including Hindi, it faded away as a living tongue.

Read the rest of this entry »

Permalink Comments (23)

Charlie Chaplin in French class

November 28, 2023 @ 9:08 am · Filed by Mark Liberman under Language and education, Linguistics in the comics

In addition to a proto-regular-expression for English monosyllables, Benjamin Lee Whorf's 12/1940 Technology Review article has a weird diagram showing how a linguist (?) would organize French language instruction along the lines of mid-20th-century factory work:

Read the rest of this entry »

Permalink Comments (13)

Cancel your taem sher

November 28, 2023 @ 8:06 am · Filed by Victor Mair under Accents, Language and advertising

Driving to work this morning, I heard an advertisement on the radio that left me mightily perplexed till the last 5-10 seconds when I finally figured out what the speaker was talking about.

He had a thick southern accent and kept talking about how bad it was to have a "taem sher". The first word sounded like it was between "tam" and "tem", so I give the makeshift transcription "taem".

I had no idea what he was decrying, but it was something very bad for "you and your family", so bad that apparently it could bankrupt you. Moreover, it was something that was very hard to get rid of.

Read the rest of this entry »

Permalink Comments (20)

Slap varieties

November 27, 2023 @ 1:05 pm · Filed by Victor Mair under Language and religion, Slang, Variation

Sunny Jhatti wrote to me: "I didn't know what 'pimp slap' meant till I saw this."

After witnessing her astonishing diatribe, Conal Boyce said:

I felt like I needed to take a shower.

(Adding insult to injury, google failed to elucidate 'Skims' for me. Had to look elsewhere to get an inkling of what that recurrent theme was about.)

I found the presenter's self-introduction here. She even has her own YouTube channel and other social media platforms. Her handle is Genevieve Akal. She is a Gnostic Priestess and Nun. From the pieties expressed on her homepage, I would never have imagined that she could indulge in such vile vitriol.

Read the rest of this entry »

Permalink Comments (22)

Vowel systems and musical sounds

November 26, 2023 @ 8:01 pm · Filed by Victor Mair under Language and music, Phonetics and phonology

[This is a guest post by H. Krishnapriyan]

Would you know of any ready reference that talks about vowels not getting articulated in specific places in the mouth, but rather being part of a system of vowels where the sound value of a vowel is determined by the vowel's relative position of articulation with respect to other vowels? I recall reading about this decades back, most likely, in a book by Henry Sweet.

Read the rest of this entry »

Permalink Comments (4)

Whorf invents generative phonology?

November 26, 2023 @ 6:15 pm · Filed by Mark Liberman under Linguistic history, Linguistics as a discipline

After stumbling on Benjamin Lee Whorf's affiliation with the Theosophical Society, I read two articles that he contributed to the MIT Technology Review in 1940: "Science and Linguistics" in the April issue, and "Linguistics as an Exact Science" in the December issue. Something in the second article surprised me.

Whorf gives a formal account of English syllable structure in terms of what he calls "pattern symbolics", presenting the term and a sketch of the associated formalism as if they were standard linguistic theory, like "Maxwell's equations" in physics. But I've never heard the phrase "pattern symbolics" before, and web search turns up no examples other than this article. And the formalism seems similarly idiosyncratic.

Read the rest of this entry »

Permalink Comments (12)

Language Log

Korean pot food in southern Taiwan

Extracting training data from LLMs

Q* = Q + A* ?

"Are": Japanese word of the year

Usage note

Implementing Pāṇini's grammar

Major Linguistic Faux Pas in Chinese Football Association PPT

Spelling and intuition

Sanskrit is far from extinct

Charlie Chaplin in French class

Cancel your taem sher

Slap varieties

Vowel systems and musical sounds

Whorf invents generative phonology?

Follow us on Twitter

Archives [+/–]

Blogroll [+/–]

Meta