LessWrong

AI Alignment and the Classical Humanist Tradition

Hi guys, I’d like to share a proposal regarding AI alignment. The proposal is that training AI in the curriculum of Classical Virtue ethics could be a promising approach to alignment. A) Because general virtues with many exemplifications can help us teach the AI what we would really want it to do, even when we can't micromanage it, and B) because this pedagogy seems to be a good fit for AI's style of learning more generally.

Background i) The Pedagogy of Classical Art

In the Classical Humanist Tradition the pedagogy of learning depends on whether one studies a subject, or practices an art. Art is understood as craft or skill, for example the art of speaking well and persuading (rhetoric), and the art of living (virtue ethics). Training in...

(See More – 418 more words)

Seth Herd6m20

Thanks for your suggestions. I think having more people deeply engaged with alignment is good for our chances of getting it right.

I think this proposal falls into the category of goal crafting (a term proposed by Roko) - deciding what we want an AGI to do. Most alignment work addresses technical alignment - how we might get an AGI to reliably do anything. I think you imply the approach "just train it"; this might work for some types of AGI, and some types of training.

I think many humans trained in classical ethics are not actually ethical by their standard... (read more)

Natural Latents: The Concepts

johnswentworth, David Lorell

Suppose our old friends Alice and Bob decide to undertake an art project. Alice will draw a bunch of random purple and green lines on a piece of paper. That will be Alice’s picture (A). She’ll then make a copy, erase all the purple lines, and send the result as a message (M) to Bob. Bob then generates his own random purple lines, and adds them to the green lines from Alice, to create Bob’s picture (B). The two then frame their two pictures and hang them side-by-side to symbolize something something similarities and differences between humans something. Y’know, artsy bullshit.

Now, suppose Carol knows the plan and is watching all this unfold. She wants to make predictions about Bob’s picture, and doesn’t want to remember irrelevant details...

(Continue Reading – 5574 more words)

3Aprillion (Peter Hozák)3h

(Writing before I read the rest of the article): I believe Carol would "naturally" expect that Alice and Bob share more mutual information than she does with Bob herself (even if they weren't "old friends", they both "decided to undertake an art project" while she "wanted to make predictions"), thus she would weight the costs of remembering more than just the green lines against the expected prediction improvement given her time constrains, lost opportunities, ... - I imagine she could complete purple lines on her own, and then remember some "diff" about the most surprising differences... Also, not all of the green lines would be equally important, so a "natural latent" would be some short messages in "tokens of remembering", not necessarily correspond to the mathematical abstraction encoded by the 2 tokens of English "green lines" => Carol doesn't need to be able to draw the green lines from her memory if that memory was optimized to predict purple lines. If the purpose was to draw the green lines, I would be happy to call that memory "green lines" (and in that, I would assume to share a prior between me and the reader that I would describe as: "to remember green lines" usually means "to remember steps how to draw similar lines on another paper" ... also, similarity could be judged by other humans ... also, not to be confused with a very different concept "to remember an array of pixel coordinates" that can also be compressed into the words "green lines", but I don't expect people will be confused about the context, so I don't have to say it now, just keep in mind if someone squirts their eyes just-so which would provoke me to clarify).

3Mateusz Bagiński7h

In the Alice and Bob example, suppose there is a part (call it X) of the image that was initially painted green but where both Alice and Bob painted purple. Would that mean that the part of the natural latent over images A and B corresponding to X should be purple?

johnswentworth15m20

Only if they both predictably painted that part purple, e.g. as part of the overall plan. If they both randomly happened to paint the same part purple, then no.

Neuroscience and Alignment

Garrett Baker

I've been in many conversations where I've mentioned the idea of using neuroscience for outer alignment, and the people who I'm talking to usually seem pretty confused about why I would want to do that. Well, I'm confused about why one wouldn't want to do that, and in this post I explain why.

As far as I see it, there are three main strategies people have for trying to deal with AI alignment in worlds where AI alignment is hard.

Value alignment
Corrigibility
Control/scalable alignment

In my opinion, these are all great efforts, but I personally like the idea of working on value alignment directly. Why? First some negatives of the others:

Corrigibility requires moderate to extreme levels of philosophical deconfusion, an effort worth doing for some, but a very small set not

...

(See More – 438 more words)

1Joseph Bloom8h

I'm not sure anyone I know in mech interp is claiming this is a non-trivial problem.

1Joseph Bloom8h

I'm confused by this statement. Do we know this? Do we have enough of an understanding of either to say this? Don't get me wrong, there's some level on which I totally buy this. However, I'm just highly uncertain about what is really being claimed here.

Garrett Baker16m20

Does this comment I wrote clear up my claim?

A clarification about in what sense I claim "biological and artificial neural-networks are based upon the same fundamental principles":
I would not be surprised if the reasons why neural networks "work" are also exploited by the brain.
In particular why I think neuroscience for value alignment is good is because we can expect that the values part of the brain will be compatible with these reasons, and won’t require too much extra fundamental advances to actually implement, unlike say corrigibility, which will first

... (read more)

Malentropic Gizmo's Shortform

Malentropic Gizmo

2mo

1Malentropic Gizmo2h

I will read the fiction book that is recommended to me first (and I haven't already read it)! Time is of the essence! I will read anything, but if you want to recommend me something I am more likely to enjoy, here are a few thing about me: I like Sci-fi, Fantasy, metaethics, computers, games, Computer Science theory, Artificial Intelligence, fitness, D&D, edgy/shock humor.

3benjamincosman1h

Too Like the Lightning by Ada Palmer :)

Seth Herd19m20

I second this recommendation. This book was amazing. It's quite unlike other scifi, and that's a good thing.

Shortform

2jmh2h

Not really memoirs but a German documentary about WWII might be of interest for you. Der unbekannte Soldat I watched on Amazon Prime and you can still find the title there in a search, not sure if it is only available for rent/sale now or if you can stream with Prime membership.

2trevor4h

I'm not sure to what extent this is helpful, or if it's an example of the dynamic you're refuting, but Duncan Sabien recently wrote a post that intersects with this topic: Where it connects is that if someone sees [making the world a better place] like simply selecting a better Nash Equilibria, they absolutely will spend time exploring solutionspace/thinking through strategies similar to Goal Factoring or Babble and Prune. Lots of people throughout history have yearned for a better world in a lot of different ways, with varying awareness of the math behind Nash Equilibira, or the transhumanist and rationalist perspectives on civilization (e.g. map & territory & biases & scope insensitivity for rationalism, cryonics/anti-aging for transhumanism). But their goal here is largely steering culture away from nihilism (since culture is a Nash Equilibria) which means steering many people away from themselves, or at least the selves that they would have been. Maybe that's pretty minor in this case e.g. because feeling moderate amounts of empathy and living in a better society are both fun, but either way, changing a society requires changing people, and thinking really creatively about ways to change people tears down lots of chesterton-Schelling fences and it's very easy to make really big damaging mistakes in the process (because you need to successfully predict and avoid all mistakes as part of the competent pruning process, and actually measurably consistently succeeding at this is thinkoomph not just creative intelligence). Add in conflict theory to the mistake theory I've described here, factor in unevenly distributed intelligence and wealth in addition to unevenly distributed traits like empathy and ambition and suspicion-towards-outgroup (e.g. different combinations of all 5 variables), and you can imagine how conflict and resentment would accumulate on both sides over the course of generations. There's tons of examples in addition to Ayn Rand and Wokeness.

3Adam Zerner9h

Yeah, I echo this. I've gone back and forth with myself about this sort of stuff. Are humans altruistic? Good? Evil? On the one hand, yes, I think lc is right about how in some situations people exhibit just an extraordinary amount of altruism and sympathy. But on the other hand, there are other situations where people do the opposite: they'll, I dunno, jump into a lake at a risk to their own life to save a drowning stranger. Or risk their lives running into a burning building to save strangers (lots of volunteers did this during 9/11). I think the explanation is what Dagon is saying about how mutable and context-dependent people are. In some situations people will act extremely altruistically. In others they'll act extremely selfishly. The way that I like to think about this is in terms of "moral weight". How many utilons to John Doe would it take for you to give up one utilon of your own? Like, would you trade 1 utilon of your own so that John Doe can get 100,000 utilons? 1,000? 100? 10? Answering these questions, you can come up with "moral weights" to assign to different types of people. But I think that people don't really assign a moral weight and then act consistently. In some situations they'll act as if their answer to my previous question is 100,000, and in other situations they'll act like it's 0.00001.

Dagon23m42

My model of utility (and the standard one, as far as I can tell) doesn't work that way. No rational agent ever gives up a utilon - that is the thing they are maximizing. I think of it as "how many utilons do you get from thinking about John Doe's increased satisfaction (not utilons, as you have no access to his, though you could say "inferred utilons") compared to the direct utilons you would otherwise get".

Those moral weights are "just" terms in your utility function.

And, since humans aren't actually rational, and don't have consistent utility functions, actions that imply moral weights are highly variable and contextual.

D0TheMath's Shortform

Garrett Baker

2Garrett Baker11h

You are right, but I guess the thing I do actually care about here is the magnitude of the advancement (which is relevant for determining the sign of the action). How large an effect do you think the model merging stuff has (I'm thinking the effect where if you train a bunch of models, then average their weights, they do better). It seems very likely to me its essentially zero, but I do admit there's a small negative tail that's greater than the positive, so the average is likely negative. As for agent interactions, all the (useful) advances there seem things that definitely would have been made even if nobody released any LLMs, and everything was APIs.

Matt Goldenberg32m20

it's true, but I don't think there's anything fundamental preventing the same sort of proliferation and advances in open source LLMs that we've seen in stable diffusion (aside from the fact that LLMs aren't as useful for porn). that it has been relatively tame so far doesn't change the basic pattern of how open source effects the growth of technology

1Shankar Sivarajan11h

I'll believe it when I see it. The man who said it would be an open release has just been fired stepped down as CEO.

2Matt Goldenberg6h

yeah, it's much less likely now

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

General Thoughts on Secular Solstice

Jeffrey Heninger

I attended Secular Solstice in Berkeley last December.

My perspective is quite unusual: I live in a rationalist group house and work at an AI safety office, but I also am a Christian and attend church every week.^[1]

I was originally not planning on going to Solstice, but a decent number of people (~5) told me in person that they would be particularly interested in my opinions of it. I realized that I was interested in learning what I would think of it too, so I went.

I took notes on my thoughts throughout the service.^[2] This blog post is my broader thoughts on the experience. I also have blog posts for a fun little correction to one of the songs and my detailed notes & commentary.

Overarching Narrative

I do not agree...

(Continue Reading – 2157 more words)

6Cole Wyeth2h

I appreciate the perspective. Personally I don't really see the point of a secular solstice. But frankly, the hostility to religion is a feature of the rationalist community, not a bug. Rejection of faith is a defining feature of the community and an unofficial litmus test for full membership. The community has a carefully cultivated culture that makes it a kind of sanctuary from the rest of the world where rationalists can exchange ideas without reestablishing foundational concepts and repeating familiar arguments (along with many other advantages). The examples you point to do not demonstrate hostility towards religious people, they demonstrate hostility towards religion. This is as appropriate here as hostility towards factory farming is at a vegan group. Organizations (corporate, social, biological) are all defined by their boundaries. Christianity seems to be unusually open to everyone, but I think this is partially a side effect of evangelism. It makes sense to open your boundaries to the other when you are trying to eat it. Judaism in contrast carefully enforces the boundaries of its spaces. Lesswrong hates religion in the way that lipids hate water. We want it on the outside. I don't know about other rationalists, but I don't have a particular desire to seek it out and destroy it everywhere it exists (and I certainly wish no harm to religious people). I agree with you that too much hostility is harmful; but I don't agree that good organizations must always welcome the other.

2the gears to ascension8h

Weird take I frequently get funny looks for, no matter where I say it, rationalist community or other places: I currently think it is accurate to say that the sun is a ball of nearly pure suffering, devoid of the conscious experience that normally might make suffering worth it. Because I hold this belief, I also hold the belief that we therefore have an obligation to starlift it. I don't claim we need to then turn it into computronium, and I'd still like warmth and lights for our planets. But starlifting the sun would likely break up the solar system, so we'd need to recoordinate the planets to do it. It would be an immense undertaking of scales not often spoken of even in science fiction. But I think we have a moral obligation to give negentropic matter the chance to become happy people as its path towards entropy. For further understanding of how I think about this - perhaps in over-dense jargon, sorry to be over-brief here - I am very close to being a pure positive utilitarian, and my current understanding of nociception and avoidance behaviors implies that suffering in the brain may just be when brain-managed matter moves away from its path to entropy being made of patterns of intended-selfhood, eg because it is damaged, and the agency of returning to an intended self-form costs negentropy. Therefore, all energy spend that is not a being having its intended form is waste, and that waste is suffering because of there being life forms who wish it to be otherwise. My priority right now is preserving life on earth, but once we've got that more stable I think ensuring there's not astronomical waste is a moral imperative because wasted negentropy is unconscious suffering.

2the gears to ascension8h

Quantum physics only adds up to normality until you learn enough about reality to find out that it really, really, really, really doesn't, and then you get to build quantum computers. I reject the claim that faith implies the world cannot change; I would describe the agnostic-compatible interreligious part of faith as a lobian bet - one could also known as wishcasting - that others will behave in ways that enact good. This does not mean the world cannot change. I agree that there is something real that could be mathematized underlying what "faith" is, and that noticing that "trust" is a near-exact synonym is part of why I agree with this. I think that it would mostly add up to normality to describe it formally, and it would in fact reveal that most religious people are wrong to have faith in many of the things they do. I recognize in myself the urge to make disses about this, and claim that if it reveals religious people are not wrong, I would in fact react to that. I went from atheist to strong agnostic. There are multiple ways I can slice the universe conceptually where I can honestly identify phenomena as alive or as people; similarly, there are multiple ways I can slice the universe where I can honestly identify phenomena as gods. Whether those gods are good is an empirical question, just as it is an empirical question for me whether another will be kind to me.

Gordon Seidoh Worley36m20

I reject the claim that faith implies the world cannot change

Me too, which is why I didn't write this.

What does "autodidact" mean?

bhauth

Alice takes a chemistry class at a university. She gets a professor who basically just reads from the textbook during lectures. She reads the textbook on her own, talks to her classmates, and finds some relevant Wikipedia articles and youtube videos.

Bob studies chemistry on his own. He buys the textbook Alice uses because it's popular, reads it on his own, talks to other people interested in chemistry, and finds some relevant Wikipedia articles and youtube videos.

Bob is an autodidact. Alice is not.

OK, I understand that, but what's the key difference? What is the essence of autodidact-ness? Is it...

The mere involvement of a "legitimate" institution, even if it makes no real difference to the individual's learning experience?
Some essential difference in the experience that Alice and Bob have while learning?
Something different about the personal character of Alice and Bob?

I don't think there's a clear consensus, and I don't think it describes a clear distinction, and that's why I don't normally use the word "autodidact".

4interstice13h

This seems significantly overstated. Most subjects are not taught in school to most people, but they don't thereby degrade into nonsense.

4Garrett Baker14h

This seems false. Often those who are rich get rich off of profitable subjects, and end up spreading awareness of those subjects. Many were never taught programming in school, yet learned to program anyway. Schools could completely neglect that subject, and still it would spread.

12bhauth17h

The thing is, I've noticed that a lot of the curriculum in schools is blind imitation of high-status science-y people, which can end up as cargo culting. Textbooks and classes have students memorize terms for certain things because "smart people know those words, and we should make kids be more like the smart people" - but they use those words because they understand underlying concepts, and without that the words can be useless. This reminds me of Feynman talking about textbooks. Now that we have the internet, you can simply see what kinds of technical language experts use and learn about those terms yourself. The distinctions "autodidact" is supposed to imply might have been weakened by the internet, and for that matter by libraries. Of course, you'd also have to pick the experts you're going to imitate on your own, but then, you also have to pick professors or schools or textbooks. Following a societal consensus about competence should work about equally well either way. I remember talking with a director of a Max Planck institute. Ion transport was relevant to the conversation, and I said something about how "of course, while lithium ions are light, that doesn't mean they diffuse quickly because they're strongly bound to their solute complex, more than sodium ions". He said, "Aha, I see you're well-educated". I thought that was funny because that wasn't something I ever took any classes on or got mentoring about. The other funny thing about that conversation was that his pet project was this energy storage idea that had no chance of working because he didn't understand ion solvation well enough.

Viliam1h20

I think we probably agree on how far the existing system is from the ideal. I wanted to point at the opposite end of the scale as a reminder that we are even further away from that.

When I was at the first grade of elementary school, they tried to teach us about "sets", which mostly meant that instead of "two plus two equals four" the textbook said "the union of a set containing two elements and another set containing two elements, has four elements". In hindsight I see this was a cargo-cultish version of set theory, which probably was very high-status at t... (read more)

nim's Shortform

nim

nim1h10

"to clean house" as implication of violence...

Due to a tragic shortage of outbuildings (to be remedied in the mid term but not immediately), my living room is the garage/makerspace of my home. I cleaned as one cleans for guests last week, because a friend from way back was dropping by. I then got to enjoy a clean-enough-for-guests home for several days, which is a big part of why it is nice to be visited by friends un-intimate enough to feel like cleaning for.

Then my partner-in-crafts came over, and we re-occupied every table with a combination of resin ca... (read more)

LESSWRONG
LW

Recommendations

Latest Posts

Quick Takes

Popular Comments

Recent Discussion

Overarching Narrative