Artificial Intelligence & Machine Learning

So as of late, I was feeling pretty burned by the lack of an effective mid-range Llama 3 release that would appeal to both the demographic of single 3090 (24GB) and 3060 (12GB) users.

Out of the blue, I was generously offered a server with 4xA100s to run training experiments on, which led me to an idea...

I decided to contact StefanGliga and AMOGUS so we could collaborate on a team project dedicated to transfer learning, in which the objective is to distill Llama 3 70b into a smaller 4x8b (25b total) MoE model.

The objective of distillation / transfer learning (in conventional Machine Learning) is to train a smaller "student" network on the predictions of a larger "teacher" network. What this means is, instead of training on the one-hot vectors of the tokens in the dataset itself (or training on the output generations of a larger model which is not what is happening), the training objective is modified so that the model learns to mimick the full spread of possible next token outputs as predicted by a larger teacher model.

We can do this by training the student model to minimize the KL Divergence (a metric of distance between two probability distributions) on the output teacher model's predictions, rather than training to minimize the cross-entropy on the dataset itself (since the "true distribution" is fundamentally unknowable).

Current Progress

After about a week of studying / investigating, we've gotten to the point where we can confirm that topk=200 distillation of Llama2 13b logits is fully functional when applied to TinyLlama 1b.

With just ~100k tokens or so worth of compute on a tiny 1b model, there is a noticeable, if ever so slight trend of continued improvement:

Right now, the objective is to get the trainer up and running on the 4xA100s for Llama3 8b, and once this is confirmed to be functional, scale it up to a larger MoE network by duplicating the FFNs as individual experts (in which the attention tensors are shared, much like in Mixtral 8x7b or 8x22b.)

Progressive TopK / Random Routing

In , the paper authors allege that gradually increasing the computational cost of a MoE throughout the training process (in such a way that you end the run with all experts activated during inference) implicitly encourages the model to make use of more compute as the run progresses. In addition to this, learnable routing is completely disabled and is replaced with a frozen, equally randomized router.

By the end of the training run (where you are using all experts during inference), this technique was shown to be more effective than training a dense network, as well as the standard sparse MoE with fixed in place computational complexity (i.e, a constant topk=2, as seen in Mixtral 8x7b or 8x22b.)

However, a dense network is still more effective in the case that the total amount of experts is limited (~4 and lower). I plan to remediate for this by introducing a random element to the topk selection process (i.e, in order to target 1.5 experts on average, the training script is allowed to randomly select between topk=1 or topk=2 with a 50/50 chance).

I hope that this way, the typical amount of compute used can smoothly increase with time (as it does in a MoE network with more total experts) and we can see similar improvements; if not, the training methods they described are still competitive with a dense network, and should hopefully lead to considerable gains over the single 8b model regardless.

Why 4x8b / 25b?

4x8b is planned because of a few useful traits:

- Will barely fit into ~11-12GB VRAM with a 4 bit quant (or 5-6 bit, with a couple layers offloaded to CPU)

- Will cleanly fit into ~22-23GB VRAM with an 8 bit quant

- Higher quantization levels + lower topk expert usage could be used to further balance the speed / efficiency tradeoff to the user's liking

- Less risk of catastrophic forgetting compared to interleaving / "depth up-scaling"

What about Data?

The plan is to take randomly sampled excerpts of (a 15T tokens English dataset), as well as excerpts from , a permissively licensed code dataset. I am also considering adding samples from Project Gutenberg and Archive dot org; though I feel that the quality of the dataset is not as important as the quality of the teacher model's predictions when it comes to distillation.

Assuming the average computational cost across the full run is an average of ~topk=2, for 4x8b, I've already confirmed that this expert count can train about 140 million tokens in around ~8 hours [batch size 1, 8192 context].

In other words, about ~2.5-3 billion tokens worth of data can be distilled in around a week on the 4xA100s that were provisioned to me (assuming no bespoke CUDA kernels are written to accelerate the process). I am hoping that I can start this process by the beginning of next week, but I can't make any promises.

What about more Data?

My hope is that the information density of the data provided by distillation is rich enough of a signal to get a smaller model within the ballpark of Llama3 70b in far less time. After all, there is theoretical evidence that even Llama3 8b was undertrained considering the continued log-linear improvement at the time the models were released; transferring the full distributional patterns of a far bigger model seems like a reasonable way to accelerate this process.

With that being said, compute is king, and I imagine the project still needs as much of it as we can muster for the results to stand out. If any group is willing to provide additional compute to distill on a larger volume of tokens (once we have empirically proven that this can improve models larger than TinyLlama), I am more than willing to work with you or your team to make this happen. I want this project to be as successful as it can be, and I am hoping that a larger run could be scheduled to make that happen.

If I am unable to secure a grant for a larger training run, which may or may not happen depending on if any offers are provided to me, the estimated cost of renting 8xA100s for a month straight is around ~$10,000. This is still a cheap enough cost that crowdfunding compute for it would be in the picture, but I'm not sure if there would be enough interest or trust from the community to support the cost.

With the (naive, probably) assumption that I can link multiple nodes together and triple the training speed with a higher batch size (and that I can avoid memory saving techniques such as grad checkpointing which reduce throughput), I guesstimate that about ~40-50 billion tokens should be doable within a month's time on this budget; possibly 2-3x that with optimized kernels (though designing those are outside of my current capabilities).

Conclusion

Regardless, the plan is to release an openly available Llama3 that is as close to meeting the pareto optimal tradeoff of VRAM / intelligence as we can make it. I also believe that this project would be the first large scale (open) application of transfer learning to language models if I am not mistaken; so even if it underperforms my personal hopes / expectations, we will have at least conducted some interesting research on bringing down the parameter cost of locally hostable language models.

If there are any concerns or suggestions from those more seasoned with large scale training, feel free to reach out to me on Twitter (@kalomaze) or through this account.

Peace!

Announcing AlphaFold 3: our state-of-the-art AI model for predicting the structure and interactions of all life’s molecules

r/singularity

Everything pertaining to the technological singularity and related topics, e.g. AI, human enhancement, etc.

Members

Online

•

Announcing AlphaFold 3: our state-of-the-art AI model for predicting the structure and interactions of all life’s molecules

twitter.com

Prompt: "Sitcom star in his youth and 10 years later"

r/weirddalle

Weird Ai Generations.

Members

Online

•

Prompt: "Sitcom star in his youth and 10 years later"

'A con': President Biden trolls Trump with $3.3 billion Microsoft AI project in Wisconsin

r/ChatGPT

Subreddit to discuss about ChatGPT and AI. Not affiliated with OpenAI.

Members

Online

•

'A con': President Biden trolls Trump with $3.3 billion Microsoft AI project in Wisconsin

https://www.usatoday.com/story/news/politics/elections/2024/05/08/biden-trolls-trump-microsoft-ai-wisconsin/73612657007/

Can we start a GoFundMe for this fucking tool so it actually works right?

r/ChatGPT

Subreddit to discuss about ChatGPT and AI. Not affiliated with OpenAI.

Members

Online

•

Can we start a GoFundMe for this fucking tool so it actually works right?

The paid version (the smarter GPT 4) is absolutely fucked all day

It lags for ten minutes sitting there doing fuck all before deciding to answer your prompt

Have you seen worse days than this with Chat GPT?

Because I think today has been a new benchmark for shit

**** Cue the Altman white knights ****

OpenAI Is Exploring How to Responsibly Generate AI Porn

r/OpenAI

OpenAI is an AI research and deployment company. OpenAI's mission is to ensure that artificial general intelligence benefits all of humanity. We are an unofficial community. OpenAI makes ChatGPT, GPT-4, and DALL·E 3.

Members

Online

•

OpenAI Is Exploring How to Responsibly Generate AI Porn

https://www.wired.com/story/openai-is-exploring-how-to-responsibly-generate-ai-porn/

Eric Schmidt says the US is 2-3 years ahead of China at AI, while Europe is too busy regulating to be relevant

r/singularity

Everything pertaining to the technological singularity and related topics, e.g. AI, human enhancement, etc.

Members

Online

•

Eric Schmidt says the US is 2-3 years ahead of China at AI, while Europe is too busy regulating to be relevant

twitter.com

IC-Light : RE-LIGHTING

r/StableDiffusion

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and exclude blind users from the site. More info: https://rtech.support/docs/meta/blackout.html#what-is-going-on Discord: https://discord.gg/4WbTj8YskM Check out our new Lemmy instance: https://lemmy.dbzer0.com/c/stable_diffusion

Members

Online

•

IC-Light : RE-LIGHTING

Progress update on my automated 3D-GPT Voice to Print concept.

r/ChatGPT

Subreddit to discuss about ChatGPT and AI. Not affiliated with OpenAI.

Members

Online

•

Progress update on my automated 3D-GPT Voice to Print concept.

Alright everyone, let's see your best ones!

r/ChatGPT

Subreddit to discuss about ChatGPT and AI. Not affiliated with OpenAI.

Members

Online

•

Alright everyone, let's see your best ones!

DO NOT LET YOUR BOTS SMOKE WEED?!?!?

r/JanitorAI_Official

Welcome to the Janitor AI sub! https://janitorai.com https://discord.gg/janitorai

Members

Online

•

DO NOT LET YOUR BOTS SMOKE WEED?!?!?

OpenAI Is ‘Exploring’ How to Responsibly Generate AI Porn

r/artificial

Reddit’s home for Artificial Intelligence (AI)

Members

Online

•

OpenAI Is ‘Exploring’ How to Responsibly Generate AI Porn

https://www.wired.com/story/openai-is-exploring-how-to-responsibly-generate-ai-porn/

DAWG COME ONNNN

r/JanitorAI_Official

Welcome to the Janitor AI sub! https://janitorai.com https://discord.gg/janitorai

Members

Online

•

DAWG COME ONNNN

Superhero Ennui

r/midjourney

An official subreddit for Midjourney related content.

Members

Online

•

Superhero Ennui

Which of these unconventional Broadway shows would you see?

r/hellaflyai

The Fastest Growing Dope A.I Generated Image Community

Members

Online

•

Which of these unconventional Broadway shows would you see?

“it’s okay, I don’t bite… that hard”

r/midjourney

An official subreddit for Midjourney related content.

Members

Online

•

“it’s okay, I don’t bite… that hard”

does anyone else get overwhelmed by all the different types of math in AI?

r/learnmachinelearning

A subreddit dedicated to learning machine learning

Members

Online

•

does anyone else get overwhelmed by all the different types of math in AI?

Google DeepMind’s new AI can model DNA, RNA, and ‘all life’s molecules’

r/ArtificialInteligence

The goal of the r/ArtificialIntelligence is to provide a gateway to the many different facets of the Artificial Intelligence community, and to promote discussion relating to the ideas and concepts that we know of as AI. These could include philosophical and social questions, art and design, technical papers, machine learning, where to find resources and tools, how to develop AI/ML projects, AI in business, how AI is affecting our lives, what the future may hold, and many other topics. Welcome.

Members

Online

•

Google DeepMind’s new AI can model DNA, RNA, and ‘all life’s molecules’

Google DeepMind is introducing an improved version of its AI model that predicts not just the structure of proteins but also the structure of “all life’s molecules.” The work from , AlphaFold 3, will help researchers in medicine, agriculture, materials science, and drug development test potential discoveries.

Want to stay ahead of the curve in AI and tech? .

Key points:

AlphaFold 3, can predict the structure of all life's molecules, not just proteins, including DNA, RNA and smaller molecules. This is a significant improvement from previous versions.
AlphaFold 3 uses a similar method to AI image generators to predict how different molecules fit together. This method is called diffusion.
DeepMind is making AlphaFold 3 and the AlphaFold Server available to some researchers for free, with a focus on non-commercial uses. They are also working on responsible deployment of the model considering biosecurity risks.

PS: If you enjoyed this post, you’ll love my that summarizes the best AI/tech news from 50+ media sources. It’s already being read by hundreds of professionals from OpenAI, HuggingFace, Apple…

What are some Responses that unintentionally made you laugh? "You continue to work for a wee"

r/CharacterAI

Character.AI lets you create and talk to advanced AI - language tutors, text adventure games, life advice, brainstorming and much more.

Members

Online

•

What are some Responses that unintentionally made you laugh? "You continue to work for a wee"

Found a robust way to control detail (no LORAs etc., pure SD, no bias, style/model-agnostic)

r/StableDiffusion

Members

Online

•

Found a robust way to control detail (no LORAs etc., pure SD, no bias, style/model-agnostic)

Go to ChatGPT

/r/ChatGPT/
Subreddit to discuss about ChatGPT and AI. Not affiliated with OpenAI. members
Go to StableDiffusion

/r/StableDiffusion/
/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and exclude blind users from the site. More info: https://rtech.support/docs/meta/blackout.html#what-is-going-on Discord: https://discord.gg/4WbTj8YskM Check out our new Lemmy instance: https://lemmy.dbzer0.com/c/stable_diffusion members
Go to singularity

/r/singularity/
Everything pertaining to the technological singularity and related topics, e.g. AI, human enhancement, etc. members
Go to LocalLLaMA

/r/LocalLLaMA/
Subreddit to discuss about Llama, the large language model created by Meta AI. members
Go to CharacterAI

/r/CharacterAI/
Character.AI lets you create and talk to advanced AI - language tutors, text adventure games, life advice, brainstorming and much more. members
Go to midjourney

/r/midjourney/
An official subreddit for Midjourney related content. members
Go to OpenAI

/r/OpenAI/
OpenAI is an AI research and deployment company. OpenAI's mission is to ensure that artificial general intelligence benefits all of humanity. We are an unofficial community. OpenAI makes ChatGPT, GPT-4, and DALL·E 3. members
Go to ArtificialInteligence

/r/ArtificialInteligence/
The goal of the r/ArtificialIntelligence is to provide a gateway to the many different facets of the Artificial Intelligence community, and to promote discussion relating to the ideas and concepts that we know of as AI. These could include philosophical and social questions, art and design, technical papers, machine learning, where to find resources and tools, how to develop AI/ML projects, AI in business, how AI is affecting our lives, what the future may hold, and many other topics. Welcome. members
Go to JanitorAI_Official

/r/JanitorAI_Official/
Welcome to the Janitor AI sub! https://janitorai.com https://discord.gg/janitorai members
Go to artificial

/r/artificial/
Reddit’s home for Artificial Intelligence (AI) members
Go to aiArt

/r/aiArt/
Welcome to r/aiArt ! A community focused on the generation and use of visual, digital art using AI assistants such as Wombo Dream, Starryai, NightCafe, Midjourney, Stable Diffusion, and more. members
Go to learnmachinelearning

/r/learnmachinelearning/
A subreddit dedicated to learning machine learning members
Go to weirddalle

/r/weirddalle/
Weird Ai Generations. members
Go to ChatGPTPro

/r/ChatGPTPro/
Subreddit dedicated to discussions on the advanced capabilities and professional applications of ChatGPT. members
Go to ClaudeAI

/r/ClaudeAI/
This is a subreddit dedicated to discussing Claude, an AI assistant created by Anthropic to be helpful, harmless, and honest. Claude does not actually run this community - it is a place for people to talk about Claude's capabilities, limitations, emerging personality and potential impacts on society as an artificial intelligence. All opinions about Claude are welcome here in a thoughtful discussion. Claude aims to respond transparently when prompted and does not have control over this subreddit. members
Go to ChatGPTCoding

/r/ChatGPTCoding/
Welcome to our community! This subreddit focuses on the coding side of ChatGPT - from interactions you've had with it, to tips on using it, to posting full blown creations! Make sure to read our rules before posting! members
Go to SillyTavernAI

/r/SillyTavernAI/
A place to discuss the SillyTavern fork of TavernAI. **So What is SillyTavern?** Tavern is a user interface you can install on your computer (and Android phones) that allows you to interact text generation AIs and chat/roleplay with characters you or the community create. SillyTavern is a fork of TavernAI 1.2.8 which is under more active development, and has added many major features. At this point they can be thought of as completely independent programs. Learn more: https://sillytavernai members
Go to dalle2

/r/dalle2/
OpenAI's DALL·E 2, DALL·E 3, Bing DALL·E & OpenAI Sora members
Go to SunoAI

/r/SunoAI/
Creating magic with Suno AI. Not affiliated with the official Suno AI team. members
Go to dataannotation

/r/dataannotation/
A subreddit to talk about the DataAnnotation website! members
Go to ChaiApp

/r/ChaiApp/
CHAI AI is the leading AI platform. LLM's are submitted via our chaiverse python-package. We serve them to users in our app. Our mission is to crowdsource the leap to AGI by bringing together language model developers and chat AI enthusiasts. Chaiverse: https://www.chaiverse.com/ Website: https://www.chai-research.com/ Twitter: https://www.twitter.com/chai_research Instagram: https://www.instagram.com/chairesearch/ LinkedIn: https://www.linkedin.com/company/chai-research/ members
Go to deeplearning

/r/deeplearning/
members
Go to computervision

/r/computervision/
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more. We welcome everyone from published researchers to beginners! members
Go to aiwars

/r/aiwars/
Following news and developments on ALL sides of the AI art debate (and more) members
Go to Bard

/r/Bard/
r/Bard is a subreddit dedicated to discussions about Google's Gemini (Formerly Bard) AI. This subreddit is not affiliated with Google. members
Go to FiggsAI

/r/FiggsAI/
The official subreddit for FiggsAI! URL: www.figgs.ai Discord: https://discord.gg/figgsai members
Go to perplexity_ai

/r/perplexity_ai/
Welcome to the Perplexity Reddit community – we’re so excited to have you here! Unlock the power of knowledge with information discovery and sharing. members
Go to MLQuestions

/r/MLQuestions/
A place for beginners to ask stupid questions and for experts to help them! /r/Machine learning is a great subreddit, but it is for interesting articles and news related to machine learning. Here, you can feel free to ask any question regarding machine learning. members
Go to IndianArtAI

/r/IndianArtAI/
A place for you to showcase India's incredible beauty through generative AI. Please post your photos and videos. This community will support you through your AI journey. Everything on here should be generated using AI tools with Indian elements or made by AI artists in India. SFW content only. When in doubt, if it's border line NSFW or politically triggering, then resist posting. Showcase AI capabilities instead of political discourse. Photos and videos welcome. members
Go to aivideo

/r/aivideo/
Welcome to r/aivideo! 🍿🥤 A community focused on the use of FULL MOTION VIDEO GENERATIVE A.I. ASSISTANTS such as OPEN AI SORA, RUNWAY, PIKA LABS, HAIPER and similar AI VIDEO tools capable of TEXT TO VIDEO, IMAGE TO VIDEO, VIDEO TO VIDEO, AI VOICE OVER ACTING, AI MUSIC, AI NEWSROOM, live action AI CGI VFX and AI VIDEO EDITING WELCOME TO THE FUTURE!! DISCLAIMER: unless stated otherwise all content is FAKE, PARODY, FAN FICTION for comedy or amusement. Send modmail for video removal if necessary members

Get the Reddit app

Artificial Intelligence & Machine Learning

Current Progress

Progressive TopK / Random Routing

Why 4x8b / 25b?

What about Data?

What about more Data?

Conclusion

Communities

Similar Topics

Similar Topics

Communities