Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores

Artificial Intelligence & Machine Learning

OpenAI and Microsoft are reportedly developing plans for the world’s biggest supercomputer, a $100bn project codenamed Stargate, which analysts speculate would be powered by several nuclear plants





Planning for Distillation of Llama 3 70b -> 4x8b / 25b
r/LocalLLaMA

Subreddit to discuss about Llama, the large language model created by Meta AI.


Members Online
Planning for Distillation of Llama 3 70b -> 4x8b / 25b

Hello r/LocalLLaMA! This is your resident meme-sampler designer kalomaze.

So as of late, I was feeling pretty burned by the lack of an effective mid-range Llama 3 release that would appeal to both the demographic of single 3090 (24GB) and 3060 (12GB) users.

Out of the blue, I was generously offered a server with 4xA100s to run training experiments on, which led me to an idea...

4x8b, topk=1 expert selection, testing basic modeling loss

I decided to contact StefanGliga and AMOGUS so we could collaborate on a team project dedicated to transfer learning, in which the objective is to distill Llama 3 70b into a smaller 4x8b (25b total) MoE model.

The objective of distillation / transfer learning (in conventional Machine Learning) is to train a smaller "student" network on the predictions of a larger "teacher" network. What this means is, instead of training on the one-hot vectors of the tokens in the dataset itself (or training on the output generations of a larger model which is not what is happening), the training objective is modified so that the model learns to mimick the full spread of possible next token outputs as predicted by a larger teacher model.

We can do this by training the student model to minimize the KL Divergence (a metric of distance between two probability distributions) on the output teacher model's predictions, rather than training to minimize the cross-entropy on the dataset itself (since the "true distribution" is fundamentally unknowable).

Current Progress

After about a week of studying / investigating, we've gotten to the point where we can confirm that topk=200 distillation of Llama2 13b logits is fully functional when applied to TinyLlama 1b.

With just ~100k tokens or so worth of compute on a tiny 1b model, there is a noticeable, if ever so slight trend of continued improvement:

TinyLlama 1b, initial test of distillation loss

Right now, the objective is to get the trainer up and running on the 4xA100s for Llama3 8b, and once this is confirmed to be functional, scale it up to a larger MoE network by duplicating the FFNs as individual experts (in which the attention tensors are shared, much like in Mixtral 8x7b or 8x22b.)

Progressive TopK / Random Routing

In Sparse MoE as the new Dropout, the paper authors allege that gradually increasing the computational cost of a MoE throughout the training process (in such a way that you end the run with all experts activated during inference) implicitly encourages the model to make use of more compute as the run progresses. In addition to this, learnable routing is completely disabled and is replaced with a frozen, equally randomized router.

By the end of the training run (where you are using all experts during inference), this technique was shown to be more effective than training a dense network, as well as the standard sparse MoE with fixed in place computational complexity (i.e, a constant topk=2, as seen in Mixtral 8x7b or 8x22b.)

However, a dense network is still more effective in the case that the total amount of experts is limited (~4 and lower). I plan to remediate for this by introducing a random element to the topk selection process (i.e, in order to target 1.5 experts on average, the training script is allowed to randomly select between topk=1 or topk=2 with a 50/50 chance).

I hope that this way, the typical amount of compute used can smoothly increase with time (as it does in a MoE network with more total experts) and we can see similar improvements; if not, the training methods they described are still competitive with a dense network, and should hopefully lead to considerable gains over the single 8b model regardless.

Why 4x8b / 25b?

4x8b is planned because of a few useful traits:

- Will barely fit into ~11-12GB VRAM with a 4 bit quant (or 5-6 bit, with a couple layers offloaded to CPU)

- Will cleanly fit into ~22-23GB VRAM with an 8 bit quant

- Higher quantization levels + lower topk expert usage could be used to further balance the speed / efficiency tradeoff to the user's liking

- Less risk of catastrophic forgetting compared to interleaving / "depth up-scaling"

What about Data?

The plan is to take randomly sampled excerpts of FineWeb (a 15T tokens English dataset), as well as excerpts from The Stack, a permissively licensed code dataset. I am also considering adding samples from Project Gutenberg and Archive dot org; though I feel that the quality of the dataset is not as important as the quality of the teacher model's predictions when it comes to distillation.

Assuming the average computational cost across the full run is an average of ~topk=2, for 4x8b, I've already confirmed that this expert count can train about 140 million tokens in around ~8 hours [batch size 1, 8192 context].

In other words, about ~2.5-3 billion tokens worth of data can be distilled in around a week on the 4xA100s that were provisioned to me (assuming no bespoke CUDA kernels are written to accelerate the process). I am hoping that I can start this process by the beginning of next week, but I can't make any promises.

What about more Data?

My hope is that the information density of the data provided by distillation is rich enough of a signal to get a smaller model within the ballpark of Llama3 70b in far less time. After all, there is theoretical evidence that even Llama3 8b was undertrained considering the continued log-linear improvement at the time the models were released; transferring the full distributional patterns of a far bigger model seems like a reasonable way to accelerate this process.

https://preview.redd.it/planning-for-distillation-of-llama-3-70b-4x8b-25b-v0-wiuc3s2kwazc1.png

With that being said, compute is king, and I imagine the project still needs as much of it as we can muster for the results to stand out. If any group is willing to provide additional compute to distill on a larger volume of tokens (once we have empirically proven that this can improve models larger than TinyLlama), I am more than willing to work with you or your team to make this happen. I want this project to be as successful as it can be, and I am hoping that a larger run could be scheduled to make that happen.

If I am unable to secure a grant for a larger training run, which may or may not happen depending on if any offers are provided to me, the estimated cost of renting 8xA100s for a month straight is around ~$10,000. This is still a cheap enough cost that crowdfunding compute for it would be in the picture, but I'm not sure if there would be enough interest or trust from the community to support the cost.

With the (naive, probably) assumption that I can link multiple nodes together and triple the training speed with a higher batch size (and that I can avoid memory saving techniques such as grad checkpointing which reduce throughput), I guesstimate that about ~40-50 billion tokens should be doable within a month's time on this budget; possibly 2-3x that with optimized kernels (though designing those are outside of my current capabilities).

Conclusion

Regardless, the plan is to release an openly available Llama3 that is as close to meeting the pareto optimal tradeoff of VRAM / intelligence as we can make it. I also believe that this project would be the first large scale (open) application of transfer learning to language models if I am not mistaken; so even if it underperforms my personal hopes / expectations, we will have at least conducted some interesting research on bringing down the parameter cost of locally hostable language models.

If there are any concerns or suggestions from those more seasoned with large scale training, feel free to reach out to me on Twitter (@kalomaze) or through this account.

Peace!





Can we start a GoFundMe for this fucking tool so it actually works right?
r/ChatGPT

Subreddit to discuss about ChatGPT and AI. Not affiliated with OpenAI.


Members Online
Can we start a GoFundMe for this fucking tool so it actually works right?

The paid version (the smarter GPT 4) is absolutely fucked all day

It lags for ten minutes sitting there doing fuck all before deciding to answer your prompt

Have you seen worse days than this with Chat GPT?

Because I think today has been a new benchmark for shit

**** Cue the Altman white knights ****




IC-Light : RE-LIGHTING
r/StableDiffusion

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and exclude blind users from the site. More info: https://rtech.support/docs/meta/blackout.html#what-is-going-on Discord: https://discord.gg/4WbTj8YskM Check out our new Lemmy instance: https://lemmy.dbzer0.com/c/stable_diffusion


Members Online
IC-Light : RE-LIGHTING










Google DeepMind’s new AI can model DNA, RNA, and ‘all life’s molecules’
r/ArtificialInteligence

The goal of the r/ArtificialIntelligence is to provide a gateway to the many different facets of the Artificial Intelligence community, and to promote discussion relating to the ideas and concepts that we know of as AI. These could include philosophical and social questions, art and design, technical papers, machine learning, where to find resources and tools, how to develop AI/ML projects, AI in business, how AI is affecting our lives, what the future may hold, and many other topics. Welcome.


Members Online
Google DeepMind’s new AI can model DNA, RNA, and ‘all life’s molecules’

Google DeepMind is introducing an improved version of its AI model that predicts not just the structure of proteins but also the structure of “all life’s molecules.” The work from the new model, AlphaFold 3, will help researchers in medicine, agriculture, materials science, and drug development test potential discoveries.

Want to stay ahead of the curve in AI and tech? take a look here.

Key points:

  • AlphaFold 3, can predict the structure of all life's molecules, not just proteins, including DNA, RNA and smaller molecules. This is a significant improvement from previous versions.

  • AlphaFold 3 uses a similar method to AI image generators to predict how different molecules fit together. This method is called diffusion.

  • DeepMind is making AlphaFold 3 and the AlphaFold Server available to some researchers for free, with a focus on non-commercial uses. They are also working on responsible deployment of the model considering biosecurity risks.

Source (The Verge)

PS: If you enjoyed this post, you’ll love my ML-powered newsletter that summarizes the best AI/tech news from 50+ media sources. It’s already being read by hundreds of professionals from OpenAI, HuggingFace, Apple



Found a robust way to control detail (no LORAs etc., pure SD, no bias, style/model-agnostic)
r/StableDiffusion

/r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and exclude blind users from the site. More info: https://rtech.support/docs/meta/blackout.html#what-is-going-on Discord: https://discord.gg/4WbTj8YskM Check out our new Lemmy instance: https://lemmy.dbzer0.com/c/stable_diffusion


Members Online
Found a robust way to control detail (no LORAs etc., pure SD, no bias, style/model-agnostic)
  • r/StableDiffusion - Found a robust way to control detail (no LORAs etc., pure SD, no bias, style/model-agnostic)
  • r/StableDiffusion - Found a robust way to control detail (no LORAs etc., pure SD, no bias, style/model-agnostic)
  • r/StableDiffusion - Found a robust way to control detail (no LORAs etc., pure SD, no bias, style/model-agnostic)

  • Subreddit to discuss about ChatGPT and AI. Not affiliated with OpenAI. members
  • /r/StableDiffusion is back open after the protest of Reddit killing open API access, which will bankrupt app developers, hamper moderation, and exclude blind users from the site. More info: https://rtech.support/docs/meta/blackout.html#what-is-going-on Discord: https://discord.gg/4WbTj8YskM Check out our new Lemmy instance: https://lemmy.dbzer0.com/c/stable_diffusion members
  • Everything pertaining to the technological singularity and related topics, e.g. AI, human enhancement, etc. members
  • Subreddit to discuss about Llama, the large language model created by Meta AI. members
  • Character.AI lets you create and talk to advanced AI - language tutors, text adventure games, life advice, brainstorming and much more. members
  • An official subreddit for Midjourney related content. members
  • OpenAI is an AI research and deployment company. OpenAI's mission is to ensure that artificial general intelligence benefits all of humanity. We are an unofficial community. OpenAI makes ChatGPT, GPT-4, and DALL·E 3. members
  • The goal of the r/ArtificialIntelligence is to provide a gateway to the many different facets of the Artificial Intelligence community, and to promote discussion relating to the ideas and concepts that we know of as AI. These could include philosophical and social questions, art and design, technical papers, machine learning, where to find resources and tools, how to develop AI/ML projects, AI in business, how AI is affecting our lives, what the future may hold, and many other topics. Welcome. members
  • Welcome to the Janitor AI sub! https://janitorai.com https://discord.gg/janitorai members
  • Reddit’s home for Artificial Intelligence (AI) members
  • Welcome to r/aiArt ! A community focused on the generation and use of visual, digital art using AI assistants such as Wombo Dream, Starryai, NightCafe, Midjourney, Stable Diffusion, and more. members
  • A subreddit dedicated to learning machine learning members
  • Weird Ai Generations. members
  • Subreddit dedicated to discussions on the advanced capabilities and professional applications of ChatGPT. members
  • This is a subreddit dedicated to discussing Claude, an AI assistant created by Anthropic to be helpful, harmless, and honest. Claude does not actually run this community - it is a place for people to talk about Claude's capabilities, limitations, emerging personality and potential impacts on society as an artificial intelligence. All opinions about Claude are welcome here in a thoughtful discussion. Claude aims to respond transparently when prompted and does not have control over this subreddit. members
  • Welcome to our community! This subreddit focuses on the coding side of ChatGPT - from interactions you've had with it, to tips on using it, to posting full blown creations! Make sure to read our rules before posting! members
  • A place to discuss the SillyTavern fork of TavernAI. **So What is SillyTavern?** Tavern is a user interface you can install on your computer (and Android phones) that allows you to interact text generation AIs and chat/roleplay with characters you or the community create. SillyTavern is a fork of TavernAI 1.2.8 which is under more active development, and has added many major features. At this point they can be thought of as completely independent programs. Learn more: https://sillytavernai members
  • OpenAI's DALL·E 2, DALL·E 3, Bing DALL·E & OpenAI Sora members
  • Creating magic with Suno AI. Not affiliated with the official Suno AI team. members
  • A subreddit to talk about the DataAnnotation website! members
  • CHAI AI is the leading AI platform. LLM's are submitted via our chaiverse python-package. We serve them to users in our app. Our mission is to crowdsource the leap to AGI by bringing together language model developers and chat AI enthusiasts. Chaiverse: https://www.chaiverse.com/ Website: https://www.chai-research.com/ Twitter: https://www.twitter.com/chai_research Instagram: https://www.instagram.com/chairesearch/ LinkedIn: https://www.linkedin.com/company/chai-research/ members
  • members
  • Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more. We welcome everyone from published researchers to beginners! members
  • Following news and developments on ALL sides of the AI art debate (and more) members
  • r/Bard is a subreddit dedicated to discussions about Google's Gemini (Formerly Bard) AI. This subreddit is not affiliated with Google. members
  • The official subreddit for FiggsAI! URL: www.figgs.ai Discord: https://discord.gg/figgsai members
  • Welcome to the Perplexity Reddit community – we’re so excited to have you here! Unlock the power of knowledge with information discovery and sharing. members
  • A place for beginners to ask stupid questions and for experts to help them! /r/Machine learning is a great subreddit, but it is for interesting articles and news related to machine learning. Here, you can feel free to ask any question regarding machine learning. members
  • A place for you to showcase India's incredible beauty through generative AI. Please post your photos and videos. This community will support you through your AI journey. Everything on here should be generated using AI tools with Indian elements or made by AI artists in India. SFW content only. When in doubt, if it's border line NSFW or politically triggering, then resist posting. Showcase AI capabilities instead of political discourse. Photos and videos welcome. members
  • Welcome to r/aivideo! 🍿🥤 A community focused on the use of FULL MOTION VIDEO GENERATIVE A.I. ASSISTANTS such as OPEN AI SORA, RUNWAY, PIKA LABS, HAIPER and similar AI VIDEO tools capable of TEXT TO VIDEO, IMAGE TO VIDEO, VIDEO TO VIDEO, AI VOICE OVER ACTING, AI MUSIC, AI NEWSROOM, live action AI CGI VFX and AI VIDEO EDITING WELCOME TO THE FUTURE!! DISCLAIMER: unless stated otherwise all content is FAKE, PARODY, FAN FICTION for comedy or amusement. Send modmail for video removal if necessary members