this post was submitted on 17 Jun 2023
11 points (92.3% liked)

LocalLLaMA

3220 readers
2 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
 

Hey, I'm working on some local LLM applications and my goal is to run the smallest model possible without crippling performance. I'm already using 4 bit GPTQ but I want something smaller. These models have been trained on such a massive amount of data but my specific use case only touches a very very small fraction of that, so I would imagine it's possible to cut away large chunks of the model that I don't care about. I'm wondering if there has been any work on runtime pruning of LLMs (not just static pruning based on model weights) based on "real world" data. Something like: you run the model a bunch of times with your actual data and monitor the neuron activations to inform some kind of pruning process. Does anyone here know about something like that?

top 2 comments
sorted by: hot top controversial new old
[โ€“] [email protected] 2 points 2 years ago

The closest that I know is distillation, you can google to get few resources (e.g. https://huggingface.co/papers/2306.08543). I don't know if it is what you are looking for

[โ€“] [email protected] 1 points 2 years ago

I don't know about that, but you could try GGML (llama.cpp). It has quantization up to 2-bits so that might be small enough.