Machine Learning

26

1

[P] Critique my geospatial Machine Learning approach. (I need second opinions) (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/No-Discipline-2354 on 2025-06-11 14:46:17+00:00.

I am working on a geospatial ML problem. It is a binary classification problem where each data sample (a geometric point location) has about 30 different features that describe the various land topography (slope, elevation, etc).

Upon doing literature surveys I found out that a lot of other research in this domain, take their observed data points and randomly train - test split those points (as in every other ML problem). But this approach assumes independence between each and every data sample in my dataset. With geospatial problems, a niche but big issue comes into the picture is spatial autocorrelation, which states that points closer to each other geometrically are more likely to have similar characteristics than points further apart.

Also a lot of research also mention that the model they have used may only work well in their regions and there is not guarantee as to how well it will adapt to new regions. Hence the motive of my work is to essentially provide a method or prove that a model has good generalization capacity.

Thus other research, simply using ML models, randomly train test splitting, can come across the issue where the train and test data samples might be near by each other, i.e having extremely high spatial correlation. So as per my understanding, this would mean that it is difficult to actually know whether the models are generalising or rather are just memorising cause there is not a lot of variety in the test and training locations.

So the approach I have taken is to divide the train and test split sub-region wise across my entire region. I have divided my region into 5 sub-regions and essentially performing cross validation where I am giving each of the 5 regions as the test region one by one. Then I am averaging the results of each 'fold-region' and using that as a final evaluation metric in order to understand if my model is actually learning anything or not.

My theory is that, showing a model that can generalise across different types of region can act as evidence to show its generalisation capacity and that it is not memorising. After this I pick the best model, and then retrain it on all the datapoints ( the entire region) and now I can show that it has generalised region wise based on my region-wise-fold metrics.

I just want a second opinion of sorts to understand whether any of this actually makes sense. Along with that I want to know if there is something that I should be working on so as to give my work proper evidence for my methods.

If anyone requires further elaboration do let me know :}

27

1

[R] ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/AccomplishedCode4689 on 2025-06-12 13:43:22+00:00.

We introduce ABBA, a new architecture for Parameter-Efficient Fine-Tuning (PEFT) that significantly outperforms LoRA and all its major variants across a broad range of benchmarks, all under the same parameter budget.

Most PEFT methods, including LoRA, represent weight updates using a low-rank decomposition added to the frozen model weights. While effective, this structure can limit the expressivity of the update, especially at low rank.

ABBA takes a fundamentally different approach:

ABBA Architecture

Reparameterizes the update as a Hadamard product of two independently learned low-rank matrices
Decouples the two components of the update from the base model, allowing them to be optimized freely
Enables significantly higher expressivity and improved performance under the same parameter budget

📈 Empirical Results

ABBA consistently beats state-of-the-art LoRA-based methods like HiRA, DoRA, and LoRA-Pro across four open-source LLMs: Mistral-7B, Gemma-2 9B, LLaMA-3.2 1B, and LLaMA-3.2 3B, on a suite of commonsense and arithmetic reasoning benchmarks. In several cases, ABBA even outperforms full fine-tuning.

📄 Paper: https://arxiv.org/abs/2505.14238

💻 Code: https://github.com/CERT-Lab/abba

We’d love to hear your thoughts, whether you're working on PEFT methods, fine-tuning, or anything related to making LLMs more adaptable and efficient. We're happy to answer questions, discuss implementation details, or just hear how this fits into your work.

28

1

[P]: I reimplemented all of frontier deep learning from scratch to help you learn (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/tanishqkumar07 on 2025-06-12 12:33:31+00:00.

Hey friends, the world needs more serious AI researchers. Many AI/LLM beginners mentioned to me that they learn better from implementations than from papers/math, but existing open-source examples rarely go beyond basic nanoGPT-level demos.

To help bridge the gap, I spent the last two months full-time reimplementing and open-sourcing a self-contained implementation of most modern deep learning techniques from scratch. The result is beyond-nanoGPT, containing 20k+ lines of handcrafted, minimal, and extensively annotated PyTorch code for your educational pleasure.

It contains a clean, working implementation + demo of everything from KV caching to linear attention to diffusion Transformers to AlphaZero to even a minimal coding agent that can make end-to-end PRs autonomously.

I'd love feedback on how to make it more helpful for people interested in transitioning into deep learning research. I will continue to add features and maintain the repo for the foreseeable future. The roaring 2020s are a surreal time to be alive, and we need all hands on deck.

29

1

[D] Are GNNs/GCNs dead ? (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/xiikjuy on 2025-06-12 12:29:44+00:00.

Before the LLMs era, it seems it could be useful or justifiable to apply GNNs/GCNs to domains like molecular science, social network analyasis etc. but now... everything is LLMs-based approaches. Are these approaches still promising at all?

30

1

[P] SWE-rebench Major Update: Tool Usage, Claude Sonnet 3.5/4, OpenAI o3 and May Data (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Long-Sleep-13 on 2025-06-12 12:25:53+00:00.

Hey everyone,

Following up on our initial announcement, we're excited to launch a major update for SWE-rebench, the continuously updated benchmark for software engineering LLMs.

Thanks to valuable community's feedback, we've added several new features:

Tool Usage Support: Agents can now interact with the environment using both text-based and tool-based approaches. You can filter the leaderboard to see results for each type.
New Frontier Models: We've evaluated the latest models such as Claude Sonnet 3.5/4 and OpenAI o3. We're working on adding more, like Gemini 2.5 Pro, and we'd love to hear your suggestions for other models to include.
Fresh May Problems: We've mined a new set of problems from May 2025 and evaluated all current models against them.

Check out the updated leaderboard here: https://swe-rebench.com/leaderboard

We welcome your feedback!

31

1

[D] Image generation using latent space learned from similar data (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/joacojoaco on 2025-06-12 00:04:00+00:00.

Okay, I just had one of those classic shower thoughts and I’m struggling to even put it into words well enough to Google it — so here I am.

Imagine this:

You have Dataset A, which contains different kinds of cells, all going through various labeled stages of mitosis.

Then you have Dataset B, which contains only one kind of cell, and only in phase 1 of mitosis.

Now, suppose you train a VAE using both datasets together. Ideally, the latent space would organize itself into clusters — different types of cells, in different phases.

Here’s the idea: Could you somehow compute the “difference” in latent space between phase 1 and phase 2 for the same cell type from Dataset A? Like a “phase change direction vector”. Then, apply that vector to the B cell cluster in phase 1, and use the decoder to generate what the B cell in phase 2 might look like.

Would that work?

A bunch of questions are bouncing around in my head: • Does this even make sense? • Is this worth trying? • Has someone already done something like this? • Since VAEs encode into a probabilistic latent space, what would be the mathematically sound way to define this kind of “direction” or “movement”? Is it something like vector arithmetic in the mean of the latent distributions? Or is that too naive?

I feel like I’m either stumbling toward something or completely misunderstanding how VAEs and biological processes work. Any thoughts, hints, papers, keywords, or reality checks would be super appreciated

32

1

[D] About spatial reasoning VLMs (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/stalin1891 on 2025-06-11 19:31:27+00:00.

Are there any state-of-the-art VLMs which excel at spatial reasoning in images? For e.g., explaining the relationship of a given object with respect to other objects in the scene. I have tried VLMs like LLaVA, they give satisfactory responses, however, it is hard to refer to a specific instance of an object when multiple such instances are present in the image (e.g., two chairs).

33

1

[D] Should I publish single-author papers to explain research output? (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/NumberGenerator on 2025-06-11 11:01:45+00:00.

I am a researcher in a small group and would appreciate a second perspective on my situation.

My typical workload involves 1-2 independent projects at a time, with the goal of publishing in top-tier conferences. Collaboration within my group is non-existent; my main interaction is a monthly meeting with my supervisor for general updates. Before deadlines, my supervisor might provide minor grammatical/styilistic edits, but the core idea, research, and writing are done independently. Alongside my research, I also have other responsibilities that do not contribute to my research output like grant applications and student supervision.

I am concerned that my research output might be significantly lower than researchers in larger, more collaborative groups. So I am wondering if publishing single-author papers would be a good strategy to explain my research output. What are your thoughts on this? Would single-author papers be perceived positively?

34

1

[R] FlashDMoE: Fast Distributed MoE in a single Kernel (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Kingandpawnendgame on 2025-06-11 03:03:39+00:00.

We introduce FlashDMoE, the first system to completely fuse the Distributed MoE forward pass into a single kernel—delivering up to 9x higher GPU utilization, 6x lower latency, and 4x improved weak-scaling efficiency.

Code: https://github.com/osayamenja/Kleos/blob/main/csrc/include/kleos/moe/README.MD

Paper: https://arxiv.org/abs/2506.04667

If you are a CUDA enthusiast, you would enjoy reading the code :) We write the fused layer from scratch in pure CUDA.

35

1

[R] Semantic Drift in LLMs Is 6.6x Worse Than Factual Degradation Over 10 Recursive Generations (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Actual_Requirement58 on 2025-06-11 02:35:19+00:00.

We ran a study to test how truth degrades in LLMs over recursive generations—but instead of measuring hallucinations, we measured semantic drift.

The common assumption is that recursive use of LLM outputs results in factual degradation. But when we systematically tested this over 10 academic domains and 10 generations of GPT-4o outputs, we found something different:

Facts are mostly retained: Only a 2% drop in factual accuracy over 10 generations
Semantic intent collapses: A new metric we introduced, Purpose Fidelity, dropped 42.5%
That’s a 6.63× higher rate of semantic drift vs factual decay

Examples:

A Descartes excerpt (“Cogito, ergo sum”) became career advice about leadership and self-awareness

A history excerpt on the Berlin Wall became a lesson in change management

Law and medicine were rewritten as “best practices” for business professionals

Chemistry and CS stayed stable: semantic degradation was domain-specific

Why this matters: Most LLM eval frameworks focus on factual accuracy and hallucination rates. But our data suggests the real long-term risk may be subtle, systematic recontextualization. Outputs can look factual and well-structured, while completely losing their intended purpose. This may impact content authenticity, training data curation, and long-term epistemic stability.

📄 Full paper (ResearchGate) - https://www.researchgate.net/publication/392558645_The_Half-Life_of_Truth_Semantic_Drift_vs_Factual_Degradation_in_Recursive_Large_Language_Model_Generation

🧵 Medium summary for general audience - https://medium.com/@maxwell.ian/when-ai-loses-its-mind-but-keeps-the-facts-the-hidden-danger-of-recursive-ai-content-08ae538b745a

36

1

[R] PINNs are driving me crazy. I need some expert opinion (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/WAIHATT on 2025-06-10 20:36:17+00:00.

Hi!

I'm a postdoc in Mathematics, but as you certainly know better than me, nowadays adding some ML to your research is sexy.

As part of a current paper I'm writing, I need to test several methods for solving inverse problems, and I have been asked by my supervisor to test also PINNs. I have been trying to implement a PINN to solve our problem, but for the love of me I cannot seem to make it converge.

Is this expected? Shouldn't PINNs be good at inverse problems?

Just to give some context, the equation we have is not too complicated, but also not too simple. It's a 2D heat equation, of which we need to identify the space-dependent diffusivity, k(x,y). So the total setup is:

Some observations, data points in our domain, taken at different times
k is defined, for simplicity, as a sum of two gaussians. Accordingly, we only have 6 parameters to learn (4 for the centers and 2 for the amplitudes), in addition to the PINNs weights and biases
We also strongly enforce BC and IC.

But there is no way to make the model converge. Heck, even if I set the parameters to be exact, the PINN does not converge.

Can someone confirm me that I'm doing something wrong? PINNs should be able to handle such a problem, right?

37

1

[P] GNNs for time series anomaly detection (Part 2) (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Important-Gear-325 on 2025-06-10 13:17:26+00:00.

Hey everyone! 👋

A while back, we posted about our project, GraGOD, which explores using Graph Neural Networks (GNNs) for Time Series Anomaly Detection. The feedback in the post was really positive and motivating, so with a lot of excitement we can announce that we've now completed our thesis and some important updates to the repository!

For anyone who was curious about the project or finds this area of research interesting, the full implementation and our detailed findings are now available in the repository. We'd love for you to try it out or take a look at our work. We are also planning on dropping a shorter paper version of the thesis, which will be available in a couple of weeks.

🔗 Updated Repo: GraGOD - GNN-Based Anomaly Detection

🔗 Original Post: P GNNs for time series anomaly detection

A huge thank you to everyone who showed interest in the original post! We welcome any further discussion, questions, or feedback. If you find the repository useful, a ⭐ would be greatly appreciated.

Looking forward to hearing your thoughts!

38

1

[D] What underrated ML techniques are better than the defaults (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/NOAMIZ on 2025-06-09 17:59:26+00:00.

I come from a biology/medicine background and slowly made my way into machine learning for research. One of the most helpful moments for me was when a CS professor casually mentioned I should ditch basic grid/random search and try Optuna for hyperparameter tuning. It completely changed my workflow, way faster, more flexible, and just better results overall.

It made me wonder what other "obvious to some, unknown to most" ML techniques or tips are out there that quietly outperform the defaults?

Curious to hear what others have picked up, especially those tips that aren’t widely taught but made a real difference in your work

39

1

[P][R] Sparse Transformers: Run 2x faster LLM with 30% lesser memory (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Economy-Mud-6626 on 2025-06-09 13:11:57+00:00.

We have built fused operator kernels for structured contextual sparsity based on the amazing works of LLM in a Flash (Apple) and Deja Vu (Zichang et al). We avoid loading and computing activations with feed forward layer weights whose outputs will eventually be zeroed out.

The result? We are seeing 5X faster MLP layer performance in transformers with 50% lesser memory consumption avoiding the sleeping nodes in every token prediction. For Llama 3.2, Feed forward layers accounted for 30% of total weights and forward pass computation resulting in 1.6-1.8x increase in throughput:

Sparse LLaMA 3.2 3B vs LLaMA 3.2 3B (on HuggingFace Implementation):
- Time to First Token (TTFT):  1.51× faster (1.209s → 0.803s)
- Output Generation Speed:     1.79× faster (0.7 → 1.2 tokens/sec)  
- Total Throughput:           1.78× faster (0.7 → 1.3 tokens/sec)
- Memory Usage:               26.4% reduction (6.125GB → 4.15GB)

Please find the operator kernels with differential weight caching open sourced (Github link in the comment).

PS: We will be actively adding kernels for int8, CUDA and sparse attention.

40

1

[D] Monthly Who's Hiring and Who wants to be Hired? (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/AutoModerator on 2025-05-31 02:30:35+00:00.

For Job Postings please use this template

Hiring: [Location], Salary:[], [Remote | Relocation], [Full Time | Contract | Part Time] and [Brief overview, what you're looking for]

For Those looking for jobs please use this template

Want to be Hired: [Location], Salary Expectation:[], [Remote | Relocation], [Full Time | Contract | Part Time] Resume: [Link to resume] and [Brief overview, what you're looking for]

Please remember that this community is geared towards those with experience.

41

1

[D][R][N] Are current AI's really reasoning or just memorizing patterns well.. (i.redd.it)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/theMonarch776 on 2025-06-08 17:20:09+00:00.

42

1

[D] is there a mistake in the RoPE embedding paper? (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/samas69420 on 2025-06-08 12:47:15+00:00.

i'm reading the paper about rope embedding but there's something weird in equation 16, we start from

q_m.T*k_n = (R_m*W_q*x_m).T*(R_n*W_k*x_n) and computing the transpose of the first term we get

q_m.T*k_n = (W_q*x_m).T * R_m.T * R_n * W_k * x_n) = x_m.T * W_q.T * (R_m.T * R_n) * W_k * x_n = x_m.T * W_q.T * R_n-m * W_k * x_n

in my case in the final step i get the transpose of the W_q matrix but in the paper at that point the matrix is not transposed, is that a mistake or i am missing something?

43

1

[R] Machine learning with hard constraints: Neural Differential-Algebraic Equations (DAEs) as a general formalism (www.stochasticlifestyle.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/ChrisRackauckas on 2025-06-08 11:36:43+00:00.

44

1

[R] Geometric Adam Optimizer (github.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/jaepil on 2025-06-08 01:24:34+00:00.

I have designed a new Adam-family optimizer. While the experimental scale is limited due to the personal project nature, I made efforts to test it across as diverse scales as possible. Although this is still an ongoing stage, I’m releasing the research report and experimental code up to this point. In the experimental environment, it successfully avoided the divergence and overfitting problems that other standard optimizers experience, even without separate hyperparameter tuning.

45

1

[R] Transferring Pretrained Embeddings (i.redd.it)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Arkamedus on 2025-06-07 16:47:37+00:00.

46

1

[D] Reproducing/Implementing Research Papers (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/thapaa3 on 2025-06-07 03:22:54+00:00.

I'm currently pursuing a Master’s in Data Science & Applied Statistics (Non-Thesis track). I don’t have experience working with research papers, but I’m considering reproducing or implementing a research paper from scratch (Attention, ResNet & BERT) and showcasing it on my resume.

I was wondering how beneficial would this be for gaining experience or standing out to employers? Thank you in advance!

47

1

[D] Got access to Gemini Diffusion (text-based) and it's lightning fast (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/hiskuu on 2025-06-07 12:46:57+00:00.

Pretty good at reasoning tasks as well. And it's blazing fast. Hope this comes to commercial models soon!

48

1

[R] Apple Research: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/hiskuu on 2025-06-07 10:47:40+00:00.

Abstract:

Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scal ing properties, and limitations remain insufficiently understood. Current evaluations primarily fo cus on established mathematical and coding benchmarks, emphasizing final answer accuracy. How ever, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of composi tional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs “think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities.

Did not know Apple wrote ML research papers haha the paper was worth the read anyways! Just wanted to share it here. They did a pretty good job showing the limitations of "Reasoning Models" and how they don't really reason even after being provided the exact algorithm to solve certain complex problems.

Paper link: the-illusion-of-thinking.pdf

49

1

[R] Log-Linear Attention (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Potential_Duty_6095 on 2025-06-07 08:34:07+00:00.

Super new research, from the authors of FlashAttention and Mamba(2):

https://arxiv.org/abs/2506.04761

Long Story Short: They extend Mamba2 to have state that can is not fixed and can grow in time, directly increasing Long Range Performance. This seem a sweet point between traditional Mamba2 where the state is fixed sized, being an bottleneck for long sequences, and Attention which is stateless, but need to store past KV pairs! All with specialised Triton kernels!

50

1

[R] Better quantization: Yet Another Quantization Algorithm (old.reddit.com)

submitted 2 weeks ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/tsengalb99 on 2025-06-06 16:13:09+00:00.

We're introducing Yet Another Quantization Algorithm, a new quantization algorithm that better preserves the original model's outputs after quantization. YAQA reduces the KL by >30% over QTIP and achieves an even lower KL than Google's QAT model on Gemma 3.

See the paper https://arxiv.org/pdf/2505.22988 and code https://github.com/Cornell-RelaxML/yaqa for more details. We also have some prequantized Llama 3.1 70B Instruct models at https://huggingface.co/collections/relaxml/yaqa-6837d4c8896eb9ceb7cb899e