Machine Learning

19 readers
1 users here now

This subreddit is temporarily closed in protest of Reddit killing third party apps, see /r/ModCoord and /r/Save3rdPartyApps for more information.

founded 2 years ago
MODERATORS
51
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Potential_Duty_6095 on 2025-06-07 08:34:07+00:00.


Super new research, from the authors of FlashAttention and Mamba(2):

https://arxiv.org/abs/2506.04761

Long Story Short: They extend Mamba2 to have state that can is not fixed and can grow in time, directly increasing Long Range Performance. This seem a sweet point between traditional Mamba2 where the state is fixed sized, being an bottleneck for long sequences, and Attention which is stateless, but need to store past KV pairs! All with specialised Triton kernels!

52
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/tsengalb99 on 2025-06-06 16:13:09+00:00.


We're introducing Yet Another Quantization Algorithm, a new quantization algorithm that better preserves the original model's outputs after quantization. YAQA reduces the KL by >30% over QTIP and achieves an even lower KL than Google's QAT model on Gemma 3.

See the paper https://arxiv.org/pdf/2505.22988 and code https://github.com/Cornell-RelaxML/yaqa for more details. We also have some prequantized Llama 3.1 70B Instruct models at https://huggingface.co/collections/relaxml/yaqa-6837d4c8896eb9ceb7cb899e

53
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/jamesvoltage on 2025-06-06 12:55:36+00:00.


https://arxiv.org/abs/2505.24293

https://github.com/jamesgolden1/llms-are-llms

Hello all, I'd like to share my new research describing an alternative approach to LLM interpretability. I show that transformer decoder LLMs can be made locally linear at inference time without changing outputs or weights.

Result: LLMs can be converted into nearly exactly equivalent linear systems that reconstruct the next-token output for any given input text sequence. Instead of 25+ layers of nonlinear computations, this method computes a single set of matrix multiplications that linearly operates on the input embedding vectors and nearly exactly reconstructs the output embedding for a single token prediction.

Method: A "linear path" through the transformer is identified, the nonlinear components are detached from the gradient, and the Jacobian with respect to the input embeddings is computed. This yields the "detached Jacobian", which is the set of matrices that operate linearly on input embeddings to reproduce the predicted output embedding with ~10⁻⁶ error for float32 models.

Interpretability: This method provides nearly-exact token attribution rather than approximate attention weights - tools from linear algebra like the SVD are used to understand which concepts drive predictions

Scope: Works across Qwen 3, Gemma 3, Llama 3, Phi 4, Ministral and OLMo 2 (tested up to 70B parameters at q4).

Practical: The method works on free Colab T4 instances for Gemma 3 4B and Llama 3.2 3B models.

Concept steering: Preliminary results are shown for using the detached Jacobian as a linear conceptual steering operator in mid to late layers for guided generation of 8B models.

Trade-offs and costs: The detached Jacobian linear system is only valid for that specific input sequence (and must be computed from scratch for each new sequence). This is slow (10 sec to compute the Jacobian for Llama 3.2 3B on a T4, up to minutes for models > 30B parameters), VRAM intensive and currently limited to very short sequences, but I plan to continue working on this aspect.

Applications: In addition to steering, there is some potential for safety analysis (bias detection, deceptive content).

Background: This extends prior work on adaptive linear networks (Mohan, Khadkhodaie, Simoncelli et al.) and locally linear image diffusion models (Khadkhodaie, Simoncelli, et al.) to transformer decoder architectures, building on decoder circuit analysis (Elhage Nanda Olsson et al).

Abstract

We demonstrate that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence without modifying the model weights or altering output predictions. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically alter the gradient computation with respect to a given input sequence for a next-token prediction such that the Jacobian of the model nearly exactly reproduces the forward prediction with a linear system. We demonstrate this approach across models (Llama 3, Gemma 3, Qwen 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4) and show through the singular value decomposition of the detached Jacobian that these LLMs operate in extremely low-dimensional subspaces where many of the largest singular vectors decode to concepts related to the most-likely output token. This approach also allows us to examine the operation of each successive layer (and its attention and MLP components) as nearly-exact linear systems and observe the emergence of semantic concepts. Additionally, we present preliminary results on the detached Jacobian as a steering operator for inserting concepts into inference responses. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through nearly-exact locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.

54
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Sad_Hall_2216 on 2025-06-06 08:46:35+00:00.


This new Apple paper focusses on limited true reasoning capabilities in a true "human" way and goes into details of where LLMs and LRMs are failing on highly complex tasks.

Interesting finding around LRMs reducing their reasoning steps as the task complexity increases and overall lack of true reasoning.

55
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/simple-Flat0263 on 2025-06-05 08:03:29+00:00.


Hi guys, I am incoming MS student at one of T5 CS institutes in the US in a fairly competitive program. I want to do a PhD and plan to shift to EU for personal reasons. I want to carry out research in computational materials science, but this may change over the course of my degree. I basically want some real advice from people currently in the EU about funding, employment opportunities,teaching opportunities, etc. I saw some posts about DeepMind fellowships, Meta fellowship etc. Are part-time work part-time PhDs common?

56
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/StartledWatermelon on 2025-06-05 11:00:58+00:00.


TL;DR: The team from Google Research continues to publish new SotA architectures for autoregressive language modelling, backed by thorough theoretical considerations.

Paper: https://www.arxiv.org/pdf/2505.23735

Abstract:

Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80% accuracy in 10M context length of BABILong benchmark.

Visual Highlights:

https://preview.redd.it/uo3umo13835f1.png?width=1201&format=png&auto=webp&s=7caf036556ccaae6821a471449ea885345ec42ea

https://preview.redd.it/37zdk764835f1.png?width=1301&format=png&auto=webp&s=16ea25baa246247a254e3ad0a071fc36c8178951

https://preview.redd.it/yij6yc55835f1.png?width=887&format=png&auto=webp&s=b4c4c28e9ce5abf43f1ecc301293084d6f86a45a

Note that Atlas(MAG) and Atlas(MAL) are hybrid architectures too.

https://preview.redd.it/a724x7n2a35f1.png?width=1203&format=png&auto=webp&s=1c9e7f4328f8dd10593560478e03394bf886a2e2

Transformer behaviour on the left panel can be explained by training the model on 4k context length, without any subsequent extension. The right panel looks super-impressive

57
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/GiftBrilliant6983 on 2025-06-05 01:03:45+00:00.


Hi, I was looking at past competitions and I was wondering if having a go at one of these conferences is worth my time. My goal is to build my resume for when I apply for a PhD in the US this upcoming admission cycle. I want to do a PhD in CS/ML. I already have work in theoretical machine learning (1 currently in preprint and another to be sent at AISTATS). I am currently working in a lab which also does theory. I wanted to however exhibit my coding and applied ML capabilities in my CV as well. This leads me here.

Are NeurIPS competitions well regarded in the academia? Do you get published if you end up winning? Has anyone known a winner/ is a winner in this sub?

If not this, what other avenues should I pursue for my goal? Thanks in advance.

58
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/carrotjuice999 on 2025-06-04 02:42:11+00:00.


Has anyone here done the onsite interviews for a ML research scientist/engineer role at Scale AI?

If so, any tips/advice? Especially for the ML coding and behavioral rounds.

Thanks!

59
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/IEEESpectrum on 2025-06-04 16:22:46+00:00.


New MLPerf training results are in, and Nvidia's Blackwell GPUs continue to dominate across all six benchmarks. That said, the computers built around the newest AMD GPU, MI325X, matched the performance of Nvidia’s H200, Blackwell’s predecessor, on the most popular LLM fine-tuning benchmark.

https://spectrum.ieee.org/mlperf-training-5

60
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/daisy_petals_ on 2025-06-04 00:27:45+00:00.


Hey everyone!

I'm excited to share a project I've been working on: SnapViewer, an alternative to PyTorch's built-in memory visualizer. It's designed to handle large memory snapshots smoothly, providing an efficient way to analyze memory usage in PyTorch models.

Features:

  • Faster: Smoothly display large memory snapshots without the performance issues found in official snapshot viewer https://docs.pytorch.org/memory_viz.
  • UI: Use WASD keys and mouse scroll to navigate through the memory timeline. Left-click on any allocation to view its size, call stack, and more; Right-click
  • Preprocessing: Convert your PyTorch memory snapshots to a zipped json format using the provided parse_dump.py script.

Getting Started:

  1. Record a Memory Snapshot: Follow PyTorch's documentation to record a memory snapshot of your model.
  2. Preprocess the Snapshot: Use the parse_dump.py script to convert the snapshot to a zip format:

bash python parse_dump.py -p snapshots/large/transformer.pickle -o ./dumpjson -d 0 -z 3. Run SnapViewer: Use Cargo to run the application.

bash cargo run -r -- -z your_dump_zipped.zip --res 2400 1080 Note: The CLI options -z and -j are mutually exclusive.

Why SnapViewer?

PyTorch's official web memory visualizer struggles with large snapshots, with a framerate of 2~3 frames per minute (yes, minute). SnapViewer aims to be faster, at least fast enough to do analyses. Currently on my RTX3050 it runs responsive (>30fps) on hundred-MB level snapshots.

I'd love to hear your feedback, suggestions, or any issues you encounter. Contributions are also welcome!

Check it out here: https://github.com/Da1sypetals/SnapViewer

61
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/dreamewaj on 2025-06-04 11:58:39+00:00.


Found this paper pretty interesting. None of the models got anything right.

arxiv link: https://arxiv.org/abs/2505.24867

Abstract:

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/ .

62
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/hedgehog0 on 2025-06-03 09:13:54+00:00.


Hi everyone,

I am a Master student in math in Germany interested in the theory and math foundationals of learning theory and neural networks. Recently I leraned that there is a program called ELLIS (European Laboratory for Learning and Intelligent Systems) in Europe, which is not mentioned a lot here.

I am interested in applying to some schools in this program, so I was wondering if you could share your thoughts and experience with this program -- such as the admission difficulty, how do you like your "grad school experience", and so on?

Many thanks!

63
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Designer-Air8060 on 2025-06-03 15:05:39+00:00.


As title says, what is the cheapest double descent experiment that can be done?

64
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/datashri on 2025-06-03 06:57:58+00:00.


In today's competitive atmosphere, authors usualy tout SOTA results, in whatever narrow sub-sub-domain. Older generations were more honest about "drawbacks", "limitations", and "directions for future research". Many (not all) modern papers either skip these sections or treat them like a marketing brochure.

An unrelated 3rd person (like me) needs a balanced view of what's good/bad about some methodology. Someone with a very high IQ and vast exposure/experience will probably find it easier to critique a paper after 1-2 reads. But that's not most people. Certainly not me.

Is there an easier way for mere mortals to get a more balanced perspective on where to place the significance of a piece of research?

In many cases, I have found that subsequent publications, who cite these papers, mention about their drawbacks. I suppose, one way would be to collect all future papers that cite paper X and use AI to search all the negative or neutral things they have to say about paper X. This pipeline could probably be put together without too much difficulty.

Is there a more Luddite approach?

65
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/taesiri on 2025-06-03 12:59:47+00:00.

66
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/hiskuu on 2025-06-03 01:37:20+00:00.


Abstract

Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning within the boundaries of human language, process ing discrete token embeddings that represent fixed points in the semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like “soft” reasoning by generating soft, abstract concept tokens in a contin uous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple mean ings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning.

If you’re into reasoning models, continuous representations, or just want to see at where AI reasoning might go beyond token-limited models, I think you’ll enjoy this paper. Might be worth looking into!

Paper link: [2505.15778] Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space

67
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/tibetbefree on 2025-06-02 15:22:13+00:00.


I found that quality and correctness-wise TMLR papers seem to be be better than CVPR and ICLR papers on an average with the latter having huge variance in the paper quality. Do people think so as well? If so, why?

68
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/idkwhatever1337 on 2025-06-01 22:47:57+00:00.


Seems like an LLM paper got accepted to ACL mains. To me this seems like a bad sign for research saturation and future innovation but I’d be curious to hear people’s perspectives…

Relevant blog post:

https://www.intology.ai/blog/zochi-acl

69
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Dev-Table on 2025-06-01 19:41:50+00:00.

70
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Expensive-Ad8916 on 2025-06-01 14:53:13+00:00.

71
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/HopeIsGold on 2025-06-01 07:19:05+00:00.


Please mention the niche you work in and in what capacity. If at all possible you can share link to your works.

Now, coming to the question. Assuming that you actively work in machine learning related fields, which books gave you the greatest benefit till now? It can be books from foundational math topics or engineering skills topics also.

I am a second year grad student (topic not yet finalised, mostly something in computer vision).

I am reading Probability Theory by E.T. Jaynes and for programming Structure and Interpretation of Computer Programs by Abelson and Sussman. Both are blowing my mind in a tremendously good way.

72
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/random_sydneysider on 2025-05-31 22:10:31+00:00.


Quick question about research engineer/scientist roles at DeepMind (or Google Research).

Would joining as a SWE and transferring internally be easier than joining externally?

I have two machine learning publications currently, and a couple others that I'm submitting soon. It seems that the bar is quite high for external hires at Google Research, whereas potentially joining internally as a SWE, doing 20% projects, seems like it might be easier. Google wanted to hire me as a SWE a few years back (though I ended up going to another company), but did not get an interview when I applied for research scientist. My PhD is in theoretical math from a well-known university, and a few of my classmates are in Google Research now.

73
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Own_Dirt_2408 on 2025-05-31 16:53:38+00:00.


Hello, I first-authored a paper and it was posted on arxiv by my co-author, but unfortunately on google scholar, everyone's name except mine is shown up and I am worried if my name wouldn't show up while citing the work. My name is still there on arXiv and the paper, and im unsure if this is just a scholar bug and how to fix the same.

74
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Fantastic-Nerve-4056 on 2025-05-31 20:45:24+00:00.


Hi folks, just came across this blog https://www.intology.ai/blog/zochi-acl

It started with ICLR workshop and now ACL main, was just wondering where are we heading. Is this all the effect of noise review process, or indeed the works are worth publishing

PS: Not a NLP guy, so couldn't really comment on the novelty/technical correctness of the work

Edit: Just found a GitHub repo, corresponding to the agent https://github.com/IntologyAI/Zochi?tab=readme-ov-file

75
 
 
This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Beyond_Birthday_13 on 2025-05-31 14:32:27+00:00.

view more: ‹ prev next ›