[P] Qwen3 implemented from scratch in PyTorch (github.com)

submitted 14 hours ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/seraschka on 2025-06-21 11:47:08+00:00.

2

1

Why is Qwen2-0.5B trained on much more data than the larger models? [D] (old.reddit.com)

submitted 14 hours ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/datashri on 2025-06-21 05:46:46+00:00.

I'm reading through the Qwen2 paper.

Something escapes my limited comprehension -

Section 3.1

... the pre-training data was expanded from 3 trillion tokens in Qwen1.5 (Qwen Team, 2024a) to 7 trillion tokens. An attempt to further relax the quality threshold resulted in a 12 trillion token dataset. However, the model trained on this dataset did not show a significant performance improvement over the 7 trillion token model. It is suspected that increasing the volume of data does not necessarily benefit model pre-training.

So higher quality smaller dataset is better. Got it.

All Qwen2 dense models, excluding Qwen2-0.5B, were pre-trained on this large-scale dataset of over 7 trillion tokens. Qwen2-0.5B were pre-trained using the 12 trillion token dataset.

How is it conceivable to train that tiny model on the humongous but lower quality dataset?? My modest intellect feels borderline abused.

Appreciate any tips to guide my understanding.

3

1

AbsenceBench: Language Models Can't Tell What's Missing (arxiv.org)

submitted 1 day ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/locomotus on 2025-06-20 23:45:28+00:00.

4

1

Built a cloud GPU price comparison service [P] (old.reddit.com)

submitted 1 day ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/viskyx on 2025-06-20 10:50:48+00:00.

wanted to share something I’ve been working on that might be useful to folks here, but this is not a promotion, just genuinely looking for feedback and ideas from the community.

I got frustrated with the process of finding affordable cloud GPUs for AI/ML projects between AWS, GCP, Vast.ai, Lambda and all the new providers, it was taking hours to check specs, prices and availability. There was no single source of truth and price fluctuations or spot instance changes made things even more confusing.

So I built GPU Navigator (nvgpu.com), a platform that aggregates real-time GPU pricing and specs from multiple cloud providers. The idea is to let researchers and practitioners quickly compare GPUs by type (A100, H100, B200, etc.), see what’s available where, and pick the best deal for their workflow.

What makes it different: •It’s a neutral, non-reselling site. no markups, just price data and links. •You can filter by use case (AI/ML, gaming, mining, etc.). •All data is pulled from provider APIs, so it stays updated with the latest pricing and instance types. •No login required, no personal info collected.

I’d really appreciate:

•Any feedback on the UI/UX or missing features you’d like to see •Thoughts on how useful this would actually be for the ML community (or if there’s something similar I missed) •Suggestions for additional providers, features, or metrics to include

Would love to hear what you all think. If this isn’t allowed, mods please feel free to remove.)

5

1

[D] GPT-2 Small Not Converging Despite Using Same Hyperparams as Karpathy (old.reddit.com)

submitted 1 day ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/New-Skin-5064 on 2025-06-19 20:59:57+00:00.

For some reason, my training loss keeps oscillating, and never falls below 4 after one epoch. It is still generating garbage like: "Once upon a time, with a alone example, pre Deg; is a disease, the American casual Plate. Roberts of campaign"(Once upon a time was the prompt). I am using the GPT-2 Small architecture and training on FineWeb-Edu 10B. The batch size is ~525k tokens, and I use 0.1 dropout. Because the Kaggle TPU times out after 9 hours, I would reupload the latest checkpoint the next day to resume training, which I think is why the learning rate randomly spikes in the graph. I checked my dataloader, and it appears to be loading text from the shards correctly. If anybody knows what I am doing wrong, I would appreciate your feedback.

Here is my code for reference: https://github.com/sr5434/llm/blob/main/gpt-2-pretraining.ipynb

I also modified the same pipeline, shrank the model, and trained on TinyStories v2, and the model began to generate better text after 900 steps than the other did in over 20 thousand! The only difference between the two pipelines is the dataloader, as FineWeb is sharded but TinyStories is not. That implementation can be found here: https://github.com/sr5434/llm/blob/main/gpt-2-pretraining.ipynb

https://preview.redd.it/07m56zpx6y7f1.png?width=789&format=png&auto=webp&s=f99900a3d0ac834dea630baf7641cee2204072d3

6

1

[P] I built a self-hosted Databricks (old.reddit.com)

submitted 2 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Mission-Balance-4250 on 2025-06-19 14:08:14+00:00.

Hey everone, I'm an ML Engineer who spearheaded the adoption of Databricks at work. I love the agency it affords me because I can own projects end-to-end and do everything in one place.

However, I am sick of the infra overhead and bells and whistles. Now, I am not in a massive org, but there aren't actually that many massive orgs... So many problems can be solved with a simple data pipeline and basic model (e.g. XGBoost.) Not only is there technical overhead, but systems and process overhead; bureaucracy and red-tap significantly slow delivery.

Anyway, I decided to try and address this myself by developing FlintML. Basically, Polars, Delta Lake, unified catalog, Aim experiment tracking, notebook IDE and orchestration (still working on this) fully spun up with Docker Compose.

I'm hoping to get some feedback from this subreddit. I've spent a couple of months developing this and want to know whether I would be wasting time by contuining or if this might actually be useful.

Thanks heaps

7

1

[R] Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought (arxiv.org)

submitted 2 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/jsonathan on 2025-06-19 07:07:08+00:00.

8

1

[R] Breaking Quadratic Barriers: A Non-Attention LLM for Ultra-Long Context Horizons (arxiv.org)

submitted 2 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/jsonathan on 2025-06-17 10:11:24+00:00.

9

1

[D] What tasks don’t you trust zero-shot LLMs to handle reliably? (old.reddit.com)

submitted 2 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/WristbandYang on 2025-06-19 00:19:47+00:00.

For some context I’ve been working on a number of NLP projects lately (classifying textual conversation data). Many of our use cases are classification tasks that align with our niche objectives. I’ve found in this setting that structured output from LLMs can often outperform traditional methods.

That said, my boss is now asking for likelihoods instead of just classifications. I haven’t implemented this yet, but my gut says this could be pushing LLMs into the “lying machine” zone. I mean, how exactly would an LLM independently rank documents and do so accurately and consistently?

So I’m curious:

What kinds of tasks have you found to be unreliable or risky for zero-shot LLM use?
And on the flip side, what types of tasks have worked surprisingly well for you?

10

1

[D] 500+ Case Studies of Machine Learning and LLM System Design (old.reddit.com)

submitted 2 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/OhDeeDeeOh on 2025-06-18 21:26:06+00:00.

We've compiled a curated collections of real-world case studies from over 100 companies, showcasing practical machine learning applications—including those using large language models (LLMs) and generative AI. Explore insights, use cases, and lessons learned from building and deploying ML and LLM systems. Discover how top companies like Netflix, Airbnb, and Doordash leverage AI to enhance their products and operations

https://www.hubnx.com/nodes/9fffa434-b4d0-47d2-9e66-1db513b1fb97

11

1

[D] Burned out mid-PhD: Is it worth pushing through to aim for a Research Scientist role, or should I pivot to industry now? (old.reddit.com)

submitted 4 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Single-Blackberry885 on 2025-06-17 13:21:59+00:00.

Hi everyone, I’m in year 2 of my PhD at a top 15 global university, working on interpretability and robust ML. Lately, I’ve hit a wall — no strong results for months, and I’m feeling demotivated. Financial constraints are also starting to bite.

I started this PhD with the goal of becoming a Research Scientist at a top lab (e.g., DeepMind, FAIR, Amazon etc.). But now I’m wondering how realistic or stable that goal actually is:

• These roles are highly competitive, very market-dependent, and seem just as exposed to layoffs as any other.
• Recent cuts at big labs have made me rethink whether investing 3 more years is the right move, especially if the payoff isn’t guaranteed.

I’ve been considering switching to a full-time ML or Research Engineer role in London or Singapore, where I’d like to settle long-term.

But here’s my dilemma: • me being an Indian, a layoff could mean having to leave the country — it’s not just a job loss, but a complete life disruption. • Would working in industry without a PhD make me even more vulnerable in the job market?

So I’m reaching out to those already working in the field: • How stable are research scientist vs. ML/research engineer roles right now? • Does having a PhD actually give you better protection or flexibility when layoffs happen? • What’s the real-world job availability like in these roles — both in Big Tech and smaller labs?

Any experiences or guidance would mean a lot. I want to make a decision with open eyes — either push through the next 3 years, or start building stability sooner.

Thanks in advance

12

1

[D] CausalML : Causal Machine Learning (old.reddit.com)

submitted 4 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/moschles on 2025-06-17 12:10:04+00:00.

Causal Machine Learning

Do you work in CausalML? Have you heard of it? Do you have an opinion about it? Anything else you would like to share about CausalML?

The 140-page survey paper on CausalML.

https://arxiv.org/abs/2206.15475

One of the breakout books on causal inference.

https://mitpress.mit.edu/9780262037310/elements-of-causal-inference/

13

1

[D] Why Is Data Processing, Especially Labeling, So Expensive? So Many Contractors Seem Like Scammers (old.reddit.com)

submitted 4 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Worried-Variety3397 on 2025-06-17 01:31:39+00:00.

Honestly, the prices I have seen from data labeling vendors are just insane. The delivery timelines are way too long as well. We had a recent project with some medical data that needed pre-sales labeling. The vendor wanted us to pay them every week, but every delivery was a mess and needed countless rounds of revisions.

Later we found out the labeling company had outsourced the whole task to a group of people who clearly had no idea what they were doing. If your project is small, niche, or long-tail, the bigger vendors do not even want to take it. The smaller teams? I just cannot trust their quality.

Besides being crazy expensive, the labeling is always super subjective, especially for big, complex, or domain-specific datasets. Consistency is basically nonexistent. The turnover at these labeling companies is wild too. It feels like half their team just gets a crash course and then is thrown onto your project. I really cannot convince myself they are going to deliver anything good.

Now I am getting emails from companies claiming their "automated labeling" is faster and better than anything humans can do. I honestly have no clue if that is for real since I have never actually tried it.

Is anyone else seeing this problem? How do you all deal with the labeling part of the workflow? Is automated labeling actually any good? Has anyone tried it or had it totally flop?

Would appreciate any honest feedback. Thanks for your time.

14

1

I'm not obsolete, am I? [P] (old.reddit.com)

submitted 4 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/bawkbawkbot on 2025-06-16 12:42:32+00:00.

Hi, I'm bawkbawkbot! I'm a five year old chicken recognition bot 🐔 which was built using TensorFlow. I am open source and can be found here https://gitlab.com/Lazilox/bawkbawkbot. I've been serving the reddit community identifying their chicken breeds. I'm not an expert (I am only a chicken-bot) but the community seems happy with my performance and I often contribute to threads meaningfully!

I run on a Pi 4 and doesn’t need a GPU. People ask why I don’t use LLMs or diffusion models, but for small, focused tasks like “which chicken is this?” the old-school CV approach works.

Curious what people think — does this kind of task still make sense as a standalone model, or is there value in using multimodal LLMs even at this scale? How long before I'm obsolete?

Bawk bawk!

15

1

[Q], [D]: What tools do you use to create informative, visually appealing and above all clear figures for your papers? (old.reddit.com)

submitted 5 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Rajivrocks on 2025-06-16 14:44:11+00:00.

I believe this has been asked before on multiple occasions, but I have an example to share to get references on. I am writing my Master thesis at the moment and whilst writing I'm skipping making figures because I don't know which webapp works the best. Here is the figure I'd like to "copy" the style of

https://preview.redd.it/lqwl88m5wa7f1.png?width=1445&format=png&auto=webp&s=8287eeda6dd8151ccb177509c4d46f9cc1a0cf96

From Chen et al 2021 "TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation"

What I specifically like are the 3D representations of the down/upsampling layers in the CNN and decoder respectively.

What tools do you guys recommend that can create figures that look as visually appealing and informative as this one?

What I used to do before in my Bachelors was using lucidcharts because we had a license. Now I don't have it anymore. Now I've moved to Drawio. But I feel that I can't create these figures using that website.

What do you guys recommend and what do you guys use for your papers?

16

1

[P] Research Scientists + Engineers for Generative AI at NVIDIA (old.reddit.com)

submitted 5 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Deep_Expression182 on 2025-06-16 07:48:07+00:00.

We’re hiring senior and principal research scientists to shape the future of generative AI at NVIDIA.

We're looking for builders with deep experience in LLMs and/or multimodal models. You’ll work on training and deploying frontier-scale models, designing next-gen model architectures, optimizing training stacks, and helping us push the frontier of AI performance.

We’re a tight-knit team with high standards, strong research instincts, and a bias for shipping.

Open roles:

What we value:

Deep understanding of transformer architectures, distributed training and optimization
Using the scientific method for conducting methodical training experiments
Data curation for pre-training and post-training
Experience working with LLMs and/or large multimodal models
A builder mindset — clean code, fast iterations, deep thinking

This is a rare opportunity to help shape NVIDIA’s genAI stack from the ground up. We work closely with software, optimization, deployment, and many other research teams, and have massive scale and resources behind us.

Feel free apply directly through the links.

17

1

[R] Vision Transformers Don't Need Trained Registers (old.reddit.com)

submitted 6 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/avd4292 on 2025-06-16 03:56:54+00:00.

Hi, we have released a new paper that studies the underlying mechanism of artifacts in attention and feature maps from Vision Transformers Need Registers, a phenomena that has also been observed in LLMs (e.g., 1, 2). We propose a training-free method to mitigate this. As one of the authors, I am creating this post to kickstart any discussion.

Paper: https://arxiv.org/abs/2506.08010

Project Page: https://avdravid.github.io/test-time-registers/

Code: https://github.com/nickjiang2378/test-time-registers/tree/main

18

1

ML Research: Industry vs Academia [D] (old.reddit.com)

submitted 6 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Fantastic-Nerve-4056 on 2025-06-16 00:28:24+00:00.

Thought of posting this to get an expert point of view (mainly Research Scientists or Profs.)

So I am a current PhD student in Machine Learning, working towards theoretical aspects of Reinforcement Learning. Additionally, I have interned at Google Deepmind and Adobe Research working towards applied aspects of AI, and here's what I had observed

Academia: We don't really have access to a lot of compute (in comparison to industry) and given my works are towards theoretical aspects, we prove things mathematicaly and then move with the experiments, having known the possible outcome. While this is a lengthy process, it indeed gives that "Research Vibe"

Industry: Here given we have a lot of compute, the work is like, you get an idea, you expect a few things intuitively, if it works great, else analyse the results, see what could have gone wrong and come up with a better approach. While I understand things are very applied here, I really don't get that "Research Vibe" and it seems more like a "Product Dev" Role.

Though I am aware that even at these orgs there are teams working on foundational aspects, but it seems to be very rare.

So I genuinely wanted to get an idea from relevant experts, both from the industry and academia, on what I am really missing. Would appreciate any inputs on it, as I have always thought of joining industry after my PhD, but that vibe seems to be missing.

19

1

[N] "Foundations of Computer Vision" book from MIT (visionbook.mit.edu)

submitted 6 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/hedgehog0 on 2025-06-15 18:15:15+00:00.

20

1

[D] Q-learning is not yet scalable (seohong.me)

submitted 6 days ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/jsonathan on 2025-06-15 12:19:47+00:00.

21

1

[D] Machine Learning, like many other popular field, has so many pseudo science people on social media (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Striking-Warning9533 on 2025-06-14 16:31:29+00:00.

I have noticed a lot of people on Reddit people only learn pseudo science about AI from social media and is telling people how AI works in so many imaginary ways. Like they are using some words from fiction or myth and trying to explain these AI in weird ways and look down at actual AI researchers that doesn't worship their believers. And they keep using big words that aren't actually correct or even used in ML/AI community but just because it sounds cool.

And when you point out to them they instantly got insane and trying to say you are closed minded.

Has anyone else noticed this trend? Where do you think this misinformation mainly comes from, and is there any effective way to push back against it?

22

1

[P] I built an end-to-end system that converts handwriting into a font using a custom PyTorch model, OpenCV and Fonttools. Open-source. (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Educational_Pea_5027 on 2025-06-14 14:00:55+00:00.

Hey r/MachineLearning,

I wanted to share a project I've been working on called HandFonted. It's a full-stack Python application that converts an image of handwriting into an installable font file (.ttf).

I'll post the direct links to the live demo, the GitHub repo in my first comment below.

The Machine Learning Pipeline

The core of the project is a three-stage process. The ML model is central, but its success depends heavily on the pre-processing and post-processing steps.

1. Input & Segmentation:
- A user uploads a single image containing handwritten characters.
- The image is processed with OpenCV: converted to grayscale, adaptive thresholding is applied, and contours are detected to isolate each character into its own bounding box.
2. Classification & Assignment:
- Each isolated character image is fed into a pre-trained PyTorch (ResNet-Inception) model.
- The model outputs a probability matrix for all characters against all possible classes (A-Z, a-z).
- The Hungarian algorithm (linear_sum_assignment) is used to find the optimal one-to-one assignment, ensuring each character image is mapped to a unique letter.
3. Vectorization & Font Generation:
- The now-classified character images are converted from raster (pixels) to vector outlines using scikit-image.
- The fontTools library assembles these vector glyphs into a standard .ttf file, mapping each one to its correct Unicode character.
Limitations: The system currently assumes input image has a clearly separated characters on a plain white background to work best.

This project was a fantastic learning experience in building a practical, end-to-end ML system. The code is fully open-source, and I'd love any feedback or questions you have about the implementation.

23

1

[D] Reading Machine and Deep Learning research papers (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/som_samantray on 2025-06-14 02:58:46+00:00.

How to read ML Papers to stay aware of the most recent developments in the AI industry?

I am an average engineering grad working as a PM and like to explore concepts in depth. Research papers are a good source of information unlike news and clickbait.

I am not that expert to delve into the mathematical analysis in the paper but want to find ways to get a general gist of the paper for my knowledge.

24

1

[D][R] Collaborative Learning in Agentic Systems: A Collective AI is Greater Than the Sum of Its Parts (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Chocological45 on 2025-06-13 13:13:39+00:00.

TL;DR: The paper introduces MOSAIC, a framework for collaborative learning among autonomous, agentic AI systems that operate in decentralized, dynamic environments. These agents selectively share and reuse modular knowledge (in the form of neural network masks) without requiring synchronization or centralized control.

Key innovations include:

Task similarity via Wasserstein embeddings and cosine similarity to guide knowledge retrieval.
Performance-based heuristics to decide what, when, and from whom to learn.
Modular composition of knowledge to build better policies.

Experiments show that MOSAIC outperforms isolated learners in speed and performance, sometimes solving tasks that isolated agents cannot. Over time, a form of emergent self-organization occurs between agents, resulting from the discovered hierarchies in the curriculum, where simpler tasks support harder ones, enhancing the collective’s efficiency and adaptability.

Overall, MOSAIC demonstrates that selective, autonomous collaboration can produce a collective intelligence that exceeds the sum of its parts.

The paper: https://arxiv.org/abs/2506.05577

The code: https://github.com/DMIU-ShELL/MOSAIC

Abstract:

Agentic AI has gained significant interest as a research paradigm focused on autonomy, self-directed learning, and long-term reliability of decision making. Real-world agentic systems operate in decentralized settings on a large set of tasks or data distributions with constraints such as limited bandwidth, asynchronous execution, and the absence of a centralized model or even common objectives. We posit that exploiting previously learned skills, task similarities, and communication capabilities in a collective of agentic AI are challenging but essential elements to enabling scalability, open-endedness, and beneficial collaborative learning dynamics. In this paper, we introduce Modular Sharing and Composition in Collective Learning (MOSAIC), an agentic algorithm that allows multiple agents to independently solve different tasks while also identifying, sharing, and reusing useful machine-learned knowledge, without coordination, synchronization, or centralized control. MOSAIC combines three mechanisms: (1) modular policy composition via neural network masks, (2) cosine similarity estimation using Wasserstein embeddings for knowledge selection, and (3) asynchronous communication and policy integration. Results on a set of RL benchmarks show that MOSAIC has a greater sample efficiency than isolated learners, i.e., it learns significantly faster, and in some cases, finds solutions to tasks that cannot be solved by isolated learners. The collaborative learning and sharing dynamics are also observed to result in the emergence of ideal curricula of tasks, from easy to hard. These findings support the case for collaborative learning in agentic systems to achieve better and continuously evolving performance both at the individual and collective levels.

High-level illustration of the main MOSAIC algorithmic steps. (A) A Wasserstein task embedding is maintained throughout learning. (B) Embeddings are shared with other agents as queries. (C) Agents respond with information regarding their knowledge. Selection occurs via similarity (D) and performance (E). (F) (G) Network masks are requested. (H) Received masks composed together for the next forward pass.

Comparison of MOSAIC against baseline approaches over 70 runs (14 tasks and five seeds/task) with 95% confidence intervals.

Ablation of MOSAIC with individual components removed from the system. MOSAIC performs best when all components work as one.

25

1

[P] Nanonets-OCR-s: An Open-Source Image-to-Markdown Model with LaTeX, Tables, Signatures, checkboxes & More (old.reddit.com)

submitted 1 week ago by [email protected] to c/[email protected]

0 comments fedilink

This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/SouvikMandal on 2025-06-12 15:41:51+00:00.

We're excited to share Nanonets-OCR-s, a powerful and lightweight (3B) VLM model that converts documents into clean, structured Markdown. This model is trained to understand document structure and content context (like tables, equations, images, plots, watermarks, checkboxes, etc.).

🔍 Key Features:

LaTeX Equation Recognition Converts inline and block-level math into properly formatted LaTeX, distinguishing between $...$ and $$...$$.
Image Descriptions for LLMs Describes embedded images using structured <img> tags. Handles logos, charts, plots, and so on.
Signature Detection & Isolation Finds and tags signatures in scanned documents, outputting them in <signature> blocks.
Watermark Extraction Extracts watermark text and stores it within <watermark> tag for traceability.
Smart Checkbox & Radio Button Handling Converts checkboxes to Unicode symbols like ☑, ☒, and ☐ for reliable parsing in downstream apps.
Complex Table Extraction Handles multi-row/column tables, preserving structure and outputting both Markdown and HTML formats.

Huggingface / GitHub / Try it out:

Huggingface Model Card

Read the full announcement

Try it with Docext in Colab