Together AI Launches DSGym Framework for Training Data Science AI Agents




Rebeca Moen
Jan 26, 2026 23:09

Together AI’s DSGym framework benchmarks LLM agents on 90+ bioinformatics tasks and 92 Kaggle competitions. Their 4B parameter model matches larger rivals.



Together AI Launches DSGym Framework for Training Data Science AI Agents

Together AI has released DSGym, a comprehensive framework for evaluating and training AI agents designed to perform data science tasks autonomously. The framework includes over 90 bioinformatics challenges and 92 Kaggle competition datasets, providing standardized benchmarks that address fragmentation issues plaguing existing evaluation methods.

The standout claim: Together AI’s 4 billion parameter model, trained using DSGym’s synthetic trajectory generation, achieves performance competitive with models 50 times its size on certain benchmarks.

Benchmark Results Show Surprising Efficiency

The published benchmarks reveal interesting performance dynamics across model sizes. Together AI’s Qwen3-4B-DSGym-SFT-2k model—fine-tuned using the framework—scored 59.36% on QRData-Verified and 77.78% on DABStep-easy tasks. That puts it ahead of the base Qwen3-4B-Instruct model (45.27% and 58.33% respectively) and competitive with models like Deepseek-v3.1 and GPT-OSS-120B on several metrics.

Claude 4.5 Sonnet currently leads the pack on harder tasks, hitting 37.04% on DABStep-hard compared to the fine-tuned 4B model’s 33.07%. But the gap narrows considerably given the massive difference in model scale.

Kimi-K2-Instruct posted the highest QRData-Verified score at 63.68%, while GPT-4o achieved 92.26% on DAEval-Verified—suggesting different architectures excel at different task types.

Why This Matters for AI Development

DSGym tackles a real problem in the AI agent space. Current benchmarks suffer from inconsistent evaluation interfaces and limited task diversity, making it difficult to compare agent performance meaningfully. The framework’s modular architecture allows researchers to add new tasks, agent scaffolds, and tools without rebuilding from scratch.

The execution-verified data synthesis pipeline is particularly notable. Rather than training on static datasets, the system generates synthetic training trajectories that are validated through actual code execution—reducing the garbage-in-garbage-out problem that hampers many AI training pipelines.

For companies building AI-powered data analysis tools, DSGym provides a standardized way to measure progress. The bioinformatics focus (DSBio) and prediction task coverage (DSPredict) extend beyond generic coding benchmarks into domain-specific applications where AI agents could deliver real productivity gains.

What’s Next

The framework is positioned as an evolving testbed rather than a static benchmark suite. Together AI has emphasized the extensibility angle, suggesting they’ll continue adding task categories and evaluation metrics. With AI agent development accelerating across the industry, having a common evaluation standard could help separate genuine capability improvements from benchmark gaming—though that’s always easier said than done.

Image source: Shutterstock




Source link

Similar Posts

  • JPMorgan Hit with $151 Million SEC Settlement Over Misleading Customers – AabeyLLC Crypto

    JPMorgan Chase has agreed to pay $151 million to settle five SEC charges, including allegations of misleading disclosures to clients. The SEC says the U.S.’s largest bank put customers at risk by failing to reveal conflicts of interest across several business lines fully. The settlement includes $61 million in fines and $90 million in reimbursements….

  • Using Solana Staking as a Strategic Volatility Hedge

    The cryptocurrency market is defined by its cycles. We see vertical rallies followed by sharp pullbacks, where even the strongest assets like Solana (SOL) aren’t immune to the broader market’s turbulence. For most investors, these fluctuations trigger a difficult choice: sell for a quick profit, hold through the drawdown, or find a way to make…

  • 谷歌高管靠内幕交易一夜狂揽百万美金 – BitRss – Crypto World News

    作者:刘凯文   本周最火爆的 Polymarket 事件,莫过于昨日轰动全场的盘口「谁将成为 2025 年谷歌年度搜索排名第一的人物?」。自盘口开启以来,教皇利奥十四世一直稳居第一,概率长期稳定在 50% 左右;而特朗普、泰勒·斯威夫特、马斯克等全球知名人物,也占据着整个市场的叙事中心——这本应是一个由「名人之间」角逐的名单。 绝大部分交易员从未认真看过盘口底部那些几乎为零权重的选项:Mikey Madison、Andy Byron、d4vd……它们出现的意义似乎只是让赔率表显得更丰富,没人会真的把它们当成可能的赢家。 无人在意的「傻大户」逆势操作 一周前,当所有人都把注意力集中在这些明星选项的起伏时,一个地址(0xafEe)用极低的价格大量买入「d4vd = 是」的仓位。在预测市场中,价格即概率——而当时 d4vd 的概率不仅极低,几乎可以忽略不计。 对旁人来说,这只是一个无足轻重的「彩票仓位」:这种长期概率接近零的选项,只要概率涨到 10%,便能带来几十倍收益。那位交易员在这个仓位上投入不过 2 万美元,而他的历史交易量近千万美元,这也使得「彩票仓位」论更令人信服。 真正怪异的动作发生在一周之后。 在谷歌何时公布热搜榜完全未知的情况下,这名交易员在前天突然开始大规模建仓。他不是买热门人物的「是」,而是疯狂扫入他们的「否」。 教皇利奥十四世、特朗普、泰勒·斯威夫特、纽约新晋市长……所有被市场认定为「可能夺冠」的选项,都被他用上百万美元的真金白银否定掉。 这种毫无避险逻辑、无视价格冲击的操作,已经完全不符合巨鲸交易逻辑,甚至不像是正常人的投资行为。市场里开始有人注意到这个逆势大户,但更多人只是把他当成「人傻钱多」的笑话。 瞠目结舌的市场反转 然而,就在他扫完单不到几个小时,谷歌突然发布了年度热搜榜单。当排名公布的一瞬间,整个市场都集体愣住了——榜首既不是教皇,也不是特朗普,更不是任何一个热门选项,而是那个长期概率接近零、交易员连资料都懒得查的名字:d4vd。 盘口瞬间爆炸。短短几秒内,d4vd 的概率从图表底部直接垂直拉升到 99.9%,所有其他选项瞬间清零。在市场仍努力理解这到底是不是谷歌系统出 bug 的时候,有人已经注意到:那个一直「胡乱操作」的大户,单日利润超过一百万美元。 他买的「d4vd = 是」胜了,收益将近 20 倍。他买的所有「热门人物 = 否」也胜了。 当人们继续向下滑动他的持仓时发现:他在另一个几乎相同的盘口「2025 年谷歌年度热搜人物 Top 5」中同样全胜,十个仓位投入近 50 万美元,浮盈达 29.2 万。他还参与了七个关于 Gemini 新版本发布时间的盘口,投入超过 100 万美元,依然全部获利。 换句话说,只要与谷歌相关,他似乎从未押错。 比内幕交易更恐怖的「改写者」 当人们开始将此事定性为谷歌内部员工利用消息差来赚取利润时,更深入的链上追踪将这一事件推向了更令人不安的方向。分析显示,这名交易员的地址为…

  • Tether for treason in Taiwan, East’s crypto exchange resurgence: Asia Express

    China buys military secrets from Taiwanese soldiers with Tether; Asia’s crypto license barrage; scam victim under suspicion: Asia Express Two Taiwanese military officers have been indicted on charges of selling sensitive government information to mainland Chinese contacts in exchange for cryptocurrency payments, the Caiotou District Prosecutors Office said in a statement on Sept. 3. According…

  • Coinbase CEO Confirms Expansion of DEX Trading to Brazil

    Coinbase, one of the leading crypto exchanges, has witnessed another landmark development in its activities for ecosystem expansion. In this respect, Coinbase is expanding the accessibility of its DEX trading for consumers in Brazil. As the CEO of Coinbase, Brian Armstrong, disclosed on his official X account, the DEX trading’s expansion to Brazil is improving…

Leave a Reply

Your email address will not be published. Required fields are marked *