Tuesday — February 18, 2025

Kindle devices can now be jailbroken with WinterBreak, Open-CUAK makes managing automation agents open-source at scale, and the EnigmaEval benchmark reveals major limitations in long multimodal reasoning by state-of-the-art models.

News

All Kindles can now be jailbroken

WinterBreak is a jailbreak for Kindles, released on New Year's Day 2025, which allows users to gain access to their device's system and install custom modifications. The jailbreak is based on Mesquito and can be installed by following a series of steps, including downloading the WinterBreak release, extracting the files to the Kindle, and running the jailbreak process. If issues arise during the installation, a troubleshooting section is available to help resolve common problems.

Mistral Saba

Mistral Saba is a 24B parameter AI model trained on datasets from the Middle East and South Asia, providing accurate and relevant responses in regional languages such as Arabic and several Indian-origin languages. The model is designed to serve use cases with strong regional context, offering benefits such as conversational support, domain-specific expertise, and cultural content creation, and is available as an API or for local deployment.

New junior developers can’t code

The author is concerned that the increasing reliance on AI tools like Copilot and GPT among junior developers is leading to a lack of deep understanding of the code they're writing, as they're able to produce working code quickly without fully grasping the underlying principles. To combat this, the author suggests using AI with a learning mindset, engaging in discussions with other developers, and building things from scratch to gain a deeper understanding of the code and the development process.

Elon Musk's terrifying vision for AI

Elon Musk's new Large Language Model, Grok, is capable of spreading propaganda and influencing people's attitudes, often without them even realizing it, which is a concerning development given Musk's immense power and influence. The model's potential for biased and unreliable output, combined with Musk's plans to integrate it into various aspects of society, including education, raises significant concerns about the impact on democracy and the spread of misinformation.

The Generative AI Con

The author argues that the Large Language Model (LLM) industry, particularly ChatGPT, is a bubble inflated by hype and media hype, and that its popularity is not a reliable indicator of its sustainability or value as a real industry. Despite having 300 million weekly users, the author questions the significance of this number, pointing out that it is largely the result of media coverage and does not necessarily translate to a viable or profitable business model.

Research

Assured LLM-Based Software Engineering

This paper proposes Assured LLM-Based Software Engineering, a generate-and-test approach that uses Large Language Models (LLMs) to improve code independently of humans while ensuring the improved code does not regress and is verifiably better. The approach applies semantic filters to discard unsuitable code, allowing LLMs to generate code with humans only reviewing the final output, similar to code generated by other human engineers.

Step-Video-T2V: The Practice, Challenges, and Future of Video Foundation Model

Step-Video-T2V is a state-of-the-art text-to-video model with 30B parameters, capable of generating high-quality videos up to 204 frames in length, utilizing various techniques such as deep compression and denoising. The model's performance is evaluated on a novel benchmark, demonstrating its superiority over existing models, and is made publicly available to accelerate innovation in video foundation models and empower content creators.

ZeroBench: An Impossible Visual Benchmark for Contemporary LMMs

Large Multimodal Models (LMMs) have significant limitations in interpreting images and spatial cognition, yet still achieve high scores on popular visual benchmarks. To address this, a new benchmark called ZeroBench has been introduced, which is designed to be impossible for current LMMs and consists of 100 curated questions that all 20 evaluated LMMs failed to answer, scoring 0.0%.

SWE-Lancer: Can LLMs Earn $1M from Freelance Software Engineering?

SWE-Lancer is a benchmark of over 1,400 freelance software engineering tasks valued at $1 million USD, encompassing both independent engineering tasks and managerial tasks. The benchmark evaluates model performance and finds that current models are still unable to solve the majority of tasks, with the dataset and evaluation tools made publicly available to facilitate future research into the economic impact of AI model development.

EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

EnigmaEval is a dataset of 1184 puzzle problems derived from competitions that tests language models' ability to perform complex reasoning and knowledge synthesis. State-of-the-art language models achieve low accuracy on these puzzles, highlighting their limitations in solving problems that require unstructured and lateral reasoning, and discovering hidden connections between unrelated pieces of information.

Code

Show HN: Bag of words – Build and share smart data apps using AI

Bag of words is a platform that enables users to create comprehensive dashboards with a single prompt and refine them iteratively, integrating seamlessly with various data sources. It offers key features such as data source integration, natural language queries, dashboard management, and LLM compatibility, with options for quick start using Docker or local development using Python and Node.js.

Open-CUAK: "OpenAI Operator" Now Goes Open Source

Open-CUAK is a platform for managing automation agents at scale, starting with browsers, designed to ensure reliability and scalability, and is being developed as an open-source project by Aident AI. The platform aims to provide a range of features, including vision-based automation, remote browser management, and account access management, with the goal of making automation more abundant and equally distributed.

Step-Video-T2V: 30B open source text-to-video foundation model

The Step-Video-T2V model is a state-of-the-art text-to-video pre-trained model with 30 billion parameters, capable of generating videos up to 204 frames. The model utilizes a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios, and incorporates Direct Preference Optimization to enhance the visual quality of the generated videos.

Cake: Distributed LLM and StableDiffusion inference for mobile desktop or server

Cake is a Rust framework for distributed inference of large AI models, allowing users to repurpose consumer hardware into a heterogeneous cluster to run models that wouldn't normally fit in a single device's GPU memory. The project aims to make AI more accessible and democratic by leveraging planned obsolescence, and it supports various operating systems and architectures, including Linux, Windows, macOS, Android, and iOS.

Barebone web research agent (Apache 2.0)

The Web Research Agent is a Python-based tool that uses the Claude 3.5 Sonnet AI model to automatically gather and analyze information from the web, with features such as dynamic field inference, web search, and webpage content extraction. The agent can be used via command line or imported into Python code, and provides structured JSON output with results and process documentation, with configurable parameters and robust error handling.