Thursday — October 17, 2024

Google backs Kairos Power's nuclear venture, researchers expose a flaw in LLM benchmark reliability while AutoML-Agent advances AI pipeline automation, and Qualcomm's Surface NPU drastically underperforms its CPU counterpart in AI tasks.

News

Show HN: I built the most over-engineered Deal With It emoji generator

This is a face detection GIF generator that allows users to upload or paste an image URL to create a "Deal With It" GIF. The generator is made with passion by Igor Klimer and the source code is available on GitHub.

Google commits to buying power generated by nuclear-energy startup Kairos Power

It seems there is no text provided. Please share the text you would like me to summarize, and I'll be happy to assist you.

FTC announces "click-to-cancel" rule making it easier to cancel subscriptions

The Federal Trade Commission (FTC) has announced a final "click-to-cancel" rule that makes it easier for consumers to end recurring subscriptions and memberships. This rule requires companies to provide clear and easy-to-use cancellation processes for consumers, allowing them to cancel subscriptions with a single click or action.

Busy Status Bar

The Busy Status Bar is a productivity multi-tool device with an LED pixel screen that displays a personal busy message, has a built-in Pomodoro timer, and supports various apps. It is fully customizable, open-source, and hacker-friendly, allowing users to control their device via API and MQTT through the Busy Cloud.

Meta's open AI hardware vision

Meta is showcasing its latest open AI hardware designs at the Open Compute Project (OCP) Global Summit 2024, including a new AI platform, cutting-edge open rack designs, and advanced network fabrics and components. The company is driven to advance its infrastructure to support emerging AI workloads and is releasing its new high-powered rack, Catalina, to the OCP community, which is designed to support the latest NVIDIA GB200 Grace Blackwell Superchip and has a modular design to empower customization.

Research

Cheating Automatic LLM Benchmarks

Researchers found that a "null model" that always outputs a constant response can achieve top-ranked win rates on popular automatic language model benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench. This highlights a vulnerability in these benchmarks and calls for the development of anti-cheating mechanisms to ensure reliable evaluation of language models.

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Model Swarms is a collaborative search algorithm that adapts large language models (LLMs) via swarm intelligence, allowing diverse LLM experts to work together to optimize a utility function and improve model performance. This approach offers tuning-free model adaptation, works in low-data regimes, and outperforms existing model composition methods by up to 21.0% across various tasks and contexts.

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

AutoML-Agent is a novel multi-agent framework that automates the full AI development pipeline, from data retrieval to model deployment, using large language models (LLMs) and a natural language interface. It enhances the efficiency of the AutoML process by introducing a retrieval-augmented planning strategy, parallel sub-task solving, and multi-stage verification, achieving a higher success rate in automating the full AutoML process.

The Dynamics of Social Conventions in LLM Populations

Researchers investigated how Large Language Model (LLM) agents form social conventions through simulated interactions, finding that globally accepted conventions can arise spontaneously and strong collective biases can emerge. They also discovered that minority groups of committed LLMs can drive social change by establishing new conventions, potentially overturning established behaviors once they reach a critical size.

The Energy Implications of AI Adoption

Researchers estimated that the adoption of Artificial Intelligence across industries in the US could lead to a 0.03% annual increase in energy consumption and a 0.02% annual increase in carbon dioxide emissions. The estimated increase in energy consumption and emissions ranges from 28 PJ and 896 ktCO$_2$ per year, respectively, to 0 and 272 ktCO$_2$ per year, respectively.

Code

AI PCs Aren't Good at AI: The CPU Beats the NPU

We benchmarked Qualcomm's NPU on a Microsoft Surface Tablet and found that it only achieved 1.3% of its claimed 45 Teraops/s performance. The benchmark measured the latency of running a simple model on the CPU and NPU, with the NPU achieving 225 Gigaops and 573 Gigaops in two different approaches.

Ichigo: Local real-time voice AI

:strawberry: Ichigo is an open research experiment to extend a text-based LLM to have native "listening" ability, using an early fusion technique inspired by Meta's Chameleon paper. The project has made significant progress, achieving an enhanced MMLU score of 63.79 and demonstrating stronger speech instruction-following capabilities, even in multi-turn interactions.

Show HN: Arch – an intelligent prompt gateway built on Envoy

Arch is an intelligent Layer 7 gateway designed to protect, observe, and personalize Large Language Model (LLM) applications by handling undifferentiated tasks such as prompt detection, API calling, and observability. It is built on Envoy Proxy and features function calling, prompt guardrails, traffic management, and standards-based observability.

Nvidia Outperforms GPT-4o with Open Source Model

Arena-Hard-Auto is an automatic evaluation tool for instruction-tuned LLMs that contains 500 challenging user queries sourced from Chatbot Arena. It has the highest correlation and separability to Chatbot Arena among popular open-ended LLM benchmarks.

Out-of-Distribution Machine Learning

This repository provides a comprehensive resource for out-of-distribution (OOD) detection, robustness, and generalization in machine learning/deep learning. It aims to address critical challenges in modern AI systems that encounter data differing from their training distribution, leading to unexpected failures.