Tuesday — January 7, 2025

Meta AI selfie leads to Instagram ad controversy, Nvidia's Project Digits debuts as a personal AI supercomputer, and TheAgentCompany evaluates LLMs on real-world tasks with mixed success.

News

Used Meta AI, now Instagram is using my face on ads targeted at me

A user edited a selfie using Meta AI and now claims that Instagram is using their face in targeted ads. The post has gained 5,692 points and 97% upvotes on the r/ABoringDystopia subreddit.

Nvidia's Project Digits is a 'personal AI supercomputer'

Nvidia has unveiled Project Digits, a "personal AI supercomputer" that provides access to the company's Grace Blackwell hardware platform in a compact form factor, designed for AI researchers, data scientists, and students. The device, which will be available starting in May for $3,000, can run models up to 200 billion parameters in size and can be linked together to run up to 405-billion-parameter models.

AMD 'Strix Halo' Ryzen AI Max+ Debuts with RDNA 3.5 Graphics and Zen 5 CPU Cores

AMD has unveiled its 'Strix Halo' Ryzen AI Max+ series laptop processors, featuring a radical new memory tech that unlocks capabilities for RDNA 3.5 graphics and Zen 5 CPU cores. The flagship Ryzen AI Max+ 395 boasts 16 CPU cores, 40 RDNA 3.5 integrated graphics cores, and up to 128GB of shared memory, delivering up to 1.4X faster gaming performance than Intel's flagship 'Lunar Lake' Core Ultra 9 288V.

LLMs and Code Optimization

David G. Andersen explores the limitations of large language models (LLMs) in optimizing code, using an example from Max Woolf's article on writing better code with LLMs. Andersen finds that while LLMs can generate faster code, they often miss obvious optimizations, and human intervention is still necessary to achieve optimal performance.

Research

Long Context vs. RAG for LLMs: An Evaluation and Revisits

Extending context windows (Long Context, LC) and using retrievers (Retrieval-Augmented Generation, RAG) are two main strategies for incorporating long external contexts into large language models (LLMs). LC generally outperforms RAG in question-answering benchmarks, but RAG has advantages in dialogue-based and general question queries, highlighting trade-offs between the two strategies.

Superhuman performance of an LLM on the reasoning tasks of a physician

OpenAI's o1-preview model demonstrated significant improvements in differential diagnosis generation and quality of diagnostic and management reasoning compared to previous models and human controls. However, it showed no improvement in probabilistic reasoning or triage differential diagnosis, highlighting the need for more robust benchmarks and real-world clinical trials to evaluate the capabilities of large language models.

Benchmarking LLM Agents on Consequential Real World Tasks

Researchers developed TheAgentCompany, a benchmark to evaluate AI agents' performance on real-world professional tasks, and found that the most competitive agent could complete 24% of tasks autonomously in a simulated workplace environment. The results suggest that while AI agents can automate simpler tasks, more complex tasks remain beyond their capabilities.

Time-Series Anomaly Detection: A Decade Review

Time series analytics, particularly anomaly detection, has become increasingly important due to the growing volume and velocity of streaming data, with applications in fields such as cybersecurity and healthcare. This survey categorizes and summarizes existing anomaly detection solutions using a process-centric taxonomy, providing a structured characterization of research methods and identifying trends in time-series anomaly detection research.

Code

Show HN: Skeet – A local-friendly command-line copilot that works with any LLM

Skeet is a command-line AI copilot that transforms natural language instructions into precise shell commands or Python scripts, adapting and retrying automatically until the job is done. It supports multiple LLM providers, including OpenAI, Anthropic, and local models, and can be configured using a YAML file to customize its behavior.

LOTUS makes LLM-powered data processing fast and easy (as easy as Pandas)

LOTUS is a query engine that enables fast and easy data processing using large language models (LLMs), providing a declarative programming model and an optimized query engine for serving powerful reasoning-based query pipelines over structured and unstructured data. It offers a Pandas-like API with semantic operators, such as sem_join, sem_filter, and sem_extract, that can be used to write AI-based pipelines with high-level logic.

Show HN: A 100-Line LLM Framework

Mini LLM Flow is a minimalist LLM framework in 100 lines of code, designed to be used by LLMs themselves to build applications, focusing on high-level programming paradigms and stripping away low-level implementation details. The framework uses a nested directed graph to break down tasks into multiple LLM steps, allowing for branching and recursion for agent-like decision-making.

Show HN: LLM Creative Story-Writing Benchmark

The LLM Creative Story-Writing Benchmark evaluates the ability of large language models to incorporate a set of 10 mandatory story elements into a short narrative, measuring both constraint satisfaction and literary quality. The benchmark found that Claude 3.5 Sonnet emerged as the clear overall winner, with Gemini models performing well and Llama models lagging behind, despite some models being larger and more expensive.

Show HN: I built an offline open-source RAG system DataBridge

DataBridge is a powerful document processing and retrieval system designed for building intelligent document-based applications, providing a robust foundation for semantic search, document processing, and AI-powered document interactions. It features a modular design, support for various document formats, and advanced security and access control options, with detailed documentation and a range of deployment options available.