Sunday — February 16, 2025

The IRS is purchasing a $7 million AI supercomputer from Nvidia for fraud detection, Distributed-Llama connects home devices into a cluster for LLM inference, and research challenges the necessity of hierarchy in HNSW for high-dimensional data.

News

If you believe in "Artificial Intelligence", take five minutes to ask it

The author tested ChatGPT's knowledge of sauropod vertebrae by asking about the reassignment of the species Brachiosaurus brancai to its own genus, Giraffatitan, and found that ChatGPT provided completely incorrect answers with confidence. The author warns that large language models (LLMs) like ChatGPT can provide plausible-sounding but incorrect information, and advises people to be cautious when using them, especially on topics they are not familiar with.

The Impact of Generative AI on Critical Thinking [pdf]

A survey of 319 knowledge workers found that the use of Generative AI (GenAI) tools is associated with reduced cognitive effort and critical thinking, with higher confidence in GenAI leading to less critical thinking and higher self-confidence leading to more critical thinking. The study suggests that GenAI shifts the nature of critical thinking towards information verification, response integration, and task stewardship, and highlights the need for designing GenAI tools that support critical thinking in knowledge work.

The IRS Is Buying an AI Supercomputer from Nvidia

The IRS is planning to purchase a state-of-the-art Nvidia SuperPod AI computing cluster, which will combine 31 separate servers to train and operate artificial intelligence models, with potential uses including automated fraud detection and identity theft prevention. The purchase, which could start at $7 million, is part of a broader push to leverage AI in the federal government, with the IRS having a vast amount of proprietary data that can be used to train machine learning algorithms.

Fighting the AI Scraperbot Scourge

The website LWN.net is fighting against AI scraperbots that are determined to scrape the entire internet to feed AI training models, with these bots ignoring robots.txt files and disguising themselves as regular readers. The site has tried various methods to combat the issue, including throttling and tarpits, but these have been ineffective as the scraperbots have been designed to evade these defenses by spreading their activity across multiple IP addresses.

AI Mistakes Are Different from Human Mistakes

Here is a 2-sentence summary of the text: AI systems make mistakes that are fundamentally different from human mistakes, with errors that are often random, unaccompanied by ignorance, and lacking in clustering or predictability. To mitigate these mistakes, researchers are exploring two areas: engineering AI models to make more human-like mistakes, and developing new mistake-correcting systems that can adapt to the unique characteristics of AI errors.

Research

Can We Trust AI Benchmarks? A Review of Current Issues in AI Evaluation

Quantitative Artificial Intelligence (AI) benchmarks are crucial for evaluating AI model performance, but their growing influence has raised concerns about their effectiveness and potential biases in assessing sensitive topics like safety and systemic risks. A review of 100 studies highlights various shortcomings in benchmarking practices, including design flaws, sociotechnical issues, and systemic problems, underscoring the need to improve the accountability and relevance of AI benchmarks in real-world scenarios.

The hierarchy in HNSW is not necessary in high dimensions

Approximate near-neighbor search over vector embeddings has become a crucial computational workload, with graph-based indexes like the Hierarchical Navigable Small World (HNSW) algorithm being the dominant paradigm. However, research has found that a flat navigable small world graph can achieve identical performance to HNSW with less memory overhead, suggesting that the hierarchical structure of HNSW may not be necessary, and instead, a "highway" of hub nodes within the graph provides the key functionality.

Scaling Test-Time Compute Can Be More Effective Than Scaling Parameters (2024)

Researchers studied how large language models (LLMs) can improve their performance by using more computation at test time, finding that the effectiveness of different approaches varies depending on the difficulty of the prompt. By using a "compute-optimal" strategy that allocates test-time computation adaptively per prompt, they were able to improve efficiency by more than 4x and even outperform a 14x larger model on certain tasks.

Direct Ascent Synthesis: Hidden Generative Capabilities in Discriminative Models

Researchers have found that discriminative models, typically used for classification tasks, also possess powerful generative capabilities, allowing them to synthesize high-quality images through a method called Direct Ascent Synthesis (DAS). This approach enables diverse applications, such as text-to-image generation and style transfer, and reveals that standard discriminative models encode more generative knowledge than previously thought, challenging the traditional distinction between discriminative and generative architectures.

What makes math problems hard for reinforcement learning: a case study

Researchers are using the Andrews-Curtis conjecture to explore challenges in finding rare, high-reward instances, and propose algorithmic enhancements and a hardness measure with broad implications. The study also resolves several open mathematical questions, including demonstrating length reducibility in certain presentations and addressing potential counterexamples in the Miller-Schupp series.

Code

Distributed-Llama: Connect home devices into a cluster for LLM inference

Distributed Llama is a project that allows users to connect home devices into a powerful cluster to accelerate Large Language Model (LLM) inference, supporting Linux, macOS, and Windows, and optimized for ARM and x86_64 AVX2 CPUs. The project enables users to run LLMs on a cluster of devices, with the root node responsible for loading the model and synchronizing the state of the neural network, and worker nodes processing their own slice of the neural network.

Get Started with Vibe Coding

There is no text to summarize. The input appears to be an error message indicating that a README file could not be retrieved.

Are LLMs able to play the card game Set?

The "When AI Fails" project documents instances where artificial intelligence systems produce unexpected or incorrect results, highlighting the limitations and challenges of AI. The project welcomes contributions and showcases various examples of AI failures, such as game AI failures and humorous misinterpretations, under the MIT License.

Want to Train Your Own GPT-Style Model? – Step-by-Step Notebook

This repository contains a Jupyter Notebook that trains a small GPT-style language model from scratch using PyTorch, covering topics such as tokenization, positional encoding, and self-attention. The notebook provides a step-by-step guide to building and training a minimal GPT-style decoder-only transformer model, allowing users to experiment with fine-tuning and inference.

Show HN: I created a media player with AI-generated subtitles for Windows

LLPlayer is a media player designed for language learning, offering features such as dual subtitles, AI-generated subtitles, real-time translation, and word lookup. The player is free, open-source, and written in C#, with a customizable interface and support for over 99 languages, making it a valuable tool for language learners.