Sunday — November 10, 2024

Dennis Crowley launches Beebop for audio-based AR, Qwen2.5-Coder rivals GPT-4o in coding abilities, and GPT-4o outperforms humans in reading mental states but shows racial biases.

News

OpenCoder: Open Cookbook for Top-Tier Code Large Language Models

OpenCoder is an open-source code Large Language Model (LLM) family that matches the performance of top-tier code LLMs, supporting both English and Chinese languages. It provides not only the final models but also reproducible training data, a complete data processing pipeline, and detailed training protocols for open scientific research.

When machine learning tells the wrong story

The author presented a research paper at ISCA in 2022, which won awards including Intel's 2024 Hardware Security Academic Award, and discusses the challenges of writing a blog post about the paper due to its complexity and personal significance. The paper explores a machine-learning-assisted side-channel attack that can be executed in modern web browsers, and the author reflects on how working on the paper altered the trajectory of their life and career.

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI

Researchers have developed FrontierMath, a benchmark of hundreds of original mathematics problems designed to test advanced reasoning capabilities in AI systems. Despite extensive support, leading AI models have shown poor performance, solving less than 2% of the problems, revealing a substantial gap between current AI capabilities and human mathematical expertise.

When you ask ChatGPT "Tell me a story" it's always is about a girl named Elara

A teacher was told they can't give students a zero for using AI unless they have proof, so they found a way to comply with the rule while still addressing the issue. The post on the MaliciousCompliance subreddit has gained 22.5k points and 2251 comments.

With AI, the future of augmented reality is in your ears

Dennis Crowley, the founder of Dodgeball and Foursquare, has launched a new startup called Hopscotch Labs, which explores the intersection of AI, ubiquitous headphones, and location-based information. His new project, Beebop, uses AI to surface relevant local information to users through their headphones as they walk by specific locations, effectively creating an audio-based augmented reality experience.

Research

GPT-4o reads the mind in the eyes

Researchers tested the multimodal Large Language Model GPT-4o using the Reading the Mind in Eyes Test and found that it outperformed humans in interpreting mental states from upright faces but underperformed when faces were inverted. However, GPT-4o showed biases in its accuracy, performing better on White faces than Non-white faces, and its error patterns differed from those of humans, particularly when processing inverted faces.

Building, Reusing, Generalizing Abstract Representations from Concrete Sequences

Humans are skilled at learning abstract patterns in sequences and applying them to new situations, but many sequence learning models struggle with this. A new model, called the Hierarchical Variable Model (HVM), has been developed that can efficiently learn and abstract patterns in sequences, and has been shown to perform similarly to humans in certain tasks, outperforming large language models.

Confinement in the Transverse Field Ising Model on the Heavy Hex Lattice

Researchers studied the transverse field Ising model on a decorated hexagonal lattice and found that a quench from a broken symmetry state leads to nonthermal behavior, while quenches to larger fields or from non-symmetry broken states result in thermalization. A minimal model based on the confinement of elementary excitations explains the nonthermal behavior, and the results provide insight into the simulability of a recent large-scale quantum computation.

Rethinking Code Refinement: Learning to Judge Code Efficiency

Large Language Models (LLMs) can refine codes, but the refined versions aren't always more efficient than the originals. A proposed method uses a code language model to compare the efficiency of different code versions, classifying the superior one or predicting relative improvement, and has shown effectiveness in multiple programming languages.

Smaller Large Language Models Can Do Moral Self-Correction

Large Language Models (LLMs) with proper safety alignment fine-tuning can achieve moral self-correction, even with fewer parameters, challenging the assumption that only large models are capable of this. However, smaller LLMs still struggle to comprehend social norms and self-explain, and all scales of LLMs perform poorly when given unethical instructions.

Code

Qwen2.5-Coder-32B with coding abilities matching those of GPT-4o

Qwen2.5-Coder is a next-generation open-source coding model that builds on the Qwen2.5 series, available in three model sizes: 1.5B, 7B, and a 32B version (coming soon). The model has been trained on a larger scale of code data, including source code, text-code grounding data, and synthetic data, totaling 5.5 trillion tokens, resulting in substantial enhancements in code-related tasks.

LLM Prompt Tuning Playbook

This document is a playbook for tuning large language models (LLMs) through effective prompting strategies, written by researchers and engineers with experience working with LLMs. The playbook provides mental models, practical techniques, and a high-level procedure for tuning prompts, with the goal of consolidating and sharing helpful intuitions and prompting techniques for the community.

Show HN: LLM driven OS to execute network security exploration agents

BOSS is an intelligent task orchestration system that leverages Large Language Models (LLMs) to coordinate and execute agent-based workflows, breaking down complex tasks into manageable steps and assigning suitable agents. It features real-time monitoring, adaptation, robust error handling, and human-in-the-loop escalation, but is still under development and not recommended for production use.

Fast-Graphrag

Fast GraphRAG is a streamlined and promptable framework for interpretable, high-precision, agent-driven retrieval workflows, offering a 6x cost saving compared to GraphRAG. It features interpretable and debuggable knowledge, fast and low-cost operations, dynamic data generation, and incremental updates, making it suitable for large-scale applications.

Show HN: Simple recipe extractor using AI

To get started with the TikTok recipe extractor, run the development server using a package manager like npm, yarn, pnpm, or bun. The app can be easily deployed using the Vercel Platform, with more details available in the Next.js deployment documentation.