Tagged articles

387 articles

Page 1 of 4

May 24, 2026 · Artificial Intelligence

Can CODA Enable LLMs and Beginners to Write Lightning‑Fast Transformer Kernels?

CODA rewrites Transformer blocks as GEMM‑epilogue programs, exposing five primitive building blocks that let both AI‑generated code and human programmers fuse memory‑intensive operations into the GEMM epilogue, eliminating costly tensor moves and achieving up to 1.8× speed‑ups on H100 GPUs for RMSNorm, SwiGLU, RoPE and other components, while preserving numerical accuracy.

CODACUDAGEMM

0 likes · 11 min read

Can CODA Enable LLMs and Beginners to Write Lightning‑Fast Transformer Kernels?

Mike Chen's Internet Architecture

May 21, 2026 · Artificial Intelligence

Demystifying AI Large Models: Architecture, Principles, and Workflow

The article explains that large language models are massive probability engines built on the Transformer architecture with self‑attention, trained through costly pre‑training on trillions of tokens, then refined by instruction fine‑tuning and RLHF, ultimately predicting the next token to generate text.

Large Language ModelRLHFSelf-Attention

0 likes · 5 min read

Demystifying AI Large Models: Architecture, Principles, and Workflow

Machine Learning Algorithms & Natural Language Processing

May 20, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

The paper shows that applying lightweight L1 regularization can make over 99% of FFN activations zero, and by using a new tile‑wise ELLPACK (TwELL) format together with a hybrid routing scheme, inference speed improves up to 30% while memory usage drops over 24% and energy consumption is reduced, all with negligible impact on downstream task performance.

CUDAGPU optimizationHybrid Routing

0 likes · 8 min read

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

Machine Learning Algorithms & Natural Language Processing

May 20, 2026 · Artificial Intelligence

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

The article surveys recent open‑weight LLM releases—Gemma 4, Laguna XS.2, ZAYA1‑8B and DeepSeek V4—detailing how KV‑cache sharing, per‑layer embeddings, layer‑wise attention budgeting, compressed convolutional attention and manifold‑constrained hyper‑connections dramatically reduce memory and compute for ultra‑long contexts while preserving model quality.

Attention optimizationKV CacheLLM

0 likes · 25 min read

How New LLM Architectures Like Gemma 4 and DeepSeek V4 Cut Long‑Context Costs

Lao Guo's Learning Space

May 12, 2026 · Artificial Intelligence

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

This article breaks down the key algorithms that power large‑language models—Transformer, Mixture‑of‑Experts, Flash Attention, KV‑Cache, Multi‑Token Prediction, quantization, Chain‑of‑Thought and Retrieval‑Augmented Generation—explaining how each contributes to the performance of ChatGPT, GPT‑4 and DeepSeek.

Flash AttentionKV CacheMixture of Experts

0 likes · 10 min read

Demystifying the Core Technologies Behind ChatGPT, GPT‑4, and DeepSeek

AI Architecture Path

May 11, 2026 · Artificial Intelligence

OpenMythos: 22‑Year‑Old Recreates Claude Mythos with Recurrent Depth Transformers

A 22‑year‑old developer reverse‑engineered Anthropic’s confidential Claude Mythos, releasing the OpenMythos project that employs a Recurrent Depth Transformer looping a single weight set up to 16 times, matching a 1.3 B‑parameter transformer’s performance with only 770 M parameters while enabling deeper inference and solving gradient instability.

AIClaude MythosOpen Source

0 likes · 9 min read

OpenMythos: 22‑Year‑Old Recreates Claude Mythos with Recurrent Depth Transformers

Machine Learning Algorithms & Natural Language Processing

May 9, 2026 · Artificial Intelligence

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

A new ICML 2026 paper by Sakana AI and NVIDIA shows that applying lightweight L1 regularization can make Feed‑Forward Network activations in Transformers over 99% sparse, and with the TwELL storage format and a hybrid routing scheme this sparsity translates into up to 20.5% inference speedup, 21.9% training‑step acceleration, lower energy consumption and reduced peak memory across 0.5‑2 B‑parameter models while preserving downstream performance.

CUDAGPU optimizationHybrid Routing

0 likes · 9 min read

Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

Xiaomi Tech

May 7, 2026 · Artificial Intelligence

OmniVoice: Open‑Source TTS Model Clones Voices in 600+ Languages with a Single Architecture

OmniVoice, an open‑source TTS system from Xiaomi AI Lab, uses a minimalist bidirectional Transformer and LLM‑enhanced pre‑training to synthesize high‑quality speech in over 600 languages, outperforming commercial systems while offering fine‑grained control and fully public code and models.

OmniVoiceOpen SourceTTS

0 likes · 8 min read

Data Party THU

Apr 30, 2026 · Artificial Intelligence

Turning Transformers into Mamba: How Apple Linearized Inference Costs

Apple introduced a two‑step cross‑architecture distillation method that converts costly quadratic‑time Transformers into cheaper linear‑time Mamba models, preserving most of the original performance while dramatically reducing inference cost.

AI researchLinear AttentionMamba

0 likes · 8 min read

Turning Transformers into Mamba: How Apple Linearized Inference Costs

SuanNi

Apr 30, 2026 · Artificial Intelligence

Why Transformers Are Naturally Succinct: Insights from the ICLR Best Paper

The ICLR 2026 best paper reveals that Transformers achieve extreme succinctness—encoding complex concepts with exponentially fewer symbols than RNNs—while proving that analyzing or verifying such models incurs EXPSPACE‑complete computational costs.

Computational ComplexityEXPSPACESuccinctness

0 likes · 8 min read

Why Transformers Are Naturally Succinct: Insights from the ICLR Best Paper

Machine Heart

Apr 29, 2026 · Artificial Intelligence

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

The Latent‑Condensed Attention (LCA) method dramatically cuts KV‑cache memory by 90%, speeds up pre‑fill by 2.5× and reduces decode latency by 1.8× for 128K‑token contexts, while requiring no extra parameters and preserving model performance across diverse LLMs.

Efficient AttentionInference AccelerationKV cache reduction

0 likes · 10 min read

LCA Boosts Long-Context Inference: 2.5× Speedup and 90% KV Cache Reduction

Bighead's Algorithm Notes

Apr 22, 2026 · Artificial Intelligence

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

DeepAries is a novel deep reinforcement‑learning framework that jointly learns when to rebalance a portfolio and how to allocate assets by combining a Transformer‑based state encoder with PPO, and extensive experiments on four major markets show it significantly outperforms fixed‑frequency baselines in risk‑adjusted return, transaction cost, and drawdown.

DeepAriesPPOPortfolio Management

0 likes · 15 min read

How DeepAries’s Adaptive Rebalancing Timing Boosts Portfolio Returns

Machine Heart

Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Artificial IntelligenceLinear AttentionMamba

0 likes · 7 min read

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Machine Heart

Apr 17, 2026 · Artificial Intelligence

Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context

Google Research introduces Memory Caching (MC), a technique that gives RNNs growing memory capacity, bridging the gap with Transformers to enable ultra‑long context processing while reducing memory demands, and demonstrates its effectiveness through extensive language‑modeling and recall experiments.

AI ArchitectureGoogle ResearchMemory Caching

0 likes · 7 min read

Combining Transformers and RNNs: Google’s Memory Caching Unlocks Ultra‑Long Context

Weekly Large Model Application

Apr 16, 2026 · Artificial Intelligence

Deep Dive into Conformer: The Convolution‑Augmented Transformer for Speech Recognition

The Conformer architecture blends global self‑attention with a depthwise separable convolution module in a Macaron‑style block, addressing the strong local time‑frequency structure and long sequence length of speech signals while keeping computational cost manageable for modern ASR systems.

ASRConformerConvolution

0 likes · 11 min read

Deep Dive into Conformer: The Convolution‑Augmented Transformer for Speech Recognition

ZhiKe AI

Apr 15, 2026 · Artificial Intelligence

From Sci‑Fi to Reality: How AI Large Models Are Reshaping Our World

The article explains what AI is, traces its three historical waves—from rule‑based expert systems to statistical learning and deep learning—focuses on the current large‑language‑model era, surveys leading domestic and overseas models, and highlights key trends such as open‑source competition, reasoning capabilities, multimodality, and edge deployment.

AIMultimodalOpen Source

0 likes · 4 min read

From Sci‑Fi to Reality: How AI Large Models Are Reshaping Our World

Machine Heart

Apr 14, 2026 · Artificial Intelligence

Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes

A developer recreated a 1970s PDP‑11 environment, wrote a single‑layer, single‑head Transformer in assembly, and trained it on a sequence‑reversal task, achieving 100% accuracy after about 350 steps and a total training time of roughly 5.5 minutes.

AssemblyLow-resource AIPDP-11

0 likes · 16 min read

Training a Transformer on a 1970s PDP‑11 Takes Only 5.5 Minutes

Lao Guo's Learning Space

Apr 12, 2026 · Artificial Intelligence

Who Wins the AI Video Throne? HappyHorse-1.0 vs ByteDance Seedance 2.0

The article dissects the April 2026 showdown between the anonymous 15‑billion‑parameter HappyHorse‑1.0 and ByteDance’s two‑year‑old Seedance 2.0, detailing Elo score gaps, contrasting single‑stream versus dual‑branch Transformer designs, speed advantages, quality trade‑offs, and offering a decision tree for different production needs.

AI VideoElo rankingMultimodal

0 likes · 11 min read

Who Wins the AI Video Throne? HappyHorse-1.0 vs ByteDance Seedance 2.0

AI Explorer

Apr 11, 2026 · Artificial Intelligence

How Kronos Redefines Quantitative Analysis with a Financial‑Market Language Model

Kronos, an open‑source large model trained on OHLCV data from over 45 exchanges, treats financial time‑series as a specialized language, using a custom tokenizer and a two‑stage Transformer to enable price prediction, market state detection, signal generation, and risk simulation, with easy Hugging Face integration and a live demo for BTC/USDT.

KronosLarge Language ModelOpen Source

0 likes · 6 min read

How Kronos Redefines Quantitative Analysis with a Financial‑Market Language Model

AI Tech Publishing

Apr 9, 2026 · Artificial Intelligence

Engineering‑Focused Guide to Training and Inference of Large Language Models

This article walks engineers through the full LLM stack—from tokenization and positional encoding to transformer blocks, efficient fine‑tuning, quantization, and production‑grade inference techniques such as KV‑cache, FlashAttention, PagedAttention, continuous batching, and speculative decoding—highlighting trade‑offs, toolchains, and practical workflow steps.

LLMLoRATransformer

0 likes · 13 min read

Engineering‑Focused Guide to Training and Inference of Large Language Models

Bighead's Algorithm Notes

Apr 6, 2026 · Artificial Intelligence

STORM: A Bidirectional Spatiotemporal Factor Model Achieving Sharpe Ratio >1

STORM introduces a bidirectional VQ‑VAE‑based spatiotemporal factor model that extracts fine‑grained time‑series and cross‑sectional features, uses discrete codebooks for orthogonal, diverse factor embeddings, and outperforms nine baselines on portfolio management and algorithmic trading tasks, delivering Sharpe ratios exceeding 1.

Algorithmic TradingPortfolio ManagementQuantitative Finance

0 likes · 17 min read

STORM: A Bidirectional Spatiotemporal Factor Model Achieving Sharpe Ratio >1

AI Programming Lab

Apr 5, 2026 · Artificial Intelligence

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

The article explains what tokens are, how different models tokenize text, the role of token embeddings, positional encoding, self‑attention, KV cache, and why output tokens cost far more than input tokens, while also covering pricing differences and prompt‑caching savings across major LLM providers.

KV CacheLLM pricingLarge Language Model

0 likes · 13 min read

Do You Really Understand Tokens? A Deep Dive Starting from a Claude Code Session

Data Party THU

Apr 3, 2026 · Artificial Intelligence

Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

The article reviews the Kimi team's Attention Residuals approach, which substitutes traditional ResNet additive shortcuts with learned attention‑based weighting, explains the theoretical motivation linking depth to time, details full‑attention and block‑wise implementations, presents experimental results showing up to 1.25× compute efficiency and improved performance on reasoning and knowledge tasks.

Residual NetworksTransformerattention mechanism

0 likes · 11 min read

Can Attention Replace Residuals? Inside the New Attention Residuals Breakthrough

ShiZhen AI

Apr 2, 2026 · Artificial Intelligence

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

The article explains the KV Cache mechanism that stores previously computed key/value vectors to avoid redundant Transformer calculations, delivering roughly a 5× speedup, while also detailing why generating output tokens is far more expensive than processing input tokens due to serial generation and memory trade‑offs.

KV CacheLLM inferenceMemory Optimization

0 likes · 9 min read

How KV Cache Works and Why Large Model Outputs Cost Five Times More Than Inputs

ArcThink

Apr 2, 2026 · Artificial Intelligence

Why LLMs Forget You: Uncovering the Limits and Solutions for Long‑Term Memory

The article explains why large language models lack persistent memory due to the stateless Transformer architecture, breaks down the four dimensions of memory loss, surveys seven technical approaches, three product implementations, and emerging research, and discusses security and privacy implications.

AILLMLong-term Memory

0 likes · 22 min read

Why LLMs Forget You: Uncovering the Limits and Solutions for Long‑Term Memory

AI Explorer

Apr 1, 2026 · Artificial Intelligence

Google Open‑Sources TimesFM: A Foundation Model for Plug‑and‑Play Time‑Series Forecasting

Google’s open‑source TimesFM is a decoder‑only Transformer foundation model that delivers plug‑and‑play time‑series forecasting with zero‑shot accuracy, larger context windows, quantile predictions, and a simple Hugging Face API, making it suitable for retail, energy, finance, monitoring, and IoT use cases.

Foundation ModelHugging FacePyTorch

0 likes · 7 min read

Google Open‑Sources TimesFM: A Foundation Model for Plug‑and‑Play Time‑Series Forecasting

Data Party THU

Mar 31, 2026 · Artificial Intelligence

Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

The STEM architecture replaces the Transformer feed‑forward network with a static token‑indexed embedding table, enabling lookup‑based memory that decouples capacity from compute, improves training stability, expands addressable memory, and delivers consistent performance gains on long‑context and knowledge‑intensive tasks.

Lookup MemorySTEM ArchitectureTransformer

0 likes · 8 min read

Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

AI Large-Model Wave and Transformation Guide

Mar 28, 2026 · Artificial Intelligence

From RNNs to Multimodal Agents: A Decade of Transformer Evolution

This article traces the evolution of sequence models from early RNN/LSTM designs through the breakthrough Transformer, its major branches, dense scaling, efficiency‑focused variants, next‑generation linear‑complexity SSMs, and finally multimodal agent architectures, highlighting each stage's strengths, weaknesses, and typical use cases.

AI ArchitectureEfficient AttentionLLM

0 likes · 12 min read

From RNNs to Multimodal Agents: A Decade of Transformer Evolution

Data Party THU

Mar 26, 2026 · Artificial Intelligence

How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

This article examines the Mixture‑of‑Depths Attention (MoDA) mechanism, detailing its novel flash‑compatible KV layout, combined sequence‑depth attention, theoretical analysis, and extensive experiments that show significant reductions in validation loss and accuracy gains on downstream tasks compared to the OLMo2 baseline.

Deep KVFlashAttentionMixture-of-Depths Attention

0 likes · 9 min read

How Mixture-of-Depths Attention Boosts Large Language Model Efficiency

Full-Stack Cultivation Path

Mar 23, 2026 · Artificial Intelligence

What Exactly Is a Token in LLMs? A First‑Principles Explanation

The article explains that a token is the smallest discrete text unit a large language model processes, detailing why tokenization is essential, how tokenizers work, how tokens flow through the transformer, and how token counts affect context windows, cost, latency, and overall model behavior.

Cost ManagementEmbeddingLLM

0 likes · 20 min read

What Exactly Is a Token in LLMs? A First‑Principles Explanation

SuanNi

Mar 17, 2026 · Artificial Intelligence

How Attention Residuals Boost Transformer Efficiency and Scale

The article presents the Attention Residuals architecture, explains how it replaces uniform residual addition with learned attention‑based aggregation, details full and block variants, engineering tricks for distributed training, and shows extensive scaling‑law experiments where the new design consistently improves validation loss and training efficiency across model sizes.

Attention ResidualsModel ScalingTransformer

0 likes · 13 min read

How Attention Residuals Boost Transformer Efficiency and Scale

ShiZhen AI

Mar 17, 2026 · Artificial Intelligence

Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

The Kimi team introduces Attention Residuals, a softmax‑based replacement for the uniform residual connections used in Transformers for a decade, enabling selective aggregation of layer histories, reducing hidden‑state growth, and achieving a 1.25× compute‑efficiency gain on a 48‑billion‑parameter MoE model with less than 2% inference latency increase.

Attention ResidualsCompute EfficiencyMoE

0 likes · 10 min read

Kimi’s Attention Residuals Swap a Decade-Old Residual Trick for 1.25× Faster 48B MoE

Shi's AI Notebook

Mar 16, 2026 · Artificial Intelligence

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

This article walks through MiniMind's Attention.forward implementation, explaining why Q, K, and V are created, how tensors are reshaped for multi‑head attention, the role of masks, KV cache, GQA, and how each token aggregates information from the entire context.

KV CacheTransformerattention

0 likes · 21 min read

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

Machine Learning Algorithms & Natural Language Processing

Mar 15, 2026 · Artificial Intelligence

HY‑WU: Real‑Time Adaptive AI Model That Generates Parameters On‑The‑Fly

HY‑WU demonstrates that generating model parameters dynamically during inference enables a single foundation model to perform diverse image‑editing tasks, outperforming fixed‑parameter baselines in human and automatic evaluations, benchmark tests, and conflict‑task experiments, highlighting a practical real‑time adaptation approach for AI systems.

HY-WULoRATransformer

0 likes · 16 min read

HY‑WU: Real‑Time Adaptive AI Model That Generates Parameters On‑The‑Fly

Machine Learning Algorithms & Natural Language Processing

Mar 14, 2026 · Artificial Intelligence

Can Large Language Models Get Stronger Without Human Language Training? A New Pre‑Pre‑Training Path

A recent study shows that pre‑training Transformers on synthetic, non‑language data generated by Neural Cellular Automata can boost language‑model performance by up to 6%, accelerate convergence by 40%, and improve downstream reasoning, even outperforming models trained on massive natural‑text corpora.

In-Context LearningLanguage ModelsNeural Cellular Automata

0 likes · 12 min read

Can Large Language Models Get Stronger Without Human Language Training? A New Pre‑Pre‑Training Path

Bighead's Algorithm Notes

Mar 14, 2026 · Artificial Intelligence

Quantitative Finance Paper Digest: AI‑Driven Market Prediction Studies (Mar 7‑13 2026)

This digest summarizes four recent research papers that apply advanced AI techniques—node‑transformer graphs with BERT sentiment analysis, a quantum‑classical LSTM‑Born machine hybrid, large‑language‑model benchmarking for portfolio optimization, and a conditional diffusion model—to improve stock market prediction, volatility forecasting, and investment decision making, providing detailed experimental results and statistical validation.

BERTLarge Language ModelQuantum Computing

0 likes · 10 min read

Quantitative Finance Paper Digest: AI‑Driven Market Prediction Studies (Mar 7‑13 2026)

High Availability Architecture

Mar 12, 2026 · Artificial Intelligence

How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%

This article explains the prompt‑caching mechanism used by Claude Code, showing how separating static prefixes from dynamic tails and leveraging KV‑tensor caching reduces the O(n²) complexity of transformer pre‑fill to O(n), achieving a 92% cache hit rate and up to 81% cost savings in long‑running AI agent sessions.

AI agentsClaudeCost Reduction

0 likes · 12 min read

How Claude Code Hits 92% Prompt Cache Rate and Slashes AI Agent Costs by 81%

Machine Learning Algorithms & Natural Language Processing

Mar 11, 2026 · Artificial Intelligence

Random Parameter Pruning Boosts Transferable Targeted Attacks Across Model Architectures

The RaPA method introduces random parameter pruning during adversarial generation, creating diverse model variants that markedly increase the success rate of targeted transfer attacks across CNN and Transformer architectures, even against defended models and with higher computational budgets, as demonstrated on ImageNet‑compatible benchmarks.

CNNTransformeradversarial attacks

0 likes · 14 min read

Random Parameter Pruning Boosts Transferable Targeted Attacks Across Model Architectures

Machine Learning Algorithms & Natural Language Processing

Mar 10, 2026 · Artificial Intelligence

How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes

InfLLM‑V2 introduces a dense‑sparse switchable attention framework that preserves the original dense‑attention parameters while enabling efficient long‑context training, matching full‑attention performance on benchmarks such as RULER, LongBench, and chain‑reasoning tasks, and delivering up to 2.3× end‑to‑end inference speedup without degrading short‑sequence abilities.

EfficiencyInfLLM-V2Transformer

0 likes · 16 min read

How InfLLM‑V2 Achieves Seamless Short‑to‑Long Context Upgrade with Minimal Structural Changes

Machine Learning Algorithms & Natural Language Processing

Mar 10, 2026 · Artificial Intelligence

Why the First Token Becomes a Value Garbage Bin – LeCun Team Dissects Spike and Attention Sink Mechanics

The paper by Yann LeCun’s team reveals that massive activation spikes and attention sinks in Transformers are not inherently coupled; spikes arise from position‑0 token interactions and specific feed‑forward dynamics, while attention sinks emerge from Pre‑norm normalization and head dimension, offering practical insights for model quantization and long‑context inference.

Attention SinkLLMMassive Activations

0 likes · 9 min read

Why the First Token Becomes a Value Garbage Bin – LeCun Team Dissects Spike and Attention Sink Mechanics

Machine Learning Algorithms & Natural Language Processing

Mar 9, 2026 · Artificial Intelligence

Instant LoRA Generation and Long‑Document Internalization: Cost‑Amortized Model Updates via 0.1‑Second Forward Pass

The article analyzes the quadratic attention and KV‑Cache bottlenecks of Transformers on ultra‑long inputs and the heavy compute cost of traditional supervised fine‑tuning, then presents Sakana AI's Cost Amortization framework—Doc‑to‑LoRA and Text‑to‑LoRA—that shifts weight updates to a meta‑training hypernetwork, achieving sub‑50 MB memory for 128K‑token inference, sub‑GB update memory for long‑document QA, and zero‑shot task adaptation with sub‑second latency.

Cost AmortizationCross-modalLoRA

0 likes · 13 min read

Instant LoRA Generation and Long‑Document Internalization: Cost‑Amortized Model Updates via 0.1‑Second Forward Pass

Machine Learning Algorithms & Natural Language Processing

Mar 7, 2026 · Artificial Intelligence

Transformer Hidden States Can Reconstruct Input with 100% Accuracy – New Invertibility Study

A recent paper from Sapienza University's GLADIA Lab shows that mainstream Transformer language models are injective, enabling a novel SIPIT algorithm to recover original text from hidden states with perfect accuracy, while extensive experiments confirm the models retain all input information.

InjectiveInvertibilitySIPIT

0 likes · 11 min read

Transformer Hidden States Can Reconstruct Input with 100% Accuracy – New Invertibility Study

Data Party THU

Mar 6, 2026 · Artificial Intelligence

How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge

This article chronicles the AdderBoard competition, detailing how researchers compressed a Transformer for 10‑digit addition down to just 121 parameters, the experimental rules, the contrasting hand‑coded and data‑driven approaches, and the insights gained about model minimalism and discoverability.

AdderBoardModel CompressionTransformer

0 likes · 13 min read

How Small Can a Transformer Get? Inside the 121‑Parameter AdderBoard Challenge

Machine Learning Algorithms & Natural Language Processing

Mar 3, 2026 · Artificial Intelligence

Identity Constraint Beats DeepSeek mHC After 150B Tokens: A Surprising Reversal

Extensive experiments on DeepSeek's 1.7B and 8B models reveal that replacing the manifold hyper‑connection (mHC) constraint with a simple identity matrix consistently outperforms the original mHC, improves signal flow stability, and avoids the collapse caused by repeated Sinkhorn‑Knopp projections.

DeepSeekHyper-ConnectionSinkhorn

0 likes · 12 min read

Identity Constraint Beats DeepSeek mHC After 150B Tokens: A Surprising Reversal

Machine Learning Algorithms & Natural Language Processing

Mar 3, 2026 · Artificial Intelligence

Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path

The paper introduces JTok and its dynamic variant JTok‑M, a token‑indexed parameter scaling method that decouples model capacity from compute, achieving up to 35% compute reduction while delivering consistent performance gains across a wide range of downstream tasks and model sizes.

Compute EfficiencyJTokToken-indexed scaling

0 likes · 16 min read

Beyond Dense and MoE: JTok Module Cuts Compute by One‑Third as a New Scaling Path

Data STUDIO

Feb 25, 2026 · Artificial Intelligence

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

This guide walks you through building, training, and fine‑tuning a Transformer‑based large language model entirely from scratch using PyTorch, covering tokenization, self‑attention, multi‑head attention, positional encoding, model architecture, data preparation, training loops, and fine‑tuning on custom lyrics.

GPTLLMPyTorch

0 likes · 43 min read

Build a Large Language Model from Scratch with PyTorch—No Libraries, No Shortcuts

Qborfy AI

Feb 21, 2026 · Artificial Intelligence

How Self-Attention Powers Modern AI: From Theory to Real-World Impact

This article explains the self‑attention mechanism behind transformers, detailing its core components, mathematical formulation, step‑by‑step example, multi‑head extension, industry use cases, and a thorough comparison with RNN and CNN approaches, all supported by concrete numbers and citations.

Self-AttentionTransformerattention mechanism

0 likes · 8 min read

How Self-Attention Powers Modern AI: From Theory to Real-World Impact

Data Party THU

Feb 21, 2026 · Artificial Intelligence

Unlocking Compositional Generalization: Meta‑Learning Strategies for Neural Networks

This article examines how meta‑learning combined with compositionality enables neural networks to rapidly adapt to new tasks by formalizing hierarchical optimization, leveraging modular architectures with hypernetworks, and exploiting Transformer latent codes for effective compositional generalization.

Bilevel OptimizationMeta LearningTransformer

0 likes · 5 min read

Unlocking Compositional Generalization: Meta‑Learning Strategies for Neural Networks

Bighead's Algorithm Notes

Feb 18, 2026 · Artificial Intelligence

Which Loss Function Ranks Stocks Best? An Empirical Study with Transformer Models

This paper evaluates point‑wise, pair‑wise, and list‑wise loss functions for Transformer‑based stock‑return prediction on 110 S&P 500 stocks, showing that Margin loss achieves the highest annual return (16.23%) and Sharpe ratio (0.75), ListNet delivers strong returns with low volatility, and BPR minimizes maximum drawdown, highlighting how loss design critically shapes ranking‑driven portfolio performance.

Loss FunctionsMachine LearningQuantitative Trading

0 likes · 15 min read

Which Loss Function Ranks Stocks Best? An Empirical Study with Transformer Models

AI Cyberspace

Feb 15, 2026 · Artificial Intelligence

From GPT-1 to GPT-4o: A Deep Dive into the Evolution of Large Language Models

This article chronicles the rapid progression of GPT models from the 2018 GPT‑1 pre‑training breakthrough through GPT‑2’s multitask learning, GPT‑3’s scaling laws and few‑shot abilities, to GPT‑4’s multimodal capabilities and the 2024 GPT‑4 Turbo, Sora, and GPT‑4o releases, while also explaining core LLM abilities and the decoder‑only architecture of GPT‑2.

AI evolutionGPTModel architecture

0 likes · 20 min read

From GPT-1 to GPT-4o: A Deep Dive into the Evolution of Large Language Models

AI Cyberspace

Feb 14, 2026 · Artificial Intelligence

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

This article provides a comprehensive, step‑by‑step walkthrough of the Transformer architecture, covering input embedding, positional encoding, the mechanics of Q‑K‑V attention, scaled dot‑product formulas, multi‑head and masked attention, feed‑forward networks, residual connections, layer normalization, decoder generation, and recent attention‑optimization techniques.

Feed-Forward NetworkPositional EncodingSelf-Attention

0 likes · 39 min read

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

AI Cyberspace

Feb 13, 2026 · Artificial Intelligence

How Attention Mechanisms Revolutionized Computer Vision and Machine Translation

This article traces the evolution of attention mechanisms from their inaugural application in computer vision and machine translation to their central role in modern Transformer models, detailing the underlying RNN‑Attention designs, the breakthrough in sequence alignment, and the innovations that enabled high‑performance, parallelizable deep learning architectures.

Transformerattention mechanismcomputer vision

0 likes · 14 min read

How Attention Mechanisms Revolutionized Computer Vision and Machine Translation

HyperAI Super Neural

Feb 6, 2026 · Artificial Intelligence

Inspired by DeepSeek Engram, Gengram Boosts Genomic Foundation Models by Up to 22.6%

The Genos team introduces Gengram, a 20‑million‑parameter plug‑in that stores 1‑6‑mer embeddings in a hash memory, uses local window aggregation and gated writing, and delivers up to 22.6% performance gains across multiple genomic tasks while accelerating training.

AI genomicsGengramGenomic Engram

0 likes · 12 min read

Inspired by DeepSeek Engram, Gengram Boosts Genomic Foundation Models by Up to 22.6%

Data Party THU

Feb 4, 2026 · Artificial Intelligence

How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained

This article analyzes Sakana AI's three recent papers that challenge traditional Transformer long‑sequence handling by removing positional embeddings, reconstructing position awareness, and adding a fast‑weight external memory, showing how each approach improves ultra‑long text understanding.

Memory MechanismPositional EmbeddingTransformer

0 likes · 12 min read

How Sakana AI Redefines Long-Context Transformers: DroPE, REPO, and FwPKM Explained

HyperAI Super Neural

Feb 3, 2026 · Artificial Intelligence

Walrus: 1.3B Transformer Model Beats Prior Foundations Across 19 Physics Domains

Walrus, a 1.3 billion‑parameter Transformer built by Polymathic AI, is pretrained on 19 diverse physics scenarios—including astrophysics, geoscience, rheology, plasma physics and acoustics—using techniques like patch jittering, adaptive compute tokenization and space‑time factorized attention, and consistently outperforms earlier foundation models on both short‑ and long‑term continuum dynamics predictions.

Foundation ModelTransformerWalrus

0 likes · 13 min read

Walrus: 1.3B Transformer Model Beats Prior Foundations Across 19 Physics Domains

Tencent Technical Engineering

Feb 2, 2026 · Artificial Intelligence

Why Neural Networks Are the Hidden Engine Behind Modern AI: From Basics to Large Language Models

This comprehensive guide walks through the fundamentals of neural networks, activation functions, training methods, and how they power large language models, while also covering tokenization, self‑attention, transformer architectures, AI infrastructure, and practical usage through agents and retrieval‑augmented generation.

Agent SystemsArtificial IntelligenceGPU infrastructure

0 likes · 75 min read

Why Neural Networks Are the Hidden Engine Behind Modern AI: From Basics to Large Language Models

Network Intelligence Research Center (NIRC)

Jan 31, 2026 · Artificial Intelligence

How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge

The article dissects DeepSeek’s new Engram architecture, which separates computation from memory by using a large, cheap‑RAM‑based lookup table to store factual knowledge, allowing the transformer’s compute layers to focus on reasoning, dramatically reducing GPU memory demand while improving code, math, and long‑context performance.

EngramGPU MemoryLarge Language Model

0 likes · 7 min read

How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge

HyperAI Super Neural

Jan 23, 2026 · Artificial Intelligence

Weekly AI Paper Digest: New Transformer Advances in Sparsity, Memory, and Reasoning

This article reviews five recent Transformer papers—including Engram's conditional memory, STEM's embedding‑based scaling, SeedFold's biomolecular structure prediction, a critique of Transformers for time‑series forecasting, and reasoning models as societies of thought—highlighting their methods, datasets, and performance gains.

Biomolecular Structure PredictionMemory MechanismsStructural Sparsity

0 likes · 7 min read

Weekly AI Paper Digest: New Transformer Advances in Sparsity, Memory, and Reasoning

PaperAgent

Jan 22, 2026 · Artificial Intelligence

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.

Embedding LookupInterpretabilityMixture of Experts

0 likes · 6 min read

How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

Java Tech Enthusiast

Jan 21, 2026 · Artificial Intelligence

Inside X’s Open‑Source Recommendation Engine: How the Grok‑Powered Transformer Works

X platform has open‑sourced its new "For You" recommendation system, revealing a Grok‑based Transformer architecture, detailed module breakdown, seven‑step content ranking pipeline, and the strategic motivations behind the unprecedented move toward algorithmic transparency and community‑driven improvement.

Machine LearningSocial MediaTransformer

0 likes · 12 min read

Inside X’s Open‑Source Recommendation Engine: How the Grok‑Powered Transformer Works

PaperAgent

Jan 20, 2026 · Artificial Intelligence

How X’s Open‑Source “For You” Recommendation Engine Works

X (formerly Twitter) has open‑sourced its “For You” recommendation algorithm, revealing a Grok‑based Transformer that merges on‑platform and off‑platform content, removes manual features, and scores posts through a multi‑stage pipeline with candidate sourcing, hydration, filtering, scoring, and selection.

GrokMachine LearningOpen Source

0 likes · 5 min read

How X’s Open‑Source “For You” Recommendation Engine Works

Data Party THU

Jan 19, 2026 · Artificial Intelligence

How VersatileFFN Cuts Memory Use While Boosting LLM Performance

The article introduces Huawei's VersatileFFN, an adaptive wide‑and‑deep feed‑forward design for large language models that reuses parameters to slash memory consumption while delivering stronger inference, detailing its dual‑system inspiration, technical mechanisms, experimental gains, and implications for efficient LLM deployment.

Adaptive ComputationLLMTransformer

0 likes · 8 min read

How VersatileFFN Cuts Memory Use While Boosting LLM Performance

AI Architecture Hub

Jan 19, 2026 · Artificial Intelligence

Demystifying the Transformer: From Input Embedding to Multi‑Head Attention

This article breaks down the core components of the Transformer architecture—including input embedding, positional encoding, multi‑head self‑attention, residual connections with layer normalization, position‑wise feed‑forward networks, and the rationale behind stacking multiple encoder layers—using clear explanations and illustrative diagrams.

Add&NormFeed ForwardInput Embedding

0 likes · 12 min read

Demystifying the Transformer: From Input Embedding to Multi‑Head Attention

AI Large Model Application Practice

Jan 15, 2026 · Artificial Intelligence

Why Transformers Need Positional Embeddings and How They Work

This article explains the order‑blindness of Transformer self‑attention, why naïvely adding raw position indices harms semantics, and walks through sinusoidal, learnable, and rotary positional encodings together with PI and YaRN techniques for extending sequence length.

AILLMPositional Embedding

0 likes · 12 min read

Why Transformers Need Positional Embeddings and How They Work

AI Cyberspace

Jan 13, 2026 · Artificial Intelligence

From Symbolic AI to LLMs: A Complete NLP History and Model Guide

This article provides a comprehensive overview of natural language processing, tracing its evolution from early symbolic and statistical stages through deep learning breakthroughs, detailing sequence models, key NLP tasks, text representation methods, and the development of modern architectures like RNN, LSTM, GRU, Transformer, and GPT series.

GPTLSTMNLP

0 likes · 60 min read

From Symbolic AI to LLMs: A Complete NLP History and Model Guide

PaperAgent

Jan 13, 2026 · Artificial Intelligence

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

DeepSeek’s newly released Engram module introduces a conditional memory mechanism that leverages O(1) N‑gram lookup to create a new sparsity axis for large language models, reducing early‑layer compute, improving inference efficiency, and delivering notable performance gains across reasoning and knowledge tasks, as demonstrated by extensive experiments on 27‑billion‑parameter models.

Efficient InferenceEngramLLM Sparsity

0 likes · 8 min read

How Engram’s Conditional Memory Redefines Sparsity in Large Language Models

AI Insight Log

Jan 12, 2026 · Artificial Intelligence

Goodbye H100: How DeepSeek’s Engram Uses CPU Memory to Scale LLM Knowledge Bases

DeepSeek’s Engram architecture adds a deterministic dictionary lookup to Transformers, storing massive N‑gram tables in cheap CPU DRAM, which reduces GPU memory use and boosts both knowledge‑heavy and reasoning benchmarks while keeping inference latency under 3%.

CPU memoryDeterministic LookupEngram

0 likes · 7 min read

Goodbye H100: How DeepSeek’s Engram Uses CPU Memory to Scale LLM Knowledge Bases

AI Architecture Hub

Jan 7, 2026 · Artificial Intelligence

Why “Attention Is All You Need” Still Shapes AI: A Beginner’s Deep Dive

This article provides a comprehensive, beginner‑friendly walkthrough of the landmark 2017 paper “Attention Is All You Need,” covering its authors, historical context, the shortcomings of RNNs and CNNs, the birth of self‑attention, the Transformer architecture, and its transformative impact on modern AI.

AI historyTransformerattention mechanism

0 likes · 9 min read

Why “Attention Is All You Need” Still Shapes AI: A Beginner’s Deep Dive

Network Intelligence Research Center (NIRC)

Jan 4, 2026 · Artificial Intelligence

How UniCodebook’s Unified 2D‑3D Discrete Priors Boost Noise‑Robust, Calibration‑Free 3D Human Pose Estimation

UniCodebook introduces a unified 2D‑3D discrete prior that combines continuous and discrete representations, enabling calibration‑free multiview 3D human pose estimation with superior noise robustness and higher accuracy, as demonstrated by state‑of‑the‑art results on Human3.6M and MPI‑INF‑3DHP.

3D pose estimationNeurIPS 2025Transformer

0 likes · 8 min read

How UniCodebook’s Unified 2D‑3D Discrete Priors Boost Noise‑Robust, Calibration‑Free 3D Human Pose Estimation

IT Services Circle

Dec 27, 2025 · Artificial Intelligence

From Ancient Brains to Modern AI: A Journey Through AI’s Evolution and Future

This comprehensive guide traces AI from the origins of human intelligence and the first computers, through the birth of artificial intelligence, the rise of machine learning and large language models, to the emergence of agents, multimodal systems, and the challenges that lie ahead.

AI historyHallucination MitigationRAG

0 likes · 39 min read

From Ancient Brains to Modern AI: A Journey Through AI’s Evolution and Future

Tencent Technical Engineering

Dec 24, 2025 · Artificial Intelligence

Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer

This article walks through constructing a small large‑language model from the ground up, covering model architecture, tokenization methods, BPE vocabulary building, embedding, positional encoding, attention mechanisms, multi‑head attention, transformer blocks, training pipelines, inference, and sampling strategies, all with runnable Python code.

LLMPythonTransformer

0 likes · 34 min read

Build a Mini LLM from Scratch: Step‑by‑Step Guide to Tokenizer, Attention, and Transformer

HyperAI Super Neural

Dec 22, 2025 · Artificial Intelligence

DA3 Enables Arbitrary‑View 3D Reconstruction with a Single Transformer

The ByteDance‑Seed team introduces Depth Anything 3 (DA3), a minimalist visual‑geometry model that uses a vanilla Transformer backbone and depth‑ray representation to jointly predict depth and camera pose from any number of images, achieving state‑of‑the‑art performance with a 35.7% gain in pose accuracy and a 23.6% improvement in geometric precision over prior methods.

3D visionDA3Depth estimation

0 likes · 6 min read

DA3 Enables Arbitrary‑View 3D Reconstruction with a Single Transformer

AI2ML AI to Machine Learning

Dec 21, 2025 · Artificial Intelligence

Why KV Caching Is Critical for Efficient LLM Inference

The article breaks down the principles of KV caching in large language models, explaining how Q/K/V behavior differs between training and inference, the role of prompts, cache size trade‑offs, and the complexities of concurrent inference, all backed by concrete examples and references.

Concurrent InferenceKV CacheLLM inference

0 likes · 7 min read

Why KV Caching Is Critical for Efficient LLM Inference

AI2ML AI to Machine Learning

Dec 19, 2025 · Artificial Intelligence

The 9 Key Ideas Behind FlashAttention

FlashAttention accelerates transformer inference by combining nine techniques—including loss‑less attention, GPU memory‑pyramid optimization, SRAM‑reusing tiling, safe softmax scaling, online buffering, tile‑size constraints, parallel multiplication, reduced KV slicing, and integrated backward‑pass caching—to achieve efficient, high‑throughput computation on modern GPUs.

FlashAttentionGPU optimizationOnline Algorithm

0 likes · 8 min read

Xiaomi Tech

Dec 19, 2025 · Artificial Intelligence

AI Evolution Mirrors Biology—Open Source Speeds Progress 1,000× (Daniel Povey)

Daniel Povey compares AI's trial‑and‑error development to biological evolution, argues that open‑source collaboration can make research a thousand times faster, and outlines his dual‑strategy approach and the three breakthroughs of the new Zapformer speech model.

AIMachine LearningOpen Source

0 likes · 12 min read

AI Evolution Mirrors Biology—Open Source Speeds Progress 1,000× (Daniel Povey)

AI Frontier Lectures

Dec 17, 2025 · Artificial Intelligence

Can OmniVGGT Unlock Multi‑Modal 3D Vision with Any Number of Inputs?

OmniVGGT introduces a flexible omni‑modality driven transformer that can ingest arbitrary numbers of geometric cues such as depth maps and camera parameters, achieving state‑of‑the‑art performance on diverse 3D tasks while keeping inference speed comparable to its RGB‑only predecessor.

3D visionGeometryOmniVGGT

0 likes · 13 min read

Can OmniVGGT Unlock Multi‑Modal 3D Vision with Any Number of Inputs?

Architect

Dec 15, 2025 · Artificial Intelligence

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

This comprehensive guide explains the fundamentals of large language model (LLM) architectures, covering the original Transformer, tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a step‑by‑step translation example, and the latest open‑source and hybrid LLM designs shaping the field.

EmbeddingLLMMoE

0 likes · 41 min read

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

Tencent Cloud Developer

Dec 9, 2025 · Artificial Intelligence

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

This article walks through the complete workflow of AI large language models, from turning user queries into token matrices via tokenization and embedding, through the Transformer’s self‑attention and multi‑head mechanisms, to decoding logits into human‑readable text, while also covering position encoding, long‑context strategies, generation parameters, and practical engineering tips.

Inference OptimizationSelf-AttentionTransformer

0 likes · 29 min read

How Do Large Language Models Turn Text into Math? A Deep Dive into Transformers

ShiZhen AI

Dec 5, 2025 · Artificial Intelligence

Can AI Achieve Human‑Like Long‑Term Memory? Inside Google’s Titans Architecture

Google’s newly unveiled Titans architecture tackles AI’s “forgetfulness” by embedding a Neural Long‑Term Memory (LMM) module that updates model weights during inference using a test‑time training approach and a MIRAS surprise metric, enabling over 2 million‑token context with linear O(N) computation and superior benchmark results versus GPT‑4 RAG.

AI ArchitectureGoogle TitansLong-term Memory

0 likes · 5 min read

Can AI Achieve Human‑Like Long‑Term Memory? Inside Google’s Titans Architecture

Tencent Technical Engineering

Dec 3, 2025 · Artificial Intelligence

Why Transformers Power Modern LLMs: A Deep Dive into Architecture and Mechanics

This article provides a comprehensive, step‑by‑step explanation of the Transformer architecture that underpins large language models, covering tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a detailed translation example, visualized attention weights, and a survey of recent open‑source LLM designs such as DeepSeek V3, OLMo 2, and Gemma 3.

EmbeddingLLMNeural Network

0 likes · 38 min read

Why Transformers Power Modern LLMs: A Deep Dive into Architecture and Mechanics

Wuming AI

Nov 30, 2025 · Artificial Intelligence

What Exactly Is a Large Language Model? A Simple Guide to AI, Transformers, and How They Work

This article explains the relationship between AI, machine learning, deep learning, and large language models, detailing their evolution, training stages, transformer architecture, attention mechanisms, inference APIs, and practical usage examples, while demystifying common misconceptions about LLM capabilities.

AI fundamentalsLarge Language ModelMachine Learning

0 likes · 10 min read

What Exactly Is a Large Language Model? A Simple Guide to AI, Transformers, and How They Work

Java Tech Enthusiast

Nov 30, 2025 · Artificial Intelligence

How a 500‑Million‑Parameter ChatGPT Clone Runs Inside Minecraft’s Redstone

A Minecraft developer built CraftGPT, a 5‑million‑parameter language model that runs entirely on Redstone circuits, demonstrating how the game’s Turing‑complete logic system can implement a transformer‑style AI with billions of in‑game blocks.

AIGame ComputingMinecraft

0 likes · 9 min read

How a 500‑Million‑Parameter ChatGPT Clone Runs Inside Minecraft’s Redstone

Data Party THU

Nov 26, 2025 · Artificial Intelligence

Are Transformers Truly Invertible? Uncovering Injectivity and the SIPIT Algorithm

A recent study demonstrates that mainstream Transformer language models are mathematically injective and practically invertible, with large‑scale experiments confirming no hidden‑state collisions and a new SIPIT algorithm achieving 100% input reconstruction across text and code.

InjectivityInvertibilitySIPIT

0 likes · 10 min read

Are Transformers Truly Invertible? Uncovering Injectivity and the SIPIT Algorithm

Huawei Cloud Developer Alliance

Nov 24, 2025 · Artificial Intelligence

How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration

This article explains why Transformer models dominate modern AI agents, outlines the challenges of large parameter counts and latency, and presents a comprehensive guide to model compression (parameter sharing, knowledge distillation, quantization, pruning) and inference acceleration (parallel computing, optimized attention, TensorRT deployment), complete with PyTorch code examples and a real‑world case study showing speed‑up and storage savings.

AI agentInference AccelerationModel Compression

0 likes · 34 min read

How to Supercharge Transformer AI Agents with Model Compression and Inference Acceleration

Kuaishou Tech

Nov 19, 2025 · Artificial Intelligence

Can a Single Number Create a Whole New Visual Style? Inside CoTyle’s Code‑to‑Style Generation

CoTyle introduces a novel open‑source framework that generates unique image styles from a numeric style code, eliminating the need for reference images, lengthy prompts, or LoRA modules, and demonstrates superior style consistency compared to existing solutions like Midjourney.

Transformerdiffusion modelgenerative AI

0 likes · 8 min read

Can a Single Number Create a Whole New Visual Style? Inside CoTyle’s Code‑to‑Style Generation

Data Party THU

Nov 13, 2025 · Artificial Intelligence

What Makes the Free Transformer a Game‑Changer in AI Decoding?

The Free Transformer paper introduces a decoder architecture that injects random latent variables to condition generation, breaking traditional GPT constraints and achieving notable performance gains on reasoning‑heavy benchmarks such as HumanEval+, MBPP, GSM8K, MMLU, and CSQA.

AI researchFree TransformerTransformer

0 likes · 10 min read

What Makes the Free Transformer a Game‑Changer in AI Decoding?

HyperAI Super Neural

Nov 7, 2025 · Artificial Intelligence

QiankunNet: A Transformer‑Based Framework for Solving the Many‑Electron Schrödinger Equation

Researchers at the University of Science and Technology of China have introduced QiankunNet, a Transformer‑based network that integrates attention mechanisms with quantum wave‑function construction to solve many‑electron Schrödinger equations, achieving near‑FCI accuracy and outperforming traditional coupled‑cluster methods on benchmark molecules.

Many‑Electron SystemsNature CommunicationsQiankunNet

0 likes · 6 min read

QiankunNet: A Transformer‑Based Framework for Solving the Many‑Electron Schrödinger Equation

Tencent Cloud Developer

Nov 4, 2025 · Artificial Intelligence

From Functions to Transformers: Mastering Neural Networks Step by Step

This article walks you through the evolution from basic mathematical functions to modern large‑scale models, explaining activation functions, forward and backward propagation, loss calculation, gradient descent, regularization, dropout, word embeddings, RNNs, and the core mechanics of the Transformer architecture.

RNNRegularizationTransformer

0 likes · 15 min read

From Functions to Transformers: Mastering Neural Networks Step by Step

Data Party THU

Nov 2, 2025 · Artificial Intelligence

From RNN to LLM: How Transformers Power Modern Language Models

This article explains the evolution from RNNs through Encoder‑Decoder models to Transformers, detailing self‑attention, multi‑head attention, and masked attention, and then describes what Large Language Models are, their key components, capabilities, limitations, and common applications.

AILLMLarge Language Model

0 likes · 9 min read

From RNN to LLM: How Transformers Power Modern Language Models

Kuaishou Large Model

Oct 31, 2025 · Artificial Intelligence

EMER: End-to-End Multi-Objective Ranking That Transforms Short-Video Recommendations

EMER, Kuaishou’s end‑to‑end multi‑objective ensemble ranking framework, replaces handcrafted scoring formulas with a transformer‑based model that learns comparative preferences, integrates normalized rank features, optimizes relative satisfaction and multi‑dimensional proxy metrics, and dynamically balances objectives via a self‑evolving advantage evaluator, delivering significant online gains.

Machine LearningRecommendation SystemsTransformer

0 likes · 17 min read

EMER: End-to-End Multi-Objective Ranking That Transforms Short-Video Recommendations

Huawei Cloud Developer Alliance

Oct 31, 2025 · Artificial Intelligence

Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

This article reviews the emerging post‑Transformer research landscape, covering linear state‑space models, efficient attention approximations, MLP/conv/RNN hybrids, sparse and causal attention mechanisms, and outlines future trends that may complement or replace the classic Transformer architecture for handling ultra‑long sequences.

AIEfficient AttentionHybrid Architecture

0 likes · 17 min read

Beyond Transformers: Exploring Post‑Transformer Architectures for Long‑Sequence Modeling

HyperAI Super Neural

Oct 30, 2025 · Artificial Intelligence

OmniCast Achieves 20× Speed Boost and Eliminates Autoregressive Error Accumulation in S2S Weather Forecasting

OmniCast, a novel latent diffusion model from UCLA and Argonne Lab, combines VAE and Transformer to generate high‑precision probabilistic sub‑seasonal to seasonal forecasts, dramatically reducing error accumulation of autoregressive methods and delivering 10‑20× faster inference while surpassing state‑of‑the‑art baselines across accuracy, physical consistency, and probabilistic metrics.

Latent DiffusionOmniCastTransformer

0 likes · 15 min read

OmniCast Achieves 20× Speed Boost and Eliminates Autoregressive Error Accumulation in S2S Weather Forecasting

Bighead's Algorithm Notes

Oct 23, 2025 · Artificial Intelligence

FinCast: A Foundation Model for Financial Time‑Series Forecasting

FinCast introduces a decoder‑only Transformer foundation model for financial time‑series forecasting that tackles non‑stationarity, multi‑domain diversity, and multi‑resolution challenges through input chunking with frequency embeddings, a sparse MoE decoder, and a PQ‑loss, achieving zero‑shot and supervised gains over state‑of‑the‑art baselines while running five times faster on consumer GPUs.

Foundation ModelPQ lossSparse MoE

0 likes · 12 min read

FinCast: A Foundation Model for Financial Time‑Series Forecasting

Wu Shixiong's Large Model Academy

Oct 23, 2025 · Artificial Intelligence

Why the Transformer Core Structure Is the Key to AI Interview Success

This article explains the fundamental purpose, architecture, and variants of the Transformer model—including Encoder‑Decoder, Encoder‑only, and Decoder‑only designs—while detailing how attention mechanisms work and why modern large‑language models favor the Decoder‑only approach, providing a concise framework for answering interview questions.

AI InterviewEncoder-DecoderLarge Language Model

0 likes · 10 min read

Why the Transformer Core Structure Is the Key to AI Interview Success

Bighead's Algorithm Notes

Oct 21, 2025 · Artificial Intelligence

KANMixer: A New KAN‑Centric Paradigm for Long‑Term Time Series Forecasting

This article reviews the KANMixer model, which places Kolmogorov‑Arnold Networks at the core of a lightweight architecture for long‑term time series forecasting, detailing its design, extensive benchmark experiments on seven real‑world datasets, ablation analyses, and its computational trade‑offs versus MLP and Transformer baselines.

Ablation StudyKANLong-term Time Series Forecasting

0 likes · 8 min read

KANMixer: A New KAN‑Centric Paradigm for Long‑Term Time Series Forecasting

Data Party THU

Oct 21, 2025 · Artificial Intelligence

Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer

The paper presents a systematic scaling‑law study of the linear‑time xLSTM architecture versus quadratic‑time Transformers, evaluating parameter‑data loss surfaces, optimal model size under equal FLOP budgets, and inference latency components, and shows that xLSTM consistently offers better cost‑effectiveness across diverse contexts and budgets.

Inference OptimizationLinear Time ComplexityTransformer

0 likes · 11 min read

Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer

AI2ML AI to Machine Learning

Oct 19, 2025 · Artificial Intelligence

Deep Dive into nanochat: Source Code, Model Size Calculations, and Optimization Techniques

This article provides a thorough analysis of nanochat’s source code, detailing transformer component differences, precise parameter‑size formulas, FlashNorm and ReLU² innovations, scaling‑law insights, memory‑usage estimations, and the distributed optimizer and training pipelines used to build the model.

LLMTransformerdistributed training

0 likes · 20 min read

Deep Dive into nanochat: Source Code, Model Size Calculations, and Optimization Techniques

Network Intelligence Research Center (NIRC)

Oct 17, 2025 · Artificial Intelligence

LucaOne: Unified Nucleic Acid & Protein Language Model Surpasses Other Models

Researchers present LucaOne, a Transformer‑based foundation model that unifies DNA/RNA and protein sequences using a 39‑token vocabulary, rotary positional encoding, and molecule‑type embeddings, and demonstrate through extensive multi‑task benchmarks that it outperforms domain‑specific models across seven biological tasks.

DNAFoundation ModelMultimodal

0 likes · 5 min read

LucaOne: Unified Nucleic Acid & Protein Language Model Surpasses Other Models

Data Party THU

Oct 16, 2025 · Artificial Intelligence

How Tensor Product Attention Redefines Long‑Context Transformers

The article analyzes the Tensor Product Attention (TPA) method presented at NeurIPS 2025, explaining how it factorizes Q, K, V tensors to drastically reduce KV cache size and attention complexity, and demonstrates superior convergence, lower perplexity, and faster inference on long‑sequence tasks compared with existing attention variants.

Efficient AttentionKV CacheRoPE

0 likes · 11 min read

How Tensor Product Attention Redefines Long‑Context Transformers

Bighead's Algorithm Notes

Oct 11, 2025 · Artificial Intelligence

Recent Advances in Multivariate Time Series Forecasting: Paper Summaries (Sep 27 – Oct 10 2025)

This article summarizes eight newly released AI papers on multivariate time‑series forecasting and anomaly detection, detailing each work's motivation, proposed methodology, key innovations such as CRIB, TS‑JEPA, DSAT‑HD, DIMIGNN, ASTGI, IndexNet, TsLLM, Moon, TimeSeriesScientist, MLG‑4TS, and Augur, and reports their experimental validation on real‑world datasets.

Anomaly DetectionLarge Language ModelTransformer

0 likes · 23 min read

Recent Advances in Multivariate Time Series Forecasting: Paper Summaries (Sep 27 – Oct 10 2025)