Tagged articles

80 articles

Page 1 of 1

Machine Learning Algorithms & Natural Language Processing

May 28, 2026 · Artificial Intelligence

Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA

This article presents GQLA, a single‑author variant of MLA that eliminates three hardware‑related drawbacks of MLA, demonstrates how it achieves balanced compute‑memory performance on both high‑end H100 and more modest H20 GPUs, and details conversion methods (TransGQLA) and sparse extensions with concrete benchmark results.

GQLALLMMLA

0 likes · 16 min read

Solo Development of GQLA: Challenging DeepSeek’s MLA and DSA

PaperAgent

May 26, 2026 · Artificial Intelligence

Why External Retrieval in RAG Is Redundant: Insights from NVIDIA’s INTRA Paper

The INTRA paper shows that using a decoder’s cross‑attention as an internal retrieval mechanism eliminates the need for a separate retriever, achieving state‑of‑the‑art multihop QA performance with only 164 K trainable parameters and shared pre‑encoded representations.

INTRARAGattention

0 likes · 8 min read

Why External Retrieval in RAG Is Redundant: Insights from NVIDIA’s INTRA Paper

Geek Labs

May 6, 2026 · Artificial Intelligence

Build a GPT from Scratch and Decode AI Coding Jargon with Two Top GitHub Projects

The article introduces two practical GitHub repositories—how-to-train-your-gpt, a step‑by‑step guide that builds a LLaMA‑style GPT model across 12 chapters, and dictionary-of-ai-coding, a plain‑language glossary of AI‑coding terms—showing how they together provide a complete understanding of modern LLM fundamentals and terminology.

AIGPTGitHub

0 likes · 9 min read

Build a GPT from Scratch and Decode AI Coding Jargon with Two Top GitHub Projects

AI Tech Publishing

Apr 29, 2026 · Artificial Intelligence

Why Do AI Agents Forget and Hallucinate? A Complete Guide to KV‑Cache Memory Mechanisms

The article explains that AI agents’ forgetting and hallucinations stem from token‑level attention scores causing key‑value cache eviction before retrieval, then surveys KV‑cache basics, naive growth, streaming‑LLM windowing, SnapKV’s attention‑guided compression, token‑retention studies, Memory Sparse Attention, compares these methods, and discusses practical system pitfalls and design implications.

AI agentsKV CacheMemory Sparse Attention

0 likes · 20 min read

Why Do AI Agents Forget and Hallucinate? A Complete Guide to KV‑Cache Memory Mechanisms

Geek Labs

Apr 20, 2026 · Artificial Intelligence

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

This open‑source tutorial breaks down large language model internals into 11 detailed topics—covering BPE tokenization, attention mathematics, backpropagation, transformer architecture, KV‑Cache, Paged and Flash Attention, and frontier techniques—each with numeric derivations and Python code, making it ideal for developers and interview preparation.

Flash AttentionInference OptimizationKV Cache

0 likes · 5 min read

A Complete Open‑Source Guide to LLM Internals: From Tokenization to Inference Optimization

AI Tech Publishing

Apr 9, 2026 · Artificial Intelligence

Engineering‑Focused Guide to Training and Inference of Large Language Models

This article walks engineers through the full LLM stack—from tokenization and positional encoding to transformer blocks, efficient fine‑tuning, quantization, and production‑grade inference techniques such as KV‑cache, FlashAttention, PagedAttention, continuous batching, and speculative decoding—highlighting trade‑offs, toolchains, and practical workflow steps.

LLMLoRATransformer

0 likes · 13 min read

Engineering‑Focused Guide to Training and Inference of Large Language Models

AI Tech Publishing

Apr 5, 2026 · Artificial Intelligence

Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

The article explains how KV cache eliminates redundant computations in autoregressive LLM generation, detailing the attention mechanism, the O(n²) waste of recomputing K and V, the cache‑based solution, its impact on time‑to‑first‑token, and the memory‑vs‑speed trade‑off.

Inference OptimizationKV CacheLLM

0 likes · 7 min read

Why the First Token Is Slow: A Deep Dive into KV Cache for LLM Inference

SuanNi

Mar 29, 2026 · Artificial Intelligence

How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization

This article analyzes the AVO system, an autonomous AI agent that replaces traditional evolutionary search pipelines to iteratively improve CUDA attention kernels on NVIDIA's Blackwell B200 GPU, achieving up to 10.5% higher throughput than hand‑tuned implementations after a week of nonstop optimization.

AICUDAGPU optimization

0 likes · 13 min read

How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization

Machine Learning Algorithms & Natural Language Processing

Mar 20, 2026 · Artificial Intelligence

Why Kimi Dropped Residual Connections: A First‑Person Deep Dive into Attention Residuals

This article explains how Attention Residuals (AttnRes) replace traditional residual shortcuts with layer‑wise attention, details the mathematical reformulation, design constraints, static‑Q trick, full and block variants, and presents experimental evidence of significant accuracy gains with modest overhead.

Efficient AttentionNLPRMSNorm

0 likes · 11 min read

Why Kimi Dropped Residual Connections: A First‑Person Deep Dive into Attention Residuals

PaperAgent

Mar 17, 2026 · Artificial Intelligence

Can Attention Replace Fixed Residuals? Inside the ‘Attention Residuals’ Breakthrough

This article analyzes the newly released Attention Residuals paper, explaining how learnable attention weighting replaces fixed residual addition to mitigate information dilution in deep LLMs, detailing the proposed Block AttnRes design, engineering trade‑offs, experimental results, and its significance for foundational model architecture.

Block AttentionLLMModel architecture

0 likes · 9 min read

Can Attention Replace Fixed Residuals? Inside the ‘Attention Residuals’ Breakthrough

Shi's AI Notebook

Mar 16, 2026 · Artificial Intelligence

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

This article walks through MiniMind's Attention.forward implementation, explaining why Q, K, and V are created, how tensors are reshaped for multi‑head attention, the role of masks, KV cache, GQA, and how each token aggregates information from the entire context.

KV CacheTransformerattention

0 likes · 21 min read

What Attention Actually Does in MiniMind: Tracing Q/K/V, Shape Changes, and Context Fusion

Bighead's Algorithm Notes

Feb 27, 2026 · Artificial Intelligence

Paper Review: NeurIF – Feature‑Controlled Learning of Dynamic Asset‑Pricing Factors and Loadings

NeurIF introduces a neural instrumented factorization framework that leverages company features as instruments, combines spatial and temporal attention to learn time‑varying latent factors and their loadings, achieves 1‑18% RMSE improvement over transformer baselines, and produces statistically significant long‑short portfolios that explain cross‑sectional pricing anomalies.

NeurIFasset pricingattention

0 likes · 15 min read

Paper Review: NeurIF – Feature‑Controlled Learning of Dynamic Asset‑Pricing Factors and Loadings

AI Cyberspace

Feb 14, 2026 · Artificial Intelligence

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

This article provides a comprehensive, step‑by‑step walkthrough of the Transformer architecture, covering input embedding, positional encoding, the mechanics of Q‑K‑V attention, scaled dot‑product formulas, multi‑head and masked attention, feed‑forward networks, residual connections, layer normalization, decoder generation, and recent attention‑optimization techniques.

Feed-Forward NetworkPositional EncodingSelf-Attention

0 likes · 39 min read

Unpacking the Transformer: From Embeddings to Multi‑Head Attention

AI Large Model Application Practice

Jan 1, 2026 · Artificial Intelligence

Why Single-Head Attention Falls Short and Multi-Head Saves the Day

This article explains the inherent limitations of single-head attention in Transformers, illustrates them with a linguistic example, and then details how multi-head attention works through independent projection matrices, splitting and concatenation, ultimately boosting model expressiveness, robustness, and interpretability.

AIattentionmulti-head

0 likes · 9 min read

Why Single-Head Attention Falls Short and Multi-Head Saves the Day

Architect

Dec 15, 2025 · Artificial Intelligence

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

This comprehensive guide explains the fundamentals of large language model (LLM) architectures, covering the original Transformer, tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a step‑by‑step translation example, and the latest open‑source and hybrid LLM designs shaping the field.

EmbeddingLLMMoE

0 likes · 41 min read

Demystifying LLM Architecture: From Transformers to Modern MoE Designs

Tencent Technical Engineering

Dec 3, 2025 · Artificial Intelligence

Why Transformers Power Modern LLMs: A Deep Dive into Architecture and Mechanics

This article provides a comprehensive, step‑by‑step explanation of the Transformer architecture that underpins large language models, covering tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a detailed translation example, visualized attention weights, and a survey of recent open‑source LLM designs such as DeepSeek V3, OLMo 2, and Gemma 3.

EmbeddingLLMNeural Network

0 likes · 38 min read

Why Transformers Power Modern LLMs: A Deep Dive into Architecture and Mechanics

Data Party THU

Nov 2, 2025 · Artificial Intelligence

From RNN to LLM: How Transformers Power Modern Language Models

This article explains the evolution from RNNs through Encoder‑Decoder models to Transformers, detailing self‑attention, multi‑head attention, and masked attention, and then describes what Large Language Models are, their key components, capabilities, limitations, and common applications.

AILLMLarge Language Model

0 likes · 9 min read

From RNN to LLM: How Transformers Power Modern Language Models

Data Party THU

Oct 4, 2025 · Artificial Intelligence

Unveiling Transformer Internals: From Theory to PyTorch Code

This article deeply explores the Transformer architecture by combining original paper principles with PyTorch source code, covering encoder‑decoder design, positional encoding assumptions, core parameters, residual connections, attention mechanisms, and detailed implementation snippets to help readers understand and reproduce the model.

Positional EncodingPyTorchTransformer

0 likes · 22 min read

Unveiling Transformer Internals: From Theory to PyTorch Code

Architect

Sep 16, 2025 · Artificial Intelligence

Why Transformers Outperform RNNs: A Beginner’s Guide to Attention and Architecture

This article introduces the Transformer architecture, explaining its attention mechanism, encoder‑decoder design, training and inference processes, and why it surpasses RNN‑based models, while also covering common applications and variations in natural language processing.

Model architectureNLPTransformer

0 likes · 13 min read

Why Transformers Outperform RNNs: A Beginner’s Guide to Attention and Architecture

Alibaba Cloud Developer

Aug 6, 2025 · Artificial Intelligence

How Transformers Revolutionize Sequence Modeling: From RNN Limits to Self‑Attention Mastery

This article explains why Transformer models surpass traditional RNN‑based seq2seq architectures by introducing self‑attention, multi‑head attention, and positional encoding, detailing the inner workings of encoders, decoders, and attention mechanisms, and comparing their advantages and limitations across NLP and vision tasks.

GRULSTMRNN

0 likes · 30 min read

How Transformers Revolutionize Sequence Modeling: From RNN Limits to Self‑Attention Mastery

AI Frontier Lectures

Jul 10, 2025 · Artificial Intelligence

Can 2‑Simplicial Attention Redefine Transformer Scaling Laws?

A recent Meta paper introduces a rotation‑invariant 2‑simplicial attention mechanism, demonstrates its superior scaling‑law coefficients over standard dot‑product attention, and provides experimental evidence of improved token efficiency and model performance under constrained token budgets.

2-simplicialMetaScaling Law

0 likes · 11 min read

Can 2‑Simplicial Attention Redefine Transformer Scaling Laws?

IT Services Circle

Jul 6, 2025 · Artificial Intelligence

Why Transformers Train Like Any Neural Network: Backpropagation Explained

This article demystifies how Transformers are trained by showing that all their linear layers have learnable weights and biases, and that the attention mechanism—including softmax and dot‑product operations—is fully differentiable and updated via standard back‑propagation.

BackpropagationPyTorchTransformer

0 likes · 7 min read

Why Transformers Train Like Any Neural Network: Backpropagation Explained

Baobao Algorithm Notes

May 26, 2025 · Artificial Intelligence

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings

This article compares two recent papers that investigate why large reasoning models such as Llama and Qwen show degraded instruction‑following performance when using chain‑of‑thought prompting, analyzing attention patterns, training effects, and proposed mitigation strategies.

LLMattentionchain-of-thought

0 likes · 11 min read

Why Do Reasoning LLMs Lose Instruction-Following Ability? A Deep Dive into Recent Findings

AI Frontier Lectures

Mar 24, 2025 · Artificial Intelligence

How MambaIRv2 Boosts Image Restoration with Attentive State‑Space Design

Introducing MambaIRv2, an image restoration backbone that replaces Mamba’s causal scanning with an attentive state‑space module, achieving single‑direction scanning, reduced parameters and computation, and superior performance on lightweight and classic super‑resolution, JPEG artifact removal, and denoising tasks, as validated by CVPR‑2025 results.

MambaIRv2attentioncomputer vision

0 likes · 8 min read

How MambaIRv2 Boosts Image Restoration with Attentive State‑Space Design

Baobao Algorithm Notes

Feb 17, 2025 · Artificial Intelligence

Can TransMLA Turn GQA into a More Powerful MLA? A Deep Dive into DeepSeek Models

This article presents a theoretical and experimental analysis of converting Group Query Attention (GQA) models to Multi‑Head Linear Attention (MLA) using the TransMLA method, demonstrating superior expressiveness and performance on DeepSeek‑based large language models while keeping KV‑Cache costs unchanged.

DeepSeekMLATransMLA

0 likes · 11 min read

Can TransMLA Turn GQA into a More Powerful MLA? A Deep Dive into DeepSeek Models

Ops Development & AI Practice

Feb 16, 2025 · Artificial Intelligence

Why FlashAttention Supercharges Qwen Models: A Technical Deep Dive

This article explains the FlashAttention algorithm, its memory‑efficient tiling and recomputation techniques, and how enabling the flash_attn flag dramatically speeds up Qwen‑series large models while outlining hardware, software requirements and potential trade‑offs.

FlashAttentionGPU optimizationLarge Language Model

0 likes · 8 min read

Why FlashAttention Supercharges Qwen Models: A Technical Deep Dive

Cognitive Technology Team

Feb 9, 2025 · Artificial Intelligence

A Beginner’s Guide to the History and Key Concepts of Deep Learning

From the perceptron’s inception in 1958 to modern Transformer-based models like GPT, this article traces the evolution of deep learning, explaining foundational architectures such as DNNs, CNNs, RNNs, LSTMs, attention mechanisms, and recent innovations like DeepSeek’s MLA, highlighting their principles and impact.

GPTMLAattention

0 likes · 19 min read

A Beginner’s Guide to the History and Key Concepts of Deep Learning

DataFunSummit

Dec 28, 2024 · Artificial Intelligence

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

This talk presents the Ant Group team's recent work on large‑model inference memory optimization, covering GPU memory challenges, virtual memory management (VMM), the Virtual Tensor framework, LayerKV techniques, performance comparisons with Page Attention and FlashAttention, and extensive experimental results demonstrating reduced latency and higher QPS.

GPUPerformanceVirtual Memory

0 likes · 25 min read

Memory Optimization for Large Model Inference: Virtual Tensor and LayerKV Techniques

NewBeeNLP

Nov 18, 2024 · Artificial Intelligence

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

This article examines various techniques for compressing and accelerating the KV cache in transformer models—including MQA, GQA, MLA, sliding‑window and linear attention, flash attention, page and ring attention, as well as mixed‑precision training and ZeRO parallelism—providing code snippets, implementation details, and practical trade‑offs.

FlashAttentionKV CacheModel Parallelism

0 likes · 17 min read

How to Optimize Multi-Head Attention: From MQA to FlashAttention and Beyond

Baidu Intelligent Cloud Tech Hub

Jul 25, 2024 · Artificial Intelligence

How Transformers Work: From Tensor Basics to GPU Performance Analysis

This article provides a comprehensive, engineer‑focused breakdown of transformer architecture—including tensor fundamentals, matrix multiplication, GPU theoretical compute, attention and FFN mechanics, quantitative parameter and FLOP analysis, performance metrics like MFU, parallelism strategies, variant optimizations, and practical exercise questions—offering clear insight into large‑model efficiency and scaling.

FFNGPU performanceTransformer

0 likes · 33 min read

How Transformers Work: From Tensor Basics to GPU Performance Analysis

JD Tech

Jun 7, 2024 · Artificial Intelligence

Understanding Attention Mechanisms, Self‑Attention, and Multi‑Head Attention in Transformers

This article explains the fundamentals of attention mechanisms, including biological inspiration, the evolution from early visual attention to modern self‑attention in Transformers, details the scaled dot‑product calculations, positional encoding, and multi‑head attention, illustrating how these concepts enable efficient parallel processing of sequence data.

AIPositional EncodingSelf-Attention

0 likes · 12 min read

Understanding Attention Mechanisms, Self‑Attention, and Multi‑Head Attention in Transformers

Baobao Algorithm Notes

May 5, 2024 · Artificial Intelligence

Deep Dive into Transformer Mechanics: Scaling, Q/K Projections, FFNs, and More

This article provides concise technical explanations for 25 common questions about Transformer models, covering scaled dot‑product attention scaling, separate Q/K projections, feed‑forward network design, attention variants, normalization, LoRA versus full‑parameter training, KV‑cache, pre‑ and post‑norm, computational cost analysis, and advanced position‑encoding techniques.

LLMLoRATransformer

0 likes · 25 min read

Deep Dive into Transformer Mechanics: Scaling, Q/K Projections, FFNs, and More

DaTaobao Tech

Mar 27, 2024 · Artificial Intelligence

Building a Simple Diffusion Model with Python

This tutorial walks through implementing a basic Denoising Diffusion Probabilistic Model in Python, explaining the forward noise schedule, reverse denoising training, and providing complete code for noise schedules, diffusion functions, residual and attention blocks, a UNet architecture, loss computation, and a training loop.

DDPMPythonU-Net

0 likes · 26 min read

Building a Simple Diffusion Model with Python

Ele.me Technology

Mar 21, 2024 · Artificial Intelligence

How FIN Boosts CTR in Online Food Ordering: A Spatial‑Temporal Modeling Breakthrough

The paper introduces FIN (Fragment and Integrate Network), a novel spatial‑temporal model that extracts multiple sub‑sequences from ultra‑long user behavior logs, applies simplified and multi‑head attention, and fuses them with physically meaningful set operations, achieving up to 5.7% CTR lift and 7.3% RPM improvement in real‑world food‑delivery advertising.

AICTR predictionLong Sequence Modeling

0 likes · 23 min read

How FIN Boosts CTR in Online Food Ordering: A Spatial‑Temporal Modeling Breakthrough

NewBeeNLP

Mar 18, 2024 · Artificial Intelligence

Mastering RAG and LLM Techniques: From Retrieval to Fine‑Tuning

This article provides a comprehensive technical guide on Retrieval‑Augmented Generation (RAG), open‑source large language models such as LLaMA, fine‑tuning methods, evaluation metrics, memory‑optimization tricks, and attention‑related optimizations for modern AI systems.

LLMLangChainMemory Optimization

0 likes · 19 min read

Mastering RAG and LLM Techniques: From Retrieval to Fine‑Tuning

Rare Earth Juejin Tech Community

Nov 15, 2023 · Artificial Intelligence

Understanding the Transformer Architecture: Encoder, Decoder, and Attention Mechanisms

This article explains the Transformer model, comparing it with RNNs, detailing its encoder‑decoder structure, multi‑head and scaled dot‑product attention, embedding layers, feed‑forward networks, and the final linear‑softmax output, supplemented with diagrams and code examples.

Artificial IntelligenceEncoder-DecoderTransformer

0 likes · 10 min read

Understanding the Transformer Architecture: Encoder, Decoder, and Attention Mechanisms

Rare Earth Juejin Tech Community

Nov 12, 2023 · Artificial Intelligence

A Comprehensive Introduction to RNN, LSTM, Attention Mechanisms, and Transformers for Large Language Models

This article provides a thorough overview of large language models, explaining the relationship between NLP and LLMs, the evolution from RNN to LSTM, the fundamentals of attention mechanisms, and the architecture and operation of Transformer models, all illustrated with clear examples and diagrams.

Artificial IntelligenceLSTMNLP

0 likes · 25 min read

A Comprehensive Introduction to RNN, LSTM, Attention Mechanisms, and Transformers for Large Language Models

DataFunSummit

Sep 29, 2023 · Artificial Intelligence

Social4Rec: Enhancing Video Recommendation with Social Interest Networks

This article introduces Social4Rec, a video recommendation algorithm that tackles user cold‑start problems by extracting and integrating social interest information through coarse‑ and fine‑grained interest extractors, attention‑based fusion, and extensive offline and online experiments demonstrating significant CTR improvements.

attentioncold-startdeep learning

0 likes · 14 min read

Social4Rec: Enhancing Video Recommendation with Social Interest Networks

Alibaba Cloud Developer

Sep 4, 2023 · Artificial Intelligence

Hands‑On Building a Transformer from Scratch with PyTorch

This tutorial walks you through implementing a full Transformer model in PyTorch, starting from basic linear‑regression code, adding attention mechanisms, multi‑head attention, encoder‑decoder architecture, training loops, and inference, all reinforced with practical debugging tips.

NLPPyTorchTransformer

0 likes · 17 min read

Hands‑On Building a Transformer from Scratch with PyTorch

Nightwalker Tech

Jul 19, 2023 · Artificial Intelligence

Step‑by‑Step Implementation of Transformer Blocks, Attention, Normalization, Feed‑Forward, Encoder and Decoder in PyTorch

This article provides a comprehensive tutorial on building the core components of a Transformer model—including multi‑head attention, layer normalization, feed‑forward networks, encoder and decoder layers—and assembles them into a complete PyTorch implementation, supplemented with explanatory diagrams and runnable code.

EncoderPyTorchTransformer

0 likes · 13 min read

Step‑by‑Step Implementation of Transformer Blocks, Attention, Normalization, Feed‑Forward, Encoder and Decoder in PyTorch

Model Perspective

Jul 6, 2023 · Fundamentals

Understanding Information Processing Theory: How the Mind Works Like a Computer

The information processing theory, emerging in the 1950s‑60s, likens human cognition to computer operations, detailing how perception, attention, memory, conceptual knowledge, reasoning, and feedback mechanisms transform sensory input into mental representations and guide behavior, influencing cognitive psychology, education, and HCI.

MemoryPerceptionattention

0 likes · 4 min read

Understanding Information Processing Theory: How the Mind Works Like a Computer

DataFunSummit

Jun 21, 2023 · Artificial Intelligence

Graph‑Enhanced Node Representation for Cold‑Start Recommendation: Neighbour‑Enhanced YouTubeDNN

This article proposes a graph‑based node representation method that combines static attribute graphs and dynamic interaction graphs with multi‑level attention to alleviate user and item cold‑start problems in recommendation systems, achieving notable AUC improvements on sparsified MovieLens datasets.

EmbeddingGraph Neural NetworkMovieLens

0 likes · 9 min read

Graph‑Enhanced Node Representation for Cold‑Start Recommendation: Neighbour‑Enhanced YouTubeDNN

Network Intelligence Research Center (NIRC)

Jun 5, 2023 · Artificial Intelligence

How DETR and Its Successors Evolve: A Deep Dive into the DETR Series for Object Detection

This article reviews the original DETR model, analyzes its strengths and weaknesses, and then examines two major follow‑up works—Deformable‑DETR and DAB‑DETR—explaining how they modify attention mechanisms, introduce deformable convolutions and dynamic anchor boxes to accelerate convergence and improve small‑object detection.

DAB-DETRDETRDeformable-DETR

0 likes · 12 min read

How DETR and Its Successors Evolve: A Deep Dive into the DETR Series for Object Detection

Architect's Guide

Feb 9, 2023 · Artificial Intelligence

Why ChatGPT Is So Powerful: A Technical Overview of NLP Model Evolution

This article explains why ChatGPT performs so well by tracing the evolution of natural‑language processing from rule‑based grammars through statistical n‑gram models to neural architectures like RNNs, LSTMs, attention mechanisms, Transformers, and the massive data and training methods that power modern large language models.

ChatGPTLanguage ModelsMachine Learning

0 likes · 14 min read

Why ChatGPT Is So Powerful: A Technical Overview of NLP Model Evolution

AntTech

Dec 19, 2022 · Artificial Intelligence

TransVCL: Attention‑Enhanced Video Copy Localization Network with Flexible Supervision

TransVCL introduces an end‑to‑end attention‑enhanced video copy localization network that leverages a custom Transformer, correlation‑Softmax similarity matrix, and temporal alignment module, combined with a semi‑supervised learning framework, achieving state‑of‑the‑art performance on VCSL and VCDB benchmarks.

AISemi-supervised LearningTransformer

0 likes · 13 min read

TransVCL: Attention‑Enhanced Video Copy Localization Network with Flexible Supervision

DaTaobao Tech

Feb 22, 2022 · Artificial Intelligence

Graph-based Deep Recall Models for Sparse User Behavior in Content Recommendation

The paper proposes graph‑based deep recall models that enrich sparse user behavior sequences in video recommendation by integrating content knowledge graphs and adaptive attention mechanisms, demonstrating that variants such as GADM, SGGA, and SGGGA significantly boost click‑through rates in online experiments.

Recommendation Systemsattentiongraph neural networks

0 likes · 11 min read

Graph-based Deep Recall Models for Sparse User Behavior in Content Recommendation

DataFunTalk

Feb 18, 2022 · Artificial Intelligence

Travel Intent Prediction in E-commerce: Algorithm Strategies, Multi‑source Behavior Modeling, and Model Design

This talk presents Alibaba's travel intent prediction system, detailing the unique challenges of low‑frequency, multi‑source travel behavior, the multi‑granular CNN and time‑attention model architecture, experimental comparisons with baselines, and how integrated user interest modeling improves recommendation performance.

Machine Learningattentiondeep learning

0 likes · 11 min read

Travel Intent Prediction in E-commerce: Algorithm Strategies, Multi‑source Behavior Modeling, and Model Design

DataFunSummit

Jan 14, 2022 · Artificial Intelligence

Graph Attention Multi‑Layer Perceptron (GAMLP) and Node‑Dependent Local Smoothing (NDLS) for Scalable and Flexible Graph Neural Networks

This presentation introduces Tencent Angel Graph's NDLS and GAMLP techniques that address GNN scalability and flexibility by adaptively selecting propagation depth per node, employing node‑wise feature and label propagation with attention mechanisms, and demonstrating superior performance on large‑scale and sparse graph benchmarks.

GAMLPNode AdaptiveScalability

0 likes · 16 min read

Graph Attention Multi‑Layer Perceptron (GAMLP) and Node‑Dependent Local Smoothing (NDLS) for Scalable and Flexible Graph Neural Networks

Code DAO

Dec 25, 2021 · Artificial Intelligence

Image Captioning with Attention in TensorFlow 2.0: An End-to-End Encoder-Decoder Tutorial

This article walks through building an image‑captioning system using a TensorFlow 2.0 encoder‑decoder with Bahdanau attention, covering dataset preparation, feature extraction with InceptionV3, model architecture, training with teacher forcing, and inference on the Flickr8K dataset.

Encoder-DecoderFlickr8kImage Captioning

0 likes · 20 min read

Image Captioning with Attention in TensorFlow 2.0: An End-to-End Encoder-Decoder Tutorial

Alimama Tech

Dec 15, 2021 · Artificial Intelligence

Scalable Multi-View Ad Retrieval (SMAD): A Graph-Based Framework for E-commerce Advertising

SMAD is a scalable graph‑based ad retrieval framework for e‑commerce search that builds a heterogeneous Query‑Item‑Ad graph, learns multi‑view embeddings with a parallel deep neural network and attention, employs category‑aware sampling for efficient distributed training, and delivers significant gains in offline relevance and online CTR, RPM, and PVR.

ad retrievalattentiondistributed training

0 likes · 17 min read

Scalable Multi-View Ad Retrieval (SMAD): A Graph-Based Framework for E-commerce Advertising

Code DAO

Dec 7, 2021 · Artificial Intelligence

Key Deep Learning Architectures for Image Captioning: Encoders, Decoders, Attention & Multimodal Models

This article surveys deep‑learning image captioning, detailing the image encoder, sequence decoder, attention mechanisms and multimodal designs, comparing encoder‑decoder, detection‑backbone, transformer and dense captioning architectures, and explaining generation strategies and BLEU evaluation.

BLEUCNNImage Captioning

0 likes · 9 min read

Key Deep Learning Architectures for Image Captioning: Encoders, Decoders, Attention & Multimodal Models

DataFunSummit

Nov 21, 2021 · Artificial Intelligence

Sequential Recommendation Algorithms: Overview and Techniques

This article surveys sequential recommendation methods, covering standard models such as pooling, RNN, CNN, attention, and Transformer, as well as long‑short term, multi‑interest, multi‑behavior approaches, and recent advances like contrastive learning, highlighting their impact on recommendation performance.

Machine LearningRNNTransformer

0 likes · 8 min read

Sequential Recommendation Algorithms: Overview and Techniques

DeWu Technology

Nov 18, 2021 · Artificial Intelligence

Background Complexity Detection for Sneaker Images Using MobileNet, FPN, and Modified SAM

The project presents a lightweight MobileNet‑FPN architecture enhanced with a modified spatial‑attention module that evaluates corner‑based self‑similarity to classify sneaker photo backgrounds, achieving 96% test accuracy—exceeding baseline CNN performance—and meeting business targets of over 80% hint accuracy and 90% mandatory enforcement.

CNNMobileNetattention

0 likes · 12 min read

Background Complexity Detection for Sneaker Images Using MobileNet, FPN, and Modified SAM

DataFunSummit

Nov 2, 2021 · Artificial Intelligence

Applying Deep Learning to Time Series Data for Financial Risk Modeling

This article explains how a financial company leverages deep learning sequence models, including embedding, attention, and transformer techniques, to automatically extract features from massive time‑series data, improve risk model performance, and build a reusable, end‑to‑end system framework.

AIEmbeddingattention

0 likes · 8 min read

Applying Deep Learning to Time Series Data for Financial Risk Modeling

58 Tech

Oct 12, 2021 · Artificial Intelligence

Seq2Seq Approaches for Phone Number Extraction from Two‑Speaker Voice Dialogues

This article presents a practical study of extracting phone numbers from two‑speaker voice dialogues using Seq2Seq models—including LSTM, GRU with attention and feature fusion, and Transformer—detailing data characteristics, model architectures, training strategies, experimental results, and comparative analysis showing the GRU‑Attention approach achieving the best performance.

GRULSTMNLP

0 likes · 13 min read

Seq2Seq Approaches for Phone Number Extraction from Two‑Speaker Voice Dialogues

TiPaiPai Technical Team

Aug 2, 2021 · Artificial Intelligence

How Attention Boosts Text Recognition: From CNN‑Seq2Seq to Multi‑Scale Models

This article explains how attention mechanisms are applied to text recognition, covering the basic CNN‑Seq2Seq‑Attention architecture, multi‑scale attention extensions, and a 2D attentional irregular scene text recognizer with detailed network components, training loss, and experimental results.

CNNMulti-ScaleSeq2Seq

0 likes · 8 min read

How Attention Boosts Text Recognition: From CNN‑Seq2Seq to Multi‑Scale Models

DataFunTalk

Jun 4, 2021 · Artificial Intelligence

Advances in Ranking Algorithms for the "Good Goods" Recommendation Scenario

This article presents a comprehensive overview of recent advancements in ranking algorithms for the Good Goods recommendation scenario, covering long‑sequence modeling, category‑retrieval attention, multi‑objective ranking, model structure optimizations, loss functions, and LTR techniques, along with experimental results and practical insights.

LTRattentionloss

0 likes · 13 min read

Advances in Ranking Algorithms for the "Good Goods" Recommendation Scenario

Cyber Elephant Tech Team

Apr 28, 2021 · Artificial Intelligence

Understanding BERT: From Encoder-Decoder to Transformer and Attention

This article explains the BERT model by first reviewing the Encoder-Decoder framework, then detailing the attention mechanism—including self-attention and multi-head attention—before describing the full Transformer architecture and finally outlining BERT’s encoder-only design, training stages, and fine-tuning applications.

BERTEncoder-DecoderNLP

0 likes · 15 min read

Understanding BERT: From Encoder-Decoder to Transformer and Attention

DataFunTalk

Apr 17, 2021 · Artificial Intelligence

Personalized Re-ranking for Recommendation (ResSys'19)

This article introduces a personalized re‑ranking model for recommendation systems, explaining the limitations of traditional point‑wise ranking, describing the PRM architecture with input, encoding, and output layers using multi‑head attention and pre‑trained personalization features, and presenting experimental results and future extensions.

CTRMachine LearningRe‑ranking

0 likes · 7 min read

Personalized Re-ranking for Recommendation (ResSys'19)

DataFunTalk

Apr 3, 2021 · Artificial Intelligence

A Survey of User Behavior Sequence Modeling for Search and Recommendation Advertising

User behavior sequence modeling, crucial for search and recommendation advertising ranking, has evolved from simple pooling to attention, RNN, capsule, and Transformer architectures, with industrial applications across e‑commerce, social, video, and music platforms, and future directions include time‑aware, multi‑dimensional, and self‑supervised approaches.

Recommendation SystemsSequence ModelingTransformer

0 likes · 24 min read

A Survey of User Behavior Sequence Modeling for Search and Recommendation Advertising

Sohu Tech Products

Feb 17, 2021 · Artificial Intelligence

Improving BERT Pre‑training with RealFormer: Principles, Implementation, and Empirical Evaluation

This article analyzes the RealFormer modification to the Transformer architecture, details its implementation in BERT, and presents extensive experiments showing that while RealFormer can boost performance on low‑label‑count classification tasks, its benefits diminish or disappear as the number of classes grows.

BERTRealFormerResidual

0 likes · 12 min read

Improving BERT Pre‑training with RealFormer: Principles, Implementation, and Empirical Evaluation

New Oriental Technology

Feb 1, 2021 · Artificial Intelligence

Neural Machine Translation: Seq2Seq, Beam Search, BLEU, Attention Mechanisms, and GNMT Improvements

This article explains key concepts of neural machine translation, covering Seq2Seq encoder‑decoder models, beam search strategies, BLEU evaluation, various attention mechanisms, and the enhancements introduced in Google's Neural Machine Translation system to improve speed, OOV handling, and translation quality.

BLEUBeam SearchGNMT

0 likes · 11 min read

Neural Machine Translation: Seq2Seq, Beam Search, BLEU, Attention Mechanisms, and GNMT Improvements

JD Tech Talk

Jan 28, 2021 · Artificial Intelligence

Spatial‑Temporal Graph Diffusion Network for City Traffic Flow Forecasting

This article introduces a hierarchical graph neural network model that jointly captures multi‑scale temporal patterns and global spatial context for urban traffic flow prediction, demonstrates its superiority over existing methods on multiple public datasets, and validates each component through extensive ablation studies.

Graph Neural Networkattentiondeep learning

0 likes · 8 min read

Spatial‑Temporal Graph Diffusion Network for City Traffic Flow Forecasting

Didi Tech

May 25, 2020 · Artificial Intelligence

How Didi Harnesses Cutting‑Edge Speech Recognition: From ASR Basics to Transformer Models

This article provides a comprehensive technical overview of modern speech recognition, covering Didi’s driver‑assistant and smart‑customer‑service applications, fundamental ASR concepts, classic GMM‑HMM methods, deep‑learning breakthroughs such as DNN‑HMM, CTC, attention‑based and transformer models, practical training tricks, signal‑processing steps, and multimodal fusion techniques.

ASRCTCMultimodal

0 likes · 16 min read

How Didi Harnesses Cutting‑Edge Speech Recognition: From ASR Basics to Transformer Models

DataFunTalk

Apr 21, 2020 · Artificial Intelligence

Attention Mechanisms in Deep Learning Recommendation Models: A Survey

This article surveys the application of attention mechanisms in deep learning recommendation systems, reviewing models such as AFM, DIN, DIEN, DSIN, Behavior Sequence Transformer, Deep Spatio‑Temporal Networks, and ATRank, and discusses their architectures, attention types, advantages, and limitations.

CTR predictionRecommendation Systemsattention

0 likes · 10 min read

Attention Mechanisms in Deep Learning Recommendation Models: A Survey

Alibaba Cloud Developer

Apr 14, 2020 · Artificial Intelligence

How Feizhu Upgraded Its Recommendation Engine from Linear to End‑to‑End Deep Models

This article details the evolution of Feizhu's "Guess You Like" ranking system, moving from a linear FTRL model to several end‑to‑end deep learning versions—including PALM, FB‑PALM, and GLA—highlighting technical challenges, architectural changes, and measurable performance gains.

AIModel Iterationattention

0 likes · 15 min read

How Feizhu Upgraded Its Recommendation Engine from Linear to End‑to‑End Deep Models

DataFunTalk

Feb 3, 2020 · Artificial Intelligence

Advances in Speech Recognition: Concepts, Deep Learning Methods, and Didi’s Applications

This article presents a comprehensive overview of modern speech recognition technology, covering basic ASR concepts, classic acoustic and language models, deep‑learning approaches such as DNN‑HMM, CTC, attention‑based and transformer models, multimodal fusion, signal‑processing pipelines, and practical deployment considerations at Didi.

ASRCTCDidi

0 likes · 15 min read

Advances in Speech Recognition: Concepts, Deep Learning Methods, and Didi’s Applications

DataFunTalk

Dec 16, 2019 · Artificial Intelligence

A Comprehensive Overview of Sequential Recommendation Models and Techniques

This article provides an in-depth overview of sequential recommendation, defining the problem, discussing data preparation, and reviewing various neural architectures—including MLP, CNN, RNN, Temporal CNN, self‑attention, and reinforcement‑learning approaches—while offering practical guidance on model selection and implementation.

CNNRNNSequential Modeling

0 likes · 36 min read

A Comprehensive Overview of Sequential Recommendation Models and Techniques

DataFunTalk

Nov 25, 2019 · Artificial Intelligence

Real-time Attention-based Look-alike Model for Recommender Systems

This talk presents a real-time attention-based look‑alike model (RALM) designed to address the long‑tail problem in recommendation systems by efficiently expanding seed users, leveraging user representation learning, attention mechanisms, and clustering to deliver timely, diverse content without retraining the model.

Long Tailattentionclustering

0 likes · 24 min read

Real-time Attention-based Look-alike Model for Recommender Systems

Qunar Tech Salon

Sep 12, 2019 · Artificial Intelligence

A Comprehensive Overview of Attention Mechanisms in Deep Learning

This article systematically reviews the history, core concepts, variants, and practical implementations of attention mechanisms—from early additive and multiplicative forms to self‑attention, multi‑head attention, and recent transformer‑based models—highlighting why attention has become fundamental in modern AI research.

NLPSelf-AttentionTransformer

0 likes · 16 min read

A Comprehensive Overview of Attention Mechanisms in Deep Learning

Alibaba Cloud Developer

Aug 9, 2019 · Artificial Intelligence

Demystifying Attention: A Beginner’s Guide to History, Types, and Applications

This article provides a comprehensive, beginner‑friendly overview of attention mechanisms—from their origins in early neural machine translation papers to modern self‑attention, multi‑head attention, and transformer variants—explaining core concepts, common variants, and why attention has become essential across NLP and vision tasks.

NLPattention

0 likes · 18 min read

Demystifying Attention: A Beginner’s Guide to History, Types, and Applications

HomeTech

Aug 7, 2019 · Artificial Intelligence

Near-Duplicate Video Retrieval: Framework, Feature Extraction, Metric Learning, and Model Optimization

This article presents a comprehensive study of near‑duplicate video retrieval, covering the definition of near‑duplicate videos, motivations for deduplication, challenges, a two‑stage offline/online processing framework, keyframe and VGG16‑based feature extraction, metric‑learning loss functions, training procedures, dataset preparation, evaluation metrics, and model enhancements using LSTM and attention mechanisms.

LSTMMAPVGG16

0 likes · 12 min read

Near-Duplicate Video Retrieval: Framework, Feature Extraction, Metric Learning, and Model Optimization

JD Tech Talk

Jul 24, 2019 · Artificial Intelligence

Absolute Semantic Recognition Competition: Feature Design, Modeling Strategy, and Core Algorithm Insights

This article presents a comprehensive solution to the absolute semantic recognition competition, detailing the problem background, dataset, evaluation metrics, feature engineering, model architecture—including Attention, Capsule, Bi‑GRU, and BERT—and analysis of results and lessons learned.

BERTCapsule NetworksMachine Learning

0 likes · 11 min read

Absolute Semantic Recognition Competition: Feature Design, Modeling Strategy, and Core Algorithm Insights

DataFunTalk

Mar 13, 2019 · Artificial Intelligence

A Comprehensive Overview of NLP Development and Deep Learning Models

This article reviews the history of natural language processing, explains key deep‑learning models such as NNLM, Word2vec, CNN, RNN, attention mechanisms, and Transformers, and discusses their applications, future trends, and practical considerations in NLP tasks.

NLPTransformerattention

0 likes · 38 min read

A Comprehensive Overview of NLP Development and Deep Learning Models

DataFunTalk

Feb 27, 2019 · Artificial Intelligence

Human‑Interactive Machine Translation: Research, Techniques, and Productization

This article reviews the current state of machine translation, explores the challenges of ambiguity, quality, and domain specificity, and presents human‑in‑the‑loop translation techniques—including attention‑enhanced models, transformer architectures, and online learning—while discussing practical productization and deployment considerations.

AI productizationHuman-in-the-LoopOnline Learning

0 likes · 16 min read

Human‑Interactive Machine Translation: Research, Techniques, and Productization

Alibaba Cloud Developer

Oct 30, 2018 · Artificial Intelligence

How Advanced LSTM (A‑LSTM) Boosts Speech Emotion Recognition by 5.5%

This article introduces Advanced LSTM (A‑LSTM), which linearly combines multiple past hidden states to overcome traditional LSTM's one‑step dependency, and demonstrates its application in utterance‑level speech emotion recognition, achieving a 5.5% accuracy improvement through attention‑based weighted‑pooling RNNs and auxiliary speaker and gender tasks.

A-LSTMLSTMRNN

0 likes · 8 min read

How Advanced LSTM (A‑LSTM) Boosts Speech Emotion Recognition by 5.5%

AntTech

Aug 16, 2018 · Artificial Intelligence

Deep Learning Approaches for Text Classification in Alipay Complaint Fraud Detection

This article reviews deep‑learning‑based text classification techniques—including TextCNN, BiGRU, Capsule Networks, Attention mechanisms, and the novel cw2vec embedding—applied to Alipay complaint fraud data, presents experimental comparisons, and discusses their advantages, challenges, and future directions.

Alipayattentioncapsule network

0 likes · 18 min read

Deep Learning Approaches for Text Classification in Alipay Complaint Fraud Detection

Didi Tech

Jun 1, 2018 · Artificial Intelligence

Didi's Attention-Based End-to-End Mandarin Speech Recognition: A Detailed Review

Didi’s attention‑based end‑to‑end Mandarin speech recognizer, built on the Listen‑Attend‑Spell architecture and modeling roughly 5,000 common characters, delivers 15‑25% relative accuracy gains over its prior LSTM‑CTC system while cutting model size, latency and server requirements and simplifying training by eliminating separate acoustic, pronunciation and language components.

End-to-EndLASMandarin

0 likes · 14 min read

Didi's Attention-Based End-to-End Mandarin Speech Recognition: A Detailed Review

Hulu Beijing

Dec 14, 2017 · Artificial Intelligence

Understanding Seq2Seq: Framework, Advantages, and Decoding Techniques

This article explains the Seq2Seq encoder‑decoder framework, its benefits for various sequence modeling tasks, and compares common decoding strategies such as greedy search and beam search, while also introducing attention and other enhancements for improved performance.

Beam SearchEncoder-Decoderattention

0 likes · 9 min read

Understanding Seq2Seq: Framework, Advantages, and Decoding Techniques

Suning Design

Apr 12, 2014 · Product Management

How Emotional Design Shapes User Relationships and Behavior

The article explains how emotional design leverages usefulness, usability, and delight to capture attention, influence emotions, and drive user actions, ultimately forming lasting relationships between users and products through the dimensions of value and arousal.

attentionbehavioremotional design

0 likes · 16 min read

How Emotional Design Shapes User Relationships and Behavior