Tag

tokenization

1 views collected around this technical thread.

Tencent Technical Engineering
Tencent Technical Engineering
May 12, 2025 · Artificial Intelligence

Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture

This article provides a detailed Chinese‑to‑English summary of Andrej Karpathy’s 7‑hour LLM tutorial, covering chat process analysis, tokenization, pre‑training data pipelines, model architecture, training strategies, post‑training fine‑tuning, reinforcement learning, chain‑of‑thought reasoning, and current industry applications.

AILLMModel Architecture
0 likes · 25 min read
Comprehensive Summary and Expansion of Andrej Karpathy’s 7‑Hour LLM Lecture
DevOps
DevOps
Apr 13, 2025 · Artificial Intelligence

The Amazing Magic of GPT‑4o and a Speculative Technical Roadmap

This article reviews the breakthrough image‑generation capabilities of GPT‑4o, showcases diverse examples, and offers a detailed speculation on its underlying autoregressive architecture, tokenization methods, VQ‑VAE/GAN advances, and training strategies that could explain its performance.

AI researchGPT-4oVQ-VAE
0 likes · 16 min read
The Amazing Magic of GPT‑4o and a Speculative Technical Roadmap
58 Tech
58 Tech
Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphTensorRTVisual Language Model
0 likes · 19 min read
Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization
Code Mala Tang
Code Mala Tang
Mar 27, 2025 · Artificial Intelligence

How Do BPE, WordPiece, and SentencePiece Shape Modern NLP Tokenization?

This article explains the fundamentals, workflows, examples, and trade‑offs of three major subword tokenization algorithms—Byte Pair Encoding, WordPiece, and SentencePiece—helping practitioners choose the right method for their large language model pipelines.

BPENLPSentencePiece
0 likes · 12 min read
How Do BPE, WordPiece, and SentencePiece Shape Modern NLP Tokenization?
Java Tech Enthusiast
Java Tech Enthusiast
Sep 15, 2024 · Fundamentals

How Source Code Is Transformed into Machine Instructions

A compiler transforms source code into executable machine instructions by first tokenizing the text into keywords, identifiers and literals, then parsing these tokens into an abstract syntax tree, generating and optimizing intermediate code, and finally assembling and linking the output for the target architecture or LLVM IR.

ASTCompilerLLVM
0 likes · 4 min read
How Source Code Is Transformed into Machine Instructions
Architect
Architect
Aug 11, 2024 · Artificial Intelligence

Understanding Large Language Models: Tokens, Tokenization, and the Evolution from Markov Chains to Transformers

This article explains how generative AI models work by demystifying tokens, tokenization with tools like tiktoken, simple Markov‑chain training, the limitations of small context windows, and how modern LLMs use neural networks, transformers and attention mechanisms to predict the next token.

Artificial IntelligenceLLMMarkov Chain
0 likes · 20 min read
Understanding Large Language Models: Tokens, Tokenization, and the Evolution from Markov Chains to Transformers
Java Architect Essentials
Java Architect Essentials
Aug 1, 2024 · Backend Development

Implementing Fuzzy Company Name Matching with MySQL RegExp in a Business Approval Workflow

This article describes a business approval scenario where a company name entered by a business user must be checked for duplicates, and explains how to implement fuzzy matching using MySQL RegExp, tokenization with IKAnalyzer, and Java service code to extract, preprocess, match, and rank results by relevance.

DatabaseJavaMySQL
0 likes · 11 min read
Implementing Fuzzy Company Name Matching with MySQL RegExp in a Business Approval Workflow
IT Services Circle
IT Services Circle
Jul 17, 2024 · Artificial Intelligence

Why Large Language Models Mistake 9.11 > 9.9: Prompting, Tokenizer Effects, and Recent Findings

The article examines why leading large language models such as GPT‑4o, Gemini Advanced, and Claude 3.5 incorrectly claim that 9.11 is larger than 9.9, analyzes tokenization and prompting strategies that cause the error, and discusses recent research and OpenAI model updates.

AI reasoningNumerical ComparisonPrompt Engineering
0 likes · 7 min read
Why Large Language Models Mistake 9.11 > 9.9: Prompting, Tokenizer Effects, and Recent Findings
Java Tech Enthusiast
Java Tech Enthusiast
Jul 16, 2024 · Artificial Intelligence

LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9

Recent tests reveal that popular large language models—including GPT‑4o, Gemini Advanced, and Claude 3.5—often claim 9.11 is larger than 9.9 because their tokenizers split the numbers, but rephrasing, zero‑shot chain‑of‑thought prompts, or treating the values as floating‑point numbers can correct the mistake, a pattern also seen variably in Chinese models.

AI evaluationLLMPrompt Engineering
0 likes · 7 min read
LLMs Misjudge Simple Number Comparison: 9.11 vs 9.9
Python Programming Learning Circle
Python Programming Learning Circle
Nov 17, 2023 · Big Data

Building a Simple Search Engine with Bloom Filter, Tokenization, and Inverted Index in Python

This article demonstrates how to implement a basic big‑data search engine in Python by creating a Bloom filter for fast existence checks, designing tokenization functions for major and minor segmentation, building an inverted index, and supporting AND/OR queries with example code and execution results.

Big DataBloom FilterInverted Index
0 likes · 12 min read
Building a Simple Search Engine with Bloom Filter, Tokenization, and Inverted Index in Python
Nightwalker Tech
Nightwalker Tech
Jul 18, 2023 · Artificial Intelligence

Implementing the Input Processing Layer of a Transformer Model: Tokenization, Embedding, and Positional Encoding

This article explains how to build the input processing stage of a Transformer—including tokenization with Hugging Face tokenizers, token‑to‑embedding conversion using BERT models, custom BPE tokenizers, and positional encoding—providing complete Python code examples and test results.

BPEPositional EncodingPyTorch
0 likes · 14 min read
Implementing the Input Processing Layer of a Transformer Model: Tokenization, Embedding, and Positional Encoding
Baidu Geek Talk
Baidu Geek Talk
Mar 21, 2022 · Frontend Development

How WebKit Parses HTML: Decoding, Tokenization, and DOM Tree Construction

The article details WebKit’s rendering pipeline in WKWebView, describing how the network process streams HTML bytes to the rendering process, which decodes them via TextResourceDecoder, tokenizes the characters with HTMLTokenizer’s state machine, and constructs an efficient DOM tree using HTMLTreeBuilder and queued insertion tasks.

Browser EngineDOMHTML parsing
0 likes · 33 min read
How WebKit Parses HTML: Decoding, Tokenization, and DOM Tree Construction
Baidu App Technology
Baidu App Technology
Mar 7, 2022 · Mobile Development

How WKWebView Parses HTML: Decoding, Tokenization, and DOM Tree Construction

WKWebView parses HTML by streaming bytes from the network process to the rendering process, decoding them into characters, tokenizing into HTML tokens, building a DOM tree through node creation and insertion, and finally laying out and painting the document using a doubly‑linked in‑memory structure.

DOMHTML parsingWKWebView
0 likes · 37 min read
How WKWebView Parses HTML: Decoding, Tokenization, and DOM Tree Construction
Aikesheng Open Source Community
Aikesheng Open Source Community
Jun 23, 2021 · Databases

Using MySQL Ngram Plugin to Enable Accurate Full‑Text Search for Chinese Text

This article explains why MySQL's default full‑text index struggles with Chinese, demonstrates how to configure token size parameters, activate the ngram parser plugin, and adjust queries (including Boolean mode) to achieve reliable Chinese full‑text search results.

Boolean ModeFull-Text SearchMySQL
0 likes · 12 min read
Using MySQL Ngram Plugin to Enable Accurate Full‑Text Search for Chinese Text
360 Quality & Efficiency
360 Quality & Efficiency
Jul 3, 2020 · Backend Development

Understanding PHP_CodeSniffer: Tokenization, Lexical Analysis, and Custom Rule Creation

This article explains how PHP_CodeSniffer parses PHP source code into tokens using lexical analysis, demonstrates token extraction with token_get_all, and guides readers through creating a custom rule to prohibit hash‑style comments, covering rule library setup, Sniff implementation, and execution.

Custom RulesPHPPHP_CodeSniffer
0 likes · 12 min read
Understanding PHP_CodeSniffer: Tokenization, Lexical Analysis, and Custom Rule Creation
Java Captain
Java Captain
Feb 17, 2019 · Backend Development

Implementing a JSON Parser in Java: Structures, Tokenization, and Parsing

This article explains the fundamentals of JSON, its object and array structures, maps JSON types to Java equivalents, and provides a complete Java implementation of a JSON parser including token definitions, lexical analysis, and object/array construction with detailed code examples.

Data StructuresJSONParser
0 likes · 14 min read
Implementing a JSON Parser in Java: Structures, Tokenization, and Parsing
360 Quality & Efficiency
360 Quality & Efficiency
Nov 13, 2018 · Backend Development

Understanding PHP_CodeSniffer: Tokenization and Lexical Analysis in PHP

This article explains how PHP_CodeSniffer performs static analysis by tokenizing PHP source code, describes PHP’s execution process, clarifies the concept of tokens and how to retrieve them with token_get_all and token_name, and shows how this knowledge enables custom rule creation.

PHPbackendlexical analysis
0 likes · 5 min read
Understanding PHP_CodeSniffer: Tokenization and Lexical Analysis in PHP
High Availability Architecture
High Availability Architecture
Jun 4, 2018 · Blockchain

Key Competitive Points of Public Chains and Insights from the GIAC Shenzhen Conference

The article summarizes the author’s takeaways from the GIAC Shenzhen conference, analyzing various public‑chain projects, their architectural choices, competitive focuses such as scalability, asset tokenization, DApp support, and the role of alliance chains in finance, traceability, and anti‑counterfeiting.

BlockchainUTXOalliance chain
0 likes · 8 min read
Key Competitive Points of Public Chains and Insights from the GIAC Shenzhen Conference