Artificial Intelligence 29 min read

Fundamentals of Natural Language Processing: Language Models, Smoothing, and Basic Tasks

This article provides a comprehensive overview of natural language processing fundamentals, covering the challenges of language modeling, N‑gram and Markov assumptions, smoothing techniques such as discounting and add‑one, evaluation via perplexity, basic tasks like Chinese word segmentation, subword tokenization, POS tagging, syntactic and semantic parsing, and a range of downstream applications including information extraction, sentiment analysis, question answering, machine translation, and dialogue systems.

DataFunTalk

Jul 30, 2021

Fundamentals of Natural Language Processing: Language Models, Smoothing, and Basic Tasks

Natural language processing (NLP) faces eight major challenges—abstraction, compositionality, ambiguity, evolution, non‑standardness, subjectivity, knowledge dependence, and portability—making its tasks diverse and complex.

Language Models describe the probability distribution of text. The classic N‑gram model estimates the conditional probability of the next word given a limited history, relying on the Markov assumption. Unigram, bigram, and trigram models correspond to 0‑, 1‑, and 2‑order Markov chains, respectively, with special start ( <BOS>) and end ( <EOS>) tokens.

Smoothing addresses data sparsity and zero‑probability issues. Discounting methods, especially add‑one (Laplace) smoothing, redistribute probability mass from frequent N‑grams to rare or unseen ones. Formulas for discounted probabilities of unigrams and bigrams are presented, and the importance of tuning hyper‑parameters on a development set is noted.

Model Evaluation commonly uses perplexity, the inverse geometric mean of word probabilities on a test set, with logarithmic computation to avoid underflow. Lower perplexity indicates better generalization, though it does not guarantee superior performance on external tasks.

Basic NLP Tasks include Chinese word segmentation, subword tokenization, part‑of‑speech (POS) tagging, syntactic parsing, and semantic analysis. For segmentation, the forward maximum matching (FMM) algorithm is illustrated:

def fmm_word_seg(sentence, lexicon, max_len):
    """sentence: input string
    lexicon: set of known words
    max_len: maximum word length in lexicon"""
    begin = 0
    end = min(begin + max_len, len(sentence))
    words = []
    while begin < end:
        word = sentence[begin:end]
        if word in lexicon or end - begin == 1:
            words.append(word)
            begin = end
            end = min(begin + max_len, len(sentence))
        else:
            end -= 1
    return words

Subword tokenization, exemplified by Byte‑Pair Encoding (BPE), iteratively merges the most frequent adjacent symbol pairs to build a compact subword vocabulary, enabling efficient handling of rare and out‑of‑vocabulary words.

POS tagging assigns grammatical categories (e.g., noun, verb) to each token, while syntactic parsing produces phrase‑structure or dependency trees that reveal hierarchical relationships. Semantic analysis encompasses word‑sense disambiguation, semantic role labeling, and semantic dependency parsing, providing deeper meaning representations.

Application Tasks span information extraction (named entity recognition, relation extraction, event extraction), sentiment analysis (classification and aspect extraction), question answering (retrieval‑based, knowledge‑base, reading‑comprehension), machine translation (rule‑based, statistical, neural), and dialogue systems (task‑oriented and open‑domain). Each application builds upon the basic tasks and language‑modeling techniques described earlier.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI NLP language model word segmentation semantic analysis smoothing Subword Tokenization

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.