Artificial Intelligence 13 min read

Multi‑Scale BERT‑Based Automated Essay Scoring: Architecture, Loss Functions, and Experimental Evaluation

This article surveys automated essay scoring (AES), compares handcrafted, deep‑learning, and pre‑trained language‑model approaches, proposes a multi‑scale BERT architecture with document, token, and segment features, introduces three combined loss functions, and demonstrates superior performance on the ASAP dataset and internal tasks.

Liulishuo Tech Team
Liulishuo Tech Team
Liulishuo Tech Team
Multi‑Scale BERT‑Based Automated Essay Scoring: Architecture, Loss Functions, and Experimental Evaluation

Automated Essay Scoring (AES) automatically assigns scores to essays by first converting the text into a numerical vector (representation module) and then computing a score from that vector (scoring module).

AES methods can be grouped into three categories: handcrafted‑feature based, deep‑learning based, and pre‑trained language‑model based. Handcrafted methods rely on linguistic features such as grammar, vocabulary, and coherence but require extensive feature engineering and have limited scalability. Deep‑learning approaches use networks like LSTM and CNN to learn representations end‑to‑end, achieving better results but suffering on small datasets. Pre‑trained models (e.g., BERT, XLNet) are fine‑tuned on AES data, yet they often do not surpass traditional deep‑learning methods without additional optimizations.

In the optimization process at Liulishuo, we improved the pre‑trained‑model pipeline and achieved significant gains on the internal dataset and the ASAP benchmark compared with traditional LSTM/CNN methods.

Problem Analysis reveals two main issues: (1) pre‑training is performed on sentences or short fragments, while AES requires encoding whole essays, causing a mismatch; (2) limited essay data makes fine‑tuning difficult. Teachers evaluate essays at multiple granularities (word, sentence, paragraph, document) and also consider score distributions and relative comparisons.

To address these, we propose segmenting an essay into multiple fragments, encoding each fragment with a pre‑trained model, and aggregating the fragment representations. Multiple scales are processed separately and fused, and a distribution‑aware loss is introduced.

Model Structure (see Figure 1) consists of a left branch that extracts document‑level and token‑level features, and a right branch that extracts multi‑scale fragment features. The final score is the sum of the document/token scores and all fragment scores. BERT is used as the backbone because it performed best in our experiments.

Document and Token Scale Features : The essay is tokenized with the BERT tokenizer; token, segment, and position embeddings are summed and fed into BERT. The CLS token output serves as the document‑level feature, while max‑pooled token outputs provide token‑level features.

Multi‑Scale Fragment Features : For a set of scales K = [k₁, k₂, …, kₛ], the token sequence is split into ⌈n/kᵢ⌉ fragments of length kᵢ. Each fragment is encoded by BERT (CLS output) and then processed by an LSTM or attention layer to obtain a scale‑specific fragment representation.

Score Prediction : Document and token features are concatenated and passed through a multilayer perceptron (MLP) to produce a document‑level score. Fragment features for each scale kᵢ are also fed to an MLP to produce a fragment‑scale score. The final essay score is the sum of all these scores.

Figure 1: Multi‑scale BERT‑based essay scoring model.

Loss Functions

Three loss components are combined:

MSE (Mean Squared Error) : Standard regression loss over a batch of N samples.

SIM (Similarity) : Encourages the predicted score distribution within a batch to match the true score distribution.

MR (Margin Ranking) : Enforces reasonable pairwise ranking among samples in a batch.

The three losses are weighted and summed; weights are tuned on a validation set.

Experimental Results

Comparison with other methods on the ASAP dataset (Table 1) and on the long‑essay subset (Table 2) shows that our approach ranks within the top‑3, achieving QWK 0.772 on long essays, surpassing competing systems (QWK 0.761). Relative to traditional deep‑learning baselines (methods 4 and 6), our multi‑scale BERT model improves QWK from 0.764 to 0.782.

Table 1 and Table 2 (images omitted) detail the performance metrics and rankings.

Beyond the ASAP task, the method has been applied internally for essay scoring and text difficulty grading, and it has been accepted at NAACL 2022. The paper is available on arXiv for further study.

Product Applications at Liulishuo

Liulishuo integrates handcrafted, deep‑learning, and pre‑trained‑model AES algorithms into a suite for speaking and writing assessment, exposing an API for external use. Example products include:

1. Liulishuo Writing

2. Darwin Speaking Homework

3. IELTS Liulishuo

Conclusion

This paper introduced AES methods, identified challenges of pre‑trained models for essay scoring, and presented a multi‑scale BERT solution that yields strong results on both public and internal datasets, while also providing an effective encoding strategy for long texts. Ongoing work will continue to refine the AES system and advance the technology for better user experience.

Artificial IntelligenceBERTloss functionsASAP datasetautomated essay scoringmulti‑scale encodingpre‑trained language models
Liulishuo Tech Team
Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.