Legal Case Similarity Competition at CCL 2019: Dataset, Task Transformation, and Model Solutions
The article reviews the CCL “Chinese Law Research Cup” similarity competition, describing the legal text dataset, converting the triple‑sample task to a binary similarity problem, outlining challenges such as long documents, and summarizing the BERT‑based Siamese, InferSent, and triplet‑loss models that achieved top‑10 results.
The author attended the 18th China National Conference on Computational Linguistics (CCL 2019) and participated in the "Chinese Law Research Cup" similarity competition, winning third place with a team.
The competition required computing similarity between pairs of legal documents, each consisting of a title and factual description, and selecting the more similar document from two candidates.
Each data point contains three legal texts (A, B, C) forming a triple where sim(A,B) > sim(A,C). The task is naturally a triple‑sample similarity problem, but due to the small training set (5,000 samples) and long document lengths, it was transformed into a binary similarity task with labels 1 (A‑B similar) and 0 (A‑C dissimilar).
Key challenges include the structural similarity of legal texts, the prevalence of generic terms, and the difficulty of processing long factual descriptions.
Model approaches explored by participants:
Encoder enhancements using BERT, CNN, and attention mechanisms.
Pairwise interaction methods such as cosine similarity, vector differences, and dot products.
Loss functions like triplet loss and margin loss.
Data augmentation by swapping triples to generate additional negative examples.
Incorporating domain‑specific legal element extraction for the top‑ranked solution.
The author's team combined a BERT encoder with Siamese and InferSent architectures. The Siamese network used shared‑weight Bi‑LSTM encoders and cosine similarity, achieving around 63.9% accuracy, while InferSent with BERT pooling and interaction features reached about 64.5% accuracy.
Additional experiments with the original BERT model also yielded competitive results. An ensemble of several models secured seventh place overall.
In conclusion, legal NLP remains a highly impactful application of AI, with ongoing research and industrial products aiming to assist judges and the public in case analysis.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.