Big Data 15 min read

Unlocking Spark: Machine Learning, New Word Discovery, and Smart Q&A Techniques

This article explains how to leverage Spark for machine learning, discover new terms from massive text corpora, and build intelligent question‑answer systems, sharing practical tips, performance considerations, and real‑world examples for data analysts and algorithm engineers.

Efficient Ops

Dec 29, 2015

Unlocking Spark: Machine Learning, New Word Discovery, and Smart Q&A Techniques

How to Use Spark for Machine Learning

Spark 1.5 provides a unified platform for real‑time and batch processing, SQL, and a rich algorithm library, running on Hadoop’s YARN and HDFS. The Spark‑Shell, combined with Scala, lets you write and execute code across hundreds of nodes as easily as a local script, enabling full‑dataset training with functions like sample for random sampling.

Algorithms such as Naive Bayes, Word2Vec, and linear regression are readily available; using Spark‑Shell dramatically speeds up experimentation compared to single‑machine workflows.

New Word Discovery with Spark

By processing over two million blog posts (≈200 GB), Spark can compute five metrics for each candidate term—cohesion, freedom, frequency, IDF, and overlapping substrings—to filter and rank new words. The pipeline includes HTML tag removal, tokenization, length filtering (max five Chinese characters), and character cleaning before applying the metric formulas.

Performance tricks such as preferring reduceByKey over groupByKey prevent memory blow‑outs, and dynamic worker scaling in Spark 1.5 helps manage resource usage.

Smart Q&A on Spark

Using Spark‑based Word2Vec embeddings, titles of blog posts and user queries are transformed into dense vectors (e.g., 50‑dimensional). Sentence vectors are obtained by element‑wise addition of word vectors, then similarity scores identify matching answers; thresholds of 0.9 and 0.7 denote direct answers and reference answers respectively.

This approach powers simple question‑answer bots that retrieve relevant content from forums, blogs, and documentation.

Conclusion

Data analysts and algorithm engineers should adopt Spark‑Shell for its interactive, full‑scale capabilities; the community now supports Python and R, and DataFrames borrow concepts from R.

For building machine‑learning platforms, see the author’s related article (http://www.jianshu.com/p/d59c3e037cb7) for architectural insights.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Spark Scala Intelligent QA New Word Discovery

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.