Understanding Collaborative Filtering, Matrix Factorization, and Spark ALS for Recommendation Systems
This article explains the fundamentals of recommendation systems, introduces collaborative filtering (both user‑based and item‑based), derives the matrix‑factorization model with ALS optimization, provides a complete Python implementation, and demonstrates how to apply Spark ALS in both demo and production environments.
Author: vivo Internet Server Team – Tang Shutao
Recommendation systems are ubiquitous in apps like Douyin, Taobao, and JD, and they rely on various technologies.
This article uses classic collaborative filtering as a starting point and focuses on the widely used matrix‑factorization algorithm, presenting its theory and practice in an easy‑to‑understand manner.
To truly understand a research paper, the best approach is to reproduce it; the implementation process reveals many theoretical details and practical challenges.
1. Background
In the information‑overloaded 21st century, search engines help us find content, while e‑commerce platforms face the problem of information overload, which recommendation engines aim to solve.
We illustrate the problem with three books: "Programmer's Neck‑Pain Guide", "Big Data Analysis with Spark", and "Scala Programming". The goal is to predict how a user would rate each book despite never having purchased them.
We model the user‑item interactions as a rating matrix R of size m × n , where each entry r_{u,i} represents the rating given by user u to item i .
2. Collaborative Filtering
Collaborative filtering predicts user preferences by finding similar users (user‑based CF) or similar items (item‑based CF) using similarity measures such as cosine similarity.
Example: cosine similarity between user‑3 and user‑4 vectors [0,2,0,3,0,4] and [0,3,3,5,4,0] is computed and visualized.
2.1 User‑Based CF (UserCF)
Find the top‑N most similar users.
Predict the target user's rating for each candidate item by a weighted average of the neighbors' ratings.
2.2 Item‑Based CF (ItemCF)
For each item the target user has interacted with, find its top‑N similar items.
Predict the rating by a weighted average of the similar items' ratings.
2.3 Comparison of UserCF and ItemCF
UserCF captures community trends, while ItemCF focuses on a user's personal taste. Both belong to the neighborhood‑based (K‑Nearest Neighbor) family.
3. Matrix Factorization
The rating matrix is usually very sparse, which makes similarity‑based methods unreliable. Matrix factorization approximates R ≈ X·Yᵀ , where X ∈ ℝ^{m×k} (user latent factors) and Y ∈ ℝ^{n×k} (item latent factors) with k ≪ m,n .
The loss function to minimize is:
min_{X,Y} Σ_{(u,i)∈K} (r_{u,i} - x_u·y_iᵀ)² + λ (‖x_u‖² + ‖y_i‖²)Two common optimization methods are Alternating Least Squares (ALS) and Stochastic Gradient Descent (SGD). Spark uses ALS because it parallelizes easily.
Derivation (user‑wise): fixing Y , the optimal x_u satisfies
(Y_Iuᵀ·Y_Iu + λI)·x_u = Y_Iuᵀ·r_uIwhere I_u is the set of items rated by user u . Similarly, fixing X yields the item‑wise update.
After alternating updates until convergence (or a maximum number of iterations), the product X·Yᵀ provides predicted ratings.
Root‑Mean‑Square Error (RMSE) is used to evaluate the model.
3.1 Python Implementation (single‑machine)
import numpy as np
from scipy.linalg import solve as linear_solve
# Rating matrix 5×6
R = np.array([[4, 0, 2, 5, 0, 0],
[3, 2, 1, 0, 0, 3],
[0, 2, 0, 3, 0, 4],
[0, 3, 3, 5, 4, 0],
[5, 0, 3, 4, 0, 0]])
m, n = R.shape
k = 3 # latent dimension
_lambda = 0.01 # regularization
X = np.random.rand(m, k)
Y = np.random.rand(n, k)
# Dictionaries of known items per user and known users per item
X_idx_dict = {1: [1, 3, 4], 2: [1, 2, 3, 6], 3: [2, 4, 6], 4: [2, 3, 4, 5], 5: [1, 3, 4]}
Y_idx_dict = {1: [1, 2, 5], 2: [2, 3, 4], 3: [1, 2, 4, 5], 4: [1, 3, 4, 5], 5: [4], 6: [2, 3]}
for _ in range(10):
# Update user factors
for u in range(1, m+1):
Iu = np.array(X_idx_dict[u]) - 1
YIu = Y[Iu]
RuIu = R[u-1, Iu]
X[u-1] = linear_solve(YIu.T @ YIu + _lambda * np.eye(k), YIu.T @ RuIu)
# Update item factors
for i in range(1, n+1):
Ui = np.array(Y_idx_dict[i]) - 1
XUi = X[Ui]
RiUi = R[:, i-1][Ui]
Y[i-1] = linear_solve(XUi.T @ XUi + _lambda * np.eye(k), XUi.T @ RiUi)
R_pred = X @ Y.T
print(R_pred)The predicted matrix closely matches the original ratings, confirming the correctness of the ALS implementation.
3.2 Spark ALS Demo
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.recommendation.ALS
val spark = SparkSession.builder().appName("als-demo").master("local[*]").getOrCreate()
val rating = spark.read
.options(Map("inferSchema" -> "true", "delimiter" -> ",", "header" -> "true"))
.csv("./data/als-demo-data.csv")
rating.show(5)
val als = new ALS()
.setMaxIter(10)
.setRank(3)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("itemId")
.setRatingCol("rating")
val model = als.fit(rating)
model.userFactors.show(truncate = false)
model.itemFactors.show(truncate = false)
model.recommendForAllUsers(2).show()Running the above code prints user and item latent vectors and the top‑2 recommendations for each user. The results align with the Python implementation.
3.3 Production Use Case
In a real‑world scenario with millions of users and hundreds of thousands of items, raw interaction logs are stored in Hive. Implicit feedback (play time, finish count, likes, shares, etc.) is converted into a weighted rating:
rating = w1*play_time + w2*finish_play_cnt + w3*praise_cnt + w4*share_cnt + ... import org.apache.spark.ml.feature.{StringIndexer, IndexToString}
val rating_df = spark.sql("SELECT imei, content_id,
AS rating FROM t_user_behavior GROUP BY imei, content_id")
val imeiIndexer = new StringIndexer().setInputCol("imei").setOutputCol("userId").fit(rating_df)
val contentIndexer = new StringIndexer().setInputCol("content_id").setOutputCol("itemId").fit(rating_df)
val ratings = contentIndexer.transform(imeiIndexer.transform(rating_df))
val model = als.fit(ratings)
val userRecs = model.recommendForAllUsers(100)
val imeiConverter = new IndexToString().setInputCol("userId").setOutputCol("imei").setLabels(imeiIndexer.labels)
val contentConverter = new IndexToString().setInputCol("itemId").setOutputCol("content_id").setLabels(contentIndexer.labels)
val finalRecs = imeiConverter.transform(userRecs)
finalRecs.foreachPartition { /* write to Redis or other KV store */ }For scenarios without explicit ratings, implicit‑feedback ALS (ALS‑WR) can be used; Spark provides this variant as well.
4. Conclusion
The article introduced recommendation basics, detailed collaborative filtering, derived the matrix‑factorization model with ALS, provided a full Python example, and demonstrated Spark ALS in both a toy demo and a production pipeline, giving readers a clear theoretical and practical understanding of matrix factorization for recommender systems.
References:
Wang Zhe, Deep Learning Recommendation Systems.
Hu, Yifan, Koren, Yehuda, Volinsky, Chris. "Collaborative filtering for implicit feedback datasets." IEEE ICDM 2008.
Zhou, Yunhong, et al. "Large‑scale parallel collaborative filtering for the Netflix prize." Springer, 2008.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.