Artificial Intelligence 25 min read

Collaborative Filtering and Matrix Factorization: Theory and Spark ALS Implementation

The article introduces collaborative filtering, derives the matrix‑factorization model R≈X·Yᵀ with L2‑regularized ALS updates, demonstrates a full Python example on a small rating matrix, then shows how to implement and scale Spark’s ALS for massive user‑item data, ending with production tips and references.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Collaborative Filtering and Matrix Factorization: Theory and Spark ALS Implementation

Recommendation systems are ubiquitous in platforms such as Douyin, Taobao, and JD. This article uses classic collaborative filtering as a starting point and focuses on the matrix factorization algorithm that is widely used in industry.

It explains the problem formulation: given a rating matrix R (users × items) with many missing entries, the goal is to predict the missing scores. User‑based and item‑based collaborative filtering are introduced, along with similarity measures such as cosine similarity.

The core of the article is the matrix factorization model R ≈ X·Yᵀ, where X (user matrix) is m×k and Y (item matrix) is n×k. The loss function with L2 regularization is presented, and a detailed derivation of the Alternating Least Squares (ALS) optimization is provided, including the closed‑form updates for X and Y.

Key mathematical results are shown with inline equations (e.g., the update for a user vector x_u: (Y_Iuᵀ·Y_Iu + λI)·x_u = Y_Iuᵀ·r_uIu) and the corresponding derivation steps.

To make the theory concrete, a small numerical example is given, and the complete Python implementation of ALS on a 5×6 rating matrix is provided:

import numpy as np
from scipy.linalg import solve as linear_solve

# Rating matrix 5 x 6
R = np.array([[4, 0, 2, 5, 0, 0],
              [3, 2, 1, 0, 0, 3],
              [0, 2, 0, 3, 0, 4],
              [0, 3, 3, 5, 4, 0],
              [5, 0, 3, 4, 0, 0]])

m, n, k = 5, 6, 3
_lambda = 0.01

# Random initialization
X = np.random.rand(m, k)
Y = np.random.rand(n, k)

# Dictionaries of observed items per user and users per item
X_idx_dict = {1: [1, 3, 4], 2: [1, 2, 3, 6], 3: [2, 4, 6], 4: [2, 3, 4, 5], 5: [1, 3, 4]}
Y_idx_dict = {1: [1, 2, 5], 2: [2, 3, 4], 3: [1, 2, 4, 5], 4: [1, 3, 4, 5], 5: [4], 6: [2, 3]}

for iter in range(10):
    # Update user factors
    for u in range(1, m+1):
        Iu = np.array(X_idx_dict[u])
        YIu = Y[Iu-1]
        RuIu = R[u-1, Iu-1]
        xu = linear_solve(YIu.T.dot(YIu) + _lambda*np.eye(k), YIu.T.dot(RuIu))
        X[u-1] = xu
    # Update item factors
    for i in range(1, n+1):
        Ui = np.array(Y_idx_dict[i])
        XUi = X[Ui-1]
        RiUi = R.T[i-1, Ui-1]
        yi = linear_solve(XUi.T.dot(XUi) + _lambda*np.eye(k), XUi.T.dot(RiUi))
        Y[i-1] = yi

# Predicted rating matrix
R_pred = X.dot(Y.T)
print(R_pred)

The resulting user and item matrices, as well as the predicted rating matrix, closely approximate the original ratings.

The article then shows how to use Spark’s ALS implementation. A CSV file containing (userId, itemId, rating) triples is read, an ALS model is configured (maxIter=10, rank=3, regParam=0.01), trained, and used to generate top‑N recommendations for each user. Sample Spark Scala code is included:

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.recommendation.ALS

val spark = SparkSession.builder().appName("als-demo").master("local[*]").getOrCreate()
val rating = spark.read.options(Map("inferSchema"->"true", "delimiter"->",", "header"->"true")).csv("./data/als-demo-data.csv")

val als = new ALS()
  .setMaxIter(10)
  .setRank(3)
  .setRegParam(0.01)
  .setUserCol("userId")
  .setItemCol("itemId")
  .setRatingCol("rating")

val model = als.fit(rating)
model.userFactors.show(truncate = false)
model.itemFactors.show(truncate = false)
model.recommendForAllUsers(2).show()

In a production setting, the article describes how to handle massive user and item sets (millions of users, hundreds of thousands of items) stored in HDFS. It shows how to compute implicit feedback scores from raw logs (play time, finish count, likes, shares) using weighted sums, convert string IDs to integer indices with StringIndexer, train ALS, generate recommendations, and write the results back to Redis for online serving.

Finally, the article summarizes the workflow, emphasizes the importance of understanding the underlying mathematics, and provides references to classic works on collaborative filtering, implicit feedback, and large‑scale parallel ALS.

machine learningcollaborative filteringmatrix factorizationRecommendation systemsSparkALS
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.