Forward Neural Networks and Their Applications in Language Modeling, Ranking, and Recommendation
This article excerpt explains the structure and training of feed‑forward neural networks, illustrates their use in neural language models, describes deep structured semantic models for ranking tasks, and details two‑stage recommendation systems such as YouTube, covering both theoretical formulas and practical deployment considerations.
The piece is an excerpt from Zhu Jian's deep‑learning notes that first reviews the architecture of a feed‑forward (forward) neural network, describing its layers, fully‑connected weight matrices, bias terms, and typical activation choices.
A concrete two‑hidden‑layer example is presented: input vector \(x\in\mathbb{R}^4\), first hidden output \(h^{(1)}\in\mathbb{R}^5\) with weight matrix \(W^{(1)}\in\mathbb{R}^{5\times4}\) and bias \(b^{(1)}\), second hidden output \(h^{(2)}\in\mathbb{R}^3\) with \(W^{(2)}\in\mathbb{R}^{3\times5}\) and \(b^{(2)}\), and final output \(\hat y\in\mathbb{R}^2\) with \(W^{(3)}\in\mathbb{R}^{2\times3}\) and \(b^{(3)}\). Hidden layers typically use ReLU, the output layer uses an identity mapping, and for classification a Softmax layer followed by cross‑entropy loss is applied.
The same forward‑network structure is then used to describe a neural probabilistic language model (Bengio 2003). Words are first mapped to embeddings via a matrix \(C\in\mathbb{R}^{|V|\times m}\); a context of \(n\) words is concatenated into \(x\), transformed by a hidden layer \(H\) and bias \(b^{(1)}\), and finally projected to a vocabulary‑size output with \(W\in\mathbb{R}^{|V|\times h}\) and bias \(b^{(2)}\). Training minimizes cross‑entropy between the predicted and true next‑word distribution.
Next, the article discusses ranking problems such as web‑search relevance. It introduces the Deep Structured Semantic Model (DSSM), where queries and documents are encoded as high‑dimensional one‑hot vectors, reduced by word‑hashing to a 30k vocabulary, then passed through a feed‑forward network to obtain semantic vectors \(y_Q\) and \(y_D\). Similarity is measured (e.g., cosine) and the model is trained with a smoothed cross‑entropy loss that incorporates a smoothing factor \(\gamma\).
The recommendation section explains how user, item, context, and other features are embedded and fed into a forward network. Using YouTube as a case study, a two‑stage pipeline is described: (1) candidate generation, treated as an extreme multi‑class classification problem with sampled softmax to handle a vocabulary of millions of videos; (2) ranking, where a weighted logistic‑regression model predicts expected watch time, weighting positive samples by observed watch duration and using odds (\(\frac{P}{1-P}\)) as the final score.
Finally, practical deployment tips are given: massive sparse features (often billions) are stored in embedding tables (e.g., Redis) and looked up via placeholders in TensorFlow, while the dense top‑level network is served separately. This separation enables efficient training on parameter‑server architectures and lightweight inference in TensorFlow Serving.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.