CTR优劣总结

CTR各模型优劣总结

前言

本文是依照上一篇文章的顺序来进行整理，现附上上一篇链接

1	https://perfect-player.github.io/2021/09/27/CTR/

本文参考原论文（主）与网络资料（次）编写而成。

Convolutional Click Prediction Model(卷积点击预测模型CCPM)

由于循环神经网络在连续广告印象上的不可改变的传播方式，在有效建模动态点击预测方面有局限性，而深度CNN架构的池化和卷积层可以从连续的广告印象中充分提取局部-全局的关键特征。

CCPM就是基于CNN的一个架构，CCPM可以从具有不同元素的输入实例中提取局部-全局关键特征，这不仅可以针对单个广告印象，也可以针对连续的广告印象。

1 2	原文表述 CCPM can extract local-global key features from an input instance with varied elements, which can be implemented for not only single ad impression but also sequential ad impression.

Factorization-supported Neural Network(因子分解支持的神经网络FNN)

出发点

1
2
3

1.之前运用的CTR模型大多是线性的，都是基于大量的稀疏特征的编码。性能相对较低，因为在学习非微观模式时，无法捕捉到假定的（有条件的）独立原始特征之间的相互作用。
2.当时的非线性模型不能利用所有可能的不同特征的组合。
3.大多数预测模型有浅层的结构，对复杂的海量数据的基础模型表达有限，数据建模和泛化能力仍然受到限制。

基于上述出发点引入了深度学习模型。

带有监督学习嵌入层的FNN使用因子化机器被提出来，以有效地减少从稀疏特征到密集的连续特征。

1
2

原文表述
Specifically,FNN with a supervised-learning embedding layer using factorisation machines is proposed to efficiently reduce the dimension from sparse features to dense continuous features.

Product-based Neural Network(PNN)

背景

深度神经网络（DNNs）在分类和回归任务中显示了巨大的能力，在用户反应预测中采用DNNs是很有前途的。之前为了改善多领域分类数据的交互，提出的一种基于因子预训练的方法，基于串联的嵌入向量，构建多层感知器（MLPs）来探索特征的相互作用。嵌入初始化的质量在很大程度上受到因式分解机的限制。

为了利用神经网络的学习能力和挖掘以一种比MLPs更有效的方式挖掘数据的潜在模式，所以提出PNN。PNN有望在多领域的分类数据上学习高阶潜在模式。

1
2
3

原文表述
To utilize the learning ability of neural networks and mine the latent patterns of data in a more effective way than MLPs,in this paper we propose Product-based Neural Network。
PNN is promising to learn high-order latent patterns on multi-field categorical data.

Wide & Deep

谷歌曾经的主流推荐模型，业界影响巨大。

1	记忆能力：可以被理解为模型直接学习并利用历史数据中物品和特征的“共现频率”的能力

提出动机

利用手工构造的交叉组合特征来使线性模型具有记忆性会得到一个不错的效果，但特征工程需要耗费大量精力，对于未曾出现过的特征组合，权重系数为0，无法进行泛化。

1 2	线性模型，记忆力较强，但泛化能力弱 embedding模型，记忆能力弱，泛化能力强

基于优势互补，提出Wide & Deep，左边Wide部分是一个简单的线性模型，右边Deep部分是一个经典的DNN模型。

WDL的深层部分将稀疏的特征嵌入连接起来作为MLP的输入，宽层部分使用手工制作的特征作为输入。深度部分和宽度部分的对数相加，得到预测概率。

1
2

原文表述
WDL’s deep part concatenates sparse feature embeddings as the input of MLP,the wide part use handcrafted feature as input. The logits of deep part and wide part are added to get the prediction probability.

DeepFM

整合了FM和深度神经网络（DNN）的架构。它像FM一样对低阶特征的交互进行建模，像DNN一样对高阶特征的交互进行建模。不同于

Wide & Deep，DeepFM可以在没有任何特征工程的情况下进行端到端训练。但复杂性较大。

优点

1
2
3

1.它不需要任何预训练
2.它同时学习高阶和低阶特征的相互作用；
3.它引入了特征嵌入的共享策略以避免特征工程

原文表述

1
2
3

1) it does not need any pre-training; 
2) it learns both high- and loworder feature interactions; 
3) it introduces a sharing strategy of feature embedding to avoid feature engineering

DeepFM可以看作是WDL和FNN的改进。与WDL相比，DeepFM在广义部分使用FM而不是LR，在深义部分使用嵌入向量的连接作为MLP的输入。与FNN相比，FM的嵌入向量和MLP的输入是相同的。而且它们不需要FM预训练向量来初始化，它们是端对端学习。

1
2

原文表述
DeepFM can be seen as an improvement of WDL and FNN.Compared with WDL,DeepFM use FM instead of LR in the wide part and use concatenation of embedding vectors as the input of MLP in the deep part. Compared with FNN,the embedding vector of FM and input to MLP are same. And they do not need a FM pretrained vector to initialiaze,they are learned end2end.

Piece-wise Linear Model

背景

1	CTR预测问题是一个高度非线性的问题。LR很难抓住非线性因素，基于树的方法不适合非常稀疏和高维的数据，FM不能适应数据中所有的一般非线性模式

提出了一个用于大规模数据的片状线性模型及其训练算法LS-PLM,遵循分而治之的策略。首先将特征空间划分为几个局部区域，然后在每个区域内拟合一个线性模型。结果是输出加权线性预测的组合。它可以从稀疏数据中捕捉到稀疏数据中的非线性模式，并将我们从繁重的特征工程工作中解救出来，这对于实际的工业应用是至关重要的。

优势
LS-PLM的优势在于在三个方面对网络规模的数据挖掘具有优势。
非线性。有了足够的划分区域，LS-PLM可以适应任何复杂的非线性函数。
可扩展性。与LR模型类似，LS-PLM可以扩展到大量样本和高维特征。
稀疏性。
工业模型，可以处理具有1000万个参数的10亿个样本的问题，这就是典型的工业数据量。

由于其能够捕获非线性模式的能力和对海量数据的可扩展性，LS-PLMs已经成为在线显示广告系统中主要的CTR预测榜样，自2012年以来为数亿用户提供服务，成为阿里巴巴在线展示广告系统中主要的点击率预测模型。

Deep & Cross Network

applies feature crossing in an automatic fashion.

以自动的方式进行特征交叉，可以处理大量的稀疏和密集的特征集，并与传统的深层网络共同学习程度有限的显性交叉特征。

1
2
3

原文表述
can handle a large set of sparse and dense features, and learns explicit cross features
of bounded degree jointly with traditional deep representations.

Attentional Factorization Machine(AFM)

AFM是FM的一个变种，传统的FM是将嵌入向量的内积均匀地加起来。AFM可以被看作是特征相互作用的加权和。

1 2	原文表述 AFM is a variant of FM,tradional FM sums the inner product of embedding vector uniformly. AFM can be seen as weighted sum of feature interactions.The weight is learned by a small MLP.

AFM弥补了FM对于不同的特征交互不能赋予不同权重的问题。

Neural Factorization Machine

NFM使用一个双交互池层来学习嵌入向量之间的特征交互，并将结果压缩成一个单一的向量，其大小与单一嵌入向量相同。MLP的输出对数和线性部分的输出对数相加，得到预测概率。

1
2

原文表述
NFM use a bi-interaction pooling layer to learn feature interaction between embedding vectors and compress the result into a singe vector which has the same size as a single embedding vector. And then fed it into a MLP.The output logit of MLP and the output logit of linear part are added to get the prediction probability.

FM的性能可能受到其线性的限制，以及仅对成对（即二阶）特征的相互作用进行建模。特别是，对于具有复杂和非线性基础结构的真实世界数据，FM可能无法表达。

NFMs:一个用于稀疏数据预测的新模型,将线性分解机的有效性与非线性神经网络的强大表示能力结合起来，用于稀疏预测分析。通过对高阶和非线性特征的相互作用的建模，增强了FMs的功能。NFM结构的关键是新提出的双交互操作。

xDeepFM

xDeepFM可以自动学习显性和隐性的高阶特征交互，这对于减少人工特征工程的工作具有重要意义。它是将一个CIN和一个DNN纳入一个端到端的框架中所产生的。

原文表述

1	Thus xDeepFM can automatically learn high-order feature interactions in both explicit and implicit fashions, which is of great significance to reducing manual feature engineering work.

xDeepFM使用压缩交互网络（CIN）来显式学习低阶和高阶特征交互，并使用MLP来隐式学习特征交互。在CIN的每一层，首先计算$x^k$和$x0$之间的外积，得到一个张量$Z{k+1}$，然后使用1DConv来学习这个张量上的特征图$H_{k+1}$。最后，对所有的特征图$H_k$应用总和池，得到一个向量。该向量用于计算CIN的贡献对数。

原文表述

xDeepFM use a Compressed Interaction Network (CIN) to learn both low and high order feature interaction explicitly,and use a MLP to learn feature interaction implicitly. In each layer of CIN,first compute outer products between $x^k$ and $x_0$ to get a tensor $Z_{k+1}$,then use a 1DConv to learn feature maps $H_{k+1}$ on this tensor. Finally,apply sum pooling on all the feature maps $H_k$ to get one vector.The vector is used to compute the logit that CIN contributes.

Deep Interest Network

出发点

在传统的深度CTR模型中，使用固定长度的表示法是捕捉用户兴趣多样性的一个瓶颈。用户的各种兴趣被压缩到一个固定长度的向量中，这限制了嵌入和MLP方法的表达能力。

深度兴趣网络（DIN），它通过考虑到候选广告的历史行为的相关性，自适应地计算用户兴趣的表示向量。

原文表述

1	Deep Interest Network (DIN), which adaptively calculates the representation vector of user interests by taking into consideration the relevance of historical behaviors given a candidate ad.

DIN引入了一种注意力方法来学习序列（多值）特征。传统的方法通常在序列特征上使用和/均值池。DIN使用一个局部激活单元来获得候选项目和历史项目之间的激活分数。用户的兴趣由用户行为的加权和表示，用户的兴趣向量和其他嵌入向量被连接起来，并输入MLP得到预测。

原文表述

DIN introduce a attention method to learn from sequence(multi-valued) feature. Tradional method usually use sum/mean pooling on sequence feature. DIN use a local activation unit to get the activation score between candidate item and history items. User’s interest are represented by weighted sum of user behaviors. user’s interest vector and other embedding vectors are concatenated and fed into a MLP to get the prediction.

Deep Interest Evolution Network

出发点

1
2

1.包括DIN在内的大多数兴趣模型都将行为直接视为兴趣，而潜在的兴趣则很难通过显性行为完全反映出来。
2.用户的兴趣是不断变化的，捕捉兴趣的动态对于兴趣的表达是非常重要的。

深度兴趣进化网络（DIEN）使用兴趣提取器层，从历史行为序列中捕捉时间性兴趣。在这一层，提出了一个辅助损失来监督每一步的兴趣提取。由于用户的兴趣是多样化的，特别是在电子商务系统中，兴趣演化层被提出来捕捉与目标项目有关的兴趣演化过程。在兴趣演化层，注意力机制被新颖地嵌入到顺序结构中，并且在兴趣演化过程中加强了相对兴趣的影响。

原文表述

Deep Interest Evolution Network (DIEN) uses interest extractor layer to capture temporal interests from history behavior sequence. At this layer, an auxiliary loss is proposed to supervise interest extracting at each step. As user interests are diverse, especially in the e-commerce system, interest evolving layer is proposed to capture interest evolving process that is relative to the target item. At interest evolving layer, attention mechanism is embedded into the sequential structure novelly, and the effects of relative interests are strengthened during interest evolution.

关于DIEN详情可参照本人的另一篇博客

1	https://perfect-player.github.io/2022/01/09/DIEN/

AutoInt

该模型能够以明确的方式自动学习高阶特征的相互作用。方法的关键的关键是新引入的交互层，它允许每个特征与其他特征交互，并通过学习来确定相关性。

原文表述

1	The key to our method is the newly-introduced interacting layer, which allows each feature to interact with the others and to determine the relevance through learning.

AutoInt使用交互层来模拟不同特征之间的相互作用。在每个交互层中，每个特征都被允许与其他所有的特征进行交互，并且能够自动识别相关的特征，通过多头关注机制形成有意义的高阶特征。通过堆叠多个交互层，AutoInt能够对不同等级的特征交互进行建模。

原文表述

AutoInt use a interacting layer to model the interactions between different features. Within each interacting layer, each feature is allowed to interact with all the other features and is able to automatically identify relevant features to form meaningful higher-order features via the multi-head attention mechanism. By stacking multiple interacting layers,AutoInt is able to model different orders of feature interactions.

ONN

出发点

1	很少有工作专注于改进由嵌入层学习的特征表示

与传统的特征嵌入方法相比，操作感知嵌入方法为所有操作学习一种表征。

原文表述

1	Compared with the traditional feature embedding method which learns one representation for all operations, operation-aware embedding can learn various representations for different operations.

ONN对二阶特征交互进行建模，就像FFM一样，并尽可能地保留二阶交互信息。此外，深度神经网络被用来学习高阶特征交互。

原文表述

ONN models second order feature interactions like like FFM and preserves second-order interaction information as much as possible.Further more,deep neural network is used to learn higher-ordered feature interactions.

FiBiNET(Feature Importance and Bilinear feature Interaction NETwork)

提出了特征重要性和双线性特征交互网络，以动态学习特征重要性和细粒度的特征交互。一方面，FiBiNET可以通过Squeeze-Excitation网络（SENET）机制动态地学习特征的重要性；另一方面，它能够通过双线性函数有效地学习特征的相互作用。

原文表述

Feature Importance and Bilinear feature Interaction NETwork is proposed to dynamically learn the feature importance and fine-grained feature interactions. On the one hand, the FiBiNET can dynamically learn the importance of fea- tures via the Squeeze-Excitation network (SENET) mechanism; on the other hand, it is able to effectively learn the feature interactions via bilinear function.

目的

于动态学习特征重要性和细粒度的特征相互作用。

1	proposed to dynamically learn the feature importance and finegrained feature interactions.

优势

1.对于CTR任务。SENET模块可以动态地学习特征的重要性。它提高了重要特征的权重，抑制了不重要特征的权重。
2.引入了三种类型的双线性交互层来学习特征的交互作用，而不是通过计算特征的交互作用。
3.在浅层模型中，将SENET机制与双线性特征交互结合起来，优于其他浅层、
4.将经典的深度神经网络（DNN）组件与浅层模型相结合，成为一个深度模型。

 1) For CTR task,the SENET module can learn the importance of features dynamically. It boosts the weight of the important feature and suppresses the weight of unimportant features. 
 2) We introduce three types of Bilinear-Interaction layers to learn feature interaction rather
than calculating the feature interactions with Hadamard product or inner product.
3) Combining the SENET mechanism with bilinear feature interaction in our shallow model outperforms other shallow models such as FM and FFM.
4) In order to improve performance further, we combine a classical deep neural network(DNN) component with the shallow model to be a deep model. The deep FiBiNET consistently outperforms the other state-of-the-art deep models such as DeepFM and XdeepFM.

IFM

输入感知因子机（IFM）通过神经网络为不同实例中的同一特征学习一个独特的输入感知因子。

1	Input-aware Factorization Machine (IFM) learns a unique input-aware factor for the same feature in different instances via a neural network.

适用于稀疏的数据集。它的目的是通过有目的地学习更灵活、更准确的特征，来增强传统的FM。

1	It aims to enhance traditional FMs by purposefully learning more flexible and accurate representation of features for different instances with the help of a factor estimating network.

IFM的两个主要优势

1 2	1.与现有的技术相比，它能产生更好的预测结果 2.它能更深入地了解每个特征在预测任务中的作用。

1 2	i). it produces better prediction results compared to existing techniques ii). it provides deeper insights into the role that each feature plays in the prediction task.

DCN V2

以一种富有表现力而又简单的方式为显式交叉建模，观察到交叉网络中权重矩阵的低秩性质。

1	Observing the low-rank nature of the weight matrix in the cross network

DIFM

双输入感知因式分解机（DIFM）可以同时在比特级和矢量级对原始特征表示进行自适应的重新加权。此外，DIFM战略性地将包括多头自适应、残差网络和DNN在内的各种组件整合到一个统一的端到端模型中。

Dual Inputaware Factorization Machines (DIFM) can adaptively reweight the original feature representations at the bit-wise and vector-wise levels simultaneously.Furthermore, DIFMs strategically integrate various components including Multi-Head Self-Attention, Residual Networks and DNNs into a unified end-to-end model.

目的是根据不同的输入实例，借助DIFMs，自适应地学习一个给定特征的灵活表示。

1	It aims to adaptively learn flexible representations of a given feature according to different input instances with the help of the Dual-Factor Estimating Network (Dual-FEN).

主要优点是它不仅能在比特级，而且能在矢量级同时有效地学习输入感知因子（用于重新加权原始特征表示）。

The major advantage of DIFM is that it can effectively learn the inputaware factors (used to reweight the original feature representations) not only at the bit-wise level but also at the vectorwise level imultaneously.

AFN

自适应因子化网络（AFN）可以从数据中自适应地学习任意等级的交叉特征。AFN的核心是一个对数转换层，将特征组合中每个特征的功率转换成要学习的系数。

Adaptive Factorization Network (AFN) can learn arbitrary-order cross features adaptively from data. The core of AFN is a logarith- mic transformation layer to convert the power of each feature in a feature combination into the coefficient to be learned.

AFN能够从数据中自适应地学习任意顺序的特征互动。而不是在一个固定的最大顺序内对所有的交叉特征进行明确的建模。
AFN能够自动生成辨别性的交叉特征和相应特征的权重。

1
2

learns arbitrary-order feature interactions adaptively from data. Instead of explicitly modeling all the cross features within a fixed maximum order,AFN is able to generate discriminative cross features and
the weights of the corresponding features automatically.