常用的數(shù)據(jù)挖掘&機(jī)器學(xué)習(xí)知識(點(diǎn)) MSE(MeanSquare Error 均方誤差),LMS(Least MeanSquare 最小均方),LSM(Least Square Methods 最小二乘法),MLE(Maximum LikelihoodEstimation最大似然估計),QP(QuadraticProgramming 二次規(guī)劃), CP(ConditionalProbability條件概率),JP(Joint Probability 聯(lián)合概率),MP(Marginal Probability邊緣概率),Bayesian Formula(貝葉斯公式),L1 /L2Regularization(L1/L2正則,以及更多的,現(xiàn)在比較火的L2.5正則等),GD(Gradient Descent 梯度下降),SGD(Stochastic GradientDescent 隨機(jī)梯度下降),Eigenvalue(特征值),Eigenvector(特征向量),QR-decomposition(QR分解),Quantile (分位數(shù)),Covariance(協(xié)方差矩陣)。 Discrete Distribution(離散型分布):Bernoulli Distribution/Binomial(貝努利分步/二項分布),Negative BinomialDistribution(負(fù)二項分布),Multinomial Distribution(多式分布),Geometric Distribution(幾何分布),Hypergeometric Distribution(超幾何分布),Poisson Distribution (泊松分布) ContinuousDistribution (連續(xù)型分布):Uniform Distribution(均勻分布),Normal Distribution/GaussianDistribution(正態(tài)分布/高斯分布),Exponential Distribution(指數(shù)分布),Lognormal Distribution(對數(shù)正態(tài)分布),Gamma Distribution(Gamma分布),Beta Distribution(Beta分布),Dirichlet Distribution(狄利克雷分布),Rayleigh Distribution(瑞利分布),Cauchy Distribution(柯西分布),Weibull Distribution (韋伯分布) Three Sampling Distribution(三大抽樣分布):Chi-square Distribution(卡方分布),t-distribution(t-distribution),F(xiàn)-distribution(F-分布) Data Pre-processing(數(shù)據(jù)預(yù)處理): MissingValue Imputation(缺失值填充),Discretization(離散化),Mapping(映射),Normalization(歸一化/標(biāo)準(zhǔn)化)。 Sampling(采樣): SimpleRandom Sampling(簡單隨機(jī)采樣),Offline Sampling(離線等可能K采樣),Online Sampling(在線等可能K采樣),Ratio-based Sampling(等比例隨機(jī)采樣),Acceptance-rejection Sampling(接受-拒絕采樣),Importance Sampling(重要性采樣),MCMC(Markov Chain MonteCarlo 馬爾科夫蒙特卡羅采樣算法:Metropolis-Hasting& Gibbs)。 K-Means,K-Mediods,二分K-Means,F(xiàn)K-Means,Canopy,Spectral-KMeans(譜聚類),GMM-EM(混合高斯模型-期望最大化算法解決),K-Pototypes,CLARANS(基于劃分),BIRCH(基于層次),CURE(基于層次),DBSCAN(基于密度),CLIQUE(基于密度和基于網(wǎng)格),2014年Science上的密度聚類算法等 Clustering EffectivenessEvaluation(聚類效果評估): Purity(純度),RI(Rand Index,芮氏指標(biāo)),ARI(Adjusted Rand Index,調(diào)整的芮氏指標(biāo)),NMI(NormalizedMutual Information,規(guī)范化互信息),F(xiàn)-meaure(F測量)等。 Classification&Regression(分類&回歸): LR(LinearRegression 線性回歸),LR(Logistic Regression邏輯回歸),SR(SoftmaxRegression 多分類邏輯回歸),GLM(Generalized LinearModel 廣義線性模型),RR(Ridge Regression 嶺回歸/L2正則最小二乘回歸),LASSO(Least AbsoluteShrinkage and Selectionator Operator L1正則最小二乘回歸), RF(隨機(jī)森林),DT(Decision Tree決策樹),GBDT(Gradient BoostingDecision Tree 梯度下降決策樹),CART(Classification AndRegression Tree 分類回歸樹),KNN(K-Nearest Neighbor K近鄰),SVM(Support Vector Machine,支持向量機(jī),包括SVC(分類)&SVR(回歸)),KF(Kernel Function 核函數(shù)Polynomial KernelFunction 多項式核函數(shù)、Guassian Kernel Function 高斯核函數(shù)/Radial Basis Function RBF徑向基函數(shù)、String Kernel Function 字符串核函數(shù))、 NB(Naive Bayes 樸素貝葉斯),BN(BayesianNetwork/Bayesian Belief Network/Belief Network 貝葉斯網(wǎng)絡(luò)/貝葉斯信度網(wǎng)絡(luò)/信念網(wǎng)絡(luò)),LDA(Linear DiscriminantAnalysis/Fisher Linear Discriminant 線性判別分析/Fisher線性判別),EL(Ensemble Learning集成學(xué)習(xí)Boosting,Bagging,Stacking),AdaBoost(AdaptiveBoosting 自適應(yīng)增強(qiáng)),MEM(Maximum Entropy Model最大熵模型) Classification EffectivenessEvaluation(分類效果評估): ConfusionMatrix(混淆矩陣),Precision(精確度),Recall(召回率),Accuracy(準(zhǔn)確率),F(xiàn)-score(F得分),ROC Curve(ROC曲線),AUC(AUC面積),Lift Curve(Lift曲線) ,KS Curve(KS曲線)。 PGM(ProbabilisticGraphical Models概率圖模型): BN(BayesianNetwork/Bayesian Belief Network/ Belief Network 貝葉斯網(wǎng)絡(luò)/貝葉斯信度網(wǎng)絡(luò)/信念網(wǎng)絡(luò)),MC(Markov Chain 馬爾科夫鏈),HMM(Hidden MarkovModel 馬爾科夫模型),MEMM(Maximum EntropyMarkov Model 最大熵馬爾科夫模型),CRF(Conditional RandomField 條件隨機(jī)場),MRF(Markov RandomField 馬爾科夫隨機(jī)場)。 NN(Neural Network神經(jīng)網(wǎng)絡(luò)): ANN(ArtificialNeural Network 人工神經(jīng)網(wǎng)絡(luò)),BP(Error Back Propagation 誤差反向傳播),HN(Hopfield Network), Recurrent Neural Network,時鐘驅(qū)動循環(huán)神經(jīng)網(wǎng)絡(luò),2014ICML)等。 Auto-encoder(自動編碼器),SAE(Stacked Auto-encoders堆疊自動編碼器:Sparse Auto-encoders稀疏自動編碼器、Denoising Auto-encoders去噪自動編碼器、ContractiveAuto-encoders 收縮自動編碼器),RBM(Restricted BoltzmannMachine 受限玻爾茲曼機(jī)),DBN(Deep BeliefNetwork 深度信念網(wǎng)絡(luò)),CNN(Convolutional NeuralNetwork 卷積神經(jīng)網(wǎng)絡(luò)),Word2Vec(詞向量學(xué)習(xí)模型)。 LDA(LinearDiscriminant Analysis/Fisher Linear Discriminant 線性判別分析/Fish線性判別),PCA(Principal ComponentAnalysis 主成分分析),ICA(Independent ComponentAnalysis 獨(dú)立成分分析),SVD(Singular ValueDecomposition 奇異值分解),F(xiàn)A(Factor Analysis 因子分析法)。 VSM(Vector SpaceModel向量空間模型),Word2Vec(詞向量學(xué)習(xí)模型),TF(Term Frequency詞頻),TF-IDF(TermFrequency-Inverse Document Frequency 詞頻-逆向文檔頻率),MI(Mutual Information 互信息),ECE(Expected CrossEntropy 期望交叉熵),QEMI(二次信息熵),IG(Information Gain 信息增益),IGR(InformationGain Ratio 信息增益率),Gini(基尼系數(shù)),x2 Statistic(x2統(tǒng)計量),TEW(Text EvidenceWeight文本證據(jù)權(quán)),OR(OddsRatio 優(yōu)勢率),N-Gram Model,LSA(LatentSemantic Analysis 潛在語義分析),PLSA(ProbabilisticLatent Semantic Analysis 基于概率的潛在語義分析),LDA(Latent DirichletAllocation 潛在狄利克雷模型),SLM(StatisticalLanguage Model,統(tǒng)計語言模型),NPLM(NeuralProbabilistic Language Model,神經(jīng)概率語言模型),CBOW(Continuous Bag of Words Model,連續(xù)詞袋模型),Skip-gram(Skip-gramModel)等。 Association Mining(關(guān)聯(lián)挖掘): Apriori,F(xiàn)P-growth(FrequencyPattern Tree Growth 頻繁模式樹生長算法),AprioriAll,Spade。 DBR(Demographic-basedRecommendation 基于人口統(tǒng)計學(xué)的推薦),CBR(Context-based Recommendation 基于內(nèi)容的推薦),CF(Collaborative Filtering協(xié)同過濾),UCF(User-based CollaborativeFiltering Recommendation 基于用戶的協(xié)同過濾推薦),ICF(Item-based CollaborativeFiltering Recommendation 基于項目的協(xié)同過濾推薦)。 SimilarityMeasure&Distance Measure(相似性與距離度量): EuclideanDistance(歐式距離),Manhattan Distance(曼哈頓距離),Chebyshev Distance(切比雪夫距離),Minkowski Distance(閔可夫斯基距離),Standardized EuclideanDistance(標(biāo)準(zhǔn)化歐氏距離),Mahalanobis Distance(馬氏距離),Cos(Cosine 余弦),Hamming Distance/EditDistance(漢明距離/編輯距離),Jaccard Distance(杰卡德距離),Correlation CoefficientDistance(相關(guān)系數(shù)距離),Information Entropy(信息熵),KL(Kullback-LeiblerDivergence KL散度/Relative Entropy 相對熵)。 Non-constrained Optimization(無約束優(yōu)化):Cyclic Variable Methods(變量輪換法),Pattern Search Methods(模式搜索法),Variable Simplex Methods(可變單純形法),Gradient Descent Methods(梯度下降法),Newton Methods(牛頓法),Quasi-Newton Methods(擬牛頓法),Conjugate GradientMethods(共軛梯度法)。 ConstrainedOptimization(有約束優(yōu)化):Approximation ProgrammingMethods(近似規(guī)劃法),F(xiàn)easible DirectionMethods(可行方向法),Penalty Function Methods(罰函數(shù)法),Multiplier Methods(乘子法)。 HeuristicAlgorithm(啟發(fā)式算法),SA(Simulated Annealing,模擬退火算法),GA(genetic algorithm遺傳算法) MutualInformation(互信息),Document Frequence(文檔頻率),Information Gain(信息增益),Chi-squared Test(卡方檢驗),Gini(基尼系數(shù))。 Outlier Detection(異常點(diǎn)檢測): Statistic-based(基于統(tǒng)計),Distance-based(基于距離),Density-based(基于密度),Clustering-based(基于聚類)。 Learning to Rank(基于學(xué)習(xí)的排序): Pointwise:McRank; Pairwise:RankingSVM,RankNet,F(xiàn)rank,RankBoost; Listwise:AdaRank,SoftRank,LamdaMART; MPI,Hadoop生態(tài)圈,Spark,BSP,Weka,Mahout,Scikit-learn,PyBrain…
以及一些具體的業(yè)務(wù)場景與case等。
后面有機(jī)會將針對這些進(jìn)行知識(面)的總結(jié),有錯誤請指正... 轉(zhuǎn)載請說明出處.......... |
|