Jiuwen Cao Chi Man Vong Yoan Miche Amaury Lendasse Editors ...

Proceedings in Adaptation, Learning and Optimization 11

Jiuwen CaoChi Man VongYoan MicheAmaury Lendasse Editors

Proceedings of ELM 2018

Proceedings in Adaptation, Learningand Optimization

Volume 11

Series Editors

Meng-Hiot Lim, Nanyang Technological University, Singapore, SingaporeYew Soon Ong, Nanyang Technological University, Singapore, Singapore

The role of adaptation, learning and optimization are becoming increasinglyessential and intertwined. The capability of a system to adapt either throughmodification of its physiological structure or via some revalidation process ofinternal mechanisms that directly dictate the response or behavior is crucial in manyreal world applications. Optimization lies at the heart of most machine learningapproaches while learning and optimization are two primary means to effectadaptation in various forms. They usually involve computational processesincorporated within the system that trigger parametric updating and knowledgeor model enhancement, giving rise to progressive improvement. This book seriesserves as a channel to consolidate work related to topics linked to adaptation,learning and optimization in systems and structures. Topics covered under thisseries include:

• complex adaptive systems including evolutionary computation, memetic com-puting, swarm intelligence, neural networks, fuzzy systems, tabu search, sim-ulated annealing, etc.

• machine learning, data mining & mathematical programming• hybridization of techniques that span across artificial intelligence and compu-

tational intelligence for synergistic alliance of strategies for problem-solving• aspects of adaptation in robotics• agent-based computing• autonomic/pervasive computing• dynamic optimization/learning in noisy and uncertain environment• systemic alliance of stochastic and conventional search techniques• all aspects of adaptations in man-machine systems.

This book series bridges the dichotomy of modern and conventional mathematicaland heuristic/meta-heuristics approaches to bring about effective adaptation,learning and optimization. It propels the maxim that the old and the new cancome together and be combined synergistically to scale new heights in problem-solving. To reach such a level, numerous research issues will emerge andresearchers will find the book series a convenient medium to track the progressesmade.

** Indexing: The books of this series are submitted to ISI Proceedings, DBLP,Google Scholar and Springerlink **

More information about this series at http://www.springer.com/series/13543

http://www.springer.com/series/13543

Jiuwen Cao • Chi Man Vong •

Yoan Miche • Amaury LendasseEditors

Proceedings of ELM 2018

123

EditorsJiuwen CaoInstitute of Information and ControlHangzhou Dianzi UniversityXiasha, Hangzhou, China

Chi Man VongDepartment of Computer and InformationScienceUniversity of MacauTaipa, Macao

Yoan MicheNokia Bell LabsCybersecurity ResearchEspoo, Finland

Amaury LendasseDepartment of Information and LogisticsTechnologyUniversity of HoustonHouston, TX, USA

ISSN 2363-6084 ISSN 2363-6092 (electronic)Proceedings in Adaptation, Learning and OptimizationISBN 978-3-030-23306-8 ISBN 978-3-030-23307-5 (eBook)https://doi.org/10.1007/978-3-030-23307-5

© Springer Nature Switzerland AG 2020This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or partof the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmissionor information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilarmethodology now known or hereafter developed.The use of general descriptive names, registered names, trademarks, service marks, etc. in thispublication does not imply, even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for general use.The publisher, the authors and the editors are safe to assume that the advice and information in thisbook are believed to be true and accurate at the date of publication. Neither the publisher nor theauthors or the editors give a warranty, expressed or implied, with respect to the material containedherein or for any errors or omissions that may have been made. The publisher remains neutral with regardto jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AGThe registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

https://doi.org/10.1007/978-3-030-23307-5

Contents

Random Orthogonal Projection Based Enhanced BidirectionalExtreme Learning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Weipeng Cao, Jinzhu Gao, Xizhao Wang, Zhong Ming, and Shubin Cai

A Novel Feature Specificity Enhancement for Taste Recognitionby Electronic Tongue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Yanbing Chen, Tao Liu, Jianjun Chen, Dongqi Li, and Mengya Wu

Comparison of Classification Methods for Very High-DimensionalData in Sparse Random Projection Representation . . . . . . . . . . . . . . . . 17Anton Akusok and Emil Eirola

A Robust and Dynamically Enhanced Neural Predictive Modelfor Foreign Exchange Rate Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 27Lingkai Xing, Zhihong Man, Jinchuan Zheng, Tony Cricenti,and Mengqiu Tao

Alzheimer’s Disease Computer Aided Diagnosis Based on HierarchicalExtreme Learning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Zhongyang Wang, Junchang Xin, Yue Zhao, and Qiyong Guo

Key Variables Soft Measurement of Wastewater Treatment ProcessBased on Hierarchical Extreme Learning Machine . . . . . . . . . . . . . . . . 45Feixiang Zhao, Mingzhe Liu, Binyang Jia, Xin Jiang, and Jun Ren

A Fast Algorithm for Sparse Extreme Learning Machine . . . . . . . . . . . 55Zhihong Miao and Qing He

Extreme Latent Representation Learning for Visual Classification . . . . 65Tan Guo, Lei Zhang, and Xiaoheng Tan

An Optimized Data Distribution Model for ElasticChain to SupportBlockchain Scalable Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Dayu Jia, Junchang Xin, Zhiqiong Wang, Wei Guo, and Guoren Wang

v

An Algorithm of Sina Microblog User’s Sentimental InfluenceAnalysis Based on CNN+ELM Model . . . . . . . . . . . . . . . . . . . . . . . . . . 86Donghong Han, Fulin Wei, Lin Bai, Xiang Tang, TingShao Zhu,and Guoren Wang

Extreme Learning Machine Based Intelligent Condition MonitoringSystem on Train Door . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Xin Sun, K. V. Ling, K. K. Sin, and Lawrence Tay

Character-Level Hybrid Convolutional and Recurrent NeuralNetwork for Fast Text Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . 108Bing Liu, Yong Zhou, and Wei Sun

Feature Points Selection for Rectangle Panorama Stitching . . . . . . . . . . 118Weiqing Yan, Shuigen Wang, Guanghui Yue, Jindong Xu,Xiangrong Tong, and Laihua Wang

Point-of-Interest Group Recommendation with an ExtremeLearning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125Zhen Zhang, Guoren Wang, and Xiangguo Zhao

Research on Recognition of Multi-user Haptic Gestures . . . . . . . . . . . . 134Lu Fang, Huaping Liu, and Yanzhi Dong

Benchmarking Hardware Accelerating Techniques for ExtremeLearning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144Liang Li, Guoren Wang, Gang Wu, and Qi Zhang

An Event Recommendation Model Using ELM in Event-BasedSocial Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154Boyang Li, Guoren Wang, Yurong Cheng, and Yongjiao Sun

Reconstructing Bifurcation Diagrams of a Chaotic Neuron ModelUsing an Extreme Learning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 163Yoshitaka Itoh and Masaharu Adachi

Extreme Learning Machine for Multi-label Classification . . . . . . . . . . . 173Haigang Zhang, Jinfeng Yang, Guimin Jia, and Shaocheng Han

Accelerating ELM Training over Data Streams . . . . . . . . . . . . . . . . . . . 182Hangxu Ji, Gang Wu, and Guoren Wang

Predictive Modeling of Hospital Readmissions with Sparse BayesianExtreme Learning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191Nan Liu, Lian Leng Low, Sean Shao Wei Lam, Julian Thumboo,and Marcus Eng Hock Ong

Rising Star Classification Based on Extreme Learning Machine . . . . . . 197Yuliang Ma, Ye Yuan, Guoren Wang, Xin Bi, Zhongqing Wang,and Yishu Wang

vi Contents

Hand Gesture Recognition Using Clip Device Applicable to SmartWatch Based on Flexible Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207Sung-Woo Byun, Da-Kyeong Oh, MyoungJin Son, Ju Hee Kim,Ye Jin Lee, and Seok-Pil Lee

Receding Horizon Optimal Control of Hybrid Electric Vehicles UsingELM-Based Driver Acceleration Rate Prediction . . . . . . . . . . . . . . . . . . 216Jiangyan Zhang, Fuguo Xu, Yahui Zhang, and Tielong Shen

CO-LEELM: Continuous-Output Location Estimation Using ExtremeLearning Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226Felis Dwiyasa and Meng-Hiot Lim

Unsupervised Absent Multiple Kernel Extreme Learning Machine . . . . 236Lingyun Xiang, Guohan Zhao, Qian Li, and Zijie Zhu

Intelligent Machine Tools Recognition Based on Hybrid CNNsand ELMs Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247Kun Zhang, Lu-Lu Tang, Zhi-Xin Yang, and Lu-Qing Luo

Scalable IP Core for Feed Forward Random Networks . . . . . . . . . . . . . 253Anurag Daram, Karan Paluru, Vedant Karia, and Dhireesha Kudithipudi

Multi-objective Artificial Bee Colony Algorithm with InformationLearning for Model Optimization of Extreme Learning Machine . . . . . 263Hao Zhang, Dingyi Zhang, and Tao Ku

Short Term PV Power Forecasting Using ELM and ProbabilisticPrediction Interval Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273Jatin Verma, Xu Yan, Junhua Zhao, and Zhao Xu

A Novel ELM Ensemble for Time Series Prediction . . . . . . . . . . . . . . . 283Zhen Li, Karl Ratner, Edward Ratner, Kallin Khan, Kaj-Mikael Bjork,and Amaury Lendasse

An ELM-Based Ensemble Strategy for POI Recommendation . . . . . . . . 292Xue He, Tiancheng Zhang, Hengyu Liu, and Ge Yu

A Method Based on S-transform and Hybrid KernelExtreme Learning Machine for Complex Power QualityDisturbances Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303Chen Zhao, Kaicheng Li, and Xuebin Xu

Sparse Bayesian Learning for Extreme LearningMachine Auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319Guanghao Zhang, Dongshun Cui, Shangbo Mao, and Guang-Bin Huang

Contents vii

A Soft Computing-Based Daily Rainfall Forecasting Model UsingELM and GEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328Yuzhong Peng, Huasheng Zhao, Jie Li, Xiao Qin, Jianping Liao,and Zhiping Liu

Comparing ELM with SVM in the Field of Sentiment Classificationof Social Media Text Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336Zhihuan Chen, Zhaoxia Wang, Zhiping Lin, and Ting Yang

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

viii Contents

Random Orthogonal Projection BasedEnhanced Bidirectional Extreme Learning

Machine

Weipeng Cao1, Jinzhu Gao2, Xizhao Wang1, Zhong Ming1(&),and Shubin Cai1

1 College of Computer Science and Software Engineering, Shenzhen University,Shenzhen 518060, China

[email protected], {xzwang,mingz,

shubin}@szu.edu.cn2 School of Engineering and Computer Science, University of the Pacific,

Stockton 95211, CA, [email protected]

Abstract. Bidirectional extreme learning machine (B-ELM) divides thelearning process into two parts: At odd learning step, the parameters of the newhidden node are generated randomly, while at even learning step, the parametersof the new hidden node are obtained analytically from the parameters of theformer node. However, some of the odd-hidden nodes play a minor role, whichwill have a negative impact on the even-hidden nodes, and result in a sharp risein the network complexity. To avoid this issue, we propose a random orthogonalprojection based enhanced bidirectional extreme learning machine algorithm(OEB-ELM). In OEB-ELM, several orthogonal candidate nodes are generatedrandomly at each odd learning step, only the node with the largest residual errorreduction will be added to the existing network. Experiments on six real datasetshave shown that the OEB-ELM has better generalization performance and sta-bility than B-ELM, EB-ELM, and EI-ELM algorithms.

Keywords: Extreme learning machine � Bidirectional extreme learningmachine �Random orthogonal projection

1 Introduction

The training mechanism of traditional single hidden layer feed-forward neural networks(SLFN) is that the input weights and hidden bias are randomly assigned initial valuesand then iteratively tuned with methods such as gradient descent until the residual errorreaches the expected value. This method has several notorious drawbacks such as slowconvergence rate and local minima problem.

Different from traditional SLFN, neural networks with random weights (NNRW)train models in a non-iterative way [1, 2]. In NNRW, the input weights and hidden biasare randomly generated from a given range and kept fixed throughout the training

© Springer Nature Switzerland AG 2020J. Cao et al. (Eds.): ELM 2018, PALO 11, pp. 1–10, 2020.https://doi.org/10.1007/978-3-030-23307-5_1

http://crossmark.crossref.org/dialog/?doi=10.1007/978-3-030-23307-5_1&domain=pdf



https://doi.org/10.1007/978-3-030-23307-5_1

process, while the output weights are obtained by solving a linear system of matrixequations. Compared with traditional SLFN, NNRW can learn faster with acceptableaccuracy.

Extreme learning machine (ELM) is a typical NNRW, which was proposed byHuang et al. in 2004 [3]. ELM inherits the advantages of NNRW and extends it to aunified form. In recent years, many ELM based algorithms have been proposed [4–6]and applied to various fields such as unsupervised learning [7] and traffic signrecognition [8]. Although ELM and its variants have achieved many interesting results,there are still several important problems that have not been solved thoroughly, one ofwhich is the determination of the number of hidden nodes [1, 9].

In recent years, many algorithms have been proposed to determine the number ofhidden nodes. We can group them into two categories: incremental and pruningstrategies. For incremental strategy, the model begins with a small initial network andthen gradually adds new hidden nodes until the desired accuracy is achieved. Somenotable incremental algorithms include I-ELM [10], EI-ELM [11], B-ELM [12],EB-ELM [13], etc. For pruning strategy, the model begins with a larger than necessarynetwork and then cuts off the redundant or less effective hidden nodes. Some notabledestructive algorithms include P-ELM [14], OP-ELM [15], etc.

This paper focuses on optimizing the performance of the existing B-ELM algo-rithm. In B-ELM [12], the authors divided the learning process into two parts: the oddand the even learning steps. At the odd learning steps, the new hidden node is gen-erated randomly at one time, while at the even learning steps, the new hidden node isdetermined by a formula defined by the former added node parameters. Compared withthe fully random incremental algorithms such as I-ELM and EI-ELM, B-ELM shows amuch faster convergence rate.

From the above analysis, we can infer that the hidden nodes generated at the oddlearning steps (the odd-hidden nodes) play an important role in B-ELM models.However, the quality of the odd-hidden nodes cannot be guaranteed. Actually, some ofthem may play a minor role, which will cause a sharp rise in the network complexity.The initial motivation of this study is to alleviate this issue.

Orthogonalization technique is one of the effective algorithms for parameter opti-mization. Wang et al. [16] proved that the ELM model with the random orthogonalprojection has better capability of sample structure preserving (SSP). Kasun et al. [17]stacked the ELM auto-encoders into a deep ELM architecture based on the orthogonalweight matrix. Huang et al. [18] orthogonalized the input weights matrix when buildingthe local receptive fields based ELM model.

Inspired by the above works, in this study, we propose a novel random orthogonalprojection based enhanced bidirectional extreme learning machine algorithm (OEB-ELM). In OEB-ELM, at each odd learning step, we first randomly generate K candidatehidden nodes and orthogonalize them into orthogonal hidden nodes based on the Gram-Schmidt orthogonalization method. Then we train it as an initial model for hiddennodes selection. After obtaining the corresponding value of residual error reduction foreach candidate node, the one with the largest residual error reduction will be selected asthe final odd-hidden node and added to the existing network. The even-hidden nodesare obtained in the same way as B-ELM and EB-ELM.

2 W. Cao et al.

Our main contributions in this study are as follows.

(1) The odd learning steps in B-ELM are optimized and better hidden nodes can beobtained. Compared with the B-ELM, the proposed algorithm achieves modelswith better generalization performance and smaller network structure.

(2) The method to set the number of candidate hidden nodes in EB-ELM is improved.In OEB-ELM, the number of candidate hidden nodes is automatically determinedaccording to data attributes, which can effectively improve the computationalefficiency and reduce the human intervention to the model.

(3) The random orthogonal projection technique is used to improve the capability ofSSP of the candidate hidden nodes selection model, and thus the quality of thehidden nodes is further improved. Experiments on six UCI regression datasets havedemonstrated the efficiency of our method.

The organization of this paper is as follows: Sect. 2 briefly reviews the relatedalgorithms. The proposed OEB-ELM is described in Sect. 3. The details of theexperiment results and analysis are given in Sect. 4. Section 5 concludes the paper.

2 Review of ELM, I-ELM, B-ELM and EB-ELM

A typical network structure of ELM with single hidden layer is shown in Fig. 1. Thetraining mechanism of ELM is that the input weights x and hidden bias b are generatedrandomly from a given range and kept fixed throughout the training process, while theoutput weights b are obtained by solving a system of matrix equation.

Fig. 1. A basic ELM neural network structure

Random Orthogonal Projection 3

The above ELM network can be modeled as

XL

i¼1

bigðxi � xj þ biÞ ¼ tj;xi 2 Rn; bi 2 R; j ¼ 1; . . .;N ð1Þ

where gð�Þ denotes the activation function, tj denotes the actual value of each sample,and N is the size of the dataset. Equation (1) can be rewritten as

Hb ¼ T ð2Þ

where H ¼gðx1 � x1 þ b1Þ . . . gðxL � x1 þ bLÞ

..

. . .. ..

.

gðx1 � xN þ b1Þ � � � gðxL � xN þ bLÞ

0B@

1CA

N�L

, b ¼bT1...

bTL

264

375

L�m

, T ¼tT1...

tTN

264

375

N�m

.

In Eq. (2), H represents the hidden layer output matrix of ELM and the outputweight b can be obtained by

b ¼ H þ T ð3Þ

where H þ is the Moore–Penrose generalized inverse of H.The residual error measures the closeness between the current network fn with n

hidden nodes and the target function f , which can be summarized as

en ¼ fn � f ð4Þ

In the I-ELM algorithm [10], the random hidden nodes are added to the hiddenlayer one by one and the parameters of the existing hidden nodes stay the same after anew hidden node is added. The output function fn at the nth step can be expressed by

fnðxÞ ¼ fn�1ðxÞþ bnGnðxÞ ð5Þ

where bn denotes the output weights between the new added hidden node and theoutput nodes, and GnðxÞ is the corresponding output of the hidden node.

The I-ELM can automatically generate the network structure; however, the networkstructure is often very complex because some of the hidden nodes play a minor role inthe network. To alleviate this issue, the EI-ELM [11] and B-ELM [12] algorithms wereproposed. The core idea of the EI-ELM algorithm is to generate K candidate hiddennodes at each learning step and only select the one with the smallest residual error.Actually, the I-ELM is a special case of the EI-ELM when K¼ 1.

4 W. Cao et al.

Different from the EI-ELM, the B-ELM divides the training process into two parts:the odd and the even learning steps. At each odd learning step (i.e., when the number ofhidden nodes L 2 f2nþ 1; n 2 Zg), the new hidden node is generated randomly as inthe I-ELM. At each even learning step (i.e., when the number of hidden nodesL 2 f2n; n 2 Zg), the parameters of the new hidden node are obtained by

x2n ¼ g�1ðuðH2nÞÞx�1 ð6Þ

b2n ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimseðg�1ðuðH2nÞÞ � x2nxÞ

pð7Þ

H2n ¼ u�1ðg�1ðx2nxþ b2nÞÞ ð8Þ

where g�1 and u�1 denote the inverse functions of g and u, respectively.From the training mechanism of the B-ELM mentioned above, we can infer that the

hidden nodes generated at the odd learning steps have a significant impact on the modelperformance. However, the B-ELM cannot guarantee the quality of these hidden nodes.The odd-hidden nodes that play a minor role in the network will cause a sharp rise inthe network complexity.

To avoid this issue, we proposed the enhanced random search method to optimizethe odd learning steps of B-ELM, that is, the EB-ELM algorithm [13]. In EB-ELM, ateach odd learning step, K candidate hidden nodes are generated and only the one withthe largest residual error reduction will be selected. EB-ELM can achieve better gen-eralization performance than B-ELM. However, the number of candidate nodes K inEB-ELM is assigned based on experience, which makes it difficult to balance thecomputational efficiency and model performance.

3 The Proposed Algorithm

In this section, we propose a random orthogonal projection based enhanced bidirec-tional extreme learning machine (OEB-ELM) for regression problems.

Theorem 1. Suppose WK�K is an orthogonal matrix, which satisfies WTW ¼ I, thenfor any X 2 Rn, jWXjj2 ¼ jjXjj2.Proof. jWXjj2 ¼ XTWTWX ¼ XTIX ¼ jjXjj2.

From Theorem 1 and its proof, we can infer that the orthogonal projection canprovide good capability of sample structure preserving for the initial model, which willimprove the performance of the model and ensure the good quality of the candidatenodes. The proposed OEB-ELM can be summarized as follows:


OEB-ELM Algorithm:

Input: A training dataset 1={( , )}N n

i i iD x t R R= ⊂ × , the number of

hidden nodes L , an activation function G , a maximum num-ber of hidden nodes maxL and an expected error ε .Output: The model structure and output weights matrix β .Step 1 Initialization: Let the number of hidden nodes 0L = and the residual er-ror be E T= . Set K d= , where K denotes the maximum num-ber of trials of assigning candidate nodes at each odd learning step, d denotes the number of data attributes.Step 2 Learning step:while

maxL L< and E ε> do(a) Increase the number of hidden nodes L : 1L L= + ;(b) If {2 1, }L n n Z∈ + ∈ thenGenerate randomly the input weights matrix

(1) (2) ( ) *[ , , , ]random j K KW ω ω ω= ⋅⋅⋅ and the random hidden bias matrix

(1) (2) ( ) *1[ , , , ]random j KB b b b= ⋅⋅⋅ ;

Orthogonalize randomW and randomB using the Gram-Schmidt or-

thogonalization method to obtain orthW and orthB , which sat-

isfy T

orth orthW W I= and = 1T

orth orthB B , respectively.

Calculate the temporal output weights tempβ according to

temp tempH Tβ += .

for 1 :j K=

Calculate the residual error ( )jE after pruning the

thj hidden node:

( )j residual residualE T H β= − ⋅

End for

(c) Let *

1 ( ){ | max || ||}j k jj j E≤ ≤= . Set *( )L orth jω ω= , and *( )L j

b b= . Update LH for the new hidden node and calculate the re-

sidual error E after adding the new hidden node thL :

L LE E H β= − .

End if(d) if {2 , }L n n Z∈ ∈ thenCalculate the error feedback function sequence LH ac-

cording to the equation1

2 2 1 2 1( )n n nH e β −

− −= ⋅

Calculate the parameter pair ( , )L Lbω and update LH based

on the equations (6), (7), and (8).Calculate the output weight Lβ according to

2 1 22 2

2

,n nn

n

e H

Hβ −⟨ ⟩

=

Calculate E after adding the new hidden node thL :L LE E H β= − .

End ifEnd while

6 W. Cao et al.

4 Experimental Results and Analysis

In this section, we present the details of our experiment settings and results. Ourexperiments are conducted on six benchmark regression problems from UCI machinelearning repository [19] and the specification of these datasets is given in Table 1. Wechose the Sigmoid function (i.e., Gðx; x; bÞ ¼ 1=ð1þ expð�ðxxþ bÞÞÞ) as the acti-vation function of B-ELM, EB-ELM, EI-ELM, and OEB-ELM. The input weights xare randomly generated from the range of (−1, 1) and the hidden biases b are generatedrandomly from the range of (0, 1) using a uniform sampling distribution. For eachregression problem, the average results over 50 trials are obtained for each algorithm.In this study, we did our experiments in the MATLAB R2014a environment on thesame Windows 10 with Intel Core i5 2.3 GHz CPU and 8 GB RAM.

Our experiments are conducted based on the following two questions:

(1) Under the same network structure, which algorithm can achieve the best gener-alization performance and stability?

(2) With the increase of the number of hidden nodes, which algorithm has the fastestconvergence rate?

For Question (1), we set the same number of hidden nodes for the B-ELM,EB-ELM, EI-ELM, and OEB-ELM algorithms. The Root-Mean-Square Error ofTesting (Testing RMSE), the Root-Mean-Square Error of Training (Training RMSE),Standard Deviation of Testing RMSE (SD), and learning time are selected as theindicators for performance testing. The smaller testing RMSE denotes the better gen-eralization performance of the algorithm and the smaller SD indicates the betterstability of the algorithm. The performance comparison of the four algorithms is shownin Table 2. It is noted that the close results are underlined and the best results are inboldface.

From Table 2, we observe that the proposed OEB-ELM algorithm has smallesttesting RMSE and standard deviation when applied on six regression datasets, whichmeans that OEB-ELM can achieve better generalization performance and stability thanB-ELM, EB-ELM, and EI-ELM. It is noted that the EI-ELM algorithm runs the longesttime on all datasets, which shows that the OEB-ELM algorithm is more efficient thanthe EI-ELM algorithm.

Table 1. Specification of six regression datasets

Name Training data Testing data Attributes

Airfoil Self-noise 750 753 5Housing 250 256 13Concrete compressive strength 500 530 8White wine quality 2000 2898 11Abalone 2000 2177 8Red wine quality 800 799 11


From the above analysis, it can be seen that the proposed OEB-ELM algorithmachieves models with better generalization performance and stability than the B-ELMand EB-ELM algorithms. Compared with the EI-ELM algorithm, the OEB-ELM hashigher computational efficiency and can achieve better generalization performance andstability in most cases.

For Question (2), we gradually increase the number of hidden nodes from 1 to 50and record the corresponding testing RMSE in the process of adding the hidden nodes.The performance comparison of each algorithm with the increase of hidden nodes isshown in Fig. 2.

Figure 2 shows the changes of testing RMSE of the four algorithms with increasinghidden nodes on Housing dataset. From Fig. 2, we can observe that the B-ELM,EB-ELM, and OEB-ELM algorithms achieve high accuracy with a few hidden nodes,which means that the three algorithms converge faster than the EI-ELM algorithm. Wealso observe that the OEB-ELM algorithm has smaller testing RMSE and fluctuation

Table 2. Performance comparison of the EB-ELM, B-ELM, EI-ELM, and OEB-ELM

Datasets Algorithm Learningtime (s)

Standarddeviation

TrainingRMSE

TestingRMSE

Airfoil self-noise EB-ELM 5.5403 0.0025 0.0709 0.0726B-ELM 1.2950 0.0032 0.0729 0.0749EI-ELM 25.3281 0.2230 0.0469 0.2567OEB-ELM 5.7078 0.0025 0.0715 0.0723

Housing EB-ELM 12.6897 0.0012 0.0182 0.0196B-ELM 2.9103 0.0060 0.0214 0.0236EI-ELM 23.9253 0.1165 0.0043 0.2232OEB-ELM 15.8434 0.0010 0.0182 0.0192

Concretecompressivestrength

EB-ELM 13.1275 0.0033 0.0216 0.0232B-ELM 2.8516 0.0225 0.0229 0.0338EI-ELM 24.5106 0.0323 0.0075 0.0439OEB-ELM 13.3669 0.0012 0.0214 0.0222

White wine EB-ELM 13.1369 0.0015 0.0150 0.0167B-ELM 3.1109 0.0071 0.0159 0.0181EI-ELM 32.2019 75.0810 0.0107 44.2757OEB-ELM 16.7169 0.0013 0.0146 0.0162

Abalone EB-ELM 12.7944 0.0079 0.0106 0.0143B-ELM 3.0247 0.0149 0.0099 0.0206EI-ELM 32.0638 0.0078 <0.0001 0.0164OEB-ELM 11.1956 0.0071 0.0130 0.0136

Red wine EB-ELM 3.5128 0.0064 0.0608 0.0659B-ELM 0.8991 0.0044 0.0625 0.0652EI-ELM 7.1844 0.0060 0.0522 0.0657OEB-ELM 4.1181 0.0020 0.0589 0.0620

8 W. Cao et al.

than other algorithms, which means that the OEB-ELM algorithm achieves modelswith better generalization performance and stability than the B-ELM, EB-ELM, andEI-ELM algorithms. Similar results can be found in other cases.

5 Conclusions

In this study, we proposed a novel random orthogonal projection based enhancedbidirectional extreme learning machine algorithm (OEB-ELM) for regression prob-lems. In OEB-ELM, the odd-hidden nodes are optimized using the random orthogonalprojection method and improved enhanced random search method. Compared withB-ELM, OEB-ELM has better generalization performance and smaller networkstructure. Compared with EB-ELM, the number of candidate hidden nodes inOEB-ELM can be automatically determined from data attributes. Note that bothEB-ELM and B-ELM are the specific cases of OEB-ELM. Specifically, EB-ELM is thenon-automated and non-orthogonal OEB-ELM, while B-ELM is the non-orthogonalOEB-ELM when the number of candidate nodes K ¼ 1.

Acknowledgment. This research was supported by the National Natural Science Foundation ofChina (61672358).

Fig. 2. The testing RMSE updating curves of the EB-ELM, B-ELM, EI-ELM, and OEB-ELM


References

1. Cao, W.P., Wang, X.Z., Ming, Z., Gao, J.Z.: A review on neural networks with randomweights. Neurocomputing 275, 278–287 (2018)

2. Cao, J.W., Lin, Z.P.: Extreme learning machines on high dimensional and large dataapplications: a survey. Math. Prob. Eng. 2015, 1–13 (2015)

3. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine a new learning scheme offeedforward neural networks. In: Proceedings of 2004 IEEE international joint conference onneural networks, vol. 2, pp. 985–990 (2004)

4. Zhang, L., Zhang, D.: Evolutionary cost-sensitive extreme learning machine. IEEE Trans.Neural Netw. Learn. Syst. 28(12), 3045–3060 (2017)

5. Ding, S.F., Guo, L.L., Hou, Y.L.: Extreme learning machine with kernel model based ondeep learning. Neural Comput. Appl. 28(8), 1975–1984 (2017)

6. Zhang, H.G., Zhang, S., Yin, Y.X.: Online sequential ELM algorithm with forgetting factorfor real applications. Neurocomputing 261, 144–152 (2017)

7. He, Q., Jin, X., Du, C.Y., Zhuang, F.Z., Shi, Z.Z.: Clustering in extreme learning machinefeature space. Neurocomputing 128, 88–95 (2014)

8. Huang, Z., Yu, Y., Gu, J., Liu, H.: An efficient method for traffic sign recognition based onextreme learning machine. IEEE Trans. Cybern. 47(4), 920–933 (2017)

9. Cao, W.P., Gao, J.Z., Ming, Z., Cai, S.B.: Some tricks in parameter selection for extremelearning machine. IOP Conf. Ser. Mater. Sci. Eng. 261(1), 012002 (2017)

10. Huang, G.B., Chen, L., Siew, C.K.: Universal approximation using incremental constructivefeedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892(2006)

11. Huang, G.B., Chen, L.: Enhanced random search based incremental extreme learningmachine. Neurocomputing 71(16–18), 3460–3468 (2008)

12. Yang, Y.M., Wang, Y.N., Yuan, X.F.: Bidirectional extreme learning machine for regressionproblem and its learning effectiveness. IEEE Trans. Neural Netw. Learn. Syst. 23(9),1498–1505 (2012)

13. Cao, W.P., Ming, Z., Wang, X.Z., Cai, S.B.: Improved bidirectional extreme learningmachine based on enhanced random search. Memetic Comput., 1–8 (2017). https://doi.org/10.1007/s12293-017-0238-1

14. Rong, H.J., Ong, Y.S., Tan, A.H., Zhu, Z.: A fast pruned-extreme learning machine forclassification problem. Neurocomputing 72, 359–366 (2008)

15. Miche, Y., Sorjamaa, A., Bas, P., Simula, O., Jutten, C., Lendasse, A.: OPELM: optimallypruned extreme learning machine. IEEE Trans. Neural Netw. 21(1), 158–162 (2010)

16. Wang, W.H., Liu, X.Y.: The selection of input weights of extreme learning machine: asample structure preserving point of view. Neurocomputing 261, 28–36 (2017)

17. Kasun, L.L.C., Zhou, H., Huang, G.B., Vong, C.M.: Representational learning with extremelearning machine for big data. IEEE Intell. Syst. 28, 31–34 (2013)

18. Huang, G.B., Bai, Z., Kaun, L.L.C., Vong, C.M.: Local receptive fields based extremelearning machine. IEEE Comput. Intell. Mag. 10, 18–29 (2015)

19. Blake, C., Merz, C.: UCI repository of machine learning databases. Technical report, Dept.Inf. Comput. Sci., Univ. California, Irvine, CA, USA (1998). http://archive.ics.uci.edu/ml/

10 W. Cao et al.

http://dx.doi.org/10.1007/s12293-017-0238-1

http://dx.doi.org/10.1007/s12293-017-0238-1

http://archive.ics.uci.edu/ml/

A Novel Feature Specificity Enhancementfor Taste Recognition by Electronic Tongue

Yanbing Chen, Tao Liu(&), Jianjun Chen, Dongqi Li,and Mengya Wu

School of Microelectronics and Communication Engineering,No. 174 Shazheng Street, Shapingba District, Chongqing 400044, China

{yanbingchen,cquliutao,cjj,20161213071}@cqu.edu.cn,

[email protected]

Abstract. An electronic tongue (E-Tongue) is a bionic system that relies on anarray of electrode sensors to realize taste perception. Large pulse voltammetry(LAPV) is an important E-Tongue type which generally generates a largeamount of response data. Considering that high common-mode characteristicsexisting in sensor arrays largely depress the recognition performance, we pro-pose an alternative feature extraction method for sensor specificity enhancement,which is called feature specificity enhancement (FSE). Specifically, the pro-posed FSE method measures sensor specificity on paired sensor responses andutilizes kernel function for nonlinear projection. Meanwhile, kernel extremelearning machine (KELM) is utilized to evaluate the overall performance ofrecognition. In experimental evaluation, we introduce several feature extractionmethods and classifiers for comparison. The results indicate that the proposedfeature extraction combined with KELM shows the highest recognition accuracyof 95% on our E-Tongue dataset, which is superior to other methods in botheffectiveness and efficiency.

Keywords: Electronic tongue � Specificity enhancement � Kernel function

1 Introduction

An electronic tongue (E-Tongue) is a common type of artificial-taste equipment forliquid phase analysis [1–3]. It relies on an array of sensors with low selectivity andproper pattern recognition methods to realize taste identifications like human beings[4]. Many E-Tongue applications have been reported [5–13]. Among them, both teaand wine are popular in recent E-Tongue academic works.

Current E-Tongues are mainly divided into two categories: potentiometric [14] andvoltammetric types [15]. Compared with the former type, the latter one attaches moreattention [16–18]. Among voltammetric E-Tongues, the large amplitude pulsevoltammetry (LAPV) type is popular. For feature extraction of LAPV, recent studieshandle the responses of single electrode directly. However, they ignore the common-mode signals existed between different electrodes in a sensor array which may beharmful to classification.

© Springer Nature Switzerland AG 2020J. Cao et al. (Eds.): ELM 2018, PALO 11, pp. 11–16, 2020.https://doi.org/10.1007/978-3-030-23307-5_2




https://doi.org/10.1007/978-3-030-23307-5_2

In this paper, we propose a feature enhancement method using a nonlinearspecificity metric, which can alleviate the common-mode components in sensorresponses. Coupled with the kernel extreme learning machine (KELM), the proposedmethod obtains the highest recognition rate in evaluation among several referencedmethods. It has been proved that the nonlinear specificity enhancement associates withKELM helps the data analysis of LAPV based E-Tongue apparently.

The rest of this paper is organized as follows. Section 2 introduces the proposedmethod. The experimental results and analysis are presented in Sect. 3. Finally, con-clusions are presented in the last section.

2 Methods

Notations. In this paper, X ¼ ½x1;x2; . . .; xm�T 2 Rm�d represents the raw data of certainsample, where m represents the sample dimension equivalent to the number of sensorsand d represents the value number of each sensor response.

2.1 Feature Specificity Enhancement

We propose feature specificity enhancement (FSE) scheme for the E-Tongues workingin LAPV manner and perform the feature extraction as follows:

Znij ¼ jðxni ; xnj Þ; i 6¼ j ð1Þ

where xni represents the i-th sensor response to the n-th sample, Znij indicates the n-th

sample feature between i-th and j-th sensor responses, jð�Þ denotes a kernel functionprojects original specificity component to a nonlinear space. Moreover, we introducekernel function to solve “dimension disaster” problem in space projection [19] asfollows:

jðxi; xjÞ ¼ expð� xi � xj�� 2

2

2r2Þ ð2Þ

where expð�Þ represents an exponential function, r is the width of the kernel functionand �k k2 denotes the l2-norm.

In recognition stage, extreme learning machines (ELM) module [20] is a favorablechoice. It randomly initializes the input weights W ¼ w1;w2; . . .;wL½ �T2 RL�D and biasb ¼ b1; b2; . . .; bL½ � 2 RL, and then the corresponding output weight matrix b 2 RL�C

can be analytically calculated based on the output matrix of the hidden layer. Theoutput matrix H of hidden layer with L hidden neurons is computed as:

H ¼g wT

1 x1 þ b1� � � � � g wT

Lx1 þ bL� �

..

. . .. ..

.

g wT1 xN þ b1

� � � � � g wTLxN þ bL

� �

264

375 ð3Þ

12 Y. Chen et al.

where g �ð Þ is the activation function. If regularization is applied, the ELM learningmodel can be expressed as follows:

minb

12

bk k2 þ l � 12

XN

i¼1

nk k2

s:t h xið Þb ¼ ti � ni; i ¼ 1; 2; . . .;N

ð4Þ

where l is the regularization coefficient, n denotes the prediction error of ELM ontraining set. The approximate solutions of output weight matrix b are calculated as:

b ¼HTHþ IL�L

l

� ��1HTT; N � L

HT HTHþ IN�Nl

� ��1T; N\L

8><>:

ð5Þ

where T ¼ ½t1; t2; . . .; tN �T 2 RN�C denotes the label matrix of training set, N is thenumber of training samples. Therefore, the output of ELM can be computed as:

f xð Þ ¼ h xð ÞHTðHTHþ IN�N

lÞ�1T ð6Þ

The KELM additionally introduces a kernel function when calculating the output ofthe network [21], which denotes Kij ¼ h xið Þ � h xj

� � ¼ k xi; xj� �

. Thus, Eq. (6) can beexpressed as:

f xð Þ ¼k x; x1ð Þ

..

.

k x; xNð Þ

264

375 Kþ IN�N

l

� ��1

T ð7Þ

3 Experiments

3.1 Experimental Data

The data acquisition was performed on our own developed E-Tongue system [22].Seven kinds of drinks including red wine, white spirit, beer, oolong tea, black tea,maofeng tea, and pu’er tea were selected as test objects. In the experiments, we for-mulate nine tests for each kind of drinks. Thus, a total of 63 (7 kinds � 9 tests) sampleswere collected.

3.2 Experimental Results

In this section, we compare FSE method experimentally with three feature extractionmethods: Raw (No treatment), principle component analysis (PCA) and discretewavelet transform (DWT). After the original E-Tongue signals processed by different

A Novel Feature Specificity Enhancement for Taste Recognition 13

feature extraction methods, support vector machine (SVM), random forest (RF) andKELM are implemented as recognition part for evaluation. In this section, weuse leave-one-out (LOO) strategy for cross validation. The average accuracies of crossvalidation are reported in Table 1 and the total computation time for cross-validationtraining and testing is presented in Table 2. From both Tables 1 and 2, we can have thefollowing observations:

(1) When SVM is used for classification, the proposed FSE with SVM performssignificantly better than other feature extraction methods. FSE with SVM canachieve 90.48%. In the view of execution time, FSE reaches the shortest timeexpense using SVM than other feature extraction methods.

(2) When RF is used for classification, both FSE and “raw” methods get the highestaverage accuracy (82.54%) compared with other feature extraction methods. Fromtime consumption of computation, FSE is nearly 90 times faster than Raw.

(3) When KELM is adopted, FSE gets the highest accuracy 95.24%. Compared withraw feature (22.22%), PCA (69.84%) and DWT (88.89%), it is obvious thatKELM shows better fitting and reasoning ability by using proposed FSE featureextraction method. Moreover, the specificity metric with Hilbert projection ismore favorable to KELM than any other classifiers. As for time consumption, FSEcoupled with KELM cost the least time expense in all methods. It indicates thatKELM keeps the minimum amount of computation while providing excellentclassification results.

4 Conclusion

In this article, we proposed a FSE method for nonlinear feature extraction in E-Tonguedata and achieves taste recognition by using several typical classifiers such as SVM, RFand KELM. The proposed FSE coupled with KELM achieves the best results in bothaccuracy and computational efficiency on our collected data set by a self-developed

Table 1. Accuracy comparison

Feature extraction methodsRaw PCA DWT FSE

RF 82.54% 73.02% 77.78% 82.54%SVM 84.13% 79.37% 77.78% 90.48%KELM 22.22% 69.84% 88.89% 95.24%

Table 2. Time consumption comparison

Feature extraction methodsRaw PCA DWT FSE

RF 344.70 s 43.12 s 56.88 s 4.24 sSVM 48.69 s 37.32 s 52.88 s 2.56 sKELM 5.63 s 34.40 s 49.73 s 0.06 s

14 Y. Chen et al.

E-Tongue system. We should admit that FSE seems to be effective in dealing withfeature extraction from high dimensional data, especially LAPV signals. On the otherhand, KELM can greatly promote the overall performance in accuracy and speed inrecognition.

References

1. Legin, A., Rudnitskaya, A., Lvova, L., Di Nataleb, C., D’Amicob, A.: Evaluation of Italianwine by the electronic tongue: recognition, quantitative analysis and correlation with humansensory perception. Anal. Chim. Acta 484(1), 33–44 (2003)

2. Ghosh, A., Bag, A.K., Sharma, P., et al.: Monitoring the fermentation process and detectionof optimum fermentation time of black tea using an electronic tongue. IEEE Sen-sors J. 15(11), 6255–6262 (2015)

3. Verrelli, G., Lvova, L., Paolesse, R., et al.: Metalloporphyrin - based electronic tongue: anapplication for the analysis of Italian white wines. Sensors 7(11), 2750–2762 (2007)

4. Tahara, Y., Toko, K.: Electronic tongues–a review. IEEE Sensors J. 13(8), 3001–3011(2013)

5. Kirsanov, D., Legin, E., Zagrebin, A., et al.: Mimicking Daphnia magna, bioassayperformance by an electronic tongue for urban water quality control. Anal. Chim. Acta824, 64–70 (2014)

6. Wei, Z., Wang, J.: Tracing floral and geographical origins of honeys by potentiometric andvoltammetric electronic tongue. Comput. Electron. Agric. 108, 112–122 (2014)

7. Wang, L., Niu, Q., Hui, Y., Jin, H.: Discrimination of rice with different pretreatmentmethods by using a voltammetric electronic tongue. Sensors 15(7), 17767–17785 (2015)

8. Apetrei, I.M., Apetrei, C.: Application of voltammetric e-tongue for the detection ofammonia and putrescine in beef products. Sens. Actuators B Chem. 234, 371–379 (2016)

9. Ciosek, P., Brzózka, Z., Wróblewski, W.: Classification of beverages using a reduced sensorarray. Sens. Actuators B Chem. 103(1), 76–83 (2004)

10. Domínguez, R.B., Morenobarón, L., Muñoz, R., et al.: Voltammetric electronic tongue andsupport vector machines for identification of selected features in Mexican coffee. Sensors14(9), 17770–17785 (2014)

11. Palit, M., Tudu, B., Bhattacharyya, N., et al.: Comparison of multivariate preprocessingtechniques as applied to electronic tongue based pattern classification for black tea. Anal.Chim. Acta 675(1), 8–15 (2010)

12. Gutiérrez, M., Llobera, A., Ipatov, A., et al.: Application of an E-tongue to the analysis ofmonovarietal and blends of white wines. Sensors 11(5), 4840–4857 (2011)

13. Dias, L.A., et al.: An electronic tongue taste evaluation: identification of goat milkadulteration with bovine milk. Sens. Actuators B Chem. 136(1), 209–217 (2009)

14. Ciosek, P., Maminska, R., Dybko, A., et al.: Potentiometric electronic tongue based onintegrated array of microelectrodes. Sens. Actuators B Chem. 127(1), 8–14 (2007)

15. Ivarsson, P., et al.: Discrimination of tea by means of a voltammetric electronic tongue anddifferent applied waveforms. Sens. Actuators B Chem. 76(1), 449–454 (2001)

16. Winquist, F., Wide, P., Lundström, I.: An electronic tongue based on voltammetry. Anal.Chim. Acta 357(1–2), 21–31 (1997)

17. Tian, S.Y., Deng, S.P., Chen, Z.X.: Multifrequency large amplitude pulse voltammetry:a novel electrochemical method for electronic tongue. Sens. Actuators B Chem. 123(2),1049–1056 (2007)

A Novel Feature Specificity Enhancement for Taste Recognition 15

18. Palit, M., Tudu, B., Dutta, P.K., et al.: Classification of black tea taste and correlation withtea taster’s mark using voltammetric electronic tongue. IEEE Trans. Instrum. Meas. 59(8),2230–2239 (2010)

19. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)20. Huang, G.B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: Theory and applications.

Neurocomputing 70(1), 489–501 (2006)21. Huang, G.B., Zhou, H., Ding, X., et al.: Extreme learning machine for regression and

multiclass classification. IEEE Trans. Syst. Man Cybern. Part B 42(2), 513–529 (2012)22. Liu, T., Chen, Y., Li, D., et al.: An active feature selection strategy for DWT in artificial

taste. J. Sens. 2018, 1–11 (2018)

16 Y. Chen et al.

Comparison of Classification Methodsfor Very High-Dimensional Data in Sparse

Random Projection Representation

Anton Akusok(B) and Emil Eirola

Department of Business Management and Analytics, Arcada UAS, Helsinki, Finland{anton.akusok,emil.eirola}@arcada.fi

Abstract. The big data trend has inspired feature-driven learning tasks,which cannot be handled by conventional machine learning models.Unstructured data produces very large binary matrices with millionsof columns when converted to vector form. However, such data is oftensparse, and hence can be manageable through the use of sparse randomprojections.

This work studies efficient non-iterative and iterative methods suit-able for such data, evaluating the results on two representative machinelearning tasks with millions of samples and features. An efficient Jaccardkernel is introduced as an alternative to the sparse random projection.Findings indicate that non-iterative methods can find larger, more accu-rate models than iterative methods in different application scenarios.

Keywords: Extreme learning machines · Sparse data ·Sparse random projection · Large dimensionality

1 Introduction

Machine learning is a mature scientific field with lots of theoretical results, estab-lished algorithms and processes that address various supervised and unsuper-vised problems using the provided data. In theoretical research, such data isgenerated in a convenient way, or various methods are compared on standardbenchmark problems – where data samples are represented as dense real-valuedvectors of fixed and relatively low length. Practical applications represented bysuch standard datasets can successfully be solved by one of a myriad of existingmachine learning methods and their implementations.

However, the most impact of machine learning is currently in the big datafield with the problems that are well explained in natural language (“Find mali-cious files”, “Is that website safe to browse?”) but are hard to encode numeri-cally. Data samples in these problems have distinct features coming from a hugeunordered set of possible features. Same approach can cover a frequent caseof missing feature values [10,29]. Another motivation for representing data by

c© Springer Nature Switzerland AG 2020J. Cao et al. (Eds.): ELM 2018, PALO 11, pp. 17–26, 2020.https://doi.org/10.1007/978-3-030-23307-5_3


https://doi.org/10.1007/978-3-030-23307-5_3

18 A. Akusok and E. Eirola

abstract binary features is a growing demand for security, as such features canbe obfuscated (for instance by hashing) to allow secure and confidential dataprocessing.

The unstructured components can be converted into vector form by definingindicator variables, each representing the presence/absence of some property(e.g., bag-of-words model [11]). Generally, the number of such indicators canbe much larger than the number of samples, which is already large by itself.Fortunately, these variables tend to be sparse. In this paper, we study howstandard machine learning solutions can be applied to such data in a practicalway.

The research problem is formulated as a classification of sparse data witha large number of samples (hundreds of thousands) and huge dimensionality(millions of features). In this work, the authors omit feature selection methodsbecause they are slow on such large scale, they can reduce the performanceif a poor set of features is used, and, most importantly, features need to bere-selected if the feature set changes. Feature selection is replaced by SparseRandom Projection [3] (SRP) that provides a dense low-dimensional representa-tion of a high-dimensional sparse data while almost preserving relative distancesbetween data samples [1]. All the machine learning methods in the paper arecompared on the same SRP representation of the data.

The paper also compares the performance of the proposed methods on SRPto find the suitable ones for big data applications. Large training and test setsare used, with a possibility to process the whole dataset at once. Iterative solu-tions are typically applied to large data processing, as their parameters can beupdated by gradually feeding all available data. Such solutions often come at ahigher computational cost and longer training time than methods where explicitsolutions exist, as in linear models. Previous works on this topic considered neu-ral networks and logistic regression [9]. Also, there is application research [27]without general comparison of classification methods. A wide comparison of iter-ative methods based on feature subset selection is given in the original paper forthe publicly available URL Reputation benchmark [18].

The remainder of this paper is structured as follows. The next Sect. 2 intro-duces the sparse random projection, and the classification methods used in thecomparison. The experimental Sect. 3 describes the comparison datasets andmakes a comparison of experimental results. The final Sect. 4 discusses the find-ings and their consequences for practical applications.

2 Methodology

2.1 Sparse Random Projection for Dimensionality Reduction

The goal of applying random projections is to efficiently reduce the size ofthe data while retaining all significant structures relevant to machine learning.According to Johnson–Lindenstrauss’ lemma, a small set of points can be pro-jected from a high-dimensional to low-dimensional Euclidean space such that rel-ative distances between points are well preserved [1]. As relative distances reflectthe structure of the dataset (and carry the information related to neighborhood

Classification Methods with SRP-Encoded Data 19

and class of particular data samples), standard machine learning methods per-form well on data in its low-dimensional representation. The lemma requires anorthogonal projection, that is well approximated by random projection matrixat high dimensionality.

Johnson–Lindenstrauss lemma applies to the given case because the numberof data samples is smaller than the original dimensionality (millions). However,computing such high-dimensional projection directly exceeds the memory capac-ity of contemporary computers. Nevertheless, a similar projection is obtained byusing sparse random projection matrix. The degree of sparseness is tuned sothat the result after the projection is a dense matrix with a low number of exactzeros.

Denote the random projection matrix by W , and the original high-dimensional data by X. The product W T X can be calculated efficiently forvery large W and X, as long as they are sparse. Specifically, the elements of Ware not drawn from a continuous distribution, but instead distributed as follows:

wij =

⎧⎪⎨

⎪⎩

−√s/d with probability 1/2s

0 with probability 1 − 1/s

+√

s/d with probability 1/2s

(1)

where s = 1/density and d is the target dimension [15].

2.2 Extreme Learning Machine

Extreme Learning Machine (ELM) [13,14] is a single hidden layer feed-forwardneural network where only the output layer weights β are optimized, and all theweights wkj between the input and hidden layer are assigned randomly. With Ninput vectors xi, i ∈ [1, N ] collected in a matrix X and the targets collected ina vector y, it can be written as

Hβ = y where H = h(W T X + 1Tb) (2)

Here W is a projection matrix with L rows corresponding to L hidden neurons,filled with normally distributed values, b is a bias vector filled with the samevalues, and h(·) is a non-linear activation function applied element-wise. Thispaper uses hyperbolic tangent function as h(·). Training this model is simple, asthe optimal output weights β are calculated directly by ordinary least squares.Tikhonov regularization [30] is often applied when solving the least square prob-lem in Eq. (2). The value of the regularization parameter can be selected byminimizing the leave-one-out cross-validation error (efficiently calculated via thePRESS statistic [20]). The model is easily adapted for sparse high-dimensionalinputs by using sparse random matrix W as described in the previous section.ELM with this structure for the random weight matrix is very similar to theternary ELM from [12].

20 A. Akusok and E. Eirola

ELM can incorporate a linear part by attaching the original data features Xto the hidden neurons output H. A random linear combination of the originaldata features can be used if attaching all the features is infeasible, as in thecurrent case of very high-dimensional sparse data. These features let ELM learnany linear dependencies in data directly, without their non-linear approximation.Such method is similar to another random neural network method called RandomVector Functional Link network (RVFL [23]), and is presented in this paper bythe RVFL name.

2.3 Radial Basis Function ELM

An alternative way of computing the hidden layer output H is by assigning acentroid vector cj , j ∈ [1, L] to each hidden neuron, and obtain H as a distance-based kernel between the training/test set and a fixed set of centroid vectors.

Hi,j = e−γjd2(xi,cj), i ∈ [1, N ], j ∈ [1, L], (3)

where γj is kernel width.Such architecture is widely known as Radial Basis Function (RBF) net-

work [6,17], except that ELM-RBF uses fixed centroids and fixed random kernelwidths γj . Centroid vectors cj are chosen from random training set samples tobetter follow the input data distribution. Distance function for dense data isEuclidean distance.

2.4 Jaccard Distance for Sparse Binary Data

Distance computations are a major drawback in any RBF network withEuclidean distances as they are slow and impossible to approximate for high-dimensional data [2]. Jaccard distances can be used for binary data [8]. However,a naive approach for Jaccard distances is infeasible for datasets with millions offeatures.

An alternative computation of Jaccard distance matrix directly from sparsedata is considered in the paper and proved to be fast enough for practical pur-poses. Recall the Jaccard distance formulation for sets a and b:

J(a, b) = 1 − |a ∩ b||a ∪ b| (4)

Each column in sparse binary matrices A and B can be considered as a set ofnon-zero values, so A = [a1, a2, . . . am] and B = [b1, b2, . . . bn]. Their union andintersection can be efficiently computed with matrix product and reductions:

|ai ∩ bj | = (ATB)ij , i ∈ [1, n], j ∈ [1,m] (5)

|ai ∪ bj | = |ai| + |bj | − |ai ∩ bj | =

(

1T∑

k

Aik +∑

K

BjkT1 − ATB

)

i,j

(6)

Classification Methods with SRP-Encoded Data 21

The sparse matrix multiplication is the slowest part, so this work utilizes itsparallel version. Note that the runtime of a sparse matrix product ATB scalessub-linearly in the number of output elements n ·m, so the approach is inefficientfor distance calculation between separate pairs of samples (ai, bj) not joined inlarge matrices A, B.

3 Experiments

3.1 Datasets

The performance of the various methods is compared on two separate, largedatasets with related classification tasks. The first dataset concerns Androidapplication packages, with the task of malware detection. Features are extractedusing static analysis techniques, and the current data consists of 6,195,080 binaryvariables. There are 120,000 samples in total, of which 60,000 are malware, andthis is split into a training set of 100,000 samples and a fixed test set of 20,000samples.

The data is very sparse – the density of nonzero elements is around 0.0017%.Even though the data is balanced between classes, the relative costs of falsepositives and false negatives are very different. As such, the overall classificationaccuracy is not the most useful metric, and the area under the ROC curve (AUC)is often preferred to compare models. More information about the data can befound in [22,26,27].

Second, the Web URL Reputation dataset [19] contains a set of 2,400,000 web-sites that can be malicious or benign. The dataset is split into 120 daily intervalswhen the data was collected; the last day is used as the test set. The task is toclassify them using the given 3,200,000 sparse binary features, as well as 65 densereal-valued features. This dataset has 0.0035% nonzero elements, however, a smallnumber of features are real-valued and dense. For this dataset, the classificationaccuracy is reported in comparison with the previous works [32,33].

3.2 Additional Methods

Additional methods include Kernel Ridge Regression (KRR), k-Nearest Neigh-bors (kNN), Support Vector Machine for binary classification (SVC), Logisticregression and Random Forest. Of these methods, only SVC and logistic regres-sion have iterative solutions.

Kernel Ridge Regression [21,25] combines Ridge regression, a linear leastsquares solution with L2-penalty on weights norm, with the kernel trick. Differentkernels may be used, like the Jaccard distance kernel for sparse data proposedabove.

k-Nearest Neighbors (kNN) method is a simple classifier that looks at k clos-est training samples to a given test sample and runs the majority vote betweenthem to predict a class. The value of k is usually odd to avoid ties. It can usedifferent distance functions, and even a pre-computed distance matrix (with theJaccard distance explained in the Methodology section).

Jiuwen Cao Chi Man Vong Yoan Miche Amaury Lendasse Editors ...

Documents

Transcript of Jiuwen Cao Chi Man Vong Yoan Miche Amaury Lendasse Editors ...