Comparing Neural and Attractiveness-based Visual Features for … · Artwork Recommendation Vicente...

Comparing Neural and A�ractiveness-based Visual Features forArtwork Recommendation

Vicente DominguezPonti�cia Universidad Catolica de

ChileSantiago, Chile

[email protected]

Pablo MessinaPonti�cia Universidad Catolica de

ChileSantiago, [email protected]

Denis ParraPonti�cia Universidad Catolica de


[email protected]

Domingo MeryPonti�cia Universidad Catolica de


[email protected]

Christoph Tra�nerMODUL University Vienna

Vienna, Austriachristoph.tra�[email protected]

Alvaro SotoPonti�cia Universidad Catolica de

ChileSantiago, [email protected]

ABSTRACTAdvances in image processing and computer vision in the latestyears have brought about the use of visual features in artworkrecommendation. Recent works have shown that visual featuresobtained from pre-trained deep neural networks (DNNs) perform ex-tremely well for recommending digital art. Other recent works haveshown that explicit visual features (EVF) based on a�ractivenesscan perform well in preference prediction tasks, but no previouswork has compared DNN features versus speci�c a�ractiveness-based visual features (e.g. brightness, texture) in terms of recom-mendation performance. In this work, we study and compare theperformance of DNN and EVF features for the purpose of physicalartwork recommendation using transactional data from UGallery,an online store of physical paintings. In addition, we perform anexploratory analysis to understand if DNN embedded features havesome relation with certain EVF. Our results show that DNN fea-tures outperform EVF, that certain EVF features are more suitedfor physical artwork recommendation and, �nally, we show evi-dence that certain neurons in the DNN might be partially encodingvisual features such as brightness, providing an opportunity forexplaining recommendations based on visual neural models.

KEYWORDSArtwork Recommendation, Computer Vision

ACM Reference format:Vicente Dominguez, Pablo Messina, Denis Parra, Domingo Mery, ChristophTra�ner, and Alvaro Soto. 2017. Comparing Neural and A�ractiveness-based Visual Features for Artwork Recommendation. In Proceedings of DLRS2017, Como, Italy, August 27, 2017, 5 pages.DOI: 10.1145/3125486.3125495

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permi�ed. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci�c permission and/or afee. Request permissions from [email protected] 2017, Como, Italy© 2017 ACM. 978-1-4503-5353-3/17/08. . .$15.00DOI: 10.1145/3125486.3125495

1 INTRODUCTIONIn the latest �ve years, the area of computer vision has been revo-lutionized by deep neural networks (DNN), where the use of visualneural embeddings from pre-trained convolutional neural networkshas increased by orders of magnitude the performance on tasks suchas image classi�cation [9] or scene identi�cation [18]. In the areaof recommender systems, a few works have exploited neural visualembeddings for recommendation, such as [3–5, 13]. Among theseworks, we are particularly interested in the use of visual features forrecommending art [4]. �e online artwork market is booming dueto the in�uence of social media and new consumption behaviorsof millennials [22], but the main works for recommending art datemore than 8 years [1] and they did not utilize visual features forrecommendation.

In a recent work, He et al. [4] introduced a recommender methodfor digital art employing ratings, social features and pre-trainedvisual DNN embeddings with very good results. However, they didnot use explicit visual features (EVF) such as brightness, color ortexture. Using only latent embeddings (from DNNs) a�ects modeltransparency and explainability of recommendations, which canin turn hinder the user acceptance of the suggestions [8, 21]. Inaddition, their work focused on digital art, and in the present workwe are interested in recommending physical artworks (paintings).In a more recent work [14], we compared the performance of vi-sual DNN features versus art metadata and explicit visual features(EVF: colorfulness, brightness, etc.) for recommendation of physicalartworks. We showed that visual features (DNN and EVF) outper-form manually-curated metadata. However, we did not analyzewhich speci�c visual features are more important in the artworkrecommendation task. Furthermore, although previous researchhave studied what is being encoded by neurons in a DNN [15, 23],to the best of our knowledge, no previous work has investigateda link between latent DNN visual features and a�ractiveness vi-sual features such as those investigated by San Pedro et al. [17](colorfulness, brightness, etc.). Understanding what visual neuralmodels are encoding could help in transparency and explainabilityof recommender systems [21].

Contribution. In this work we compare the performance of 8 ex-plicit visual features (EVF) for the task of recommending physical

arX

iv:1

706.

0751

5v2

[cs

.IR

] 2

1 Ju

l 201

7

DLRS 2017, August 27, 2017, Como, Italy Dominguez et al.

paintings from an online store, UGallery1. Our results indicate thatthe combination of these features o�ers the best performance, butit also highlights that features like entropy and contrast contributemore than features like colorfulness to the recommendation task.Moreover, an exploratory analysis provides evidence that certainlatent features from the DNN visual embedding might be partiallyrelated to explicit visual features. �is result could be eventually uti-lized in explaining recommendations with black-box neural visualmodels.

2 RELATEDWORKHere we survey someworks using visual features obtained from pre-trained deep neural networks for recommendation tasks. McAuleyet al. [13] introduced an image-based recommendation systembased on styles and substitutes for clothing using visual embeddingspre-trained on a large-scale dataset obtained from Amazon.com.Recently, He et al. [5] went further in this line of research andintroduced a visually-aware matrix factorization approach thatincorporates visual signals (from a pre-trained DNN) into predictorsof people’s opinions. �e latest work by He et al. [4] deals withvisually-aware artistic recommendation, building a model whichcombines ratings, social signals and visual features. Deldjoo et al.[3] compared visual DNN and explicit (stylistic) visual features formovie recommendation, and they found the explicit visual featuresmore informative than DNN features for their recommendationtask.

Unlike these previous works, we compare in this article neuraland speci�c explicit visual features (such as brightness or contrast)for physical painting recommendation, and we also explore thepotential relation between these two types of features (DNN vs.EVF features).

3 PROBLEM DESCRIPTION & DATASET�e online web store UGallery supports emergent artists by help-ing them to sell their original paintings online. In this work, westudy content-based top-n recommendation based on sets of vi-sual features extracted directly from images. In order to performpersonalized recommendations, we create a user pro�le based onpaintings already bought by a user, and based on this model wea�empt to predict the next paintings the user will buy. In this work,we focus on comparing di�erent sets of visual features, in order tounderstand which ones could provide a be�er recommendation.

Dataset. UGallery shared with us an anonymized dataset of1, 371 users, 3, 490 paintings (with their respective images) and2, 846 purchases (transactions) of paintings, where all users havemade at least one transaction. In average, each user has bought 2-3items in the latest years2.

4 FEATURES & RECOMMENDATIONMETHOD

Since we are using a content-based approach to produce recommen-dations, we �rst describe the visual features extracted from images,and then we describe how we used them to make recommendations.1h�p://www.UGallery.com2Our collaborators at UGallery requested us not to disclose the exact dates where thedata was collected.

Figure 1: Evaluation procedure. Each surrounding box rep-resents a test, wherewe predict the items of the purchase ses-sion. In the �gure, we predict which artworks bought User1 in purchase P3. ‘Training:P2’ means that we used itemsfrom purchase session P2 to train the model.

Visual Features. For each image representing a painting in thedataset we obtain features from a pre-trained AlexNet DNN [9],which outputs a vector of 4, 096 dimensions, the fc6 layer. �isnetwork was trained with Ca�e [6] using the ImageNet ILSVRC2012 dataset [9]. We also tested a pre-trained VGG16 [19] model,but the results were not signi�cantly di�erent so we do not reportthem in this article.

We also obtain a vector of explicit visual features of a�ractive-ness, using the OpenCV so�ware library3, based on the work of SanPedro et al. [17]: brightness, saturation, sharpness, entropy, RGB-contrast, colorfulness and naturalness. A more detailed descriptionof these features:• Brightness: It measures the level of luminance of an image. For

images in the YUV color space, we obtain the average of theluminance component Y.

• Saturation: It measures the vividness of a picture. For imagesin the HSV or HSL color space, we obtain the average of thesaturation component S.

• Sharpness: It measures how detailed is the image.• Colorfulness: It measures how distant are the colors from the

gray color.• Naturalness: It measures how natural is the picture, grouping the

pixels in Sky, Grass and Skins pixels and applying the formulain [17].

• RGB-contrast: Measures the variance of luminance in the RGBcolor space.

• Entropy: Shannon’s entropy is calculated, applied to the his-togram of values of every pixel in grayscale used as a vector. �ehistogram is used as the distribution to calculate the entropy.

• Local Binary Pa�erns: Although this is not actually an “explicit”visual feature, it is a traditional baseline in several computervision tasks [16], so we test it for recommendations too.Recommendation Method. In order to produce a content-

based recommendation list of k artworks for a user u, we follow thefollowing procedure: For every item in the current inventory we:(1) calculate its similarity to each item in the user’s model, then (2)3h�p://opencv.org/

Comparing Neural and A�ractiveness-based Visual Features for Artwork Recommendation DLRS 2017, August 27, 2017, Como, Italy

Figure 2: Partial t-SNE map of the Ugallery dataset using the image embeddings from AlexNet DNN. �is map helps to showgroups of similar images as encoded with the AlexNet embedding. �e grid is made using Jonker-Volgenant algorithm [7].

these single similarities are aggregated into a single score as eitherthe sum or the maximum of all these similarities, and �nally (3) theitems are sorted by their scores and the top k items in the list arerecommended. Formally, given a user u who has consumed a setof artworks Iu , and an arbitrary artwork i from the inventory, thescore of this item i to be recommended to u is:

score(i,u) =∑j ∈Iu

sim(Vi ,Vj ) (1)

where Vz is the feature vector embedding of the painting z ob-tained with the Alexnet or from the EVFs. Moreover, the similarityfunction used was cosine similarity, expressed as:

sim(Vi ,Vj ) = cos(Vi ,Vj ) =Vi ·Vj‖Vi ‖‖Vj ‖

(2)

5 EVALUATION METHODOur protocol is based on Macedo et al. [11] to evaluate a recom-mender system in a time-based manner, and it is presented in Figure1. We a�empt to predict the items purchased in every transaction,where the training set contains all the artworks previously boughtby a user just before making the transaction to be predicted. Userswho have purchased exactly one artwork were considered as coldstart users, and we remove them for this evaluation. A�er this�ltering, our datasets ends up with 365 people who bought morethan a single item, who performed a total of 1,629 transactions (i.e.we conducted 1,629 evaluation tests with each algorithm).

Metrics. As suggested by Cremonesi et al. [2] for Top-N recom-mendation, we used recall@k and precision@k , as well as nDCG,a commonly used metric for ranking evaluation [12].

6 RESULTSTable 1 presents the results, which we summarize in three points:• AlexNet DNN features perform be�er than those based on EVF,

either combined or isolated. Figure 2 presents a sample of thet-SNE map [10] made from the AlexNet DNN visual features,which shows well-de�ned clusters of images. �is result re�ectsthe current state-of-the-art of deep neural networks in computervision, but as already stated, the lack of transparency of DNNvisual features can hinder the user acceptance of these recom-mendations due to di�culties to explain these recommendations[8, 21].

Table 1: Results of the recommendation task using a diverseset of latent (DNN) and explicit visual features (EVF).

name ndcg@5 ndcg@10 rec@5 rec@10 prec@5 prec@10DNN .0884 .1027 .1180 .1598 .0290 .0204EVF (all features) .0344 .0459 .0547 .0885 .0127 .0111EVF (all, except LBP) .0370 .0453 .0585 .0826 .0152 .0109EVF (LBP) .0292 .0388 .0431 .0715 .0103 .0085EVF (brightness) .0048 .0083 .0080 .0186 .0035 .0031EVF (colorfulness) .0020 .0052 .0033 .0126 .0008 .0017EVF (contrast) .0079 .0102 .0149 .0220 .0033 .0025EVF (entropy) .0088 .0098 .0110 .0137 .0026 .0019EVF (naturalness) .0062 .0110 .0108 .0248 .0026 .0032EVF (saturation) .0055 .0084 .0096 .0187 .0021 .0020EVF (sharpness) .0063 .0085 .0117 .0178 .0024 .0021

• Combining EVF features improves their performance comparedto using them isolated, but in some cases it is detrimental. Forinstance, the combination EVF (all, except LBP) yields be�erndcg@5, recall@5 and precision@5 than EVF (all features).

• By comparing isolated EVF features, LBP performs the best be-cause it encodes texture pa�erns and local contrast very well,however its explanation is more complex than image brightnessor contrast. Considering only the 7 original features proposed bySanPedro et al. [17] we observe that entropy, contrast, and natu-ralness perform consistently well, as well as brightness in termsof precision. On the other side, colorfulness does not seem tohave a signi�cant impact on providing good recommendations.

6.1 Relation between DNN and EVF featuresWe also explored, by a correlation analysis, the potential relationbetween each of the 4, 096 AlexNet features and the isolated EVFs,results are shown in Table 2. We found that brightness has a signi�-cant positive correlation with AlexNet feature F [598] (ρ = .58)as well as a signi�cant negative correlation with AlexNet fea-ture F [3916] (ρ = −.5). �e opposite is found with the featurenaturalness, which largest positive and negative correlations are(rhomax = .008) and (rhomin = −.002), meaning that no singleAlexNet DNN neuron seems to be explicitly encoding naturalness.

Figure 3 shows a plot with correlations of brigthness to all AlexNetdimensions, sorted from the smallest F [3916] to the largest F [598].�is �gure also presents sampled images that support this analysis,

DLRS 2017, August 27, 2017, Como, Italy Dominguez et al.

Figure 3: �e plot at the le� shows the correlation of each AlexNet feature with the visual feature Brightness, highlightingfeatures with most negative (F [3916] = −.5) and positive correlations (F [598] = .58). Example images at the right side showevidence of this correlation. Brightness values range between [0..250], F [598] between [0..49.66], and F [3916] between [0..38.79].

Table 2: Maximum and minimum correlation betweenattractiveness-based visual features and the 4,096 AlexNetembeddings, indicating the respective dimension index.

Visual feature max(corr .) index(Fmax ) min(corr .) index(Fmin )brightness 0.5815 598 -0.5020 3916sharpness 0.4682 4095 -0.3668 1025saturation 0.3645 445 -0.5019 2370colorfulness 0.3034 1014 -0.3383 2410entropy 0.3410 286 -0.4369 3499contrast 0.3941 469 -0.2799 3761naturalness 0.0852 2626 -0.0256 2499

where two images with high and low brightness, also show respec-tive high and low values in the AlexNet features F [598] and theopposite for F [3916].

Limitations. It is important to note that this is only an exploratoryanalysis and generalizability should be considered carefully. Never-theless, the high correlation of brightness and the small correlationof naturalness provides a hint towards what types of features arenot being explicitly encoded in the single neurons of the fc6 layerin the AlexNet DNN.

7 CONCLUSION AND FUTUREWORKIn this article we have investigated the impact of latent (DNN)and explicit (EVF) visual features on artwork recommendation.Our results support that DNN features outperform explicit visualfeatures. Although one previous work on movie recommendationfound the opposite [3] –i.e. EVF being more informative than DNNvisual features–, we argue that the domain di�erences (paintingsvs. movies) and the fact that we use images rather than video mightexplain the di�erence, but further research is needed in this aspect.

We also show that some EVF contribute more information for theartwork recommendation task, such as entropy, contrast, naturalnessand brightness. On the other side, colorfulness is the less informativevisual feature for this task.

Moreover, by a correlation analysis we showed that brightness,saturation and entropy are signi�cantly correlated with some DNN

embedding features, while naturalness is poorly correlated. �ispreliminary results should be further studied with the aim of im-proving the explainability of these black-box models [21].

In future work, we will study the use of other visual embeddingswhich have shown good results in computer vision tasks, such asGoogleNet [20]. In addition, we will conduct a user study to inves-tigate ways to explain image recommendations using the relationsfound between visual DNN features and EVF.

ACKNOWLEDGMENTSWe thank Alex Farkas and Stephen Tanenbaum from UGallery forsharing the dataset and answering our questions. We also thankFelipe Cortes, PUC Chile student who implemented a web tool tovisualize the DNN embedding. Authors Vicente Dominguez, PabloMessina and Denis Parra are supported by the Chilean researchagency Conicyt, under Grant Fondecyt Iniciacion No.: 11150783.

REFERENCES[1] LM Aroyo, Y Wang, R Brussee, Peter Gorgels, LW Rutledge, and N Stash. 2007.

Personalized museum experience: �e Rijksmuseum use case. In Proceedings ofMuseums and the Web.

[2] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Performance ofRecommender Algorithms on Top-n Recommendation Tasks. In Proceedings ofthe Fourth ACM Conference on Recommender Systems (RecSys ’10). ACM, NewYork, NY, USA, 39–46.

[3] Yashar Deldjoo, Massimo�adrana, Mehdi Elahi, and Paolo Cremonesi. 2017.Using Mise-En-Scene Visual Features based on MPEG-7 and Deep Learning forMovie Recommendation. (2017). arXiv:arXiv:1704.06109

[4] Ruining He, Chen Fang, Zhaowen Wang, and Julian McAuley. 2016. Vista: AVisually, Socially, and Temporally-aware Model for Artistic Recommendation.In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys ’16).ACM, New York, NY, USA, 309–316. h�ps://doi.org/10.1145/2959100.2959152

[5] Ruining He and Julian McAuley. 2016. VBPR: visual Bayesian PersonalizedRanking from implicit feedback. In Proceedings of the �irtieth AAAI Conferenceon Arti�cial Intelligence. AAAI Press, 144–150.

[6] Yangqing Jia, Evan Shelhamer, Je� Donahue, Sergey Karayev, Jonathan Long,Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Ca�e: Convolu-tional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093(2014).

[7] Roy Jonker and Anton Volgenant. 1987. A shortest augmenting path algorithm fordense and sparse linear assignment problems. Computing 38, 4 (1987), 325–340.

[8] Joseph A Konstan and John Riedl. 2012. Recommender systems: from algorithmsto user experience. User Modeling and User-Adapted Interaction 22, 1-2 (2012),101–123.

[9] Alex Krizhevsky, Ilya Sutskever, and Geo�rey E Hinton. 2012. Imagenet classi�ca-tion with deep convolutional neural networks. In Advances in neural information

http://arxiv.org/abs/arXiv:1704.06109

https://doi.org/10.1145/2959100.2959152

Comparing Neural and A�ractiveness-based Visual Features for Artwork Recommendation DLRS 2017, August 27, 2017, Como, Italy

processing systems. 1097–1105.[10] Laurens van der Maaten and Geo�rey Hinton. 2008. Visualizing data using t-SNE.

Journal of Machine Learning Research 9, Nov (2008), 2579–2605.[11] Augusto Q Macedo, Leandro B Marinho, and Rodrygo LT Santos. 2015. Context-

aware event recommendation in event-based social networks. In Proceedings ofthe 9th ACM Conference on Recommender Systems. ACM, 123–130.

[12] Christopher D Manning, Prabhakar Raghavan, Hinrich Schutze, et al. 2008. Intro-duction to information retrieval. Vol. 1. Cambridge university press Cambridge.

[13] Julian McAuley, Christopher Targe�, Qinfeng Shi, and Anton Van Den Hengel.2015. Image-based recommendations on styles and substitutes. In Proceedingsof the 38th International ACM SIGIR Conference on Research and Development inInformation Retrieval. ACM, 43–52.

[14] Pablo Messina, Vicente Dominguez, Denis Parra, Christoph Tra�ner, and AlvaroSoto. 2017. Exploring Content-based Artwork Recommendation with Metadataand Visual Features. arXiv preprint arXiv:1706.05786 (2017).

[15] Anh Nguyen, Jason Yosinski, and Je� Clune. 2016. Multifaceted feature visual-ization: Uncovering the di�erent types of features learned by each neuron indeep neural networks. arXiv preprint arXiv:1602.03616 (2016).

[16] Timo Ojala, Ma�i Pietikainen, and David Harwood. 1996. A comparative studyof texture measures with classi�cation based on featured distributions. Pa�ernrecognition 29, 1 (1996), 51–59.

[17] Jose San Pedro and Stefan Siersdorfer. 2009. Ranking and Classifying A�ractive-ness of Photos in Folksonomies. In Proceedings of the 18th International Conference

on World Wide Web (WWW ’09). ACM, New York, NY, USA, 771–780.[18] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson.

2014. CNN features o�-the-shelf: an astounding baseline for recognition. InProceedings of the IEEE Conference on Computer Vision and Pa�ern RecognitionWorkshops. 806–813.

[19] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net-works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[20] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Sco� Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015.Going deeper with convolutions. In Proceedings of the IEEE Conference on Com-puter Vision and Pa�ern Recognition. 1–9.

[21] Katrien Verbert, Denis Parra, Peter Brusilovsky, and Erik Duval. 2013. Visualizingrecommendations to support exploration, transparency and controllability. InProceedings of the 2013 international conference on Intelligent user interfaces. ACM,351–362.

[22] Deborah Weinswig. 2016. Art Market Cooling, But Online SalesBooming. h�ps://www.forbes.com/sites/deborahweinswig/2016/05/13/art-market-cooling-but-online-sales-booming/. (2016). [Online; accessed21-March-2017].

[23] Ma�hew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolu-tional networks. In European conference on computer vision. Springer, 818–833.

https://www.forbes.com/sites/deborahweinswig/2016/05/13/art-market-cooling-but-online-sales-booming/

https://www.forbes.com/sites/deborahweinswig/2016/05/13/art-market-cooling-but-online-sales-booming/

Comparing Neural and Attractiveness-based Visual Features for … · Artwork Recommendation Vicente...

Documents

Transcript of Comparing Neural and Attractiveness-based Visual Features for … · Artwork Recommendation Vicente...