Post on 16-Apr-2017
• What is Review • User-supplied descriptions/opinions of online products
• What is Helpfulness • Criterion of reviews’ importance
• Why we use helpfulness • Fight with spam reviews • Help users make decisions easier
56 people found this helpful
review1: ‘hello world’ review2: ‘hello world again and again’ review3: ‘goodbye to a world’
unique words: ‘hello’, ‘world’, ‘again’, ‘goodbye’, ‘and’, ‘to’, ‘a’
vector1: [1, 1, 0, 0, 0, 0, 0] vector2: [1, 1, 1, 0, 1, 0, 0] vector3: [0, 1, 0, 1, 0, 1, 1]
unique words vector: [0/1, 0/1, 0/1, 0/1, 0/1, 0/1, 0/1] with 7 dimension
• 10,000 reviews include 30,000 unique words • 30,000 dimension vector, but only average 100 words in each review • 99% elements in vector will be zero in the end(sparse)
http://www.mlsociety.com/wp-content/uploads/2016/10/PredictionMachineLearning.png
vector1 vector2 vector3
helpfulness1 helpfulness2 helpfulness3
f(t) =
(1, if t in di0, if t not in di
F (t) =nX
i=1
f(t, di)
G(t) =�mX
i=1
Pr(ci) logPr(ci) + Pr(t)mX
i=1
Pr(ci|t) logPr(ci|t)
+ Pr(t)mX
i=1
Pr(ci|¯t) logPr(ci|¯t)
�2
Imax
(t) =m
max
i=1{I(t, c
i
)}
I(t.c) = log
Pr(t ^ c)
Pr(t)⇥ Pr(c)
�2(t, c) =N ⇥ (AD �BC)2
(A+ C)⇥ (B +D)⇥ (A+B)⇥ (C +D)
�2max
(t) =m
max
i=1{�2
(t, ci
)}
• calculate score of each word, determine its importance • the higher score, the more important
• Classification method: Naive Bayes • separate reviews to two categories • with zero helpfulness / non-zero helpfulness
• Test method • 10-fold cross validation
• pre-processing • stopword • punctuations • restrict word length between 2 and 10
• # of reviews used • 2,000 • 18,000
low
high
prec
isio
n of
cla
ssifi
catio
n
small large
percentage of total unique words been used
only used top 20% most important words used almost 80% most important words