Zhang second pdf

• What is Review • User-supplied descriptions/opinions of online products

• What is Helpfulness • Criterion of reviews’ importance

• Why we use helpfulness • Fight with spam reviews • Help users make decisions easier

56 people found this helpful

review1: ‘hello world’ review2: ‘hello world again and again’ review3: ‘goodbye to a world’

unique words: ‘hello’, ‘world’, ‘again’, ‘goodbye’, ‘and’, ‘to’, ‘a’

vector1: [1, 1, 0, 0, 0, 0, 0] vector2: [1, 1, 1, 0, 1, 0, 0] vector3: [0, 1, 0, 1, 0, 1, 1]

unique words vector: [0/1, 0/1, 0/1, 0/1, 0/1, 0/1, 0/1] with 7 dimension

• 10,000 reviews include 30,000 unique words • 30,000 dimension vector, but only average 100 words in each review • 99% elements in vector will be zero in the end(sparse)

http://www.mlsociety.com/wp-content/uploads/2016/10/PredictionMachineLearning.png

vector1 vector2 vector3

helpfulness1 helpfulness2 helpfulness3

f(t) =

(1, if t in di0, if t not in di

F (t) =nX

i=1

f(t, di)

G(t) =�mX

i=1

Pr(ci) logPr(ci) + Pr(t)mX

i=1

Pr(ci|t) logPr(ci|t)

+ Pr(t)mX

i=1

Pr(ci|¯t) logPr(ci|¯t)

�2

Imax

(t) =m

max

i=1{I(t, c

i

)}

I(t.c) = log

Pr(t ^ c)

Pr(t)⇥ Pr(c)

�2(t, c) =N ⇥ (AD �BC)2

(A+ C)⇥ (B +D)⇥ (A+B)⇥ (C +D)

�2max

(t) =m

max

i=1{�2

(t, ci

)}

• calculate score of each word, determine its importance • the higher score, the more important

• Classification method: Naive Bayes • separate reviews to two categories • with zero helpfulness / non-zero helpfulness

• Test method • 10-fold cross validation

• pre-processing • stopword • punctuations • restrict word length between 2 and 10

• # of reviews used • 2,000 • 18,000

low

high

prec

isio

n of

cla

ssifi

catio

n

small large

percentage of total unique words been used

only used top 20% most important words used almost 80% most important words

http://www.nltk.org/book/

Zhang second pdf

Data & Analytics

Transcript of Zhang second pdf