Zhang second pdf

1
What is Review User-supplied descriptions/opinions of online products What is Helpfulness Criterion of reviews’ importance Why we use helpfulness Fight with spam reviews Help users make decisions easier 56 people found this helpful review1: ‘hello worldreview2: ‘hello world again and againreview3: ‘goodbye to a worldunique words: ‘hello’, ‘world’, ‘again’, ‘goodbye’, ‘and’, ‘to’, ‘a’ vector1: [1, 1, 0, 0, 0, 0, 0] vector2: [1, 1, 1, 0, 1, 0, 0] vector3: [0, 1, 0, 1, 0, 1, 1] unique words vector: [0/1, 0/1, 0/1, 0/1, 0/1, 0/1, 0/1] with 7 dimension 10,000 reviews include 30,000 unique words 30,000 dimension vector, but only average 100 words in each review 99% elements in vector will be zero in the end(sparse) http://www.mlsociety.com/wp-content/uploads/2016/10/PredictionMachineLearning.png vector1 vector2 vector3 helpfulness1 helpfulness2 helpfulness3 f (t)= ( 1, if t in d i 0, if t not in d i F (t)= n X i=1 f (t, d i ) G(t)= - m X i=1 P r (c i ) log P r (c i )+ P r (t) m X i=1 P r (c i |t) log P r (c i |t) + P r (t) m X i=1 P r (c i | ¯ t) log P r (c i | ¯ t) χ 2 I max (t)= m max i=1 {I (t, c i )} I (t.c) = log P r (t ^ c) P r (t) P r (c) χ 2 (t, c)= N (AD - BC ) 2 (A + C ) (B + D) (A + B ) (C + D) χ 2 max (t)= m max i=1 {χ 2 (t, c i )} calculate score of each word, determine its importance the higher score, the more important Classification method: Naive Bayes separate reviews to two categories with zero helpfulness / non-zero helpfulness Test method 10-fold cross validation pre-processing stopword punctuations restrict word length between 2 and 10 # of reviews used 2,000 18,000 low high precision of classification small large percentage of total unique words been used only used top 20% most important words used almost 80% most important words

Transcript of Zhang second pdf

Page 1: Zhang second pdf

• What is Review • User-supplied descriptions/opinions of online products

• What is Helpfulness • Criterion of reviews’ importance

• Why we use helpfulness • Fight with spam reviews • Help users make decisions easier

56 people found this helpful

review1: ‘hello world’ review2: ‘hello world again and again’ review3: ‘goodbye to a world’

unique words: ‘hello’, ‘world’, ‘again’, ‘goodbye’, ‘and’, ‘to’, ‘a’

vector1: [1, 1, 0, 0, 0, 0, 0] vector2: [1, 1, 1, 0, 1, 0, 0] vector3: [0, 1, 0, 1, 0, 1, 1]

unique words vector: [0/1, 0/1, 0/1, 0/1, 0/1, 0/1, 0/1] with 7 dimension

• 10,000 reviews include 30,000 unique words • 30,000 dimension vector, but only average 100 words in each review • 99% elements in vector will be zero in the end(sparse)

http://www.mlsociety.com/wp-content/uploads/2016/10/PredictionMachineLearning.png

vector1 vector2 vector3

helpfulness1 helpfulness2 helpfulness3

f(t) =

(1, if t in di0, if t not in di

F (t) =nX

i=1

f(t, di)

G(t) =�mX

i=1

Pr(ci) logPr(ci) + Pr(t)mX

i=1

Pr(ci|t) logPr(ci|t)

+ Pr(t)mX

i=1

Pr(ci|¯t) logPr(ci|¯t)

�2

Imax

(t) =m

max

i=1{I(t, c

i

)}

I(t.c) = log

Pr(t ^ c)

Pr(t)⇥ Pr(c)

�2(t, c) =N ⇥ (AD �BC)2

(A+ C)⇥ (B +D)⇥ (A+B)⇥ (C +D)

�2max

(t) =m

max

i=1{�2

(t, ci

)}

• calculate score of each word, determine its importance • the higher score, the more important

• Classification method: Naive Bayes • separate reviews to two categories • with zero helpfulness / non-zero helpfulness

• Test method • 10-fold cross validation

• pre-processing • stopword • punctuations • restrict word length between 2 and 10

• # of reviews used • 2,000 • 18,000

low

high

prec

isio

n of

cla

ssifi

catio

n

small large

percentage of total unique words been used

only used top 20% most important words used almost 80% most important words