Oracle, Where Shall I Submit My Precious Papers?pike.psu.edu/presentations/oracle.pdf · 1 Oracle,...
Transcript of Oracle, Where Shall I Submit My Precious Papers?pike.psu.edu/presentations/oracle.pdf · 1 Oracle,...
1
Oracle, Where Shall I Submit My Precious Papers?
IST Faculty Brown BagSep. 22, 2006
Dongwon Lee
Sep. 22, 2006 2
Credits
StudentsErgin Elmacioglu (CSE, Penn State)Su Yan (IST, Penn State)Ziming Zhuang (IST, Penn State)
ColleaguesLee Giles (Penn State)Min-Yen Kan (NUS, Singapore)Jaewoo Kang (Korea U., Korea)Divesh Srivastava (AT&T Labs – Research)
2
Sep. 22, 2006 3
What do I do?
Databases /Data Mining
Digital Libraries / Info. Retrieval
XML / Web
Sep. 22, 2006 4
What projects do I do?
Databases /Data Mining
Digital Libraries / Info. Retrieval
XML / WebIBM Eclipse, 2004 & 2006Penn State eBRC, 2005
Microsoft SciData 2005
NSF OISE 2006
Today’s Talk
3
Sep. 22, 2006 5
Outline
MotivationSimple StudyResultsSummary
Sep. 22, 2006 6
MIT’s Prankhttp://pdos.csail.mit.edu/scigen/
The World Multi-Conference on Systemics, Cybernetics and Informatics (SCI)
4
Sep. 22, 2006 7
Annoyance…
Sep. 22, 2006 8
“Dong-Won Lee” as PC?
WMSCI2006
5
Sep. 22, 2006 9
Some Known Questionable VenuesFrom http://www.inesc-id.pt/~aml/trash.html:
IMCSE: International Multiconference in Computer Science and Computer Engineering WMSCI or SCI: World Multiconference on Systemics, Cybernetics and Informatics ICCCT: International Conference on Computing, Communications and Control Technologies PISTA: Conference on Politics and Information Systems: Technologies and Applications SSCCII: Symposium of Santa Caterina on Challanges in the Internet and Interdisciplinary Research CITSA: International Conference on Cybernetics and Information Technologies, Systems and Applications ISAS: International Conference on Information Systems Analysis and Synthesis CISCI: Conferencia Iberoamericana en Sistemas, Cibernética e InformáticaSIECI: Simposium Iberoamericano de Educación, Cibernética e InformáticaWCAC: World Congress in Applied Computing Any IPSI International Conference or journal Any GESTS international conference or journal KCPR: International Conference on Knowledge Communication and Peer Reviewing International e-Conference on Computer Science …
http://fakeconferences.org => down from a threat
Sep. 22, 2006 10
Fakes Everywhere
Microsoft HoneyMonkey
6
Sep. 22, 2006 11
Fake VenuesAccording to fakeconferences.org,
“… fake venues are ones that are organized for the revenue, not for the advancement of science…They share a lot in common…an abundance of varying, vaguely connected topics, high frequency of conference, spam mailings, obscure organizers and sponsors, and poor peer reviewing and randomly accepting papers …”
WMSCI has listed close to 300 research topics as relevant in its Call-For-Paper (CFP), and reportedly accepted 2,165 and 2,904 papers in 2003 and 2004, respectively
Sep. 22, 2006 12
Differences in DisciplinesComputer Science
Peer-reviewed conferencesTop conferences have 5-15% acceptance rateSpecialized and small conferences (attendance of 500+)Often value conferences > journals
Pure Sciences (eg, Math, Physics)Pre-print at Arxiv.orgRigorous reviews for journalsHuge flagship conference (ICM 98 attracted ~4000)
Social SciencesOften value journals > conferencesConferences are mostly for gathering or short abstract based screeningRigorous reviews for journals
7
Sep. 22, 2006 13
Outline
MotivationSimple StudyResultsSummary
Sep. 22, 2006 14
Research Question
Can we detect the so called “fake venues” automatically?
DesiderataLarge-number of venues per year scalableAutomatic detection
no human involvementFalse positives >> false negatives
0
100
200
300
400
500
600
700
800
1999 2000 2001 2002 2003 2004 2005 2006
Histogram of CFPs in dbworld
8
Sep. 22, 2006 15
Candidate Features
Good vs. bad venuesCitation counting (eg, Impact Factor)Acceptance rateReputation (eg, society)History…
At the end, none satisfy our desiderata. Need something else…
Sep. 22, 2006 16
Research Hypothesis
PC member list can be readily available from CFP data extraction + data cleaningEach CFP has only finite number of PCs scalabilityExamine quality of PC w.r.t heuristics:
Citation counting, productivity, centrality, betweeness, impact, …
Qualities of venues are closely correlated with those of PC members of the venues
9
Sep. 22, 2006 17
Data Mining ModelsOutlier detection
Clustering
Classification Fake ?
training set
Sep. 22, 2006 18
Classification w. Decision Tree
Fake ?training set
PC has feature A?
Yes
No PC has feature B?
PC has feature C? PC has feature D?Regular venue
Fake venue
10
Sep. 22, 2006 19
5 Classification Features
# of PC # of publication of PC# of co-authors of PCCloseness centrality of PCBetweeness centrality of PC…
Sep. 22, 2006 20
Set-UpACM DL: downloaded data of 1950-2004
0.6M authors, 0.7M articles1.2M edges (ie, collaboration)
Dbworld: 2,979 CFPs (free text formats)16,147 distinct PC names
Hand-selected 20 fake venues QLaborious cleaning process for venue, PC names, and citations:
Entity resolutionName disambiguationRecord linkage
AnotherTalk
11
Sep. 22, 2006 21
Outline
MotivationSimple StudyResultsSummary
Sep. 22, 2006 22
# of PC
0%
5%
10%
15%
20%
25%
30%
35%
10 50 90 130 170 210 250 290 330 370 410
number of pc
fract
ion
of c
onfe
renc
es
00.10.20.30.40.50.60.70.80.91
prob
abili
ty o
f Q
12
Sep. 22, 2006 23
# of publication of PC
Percent
52.545.037.530.022.515.07.50.0
80
70
60
50
40
30
20
10
0
52.545.037.530.022.515.07.50.0
C LQCQ
Sep. 22, 2006 24
Closeness centrality of PC
0%2%4%6%8%
10%12%14%
5.00E-03
0.015
0.025
0.035
0.045
0.055
0.065
0.075
0.085
0.095
average closeness
fract
ion
of c
onfe
renc
es
0
0.05
0.1
0.15
0.2
0.25
0.3
prob
abilit
y of
LQ
C
∑∈
−=
Gwvwvd
nvCC
,),(
1)(
Q
13
Sep. 22, 2006 25
Combining All Features
Naïve (C4.5)Precision: 0.877Recall: 0.965
BaggingPrecision: 0.899Recall: 0.979
BoostingPrecision: 0.938Recall: 0.964
PC has feature A?
PC has feature B?
PC has feature C? PC has feature D?
Sep. 22, 2006 26
More than “usual suspects”
Classification detected two:The 2nd International Advanced Database Conference (IADC)The 4th International Conference on Computer Science and its Applications (ICCSA)
Not part of original Q
14
Sep. 22, 2006 27
PSU PrankApr. 10, 2006, we generated 3 bogus papers using MIT SCIgen software:
P1 by Ethan PatelP2 by Simon R. HathawayP3 by Richard Zhang
P2
P1
Sep. 22, 2006 28
PSU PrankIndiana’sInauthentic Paper Detectorsays:
P1: 28.9% => inauthenticP2: 61.5% => authenticP3: 38% => inauthentic
15
Sep. 22, 2006 29
PSU Prank
April 24 – May 1, 2006 P1 to ICCSA on April 24, (2) P2 to IADC on April 26, and (3) P3 to ICCSA on May 1.
May 15, 2006 P1 and P2 accepted w/o reviewsP3 rejected w/o reviewsAsked for reviews or any rationale no response so far
Sep. 22, 2006 30
“Ethan Patel” made it !
16
Sep. 22, 2006 31
“Richard Zhang” too !
Sep. 22, 2006 32
Outline
MotivationSimple StudyResultsSummary
17
Sep. 22, 2006 33
Summary
Practical setting of outlier detectionSemantic outlier vs. syntactic outlier
Developing general semantic outlier detection frameworkApplying to other practical problems
Eg, GM counterfeit detectionDeveloping general venue ranking framework
AppleRank project