Marianne hoogeveen demo1

15
Jobmash Job searching without the pain Marianne Hoogeveen

Transcript of Marianne hoogeveen demo1

Jobmash

Job searching without the pain

Marianne Hoogeveen

Searching for data science jobs is tiring and depressing

‘Data Scientist’ Ranges from Excel pusher to software engineer

! MANY jobs, but how to find the right kind?

! What IS the right kind (for me)?

Want “more like this” feature

MORE LIKE THIS, PLEASE!

Data: 11000 job postings from Indeed

Specific data challenges:Buzzwords (“world-class”, “driven”, “exciting”, “mission”, “opportunity”)

Difficult to validate: what is ground truth?

Look for “Data Scientist”, “Data Engineer”, “Data Analyst” job descriptions

US Wide

Compute text similarity

Remove low-information words (stop words, dates, locations, numbers, common verbs, …)

Count occurrence of words (and bigrams), weighted negatively if they are common in the corpus of all documents (TF-IDF)

Compare similarity between these weighted TF-IDF vectors using cosine similarity

What are the job titles of top-100 most similar job postings?

Search term: Data Scientist Search term: Data EngineerSearch term: Data Analyst

DA DA DS DEDSDS DE DEDAother other other

Using cosine similarity on TF-IDF vectors, after removing buzzwords; find 100 most similar and compare job titles

Which job titles can we predict from job description?

Removing buzzwords

False positive rate0.0 1.00.4 0.6 0.80.2

1.0 1.0

False positive rate0.0 1.00.4 0.6 0.80.2

Keeping buzzwords

True

pos

itive

rate

True

pos

itive

rate

(“world-class”, “exciting”, “mission”, “opportunity”)

Let’s have a look!

JOBMASH APP

Why is this useful?

Less irrelevant results

Don’t miss similar jobs that don’t have the right job title

About me

PhD Theoretical Physics, King’s College London

Data Science Internship at Cytora Ltd:

Recognising street addresses in newspaper articles

Extra slides

Validation

200 random docs

Compare with same docs cut in half

Compute pairwise cosine similarities

0

0.2

0.4

0.6

0.8

1

Orig

inal

doc

umen

ts s

orte

d by

ID

Half documents sorted by original’s ID

Topic modellingSalient terms:

ClientMachineModelSystemStatusFinancialResearchRiskAnalysisStatisticalEngineerEmploymentSupportBig

LDA topics

e.g. “analysis”, “report”, “statistical” , “excel”, “sas”, “insight”

e.g. “big”, “technology”, “system” , “engineer”, “design”, “build”, “hadoop”, “platform”

Other LDA topics

e.g. “status”, “disability”, “gender” e.g. “benefit”, “dental”, “pay”, “medical”, “401k”