Ml masterclass

Post on 15-Apr-2017

130 views 1 download

Transcript of Ml masterclass

Machine Learning Masterclass

+

ML Overview

● “AI”, “ML”, lots of hype - but what does it actually mean?

● Systems that learn from their experience over time, without explicit programming

● We aren’t in the business of building brains… (99% us aren’t at least) and you shouldn’t be either

ML Fun: Where’s Wally (Waldo)

ML Fun: Where’s Wally (Waldo)

Classification

The Problem

● Given a set of observations we want to be able to predict what class a new point belongs to

● e.g. does this patient have disease X given a set of measurements

Example Algorithm: Random Forests

● We basically subdivide the data into two with an axis aligned line

Example Algorithm: Random Forests

● We continue subdividing the data in the areas which have a bad mix of classes

Example Algorithm: Random Forests

● We build many of these decision trees

● Each perform poorly individually

● Their combined vote is powerful

Many Algorithms

Regression

The Problem

● Given a set of observations we want to be able to predict what value a new point belongs to

● e.g. how profitable will our website be next month? What’s the value of my house?

Example Algorithm: Gaussian Processes

● We pick a method of how we wish to join the dots

● Simplest case we fit a line to the data

● Infinite functions can join the dots - simpler the better (Occam’s Razor)

Example Algorithm: Gaussian Processes

● The ‘kernel’ describes what type of trends we expect and how to interpolate

https://github.com/jkfitzsimons/IPyNotebook_MachineLearning/blob/master/Just%20Another%20Kernel%20Cookbook....ipynb

Example Algorithm: Gaussian Processes

● The ‘kernel’ describes what type of trends we expect and how to interpolate

Feature Learning

Example Algorithm: Autoencoders

● The observations have an extremely complex relationship to the output

● We have a lot of data

● Most of the data is redundant

● We wish to learn the useful latent features

Example Algorithm: PCA (EigenFaces)

Example Algorithm: PCA (EigenFaces)

Yes, but how?

● How does one actually go about using it in any practical setting?

● Many applications invisible - hard to see the actual process

● There are principles and general concerns

● Four main issues: data, pipelining, error risk, institutionalization

#1: All comes down to data● Quantity is important, but it’s far from being the only thing

● Hygiene is key - structured is better than unstructured, complete is better than partial

● Bottleneck is often knowing what data is important, matched to goals

● Data scientists spend 80%+ of their time cleaning + preprocessing data, before any analysis is done

● Side note: Data science != machine learning; some highly competent data scientists are skilled in ML methods, but they may not necessarily be able to create new algorithms

#2: Data pipelining

● Having the data is no good if you can’t get it to where it needs to be● Operating in-place is the ultimate, but extremely difficult● The data lake problem: lake grows exponentially, replication● Define streaming vs batch (examples of streaming vs batch)

#3: Error risk

● Machine learning models are never 100% accurate

● What happens when the model is wrong?

● Play out consequences, their magnitude, and scope

● The best applications have low risk high gain

#4: Institutionalization● Every project must consider how the results will be used

● Who will use the results? Will the results be factored into decision-making, or will action be taken automatically?

● It’s not just about “doing machine learning”, it’s about creating a culture that uses ML as a core tool

● Data-driven decision making, only more evolved

● Leaders in the space make it so that every person in their organization can answer the “why” question

A lot of work!

The Upshot

● Google dropped energy usage in data centers by 40%, which translates to $100M USD / year● Self-driving cars are reality now (Uber, Tesla, countless others)● IBM Watson being used for developing cancer treatments and providing supporting diagnoses● Better security: access control at Amazon● Genome sequencing (makes heavy use of various ML methods)● CERN, LHC: Collision data (Higgs Boson, anyone?)● George Washington University: automatically learning optimal climate models

<shameless plug> Dubai Holding: increase profit margins by 25% in real estate businesses, $12B AED</shameless plug>