Software Engineering College 2 - ETL and databases

41
College 2 – ETL and loading data into R

Transcript of Software Engineering College 2 - ETL and databases

Page 1: Software Engineering College 2 - ETL and databases

College 2 – ETL and loading data into R

Page 2: Software Engineering College 2 - ETL and databases

Hoofdstukken

Kortom: Wat is er blijven hangen van het vorige college?

Page 3: Software Engineering College 2 - ETL and databases

Het doel van dit college is• Dat je om kunt gaan met messy data

(DataCamp)• Dat je begrip hebt van het ETL proces• Dat je kennis hebt van verschillende type data

acquisitie• Dat je zelfstandig online data kunt ETL-en in R

Page 4: Software Engineering College 2 - ETL and databases
Page 5: Software Engineering College 2 - ETL and databases

Messy data á la DataCamp

ETL (hfst 12 corporate performance mgt)

Loading data into R (hfst 2)

Er zat een API op een stokkie

Page 6: Software Engineering College 2 - ETL and databases
Page 7: Software Engineering College 2 - ETL and databases

Messy data according to DataCamp

1. Column headers are values, not variable names2. Multiple values are stored in one column3. Variables are stored in both rows and columns4. Multiple types of observational units are stored in the same table5. A single observational unit is stored in multiple tables

Page 8: Software Engineering College 2 - ETL and databases

MessyGatherUnite Seperate

Data.table -> dcast / reshape

Messy data according to DataCamp / tidyR

Page 10: Software Engineering College 2 - ETL and databases
Page 11: Software Engineering College 2 - ETL and databases
Page 12: Software Engineering College 2 - ETL and databases

Definities ETL

Extract The process of pulling data from a source server or systems to an intermediate format

TransformReconciles data type and format differences, resolves uniqueness issues, and ansures conformity of data before the data is loaded into the data warehouse, it can also contain repairing and cleansing data.

LoadMoves data from the staging area into the dimension** and fact tables**

** volgt in college over dwh

Page 13: Software Engineering College 2 - ETL and databases

3 methoden om gegevens op te halen

Batch ETL, • Nadat productiesysteem consistent is• Initiatie bij DWH

Online ETL, • Vanuit productiesysteem contiue update naar DWH

Ad-hoc ETL, Zodra de gebruiker om een rapport vraagt

Page 14: Software Engineering College 2 - ETL and databases

Proces stappen ETLIsoleren Let op

• structuur data (zie verderop presentatie)• Doorlooptijd (1 mB vs 10TB)• Netwerkbelasting • actualiteit

Prepareren Reconciles data type and format differences, resolves uniqueness issues, and ansures conformity of data before the data is loaded into the data warehouse, it can also contain repairing and cleansing data.

Ook aggregeren en verrijken

TransformerenUniform maken volgens DWH (ook wel mappings klaar maken)

LadenInladen in DWH

Page 15: Software Engineering College 2 - ETL and databases

OUT OF 380,000 RECORDS A SUBSET OF ~250,000 WAS SUITABLE FOR ANALYSIS

Data filtering and smart algorithms required for quality of data analysis.

Total

before

clean

sing

Connection time r

epair

Physicly

impossi

ble charg

e sess

ions

Unknown data

Double reco

rds

Short

time

Double provid

er

Net usab

le rec

ords

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

Causes of ~35% data removal per error type

Page 16: Software Engineering College 2 - ETL and databases

EXAMPLE OF SHORT TIME ALGORITHM

SHORT TIME CHARGE SESSIONS FILTERED

Note: Our hypothesis is that loose cable connections and information transfer issues cause this problem.

total charge sesion

Session 1

Session 2

Session 3

Session 4

Session 5

Session 6

Session 7

Session 8

Session 9

The crawling algorithm checks on adjacent short times.

The algorithm influences the # charge sessions as well, and thus the mean session duration.

Source: Charge infrastructure forecast database

Page 17: Software Engineering College 2 - ETL and databases

WITH THE CHARGING DATA AS CENTRAL DATASET, THE DATABASE IS CONTINUOUSLY EXPANDED, EXTENDED AND ENRICHED AND SCRAPED

Data Extension

Data enrichment Data Scraping

Data Expansion

OCPI

Page 18: Software Engineering College 2 - ETL and databases
Page 19: Software Engineering College 2 - ETL and databases

Let op bij https daar wordt R niet blij van dus oplossing

Extract

# Failread.csv("https://raw.github.com/sciruela/Happiness-Salaries/master/data.csv")

# Winread.url <- function(url, ...){ tmpFile <- tempfile() download.file(url, destfile = tmpFile, method = "curl") url.data <- read.csv(tmpFile, ...) return(url.data)}read.url("https://raw.github.com/sciruela/Happiness-Salaries/master/data.csv")

Page 20: Software Engineering College 2 - ETL and databases

Tidy data betekent ook de juiste type variablelen

Page 21: Software Engineering College 2 - ETL and databases

Hoe zou je dit met dlyr aanpakken

En hoe met datatable?-

filter(flights, month == 1, day == 1)flights %>% filter( month == 1, day == 1)#> # A tibble: 842 x 19#> year month day dep_time sched_dep_time dep_delay arr_time#> <int> <int> <int> <int> <int> <dbl> <int>#> 1 2013 1 1 517 515 2 830#> 2 2013 1 1 533 529 4 850#> 3 2013 1 1 542 540 2 923#> 4 2013 1 1 544 545 -1 1004#> ... with 838 more rows, and 12 more variables: sched_arr_time <int>,#> arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,#> origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,#> minute <dbl>, time_hour <time>

Page 22: Software Engineering College 2 - ETL and databases
Page 23: Software Engineering College 2 - ETL and databases

RJDBC # Set JAVA_HOME, set max. memory, and load rJava library Sys.setenv(JAVA_HOME='/path/to/java_home') options(java.parameters="-Xmx2g") library(rJava) # Output Java version .jinit() print(.jcall("java/lang/System", "S", "getProperty", "java.version")) # Load RJDBC library library(RJDBC) # Create connection driver and open connection jdbcDriver <- JDBC(driverClass="oracle.jdbc.OracleDriver", classPath="lib/ojdbc6.jar") jdbcConnection <- dbConnect(jdbcDriver, "jdbc:oracle:thin:@//database.hostname.com:port/service_name_or_sid", "username", "password") # Query on the Oracle instance name. instanceName <- dbGetQuery(jdbcConnection, "SELECT instance_name FROM v$instance") print(instanceName) # Close connection dbDisconnect(jdbcConnection)

Page 24: Software Engineering College 2 - ETL and databases

RODBClibrary(RODBC) channel <- odbcDriverConnect("driver=SQL Server;server=01wh155073") initdata<- sqlQuery(channel,paste("select * from test_DB .. test_vikrant")) dim(initdata) odbcClose(channel)

gaan we nog leren tijdens de werkcolleges

Page 25: Software Engineering College 2 - ETL and databases

File types

Page 26: Software Engineering College 2 - ETL and databases

JSON versus XML

Beide file types zijn ongestructureerd maar toch.. Leg uit!

Page 27: Software Engineering College 2 - ETL and databases
Page 28: Software Engineering College 2 - ETL and databases

Over API’s

Page 29: Software Engineering College 2 - ETL and databases

De basisgedachte achter API’s

Page 30: Software Engineering College 2 - ETL and databases

De volgende statements kun je uitvoeren

Page 31: Software Engineering College 2 - ETL and databases

Twitter doorzoeken met API#Create your own appication key at https://dev.twitter.com/appsconsumer_key = "EZRy5JzOH2QQmVAe9B4j2w";consumer_secret = "OIDC4MdfZJ82nbwpZfoUO4WOLTYjoRhpHRAWj6JMec";

#Use basic authsecret <- openssl::base64_encode(paste(consumer_key, consumer_secret, sep = ":"));req <- httr::POST("https://api.twitter.com/oauth2/token", httr::add_headers( "Authorization" = paste("Basic", secret), "Content-Type" = "application/x-www-form-urlencoded;charset=UTF-8" ), body = "grant_type=client_credentials");

#Extract the access tokentoken <- paste("Bearer", content(req)$access_token)

#Actual API callurl <- "https://api.twitter.com/1.1/statuses/user_timeline.json?count=10&screen_name=Rbloggers"req <- httr::GET(url, add_headers(Authorization = token))json <- httr::content(req, as = "text")tweets <- fromJSON(json)substring(tweets$text, 1, 100)

Page 32: Software Engineering College 2 - ETL and databases
Page 33: Software Engineering College 2 - ETL and databases

Of met een package # Install and Activate Packagesinstall.packages("twitteR", "RCurl", "RJSONIO", "stringr")library(twitteR)library(RCurl)library(RJSONIO)library(stringr) # Declare Twitter API Credentialsapi_key <- "API KEY" # From dev.twitter.comapi_secret <- "API SECRET" # From dev.twitter.comtoken <- "TOKEN" # From dev.twitter.comtoken_secret <- "TOKEN SECRET" # From dev.twitter.com # Create Twitter Connectionsetup_twitter_oauth(api_key, api_secret, token, token_secret) # Run Twitter Search. Format is searchTwitter("Search Terms", n=100, lang="en", geocode="lat,lng", also accepts since and until). tweets <- searchTwitter("Obamacare OR ACA OR 'Affordable Care Act' OR #ACA", n=100, lang="en", since="2014-08-20") # Transform tweets list into a data frametweets.df <- twListToDF(tweets) # Use the searchTwitter function to only get tweets within 50 miles of Los Angelestweets_geolocated <- searchTwitter("Obamacare OR ACA OR 'Affordable Care Act' OR #ACA", n=100, lang="en", geocode='34.04993,-118.24084,50mi', since="2014-08-20")tweets_geoolocated.df <- twListToDF(tweets_geolocated)

Page 34: Software Engineering College 2 - ETL and databases

Google doorzoeken met APIlibrary(gtrendsR)user <- "<Google account email>"psw <- "<Google account password>"gconnect(usr, psw) lang_trend <- gtrends(c("data is", "data are"), res="week")plot(lang_trend)

Page 35: Software Engineering College 2 - ETL and databases
Page 36: Software Engineering College 2 - ETL and databases

Opdracht voor komende vrijdag

Page 37: Software Engineering College 2 - ETL and databases

Opdracht presentatie college vrijdag

Doel: Open data gebruiken voor analyses

Opdracht:Onderzoek of het weer invloed heeft op het eten van ijsjes (en dit delen op sociale media)

Stappen1. Verplicht zoek een nieuwe groep die bestaat uit studenten van verschillende opleidingen. We willen geen eenzijdige teams meer.2. Registreer je voor een Twitter, Google en facebook API key3. Download met R code de KNMI data 4. Zoek de frequentie van relevante ijsjes termen op diverse media5. Maak relevante plots en een lopend verhaal over de relatie tussen het weer ijsconsumptie

Page 38: Software Engineering College 2 - ETL and databases

Loops een relevante ijsjes termen

Maak een lijst relevante termen

maak eerst alle code voor 1 term

Maak vervolgens een loop voor de collectie van termen (lapply)

Maak een dataframe waarin je in de loop steeds een rbind doet van het resultaat aan een totaal dataframe waarover je de analyse doet

Page 39: Software Engineering College 2 - ETL and databases

Typische Tentamenvragen• Leg uit wat het verschil is tussen ELT en ELT? Welk effect op

rekencapaciteit heeft dit?• Welke fases kent het ELT proces en wat gebeurt er in deze

fases?• Geef 3 voorbeelden van messy data en leg uit hoe je met R deze

weer tidy kun maken. • Leg uit wat een API is • Waarom is JSON wel en geen gestructureerd file type?

Page 40: Software Engineering College 2 - ETL and databases

https://www.youtube.com/watch?v=jyju2P-7hPA&list=PLAwxTw4SYaPm4R6j_wzVOCV9fJaiQDYx4

LECTURE

Volgende keer gaan te praten over datawarehouses. Ik verwacht dat je dan hfst 8 (was deze week) en 11 en 12 gelezen hebt. Blended learning tips• Zoek relevante flipjes over DHW en OLAP en ETL

Page 41: Software Engineering College 2 - ETL and databases

Links

Datasets https://vincentarelbundock.github.io/Rdatasets/datasets.html https://github.com/caesar0301/awesome-public-datasets#data-challengeshttps://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html

instructieshttps://www.r-bloggers.com/playing-with-twitter-data/ https://www.r-bloggers.com/how-to-create-a-twitter-sentiment-analysis-using-r-and-shiny/ https://www.r-bloggers.com/accessing-apis-from-r-and-a-little-r-programming/ https://theodi.org/blog/how-to-use-r-to-access-data-on-the-web http://bogdanrau.com/blog/collecting-tweets-using-r-and-the-twitter-search-api/ http://www.ryanpraski.com/google-search-console-api-r-guide-to-get-started/ https://bigdataenthusiast.wordpress.com/2016/03/19/mining-facebook-data-using-r-facebook-api/