CS-695 NoSQL Database PostgreSQL (part 2 of 2)ccartled/Teaching/2015... · 3Sept. 20153Sept....

29
1/29 Miscellanea Assignment Extensions Summary Conclusion References Backup slides CS-695 NoSQL Database PostgreSQL (part 2 of 2) Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015 3 Sept. 2015

Transcript of CS-695 NoSQL Database PostgreSQL (part 2 of 2)ccartled/Teaching/2015... · 3Sept. 20153Sept....

  • 1/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    CS-695 NoSQL DatabasePostgreSQL (part 2 of 2)

    Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge

    3 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 20153 Sept. 2015

  • 2/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    Table of contents I

    1 Miscellanea

    2 Assignment

    3 Extensions

    4 Summary

    5 Conclusion

    6 References

    7 Backup slides

  • 3/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    Corrections and additions since last lecture.

    Be sure to look atassignment #01.

  • 4/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    Bits and pieces

    Little things that mean a lot.

    How to “know” what theuser intended

    How to measure “sameness”

    How to connect thedatabases one-to-another

  • 5/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    What are they and why should I care?

    PostgreSQL is largely ANSI-SQL:2008 compliant

    “The nice thingabout standards is thatyou have so many tochoose from.”

    Andrew S. Tanenbaum[8]

    ANSI-SQL:2011 adds many temporal related capabilities.

  • 6/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    What are they and why should I care?

    PostgreSQL is open source

    “. . . is not only apowerful databasesystem capable ofrunning the enterprise,it is a developmentplatform upon which todevelop in-house, web,or commercial softwareproducts that require acapable DBMS.”

    PostgreSQL Staff [3]

    Programmers are tool makers (among other things). Whenpossible to extend something, they will.

  • 7/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    What are they and why should I care?

    What are extensions?

    A way to define a collection of “loose” objects into a named entity.

    A collection is called an “extension”

    An extension may have many internal objects

    An extension is loaded via the CREATE EXTENSION command

    An extension is dropped via the DROP EXTENSION command

    An extension object can be modified via the CREATEFUNCTION or REPLACE FUNCTION command

    \dx to list installed extensions

    select * from pg available extensions() order by

    name; is also available.

  • 8/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    What are they and why should I care?

    Where can I find information about an extension?

    Like so many other things. It is in the documentation.1

    Documentation is terse.

    A few sentences about theextension.

    A list of objects in thecollection.

    Maybe an example.

    1http://www.postgresql.org/docs/9.3/static/contrib.html

  • 9/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    Overview

    “The fuzzystrmatch module providesseveral functions to determinesimilarities and distance betweenstrings.” [5]

    soundex — converts string toSoundex code

    metaphone — computesrepresentative string

    dmetphone — computes two“sounds like” strings

    levenshtein — computes“edit-distance” between two strings

    \dx+ to list objects in an extension

  • 10/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    Levenshtein algorithm

    Informally, the Levenshtein distance between two words is the minimumnumber of single-character edits (i.e. insertions, deletions or substitutions)required to change one word into the other.[7]

    There is a source and target word/string (s, t)

    Of length |s| and |t| respectivelyThere is a matrix levs,t(|s|, |t|) where:

    levs,t(i , j) =

    max(i , j) ifmin(i , j) = 0,

    min

    levs,t(i − 1, j) + 1 Deletionlevs,t(i , j − 1) + 1 Insertionlevs,t(i − 1, j − 1) + 1si 6=ti Substution

    The Levenshtein distance is in cell levs,t(|s|, |t|).

  • 11/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    levenshtein example

    SELECT *

    FROM some table

    WHERE levenshtein(storedValue, ’userInput’)

  • 12/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    A different type of similarity2

    How similar are thesesentences??

    Julie loves me more thanLinda loves me

    Jane likes me more thanJulie loves me

    A · B = ||A || ||B || cos θ

    cos θ =A · B

    ||A || ||B ||

    =

    n

    i=1 Ai × Bi√

    n

    i=1(Ai )2 ×

    √∑

    n

    i=1(Bi )2

    =

    (A× B)√

    (A)2 ×√

    (B)2

    =

    (A× B)√

    (A× A)×√

    (B × B)

    =A · B√

    A · A×√B · B

    (1)

    Math has the answer!!2http://stackoverflow.com/questions/1746501

    http://stackoverflow.com/questions/1746501

  • 13/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    The mechanics (part 1).

    Start with term frequency:

    A = Julie loves me more than Linda loves me

    B = Jane likes me more than Julie loves me

    The set of all words (as lower case):

    words = me, julie, loves, linda, than, more, likes, jane

  • 14/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    The mechanics (part 2).

    How often do the original strings use the full set of words??

    Word A B

    me 2 2julie 1 1loves 2 1linda 1 0than 1 1more 1 1likes 0 1jane 0 1

    Convert sentences to vectors.

    A = (2, 1, 2, 1, 1, 1, 0, 0)

    B = (2, 1, 1, 0, 1, 1, 1, 1)

    cos θ =9

    3.46× 3.16= 0.823

    (2)

    Range of cos θ is: 0 (no match) to 1 (perfect match).

  • 15/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    Another example

    The input strings:

    He hates term frequency

    She loves math

    Word A B

    he 1 0hates 1 0term 1 0frequency 1 0she 0 1loves 0 1math 0 1

    Convert sentences to vector.

    A = (1, 1, 1, 1, 0, 0, 0)

    B = (0, 0, 0, 0, 1, 1, 1)

    cos θ =0

    4× 3= 0.0

    (3)

    Process works well for unknown terms (i.e., great flexibility).

  • 16/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    Using “distance” to compute similarity

    From term-frequency (tf)discussion, we have the ideaof converting terms (tokens)to a numerical vector

    The tf vectors are created“on the fly”

    What if the union vectorwere known in advance??

  • 17/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    As records were added, their “word vector” could becomputed

    union vector = ( me, julie,loves, linda, than, more,likes, jane)

    input vector = Julie lovesme more than Linda lovesme

    word vector =(2,1,2,1,1,1,0,0)

    Each input vector now “lives” atpoint on a multi-dimensionalplane Image from [6].

    All documents would “live” on the same multi-dimensional plane.

  • 18/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    fuzzystrmatch extension

    Now that everything “lives” on the same plane . . .

    We can compute how close they are to each other.Distance d(p, q) between points p and q.

    1D:√

    (p − q)2

    2D:√

    (p1 − q1)2 + (p2 − q2)2

    3D:√

    (p1 − q1)2 + (p2 − q2)2 + (p3 − q3)2

    nD:√

    n

    i=1(pi − qi )2

    Wouldn’t it be nice if PostgreSQL could help us with this math??

  • 19/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    cube extension

    The “cube” extension (part 1 of 2)

    “This module implements a data type cube for representing multidimensionalcubes.”[4]

    Adds a custom data type called cube

    A cube type expects a vector’(0,0,0,0,0,0,0,7,0,0,0,0,0,0,0,0,0,0)’

    The vector contains n values (i.e., ndimensions)

    Values are user units Notional, rather than real.

    Image from [2].

  • 20/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    cube extension

    The “cube” extension (part 2 of 2)

    ’Oedipus’, ’(2,0,0,0,0,9,0,9,9,0,0,0,0,0,0,0,0,0)’’Gone with the Wind’, ’(0,0,0,3,0,0,0,5,0,0,0,0,0,0,0,0,0,0)’

    ’The 40 Year Old Virgin’, ’(0,0,0,5,5,0,0,0,0,0,0,0,0,0,0,0,0,0)’’Animal House’, ’(0,0,0,5,9,0,0,0,0,0,0,0,0,0,0,0,0,0)’

  • 21/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    postgres fdw extension

    Working with more than one database (part 1 of 2)

    Original database design was envisaged as standalone. Needschanged over time. The dblink extensions (pre 9.3), postgres fdw(9.3+)PostgreSQL pre-9.3 used dblink — primarilally executes select* that returns rowsSELECT *

    FROM table1 tb1

    LEFT JOIN (

    SELECT *

    FROM dblink(’dbname=db2’,’SELECT id, code FROM

    table2’)

    AS tb2(id int, code text);

    ) AS tb2 ON tb2.column = tb1.column;

  • 22/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    postgres fdw extension

    Working with more than one database (part 2 of 2)

    PostgreSQL post-9.3 uses postgres fdw3

    CREATE SERVER —connect to a remotedatabase

    CREATE USER MAPING —authenticate to remotedatabase

    CREATE FOREIGN TABLE

    — local table connectedto remote table

    CREATE SERVER book server

    FOREIGN DATA WRAPPER postgres fdw

    OPTIONS (host ’localhost’, port

    ’5432’, dbname ’postgis in action’);

    CREATE USER MAPPING FOR public SERVER

    book server

    OPTIONS (user ’book guest’, password

    ’whatever’);

    CREATE FOREIGN TABLE

    ch01.ft restaurants

    (id integer, franchise character(3),

    geom geometry(Point,2163)

    SERVER book server OPTIONS

    (schema name ’ch01’, table name

    ’restaurants’);

    3http://www.postgresql.org/docs/9.3/static/postgres-fdw.htmlhttp://www.postgresonline.com/journal/archives/322-Generating-Create-

  • 23/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    Strengths and weaknesses

    Good and not so good

    Strengths

    Age, lots of years ofactive developmentLots of language specificdriversExtensibilityOpen source

    Weaknesses

    Partionability (re. CAPTheorem)Data must be “neat andtidy”

  • 24/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    Applicabilities

    Good for, and not so good for

    Good fit

    Well structured dataData known in advanceData use not known inadvance

    Not so good fit

    Highly variable dataHierarchical or “objectoriented”Extremely sparse data

  • 25/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    What have we covered?

    Reviewed assignment #01Covered PostgreSQLextensionsCovered different ways tocompute and measuredocument “sameness”Remember Assignment #01due before next class

    Next time: CRUDy Riak

  • 26/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    References I

    [1] Thomas Lockhart, Postgresql programmers guide, Thomas Lockhart(editor), 2001.

    [2] Eric Redmond and Jim R Wilson, Seven databases in seven weeks,Pragmatic Bookshelf, 2012.

    [3] PostgreSQL Staff, About, http://www.postgresql.org/about/, 2015.

    [4] , cube,http://www.postgresql.org/docs/9.3/static/cube.html, 2015.

    [5] , fuzzystrmatch,http://www.postgresql.org/docs/9.3/static/fuzzystrmatch.html,2015.

    [6] WikiHow Staff, How to plot points in three dimensions,http://www.wikihow.com/Plot-Points-in-Three-Dimensions, 2015.

    http://www.postgresql.org/about/http://www.postgresql.org/docs/9.3/static/cube.htmlhttp://www.postgresql.org/docs/9.3/static/fuzzystrmatch.htmlhttp://www.wikihow.com/Plot-Points-in-Three-Dimensions

  • 27/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    References II

    [7] Wikipedia Staff, Levenshtein distance,https://en.wikipedia.org/wiki/Levenshtein_distance, 2015.

    [8] Andrew S Tanenbaum, Computer networks, Prentice Hall, 2003.

    https://en.wikipedia.org/wiki/Levenshtein_distance

  • 28/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    Slides

    Genres from [2]

    INSERT INTO genres (name,position) VALUES

    (’Action’,1),

    (’Adventure’,2),

    (’Animation’,3),

    (’Comedy’,4),

    (’Crime’,5),

    (’Disaster’,6),

    (’Documentary’,7),

    (’Drama’,8),

    (’Eastern’,9),

    (’Fantasy’,10),

    (’History’,11),

    (’Horror’,12),

    (’Musical’,13),

    (’Romance’,14),

    (’SciFi’,15),

    (’Sport’,16),

    (’Thriller’,17),

    (’Western’,18);

  • 29/29

    Miscellanea Assignment Extensions Summary Conclusion References Backup slides

    Slides

    Connection architecture

    Image from [1].

    postgreSQLConnection.png

    MiscellaneaCorrections and additions since last lecture.

    AssignmentBits and pieces

    ExtensionsWhat are they and why should I care?fuzzystrmatch extensioncube extensionpostgres_fdw extension

    SummaryStrengths and weaknessesApplicabilities

    ConclusionReferencesBackup slidesSlides