ﺕﺍﺭﺍﺮﻘﻟﺍ ﺫﺎﲣﻹ ﺎﻬﺑ...
Transcript of ﺕﺍﺭﺍﺮﻘﻟﺍ ﺫﺎﲣﻹ ﺎﻬﺑ...
Prof. Elabassi 11/17/2016
Prof. Elabassi 1
ةالكمية والكيفيحتليل البيانات إلختاذ القراراتوالتنقيب بها
عبداحلميد العباسي / إعداد االستاذ الدكتورعميد املعهد
2016 نوفمرب
جامعة القاهرةمعهد الدراسات والبحوث اإلحصائية
Prof. Elabassi1 11/17/2016
The Decision Making Process
Begin Here:
Identify theProblem
Data
Information
Knowledge
Decision
Descriptive Statistics,Probability, Computers
Experience, Theory,Literature, InferentialStatistics, Computers
Prof. Elabassi2 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 2
Prof. Elabassi3 11/17/2016
Prof. Elabassi4 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 3
Prof. Elabassi5 11/17/2016
Prof. Elabassi6 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 4
Prof. Elabassi7 11/17/2016
Prof. Elabassi8 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 5
Prof. Elabassi9 11/17/2016
Prof. Elabassi10 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 6
Prof. Elabassi11 11/17/2016
Prof. Elabassi12 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 7
Prof. Elabassi13 11/17/2016
Prof. Elabassi14 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 8
Prof. Elabassi15 11/17/2016
Prof. Elabassi16 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 9
Prof. Elabassi17 11/17/2016
Prof. Elabassi18 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 10
Prof. Elabassi19 11/17/2016
Prof. Elabassi20 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 11
Prof. Elabassi21 11/17/2016
• Making statements about a population by examining sample results
Sample statistics Population parameters
(known) Inference (unknown, but can
be estimated from
sample evidence)
Sample Population
Inferential Statistics
Prof. Elabassi22 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 12
Sampling Techniques
Convenience
Sampling Techniques
Nonstatistical Sampling
Judgment
Statistical Sampling
SimpleRandom
Systematic
StratifiedCluster
Prof. Elabassi23 11/17/2016
Data Types
Data
Qualitative(Categorical)
Quantitative(Numerical)
Discrete Continuous
Examples:
Marital Status Political Party Eye Color
(Defined categories) Examples:
Number of Children Defects per hour
(Counted items)
Examples:
Weight Voltage
(Measured characteristics)
Prof. Elabassi24 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 13
Levels of Measurementand Measurement Scales
Interval Data
Ordinal Data
Nominal Data
Height, Age, Weekly Food Spending
Service quality rating, Standard & Poor’s bond rating, Student letter grades
Marital status, Type of car owned
Ratio Data
Temperature in Fahrenheit, Standardized exam score
Categories (no ordering or direction)
Ordered Categories (rankings, order, or scaling)
Differences between measurements but no true zero
Differences between measurements, true zero exists
EXAMPLES:
Prof. Elabassi25 11/17/2016
Pie Chart Example
Percentages are rounded to the nearest percent
Current Investment Portfolio
Savings15%
CD 14%
Bonds 29%
Stocks
42%
Investment Amount PercentageType (in thousands $)
Stocks 46.5 42.27
Bonds 32.0 29.09
CD 15.5 14.09
Savings 16.0 14.55
Total 110 100
(Variables are Qualitative)
Prof. Elabassi26 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 14
Bar Chart Example 2
Newspaper readership per week
0
10
20
30
40
50
0 1 2 3 4 5 6 7
Number of days newspaper is read per week
Fre
uen
cy
Prof. Elabassi27 11/17/2016
Side‐by‐Side Chart Example
•Sales by quarter for three sales territories:
0
10
20
30
40
50
60
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
EastWestNorth
1st Qtr 2nd Qtr 3rd Qtr 4th QtrEast 20.4 27.4 59 20.4West 30.6 38.6 34.6 31.6North 45.9 46.9 45 43.9
Prof. Elabassi28 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 15
Summary Measures
Center and Location
Mean
Median
Mode
Other Measures of Location
Weighted Mean
Describing Data Numerically
Variation
Variance
Standard Deviation
Coefficient of Variation
RangePercentiles
Interquartile RangeQuartiles
Prof. Elabassi29 11/17/2016
Same center,different variation
Measures of Variation
Variation
Variance Standard Deviation
Coefficient of Variation
Range InterquartileRange
Measures of variation give information on the spread or variability of the data
values.
Prof. Elabassi30 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 16
المضلع التكرارى
0
2
4
6
8
10
12
14
0
2
4
6
8
10
12 المنحنى التكرارىالمدرج التكرارى
0
2
4
6
8
10
12
14
التوزيع الطبيعى
Prof. Elabassi31 11/17/2016
A Classification of Univariate Techniques
Independent Related
Independent Related* Two- Group test
* Z test* One-Way
ANOVA
* Pairedt test
* Chi-Square* Mann-Whitney* Median* K-S* K-W ANOVA
* Sign* Wilcoxon* McNemar* Chi-Square
Metric Data Non-numeric Data
Univariate Techniques
One Sample Two or More Samples
One Sample Two or More Samples
* t test* Z test
* Frequency* Chi-Square* K-S* Runs* Binomial
Prof. Elabassi32 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 17
راتمعامالت االرتباط تبعا لقياس المتغي
Prof. Elabassi33 11/17/2016
Scatter Plots of Data with Various Correlation Coefficients
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1 r = -.6 r = 0
r = +.3r = +1
Y
Xr = 0
Prof. Elabassi34 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 18
(continued)
Random Error for this Xi value
Y
X
Observed Value of Y for Xi
Predicted Value of Y for Xi
ii10i εXββY
Xi
Slope = β1
Intercept = β0
εi
Simple Linear Regression Model
Prof. Elabassi35 11/17/2016
The Multiple Regression Model
Idea: Examine the linear relationship between1 dependent (Y) & 2 or more independent variables (Xi)
ikik2i21i10i εXβXβXββY
Multiple Regression Model with k Independent Variables:
Y-intercept Population slopes Random Error
kik2i21i10i XbXbXbbY
Estimated(or predicted)
value of YEstimated slope coefficientsEstimated
intercept
Prof. Elabassi36 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 19
Logistic Regression
• Used when the dependent variable Y is binary (i.e., Y takes on only two values)
• Examples
– Customer prefers Brand A or Brand B
– Employee chooses to work full‐time or part‐time
– Loan is delinquent or is not delinquent
– Person voted in last election or did not
• Logistic regression allows you to predict the probability of a particular categorical response
Prof. Elabassi37 11/17/2016
Logistic Regression
• Logistic regression is based on the odds ratio, which represents the probability of a success compared with the probability of failure
• The logistic regression model is based on the natural log of this odds ratio
(continued)
success ofy probabilit1
success ofy probabilit ratio Odds
Prof. Elabassi38 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 20
Logistic Regression
ikik2i21i10 εXβXβXββratio) ln(odds
Where k = number of independent variables in the model
εi = random error in observation i
kik2i21i10 XbXbXbbratio) odds edln(estimat
Logistic Regression Model:
Logistic Regression Equation:
(continued)
Prof. Elabassi39 11/17/2016
Estimated Odds Ratio and Probability of Success
• Once you have the logistic regression equation, compute the estimated odds ratio:
• The estimated probability of success is
ratio) odds edln(estimateratio odds Estimated
ratio odds estimated1
ratio odds estimatedsuccess ofy probabilit Estimated
Prof. Elabassi40 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 21
• The relationship between the dependent variable and an independent variable may not be linear
• Can review the scatter diagram to check for non‐linear relationships
• Example: Quadratic model
– The second independent variable is the square of the first variable
Nonlinear Relationships
i21i21i10i εXβXββY
Prof. Elabassi41 11/17/2016
Quadratic Regression Model
• where:
β0 = Y intercept
β1 = regression coefficient for linear effect of X on Y
β2 = regression coefficient for quadratic effect on Y
εi = random error in Y for observation i
i21i21i10i εXβXββY
Model form:
Prof. Elabassi42 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 22
Quadratic Regression Model
Quadratic models may be considered when the scatter diagram takes on one of the following shapes:
X1
Y
X1X1
YYY
β1 < 0 β1 > 0 β1 < 0 β1 > 0
β1 = the coefficient of the linear termβ2 = the coefficient of the squared term
X1
i21i21i10i εXβXββY
β2 > 0 β2 > 0 β2 < 0 β2 < 0
Prof. Elabassi43 11/17/2016
The Importance of Forecasting
Governments forecast unemployment, interest rates, and expected revenues from income taxes for policy purposes
Marketing executives forecast demand, sales, and consumer preferences for strategic planning
College administrators forecast enrollments to plan for facilities and for faculty recruitment
Retail stores forecast demand to control inventory levels, hire employees and provide training
Time-Series Forecasting
Prof. Elabassi Chap 1-44
Prof. Elabassi 11/17/2016
Prof. Elabassi 23
Common Approaches to Forecasting
Used when historical data are unavailable
Considered highly subjective and judgmental
Common Approaches to Forecasting
Causal
Quantitative forecasting methods
Qualitative forecasting methods
Time Series
Use past data to predict future values
Prof. Elabassi Chap 1-45
Time-Series Components
Time Series
Cyclical Component
Irregular Component
Trend Component
Seasonal Component
Overall, persistent, long-term movement
Regular periodic fluctuations,
usually within a 12-month period
Repeating swings or
movements over more than one
year
Erratic or residual
fluctuations
Prof. Elabassi Chap 1-46
Prof. Elabassi 11/17/2016
Prof. Elabassi 24
Multiplicative Time-Series Model with a Seasonal Component
Used primarily for forecasting
Allows consideration of seasonal variation
where Ti = Trend value at time i
Si = Seasonal value at time i
Ci = Cyclical value at time i
Ii = Irregular (random) value at time i
iiiii ICSTY
Prof. Elabassi Chap 1-47
Sales vs. Smoothed Sales
Fluctuations have been smoothed
NOTE: the smoothed value in this case is generally a little low, since the trend is upward sloping and the weighting factor is only .2
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10Time Period
Sa
les
Sales Smoothed
Prof. Elabassi Chap 1-48
Prof. Elabassi 11/17/2016
Prof. Elabassi 25
Trend-Based Forecasting
Forecast for time period 6:(continued)
Sales trend
01020304050607080
0 1 2 3 4 5 6
Year
sale
s
79.33
(6) 9.571421.905Y
Prof. Elabassi Chap 1-49
Variance
Means:
Correlation,Regression
Variances:
Interval or Ratio(such asheights,weights)
One Population
Nominal (data consistingof proportionsor frequency
counts fordifferent
categories
More thanTwo Populations
One Population
Two Populations
Mean
Contingency Table(multiple rows,
Columns)
Two Populations:
Independent:
Matched Pairs:
Multinomial(one row)
One Population
Estimatingwith Confidence
Interval:
HypothesisTesting with
Large Sample:
Estimatingwith Confidence
Interval:
HypothesisTesting:
EstimatingProportion with
Confidence
HypothesisTesting:
Ordinal(such as data
consisting of ranks)
Two Populations
FrequencyCounts forCategories
Proportions
What is thelevel of
measurementof the data?
Level ofMeasurement
Number ofPopulations
Claim orParameter
Inference
More thanTwo Populations
Prof. Elabassi Chap 1-50
Prof. Elabassi 11/17/2016
Prof. Elabassi 26
Prof. Elabassi 5111/17/2016
Prof. Elabassi 5211/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 27
Prof. Elabassi 5311/17/2016
Prof. Elabassi 5411/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 28
Prof. Elabassi 5511/17/2016
Prof. Elabassi 5611/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 29
Prof. Elabassi 5711/17/2016
على نفس المستوىيجب ان تكون البيانات الواحدة) المعادلة(فى النموذج
استخدام تحليل متعدد المستويات
Multilevel
مراعاة معايير الجودة فى النموذجProf. Elabassi 5811/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 30
Prof. Elabassi 5911/17/2016
Prof. Elabassi 6011/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 31
Introduction: What is SPSS?
• Originally it is an acronym of Statistical Packagefor the Social Science but now it stands forStatistical Product and Service Solutions.
• Since SPSS 17 it is called PASW, PredictiveAnalysis SoftWare, and SPSS is now named asPASW Statistics.
• One of the most popular statistical packageswhich can perform highly complex datamanipulation and analysis with simple instructions.
Prof. Elabassi 6111/17/2016
The Basic Analysis1- Frequencies2- Descriptives3- Explore4- Crosstabs
Prof. Elabassi 6211/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 32
Test Hypothesis1- One sample2- Two Samples (R-I)3- Anova (F test)4- Ancova
Modelling1- Cor. & Reg.2- Log. & Dis.Prof. Elabassi 6311/17/2016
Regression Analysis
• Click ‘Analyze,’ ‘Regression,’ then click‘Linear’ from the main menu.
Prof. Elabassi 6411/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 33
Regression Analysis
• Clicking OK gives the result
Prof. Elabassi 6511/17/2016
Data MiningMay be defined as follows:data mining is a collection of techniques for efficient automateddiscovery of previously unknown, valid, novel, useful andunderstandable patterns in large databases. The patterns mustbe actionable so they may be used in an enterprise’s decisionmaking.
Why Data Mining Now?
Data are being produced
Data are being stored in data warehouses
Computing power if more affordable
Competitive pressures are enormous
Availability of easy to use data mining software
Prof. Elabassi66 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 34
Cross Industry Standard Process ‐ DM
Iterative CRISP‐DM process shown in outer circle
Most significant dependencies between phases shown
Next phase depends on results from preceding phase
Returning to earlier phase possible before moving forward
Prof. Elabassi67 11/17/2016
Data Mining Tasks•Description•Estimation•Classification•Prediction
•Clustering•Affinity Analysis
Supervised
Directed
Unsupervised
Undirected
Difference; target variable—numeric or categorical
Difference between prediction and (classification and estimation) is future
Matching Data Mining Tasks to Data Mining Algorithms
Estimation Multiple Linear Regression, Neural Networks
Classification Decision Trees, Logistic Regression, Neural Networks, k‐NN
Prediction Estimation & Classification for future values
Clustering k‐means, Kohonen Self Organizing Maps
Affinity Ana. Association Analysis, sometimes referred to as Market Basket AnalysisProf. Elabassi68 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 35
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
Algorithm
Distributed Computing
Visualization
Prof. Elabassi69 11/17/2016
______
______
______
Transformed Data
Patternsand
Rules
Target Data
RawData
KnowledgeInterpretation& Evaluation
Integration
Un
de
rsta
nd
ing
Knowledge Discovery Process
DATAWarehouse
Knowledge
Prof. Elabassi70 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 36
Knowledge Discovery Process
–Data mining: the core of knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Preprocessed Data
Task-relevant DataData transformations
Selection
Data Mining
Knowledge Interpretation
Prof. Elabassi71 11/17/2016
Data Mining Models and Tasks
Prof. Elabassi72 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 37
KDD Process Ex: Shuttle Data• Selection:
– Select data (which missions etc) to use
• Preprocessing:– Remove Spikes
• Transformation:– DFT, DWT, PAA etc
• Data Mining:– Look for Rules…
• Interpretation/Evaluation:– Show rules to domain experts
• Potential User Applications:– Prediction of Failures
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
00.10.20.30.40.50.60.70.80.9
1
0 100 200 300 400 500 600 700 800 900 1000
Prof. Elabassi73 11/17/2016
Data Mining Development•Similarity Measures•Hierarchical Clustering•IR Systems•Imprecise Queries•Textual Data•Web Search Engines
•Bayes Theorem•Regression Analysis•EM Algorithm•K‐Means Clustering•Time Series Analysis
•Neural Networks•Decision Tree Algorithms
•Algorithm Design Techniques•Algorithm Analysis•Data Structures
•Relational Data Model•SQL•Association Rule Algorithms•Data Warehousing•Scalability Techniques
Prof. Elabassi74 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 38
A Numeric Example
•Feed forward restricts network flow to single direction•Fully connected•Flow does not loop or cycle•Network composed of two or more layers
x0
x1
x2
x3
Node 1
Node 2
Node 3
Node B
Node A
Node Z
W1A
W1B
W2A
W2B
WAZ
W3A
W3B
W0A
WBZ
W0Z
W0B
Input Layer Hidden Layer Output Layer
Prof. Elabassi75 11/17/2016
Linear neurons
• These are simple but computationally limited– If we can make them learn we may get insight into more complicated
neurons.
ii
iwxby output
bias
index overinput connections
i inputth
ith
weight on
input
Prof. Elabassi76 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 39
Prof. Elabassi77 11/17/2016
Prof. Elabassi78 11/17/2016
Prof. Elabassi 11/17/2016
Prof. Elabassi 40
Prof. Elabassi79 11/17/2016
.n, #VAR …استعرض معلومات عنه MV.SAVإذا إتيحت لك البيانات المرفقة بملف .للبيانات N~(30,10)قم بتوليد متغير العمر عشوائيا1..بالمتوسط 20إستبدل الحاالت التي يقل عمرها عن 2..أحسب المقاييس األساسية لجميع المتغيرات3.).مرات 4كرر البيانات (مشاهدات 4البيانات باعتبار كل حالة تمثل ) أوزن(رجح 4..ج بدونأحسب المقاييس األساسية لجميع المتغيرات بعد الترجيح وقارنها مع النتائ5..بغرض أن البيانات تقيس آراء واتجاهات أحسب معامل الثبات والصدق المناسب6..75ثم Y (GPA)=60اختبر أن متوسط 7. GRD.والدرجةWORتختلف في متوسطها وترتبط بالعمل Yيقال أن 8.GRD.والدرجة WORيوجد عالقة إرتباط لكل من العمل 9.
.X1,X5يقال بوجود عالقة معنوية ذات داللة إحصائية بين متوسط وقيم 10.
.وترتيب أهميتها Y (GPA)ماهى أهم المتغيرات الكمية المؤثرة في 11.
.وترتيب أهميتها Y (GPA)ماهى أهم المتغيرات الكمية والوصفية المؤثرة في 12.
.وترتيب أهميتها WORماهى أهم المتغيرات المؤثرة في 13.
.وترتيب أهميتها GRDماهى أهم المتغيرات المؤثرة في 14.
.باستخدام أسلوب الشبكات العصبية وقارن بين النتائج 14و 12حقق 15.
.المتغيرات بعدد أقل)قلص(وادمج ، )مجموعات(عناقيد 4في ضع الحاالت والمتغيرات 16.
.اتصاعديا معتبرها تسلسل زمنى مقدر اتجاه Yرتب ، X3,Yقدر أفضل منحنى لكل 17. Prof. Elabassi80 11/17/2016