Mainframe Fine Tuning - Fabio Massimo Ottaviani
-
Upload
nrb -
Category
Technology
-
view
68 -
download
2
Transcript of Mainframe Fine Tuning - Fabio Massimo Ottaviani
Mainframe Fine Tuning
Fabio Massimo OttavianiEPV Technologies
NRB Mainframe Day 2015
Disclaimer, copyright & trademarksDisclaimer:THE INFORMATION CONTAINED IN THIS PRESENTATION HAS NOT BEEN SUBMITTED TO ANY FORMAL REVIEW AND IS DISTRIBUTED ON AN “AS IS” BASIS WITHOUT ANY WARRANTY EITHER EXPRESS OR IMPLIED. THE USE OF THIS INFORMATION OR THE IMPLEMENTATION OF ANY OF THESE TECHNIQUES IS A USER RESPONSIBILITY AND DEPENDS ON THE USER’S ABILITY TO EVALUATE AND INTEGRATE THEM INTO THE USER’S OPERATIONAL ENVIRONMENT. WHILE EACH ITEM MAY HAVE BEEN REVIEWED FOR ACCURACY IN A SPECIFIC SITUATION, THERE IS NO GUARANTEE THAT THE SAME OR SIMILAR RESULTS WILL BE OBTAINED ELSEWHERE. USERS ATTEMPTING TO ADAPT THESE TECHNIQUES TO THEIR OWN ENVIRONMENTS DO SO AT THEIR OWN RISK.
Copyright Notice:© EPV Technologies. All rights reserved.
Trademarks: All the trademarks mentioned here belong to their respective companies.
2
Introduction Reduce mainframe cost while
improving application performance is still one of the most important goals of companies running z/OS applications
In many situations needed actions require both a technical analysis and a management decision
In this presentation, starting from real life examples, we will focus on what are the most common tuning opportunities we found at many sites
3
Agenda
1. Who’s Using My CPU? 2. The Best I/O is no I/O3. Large Memory Pages4. WLC Checks for Managers
4
Who’s Using My CPU?
5
Who’s Using My CPU?6
This is an example of the abnormal behaviour of a monitoring tool
It normally uses few MIPS but for some reasons on Saturday morning started to loop using almost a full CPU
Customer technical team tried to restartthe STC; it worked; in the mean time they asked for a correction from the ISV
7
Two heavy TSO users in the peak hours Customer created a Type3 WLM
Resource Group with a maximum limit of 30% including the ALLTSO service class
A management decision may be needed
Who’s Using My CPU?
8
APPLID DATE TRANNAME FREQ 8 9 10 11 12 13 14 15 16 17
CICSP1 15/12/2014 TRX7 147.268 0,058 0,067 0,059 0,059 0,063 0,055 0,060 0,064 0,065 0,058
CICSP1 16/12/2014 TRX7 148.083 0,062 0,062 0,059 0,057 0,061 0,052 0,051 0,058 0,058 0,052
CICSP1 17/12/2014 TRX7 130.336 0,061 0,062 0,057 0,056 0,059 0,052 0,051 0,059 0,059 0,047
CICSP1 18/12/2014 TRX7 129.313 0,061 0,063 0,058 0,057 0,059 0,051 0,055 0,059 0,060 0,052
CICSP1 19/12/2014 TRX7 134.382 0,062 0,062 0,057 0,064 0,063 0,057 0,056 0,063 0,062 0,053
AVG CPU seconds per Execution
APPLID DATE TRANNAME FREQ 8 9 10 11 12 13 14 15 16 17
CICSP1 15/12/2014 TRX7 147.268 76 1.502 1.634 1.098 997 480 460 759 797 531
CICSP1 16/12/2014 TRX7 148.083 89 1.892 1.558 778 599 658 528 605 678 435
CICSP1 17/12/2014 TRX7 130.336 78 1.494 1.492 766 539 373 341 580 716 327
CICSP1 18/12/2014 TRX7 129.313 87 1.387 1.421 942 567 333 322 650 601 408
CICSP1 19/12/2014 TRX7 134.382 78 1.763 1.555 746 699 376 311 549 724 355
CPU seconds
APPLID DATE TRANNAME FREQ 8 9 10 11 12 13 14 15 16 17
CICSP1 15/12/2014 TRX7 147.268 21 415 451 303 275 133 127 210 220 147
CICSP1 16/12/2014 TRX7 148.083 25 523 430 215 165 182 146 167 187 120
CICSP1 17/12/2014 TRX7 130.336 22 413 412 212 149 103 94 160 198 90
CICSP1 18/12/2014 TRX7 129.313 24 383 393 260 157 92 89 180 166 113
CICSP1 19/12/2014 TRX7 134.382 22 487 430 206 193 104 86 152 200 98
MIPS
Who’s Using My CPU?
Application tuning requires a joint effort between technical and developent teams
Most of the times management decision and commitment is needed
9
Who’s Using My CPU?
The Best I/O is no I/O
10
Accessing data in memory provides betterperformance and less CPU usage
Many Data In Memory possibilities availablein z/OS; most of them since many years
Because of current disk performance mostsites don’t care about the number of I/Os they do
To understand if the system I/O load isexcessive we suggest to use the IOC index(calculated dividing the AVERAGE DISK I/O RATE by AVERAGE MIPS USED)
Values higher than 3 should be investigated
11
The Best I/O is no I/O
12
-
0,50
1,00
1,50
2,00
2,50
3,00
3,50
4,00
4,5020
14-W
49
2014
-W50
2014
-W51
2014
-W52
2015
-W01
2015
-W02
2015
-W03
2015
-W04
2015
-W05
2015
-W06
2015
-W07
2015
-W08
2015
-W09
I/O rate - MIPS ratio
PRDA
PRDB
The Best I/O is no I/O
Most common reasons for excessiveI/Os: Library not included in LLA/VLF or not
frozen
13
The Best I/O is no I/O
14
HOUR SSID VOLSER DEVNR HPAV UCBS IORATE DS ALLOC %WRITE8 309 IMS10A 1947 Y 2,1 686 4 0,09 309 IMS10A 1947 Y 1,4 1.148 4 0,010 309 IMS10A 1947 Y 1,5 1.184 4 0,011 309 IMS10A 1947 Y 1,6 1.332 4 0,012 309 IMS10A 1947 Y 1,2 873 4 0,013 309 IMS10A 1947 Y 1,1 603 4 0,014 309 IMS10A 1947 Y 1,3 649 4 0,015 309 IMS10A 1947 Y 1,3 1.026 4 0,016 309 IMS10A 1947 Y 1,1 622 4 0,017 309 IMS10A 1947 Y 1 463 4 0,08 412 IMS20A 122D Y 3,1 1.099 4 0,09 412 IMS20A 122D Y 4,3 1.623 4 0,010 412 IMS20A 122D Y 4,4 1.783 4 0,011 412 IMS20A 122D Y 4,4 1.901 4 0,012 412 IMS20A 122D Y 4,2 1.306 4 0,013 412 IMS20A 122D Y 3,1 985 4 0,014 412 IMS20A 122D Y 3,2 1.041 4 0,015 412 IMS20A 122D Y 4,2 1.628 4 0,016 412 IMS20A 122D Y 3,1 882 4 0,017 412 IMS20A 122D Y 2 656 4 0,0
The Best I/O is no I/O
Most common reasons for excessiveI/Os: Library not included in LLA/VLF or not
frozen Small DB2 Buffer Pools
15
The Best I/O is no I/O
16
HOUR SSID VOLSER DEVNR HPAV UCBS IORATE DS ALLOC %WRITE8 325 DB1111 9D0C Y 9,3 14.696 160 0,09 325 DB1111 9D0C Y 11,9 14.379 125 0,010 325 DB1111 9D0C Y 11,5 13.852 136 0,011 325 DB1111 9D0C Y 15 16.619 126 0,012 325 DB1111 9D0C Y 9,7 11.784 166 0,013 325 DB1111 9D0C Y 7,2 9.323 220 0,014 325 DB1111 9D0C Y 13,2 11.294 200 0,015 325 DB1111 9D0C Y 11,7 15.884 203 0,016 325 DB1111 9D0C Y 5,8 7.324 225 0,017 325 DB1111 9D0C Y 3,3 3.622 197 0,1
The Best I/O is no I/O
Most common reasons for excessiveI/Os: Library not included in LLA/VLF or not
frozen Small DB2 Buffer Pools Bad access paths Bad SQL ...
17
The Best I/O is no I/O
How much CPU does an I/O cost? Our study (some years ago) estimated 1
MIPS every 50 I/O per second for directory reads
1000 I/O per second = 1000 / 50 = 20 MIPS
Recent IBM study (Feb 2015) estimated35 CPU microseconds (on a 2827-712) per DB2 synchronous I/O
1000 I/O per second = 0,035 * 14166 / 12 = 41 MIPS
18
The Best I/O is no I/O
Exploiting Large Pages
19
Virtual memory above 2 GB can only be allocated by using memory objects
A memory object is a contiguous range of virtual addresses that is allocated in units of megabytes on a megabyte boundary
Memory objects can be written to 4K, 1MB and 2GB pages (available since zEC12)
1MB and 2GB pages are called large memory pages
Exploiting Large Pages
From “ABCs of z/OS System Programming - Volume 1”
64 bit addressing
In addition to Segment and Page tables:• Region 3 tables to
map 2048 segment tables (up to 4 TB)
• Region 2 tables to map 2048 Region 3 tables (up to 8 PB)
• Region 1 tables to map 2048 Region 2 tables (up to 16 EB)
Exploiting Large Pages
-
5
10
15
20
25
30
35
0,0%
1,0%
2,0%
3,0%
4,0%
5,0%
6,0%
7,0%
8,0%
9,0%
10,0%
06MAY13 07MAY13 08MAY13 09MAY13 10MAY13
%CPU cycles due to TLB1 miss CPU cycles/TLB1 miss
Exploiting Large Pages
As a general rule large pages may provide performance value to long-running memory access-intensive applications
First large memory pages exploiters: the z/OS nucleus (since z/OS 1.12) DB2 buffer pools (since V10) when the
PGFIX=YES parameter is specified JVM can use large memory pages (both for
code-cache and heap) by specifying the –Xlp option; more recent JVM versions will automatically use large memory pages if they are available
ADABAS
Exploiting Large Pages
Additional exploiters: DB2 executable code (since V11) IMS CQS (since V12) Various IMS pools (since V13) IMS OLDS (since V13) System Logger (since z/OS 1.13) USS
Exploiting Large Pages
WLC Checks for Managers
25
WLC Checks for Managers
Customers have the primary responsibility for preventing uncontrolled loops, operator errors, or unwanted utilization spikes. However, IBM understands that, occasionally, situations that could not be prevented (especially situations related to disaster recovery) might cause exceptional utilization values. In these situations, IBM does not normally expect customers to pay for the increased utilization associated with the unusual situation. Use your best judgement to determine if an unusual situation has occurred. IBM does not publish a list of unusual situations because, by their nature, they will be unpredictable.
From the “Using the Sub-Capacity Reporting Tool” manual.
26
Not a “beautiful” day ?
• Machine is a 2097-717 valued 1,329 MSUs
• Report refers to February 2012
• 4-hour rolling average monthly peak is 1,309 MSUs
• It happened on Sunday• Note the big difference
with the second peak value (354 MSUs)
27
Not a “beautiful” day ?Bad news At this customer site Saturday and
Sunday are not business days so a such high value on Sunday has to be considered abnormal
In this case it was caused by a long, recovery activity needed to fix a data corruption issue following the migration to new storage processors which happened on the previous day
28
• Machine is a 2827-711 valued 1.593 MSUs
• Report refers to December 2014
• 4-hour rolling average monthly peak is 1.017 MSUs
• It happened on Friday• The difference with the
second peak value is 97 MSUs
(un)Happy HourDATE DAY TYPE MODEL MSU USED
19/12/2014 Fri 2827 711 1.593 1.01703/12/2014 Wed 2827 711 1.593 91404/12/2014 Thu 2827 711 1.593 86615/12/2014 Mon 2827 711 1.593 83630/12/2014 Tue 2827 711 1.593 82729/12/2014 Mon 2827 711 1.593 82416/12/2014 Tue 2827 711 1.593 82418/12/2014 Thu 2827 711 1.593 82323/12/2014 Tue 2827 711 1.593 80917/12/2014 Wed 2827 711 1.593 78224/12/2014 Wed 2827 711 1.593 77402/12/2014 Tue 2827 711 1.593 73822/12/2014 Mon 2827 711 1.593 72805/12/2014 Fri 2827 711 1.593 72231/12/2014 Wed 2827 711 1.593 70219/12/2014 Fri 2827 711 1.593 62101/01/2015 Thu 2827 711 1.593 58406/12/2014 Sat 2827 711 1.593 57420/12/2014 Sat 2827 711 1.593 57225/12/2014 Thu 2827 711 1.593 53213/12/2014 Sat 2827 711 1.593 26122/12/2014 Mon 2827 711 1.593 25728/12/2014 Sun 2827 711 1.593 21821/12/2014 Sun 2827 711 1.593 213
29
Looking at the different systems’ contributions, it appeared clear that the peak was due to something running inside the SYS2 system
Our customer asked the technical team for a deeper analysis
(un)Happy HourSYSTEM 12 13 14 15 16 17 18 19 20 21 22 23
SYS1 96 103 120 130 130 125 106 87 75 69 56 21SYS2 699 720 746 538 549 594 736 878 898 867 746 580SYS3 4 4 3 4 5 3 4 4 4 4 3 3SYS4 44 43 38 35 38 43 49 48 40 39 30 23
TOTAL 843 870 907 707 722 765 895 1017 1017 979 835 627
30
The late afternoon peak was caused by a TSO user running into a loop
As you can see in the above report, TSO001 used about all the MSUs of 1 CP continuously for about 5 hours
(un)Happy HourWKL ADDRESS SPACE SRVCLASS MEAN 12 13 14 15 16 17 18 19 20 21 22 23TSO TSO001 TSO 71 97 138 143 142 141 142 47 JOB BATCH001 BATCHHI 31 4,8 56,3JOB BATCH002 BATCHHI 25 27,3 49,2 8,6 14,5JOB BATCH005 BATCHHI 24 31,3 38,8 0,5JOB BATCH006 BATCHHI 23 29,9 38,8 0,5JOB BATCH008 BATCHHI 22 8,3 18,8 28,2 49,4 29 19,1 22,7 1,3DB2 DB2DIST DDFDB2 22 29,5 22,7 33,4 52,4 63 30,7 6,6 5,6 3,3 8,7 3,8 1,2
31
• ZNET workload used to be very stable
• something happened on October 24th
• It was Monday !• First idea was to
check for maintenance activities performed in the week end
The system you don’t expectDATE DAY MSU SYSA TST1 TST2 ZNET TOT
16/10/2011 Sun 1.139 395 5 5 18 423
17/10/2011 Mon 1.139 886 7 7 43 942
18/10/2011 Tue 1.139 896 8 7 43 954
19/10/2011 Wed 1.139 869 9 8 43 928
20/10/2011 Thu 1.139 851 8 7 45 910
21/10/2011 Fri 1.139 796 7 7 41 850
22/10/2011 Sat 1.139 684 5 5 24 718
23/10/2011 Sun 1.139 376 5 5 16 402
24/10/2011 Mon 1.139 863 7 7 79 955
25/10/2011 Tue 1.139 891 9 7 78 985
26/10/2011 Wed 1.139 900 10 8 78 996
27/10/2011 Thu 1.139 892 8 8 79 987
28/10/2011 Fri 1.139 842 7 7 75 931
29/10/2011 Sat 1.139 698 5 5 40 748
30/10/2011 Sun 1.139 385 5 5 38 433
31/10/2011 Mon 1.139 979 7 7 84 1077
01/11/2011 Tue 1.139 988 10 8 86 1092
32
The system you don’t expect A more detailed ZNET workload analysis
showed a correspondent CPU increase of the session manager address space
The new version of the session manager caused such a big increase (about 40 MSUs).
In this case most of these MSUs were recovered thanks to some PTFs
Being able to measure and report this issue gave the customer the possibility of discussing the October and November monthly bills with IBM in order to reduce them
33
DATE CURR NO IIPCP2014-10 1267 1192
2014-09 1218 1092
2014-08 1182 1076
2014-07 1206 1146
2014-06 1200 1140
2014-05 1194 1134
2014-04 1188 1129
2014-03 1152 1094
2014-02 1134 1077
2014-01 1128 1128
2013-12 1140 1140
2013-11 1110 1110
2013-10 1120 1120
• IIPCP was always substantially less than CURR
• In October 2014 peak hour the difference is 75 MSUs
Could we save more money with zIIP ? 34
Could we save more money with zIIP ? 35
DATE IMP 0-6 IMP 0-5 IMP 0-4 IMP 0-3 IMP 0-2 IMP 0-1 IMP 02015-01 1.087 954 713 628 440 387 1402014-12 1.167 1.017 590 513 373 323 1432014-11 1.163 1.013 655 565 351 301 1462014-10 1.028 847 485 426 291 223 1172014-09 1.004 841 512 461 302 251 1242014-08 1.006 833 498 459 314 257 1332014-07 1.044 862 410 359 283 224 1292014-06 1.090 924 499 414 295 239 1362014-05 1.114 952 603 523 323 260 1232014-04 1.117 946 554 513 299 249 1502014-03 1.128 970 548 465 302 245 1822014-02 1.071 900 486 392 292 232 1492014-01 1.095 899 569 525 293 234 133
IBM 2827 Model 709 - 1.350 MSU
The importance of “importance”
• The difference between these two columns is the contribution of discretionary workloads to the software bill
• At this customer site, this difference was very high in every month
36
The importance of “importance” 37
Questions ?38