Mainframe Fine Tuning - Fabio Massimo Ottaviani

Mainframe Fine Tuning

Fabio Massimo OttavianiEPV Technologies

([email protected])

NRB Mainframe Day 2015

Disclaimer, copyright & trademarksDisclaimer:THE INFORMATION CONTAINED IN THIS PRESENTATION HAS NOT BEEN SUBMITTED TO ANY FORMAL REVIEW AND IS DISTRIBUTED ON AN “AS IS” BASIS WITHOUT ANY WARRANTY EITHER EXPRESS OR IMPLIED. THE USE OF THIS INFORMATION OR THE IMPLEMENTATION OF ANY OF THESE TECHNIQUES IS A USER RESPONSIBILITY AND DEPENDS ON THE USER’S ABILITY TO EVALUATE AND INTEGRATE THEM INTO THE USER’S OPERATIONAL ENVIRONMENT. WHILE EACH ITEM MAY HAVE BEEN REVIEWED FOR ACCURACY IN A SPECIFIC SITUATION, THERE IS NO GUARANTEE THAT THE SAME OR SIMILAR RESULTS WILL BE OBTAINED ELSEWHERE. USERS ATTEMPTING TO ADAPT THESE TECHNIQUES TO THEIR OWN ENVIRONMENTS DO SO AT THEIR OWN RISK.

Copyright Notice:© EPV Technologies. All rights reserved.

Trademarks: All the trademarks mentioned here belong to their respective companies.

2

Introduction Reduce mainframe cost while

improving application performance is still one of the most important goals of companies running z/OS applications

In many situations needed actions require both a technical analysis and a management decision

In this presentation, starting from real life examples, we will focus on what are the most common tuning opportunities we found at many sites

3

Agenda

1. Who’s Using My CPU? 2. The Best I/O is no I/O3. Large Memory Pages4. WLC Checks for Managers

4

Who’s Using My CPU?

5

Who’s Using My CPU?6

This is an example of the abnormal behaviour of a monitoring tool

It normally uses few MIPS but for some reasons on Saturday morning started to loop using almost a full CPU

Customer technical team tried to restartthe STC; it worked; in the mean time they asked for a correction from the ISV

7

Two heavy TSO users in the peak hours Customer created a Type3 WLM

Resource Group with a maximum limit of 30% including the ALLTSO service class

A management decision may be needed


8

APPLID DATE TRANNAME FREQ 8 9 10 11 12 13 14 15 16 17

CICSP1 15/12/2014 TRX7 147.268 0,058 0,067 0,059 0,059 0,063 0,055 0,060 0,064 0,065 0,058

CICSP1 16/12/2014 TRX7 148.083 0,062 0,062 0,059 0,057 0,061 0,052 0,051 0,058 0,058 0,052

CICSP1 17/12/2014 TRX7 130.336 0,061 0,062 0,057 0,056 0,059 0,052 0,051 0,059 0,059 0,047

CICSP1 18/12/2014 TRX7 129.313 0,061 0,063 0,058 0,057 0,059 0,051 0,055 0,059 0,060 0,052

CICSP1 19/12/2014 TRX7 134.382 0,062 0,062 0,057 0,064 0,063 0,057 0,056 0,063 0,062 0,053

AVG CPU seconds per Execution


CICSP1 15/12/2014 TRX7 147.268 76 1.502 1.634 1.098 997 480 460 759 797 531

CICSP1 16/12/2014 TRX7 148.083 89 1.892 1.558 778 599 658 528 605 678 435

CICSP1 17/12/2014 TRX7 130.336 78 1.494 1.492 766 539 373 341 580 716 327

CICSP1 18/12/2014 TRX7 129.313 87 1.387 1.421 942 567 333 322 650 601 408

CICSP1 19/12/2014 TRX7 134.382 78 1.763 1.555 746 699 376 311 549 724 355

CPU seconds


CICSP1 15/12/2014 TRX7 147.268 21 415 451 303 275 133 127 210 220 147

CICSP1 16/12/2014 TRX7 148.083 25 523 430 215 165 182 146 167 187 120

CICSP1 17/12/2014 TRX7 130.336 22 413 412 212 149 103 94 160 198 90

CICSP1 18/12/2014 TRX7 129.313 24 383 393 260 157 92 89 180 166 113

CICSP1 19/12/2014 TRX7 134.382 22 487 430 206 193 104 86 152 200 98

MIPS


Application tuning requires a joint effort between technical and developent teams

Most of the times management decision and commitment is needed

9


The Best I/O is no I/O

10

Accessing data in memory provides betterperformance and less CPU usage

Many Data In Memory possibilities availablein z/OS; most of them since many years

Because of current disk performance mostsites don’t care about the number of I/Os they do

To understand if the system I/O load isexcessive we suggest to use the IOC index(calculated dividing the AVERAGE DISK I/O RATE by AVERAGE MIPS USED)

Values higher than 3 should be investigated

11


12

-

0,50

1,00

1,50

2,00

2,50

3,00

3,50

4,00

4,5020

14-W

49

2014

-W50

2014

-W51

2014

-W52

2015

-W01

2015

-W02

2015

-W03

2015

-W04

2015

-W05

2015

-W06

2015

-W07

2015

-W08

2015

-W09

I/O rate - MIPS ratio

PRDA

PRDB


Most common reasons for excessiveI/Os: Library not included in LLA/VLF or not

frozen

13


14

HOUR SSID VOLSER DEVNR HPAV UCBS IORATE DS ALLOC %WRITE8 309 IMS10A 1947 Y 2,1 686 4 0,09 309 IMS10A 1947 Y 1,4 1.148 4 0,010 309 IMS10A 1947 Y 1,5 1.184 4 0,011 309 IMS10A 1947 Y 1,6 1.332 4 0,012 309 IMS10A 1947 Y 1,2 873 4 0,013 309 IMS10A 1947 Y 1,1 603 4 0,014 309 IMS10A 1947 Y 1,3 649 4 0,015 309 IMS10A 1947 Y 1,3 1.026 4 0,016 309 IMS10A 1947 Y 1,1 622 4 0,017 309 IMS10A 1947 Y 1 463 4 0,08 412 IMS20A 122D Y 3,1 1.099 4 0,09 412 IMS20A 122D Y 4,3 1.623 4 0,010 412 IMS20A 122D Y 4,4 1.783 4 0,011 412 IMS20A 122D Y 4,4 1.901 4 0,012 412 IMS20A 122D Y 4,2 1.306 4 0,013 412 IMS20A 122D Y 3,1 985 4 0,014 412 IMS20A 122D Y 3,2 1.041 4 0,015 412 IMS20A 122D Y 4,2 1.628 4 0,016 412 IMS20A 122D Y 3,1 882 4 0,017 412 IMS20A 122D Y 2 656 4 0,0



frozen Small DB2 Buffer Pools

15


16

HOUR SSID VOLSER DEVNR HPAV UCBS IORATE DS ALLOC %WRITE8 325 DB1111 9D0C Y 9,3 14.696 160 0,09 325 DB1111 9D0C Y 11,9 14.379 125 0,010 325 DB1111 9D0C Y 11,5 13.852 136 0,011 325 DB1111 9D0C Y 15 16.619 126 0,012 325 DB1111 9D0C Y 9,7 11.784 166 0,013 325 DB1111 9D0C Y 7,2 9.323 220 0,014 325 DB1111 9D0C Y 13,2 11.294 200 0,015 325 DB1111 9D0C Y 11,7 15.884 203 0,016 325 DB1111 9D0C Y 5,8 7.324 225 0,017 325 DB1111 9D0C Y 3,3 3.622 197 0,1



frozen Small DB2 Buffer Pools Bad access paths Bad SQL ...

17


How much CPU does an I/O cost? Our study (some years ago) estimated 1

MIPS every 50 I/O per second for directory reads

1000 I/O per second = 1000 / 50 = 20 MIPS

Recent IBM study (Feb 2015) estimated35 CPU microseconds (on a 2827-712) per DB2 synchronous I/O

1000 I/O per second = 0,035 * 14166 / 12 = 41 MIPS

18


Exploiting Large Pages

19

Virtual memory above 2 GB can only be allocated by using memory objects

A memory object is a contiguous range of virtual addresses that is allocated in units of megabytes on a megabyte boundary

Memory objects can be written to 4K, 1MB and 2GB pages (available since zEC12)

1MB and 2GB pages are called large memory pages


From “ABCs of z/OS System Programming - Volume 1”

64 bit addressing

In addition to Segment and Page tables:• Region 3 tables to

map 2048 segment tables (up to 4 TB)

• Region 2 tables to map 2048 Region 3 tables (up to 8 PB)

• Region 1 tables to map 2048 Region 2 tables (up to 16 EB)


-

5

10

15

20

25

30

35

0,0%

1,0%

2,0%

3,0%

4,0%

5,0%

6,0%

7,0%

8,0%

9,0%

10,0%

06MAY13 07MAY13 08MAY13 09MAY13 10MAY13

%CPU cycles due to TLB1 miss CPU cycles/TLB1 miss


As a general rule large pages may provide performance value to long-running memory access-intensive applications

First large memory pages exploiters: the z/OS nucleus (since z/OS 1.12) DB2 buffer pools (since V10) when the

PGFIX=YES parameter is specified JVM can use large memory pages (both for

code-cache and heap) by specifying the –Xlp option; more recent JVM versions will automatically use large memory pages if they are available

ADABAS


Additional exploiters: DB2 executable code (since V11) IMS CQS (since V12) Various IMS pools (since V13) IMS OLDS (since V13) System Logger (since z/OS 1.13) USS


WLC Checks for Managers

25

WLC Checks for Managers

Customers have the primary responsibility for preventing uncontrolled loops, operator errors, or unwanted utilization spikes. However, IBM understands that, occasionally, situations that could not be prevented (especially situations related to disaster recovery) might cause exceptional utilization values. In these situations, IBM does not normally expect customers to pay for the increased utilization associated with the unusual situation. Use your best judgement to determine if an unusual situation has occurred. IBM does not publish a list of unusual situations because, by their nature, they will be unpredictable.

From the “Using the Sub-Capacity Reporting Tool” manual.

26

Not a “beautiful” day ?

• Machine is a 2097-717 valued 1,329 MSUs

• Report refers to February 2012

• 4-hour rolling average monthly peak is 1,309 MSUs

• It happened on Sunday• Note the big difference

with the second peak value (354 MSUs)

27

Not a “beautiful” day ?Bad news At this customer site Saturday and

Sunday are not business days so a such high value on Sunday has to be considered abnormal

In this case it was caused by a long, recovery activity needed to fix a data corruption issue following the migration to new storage processors which happened on the previous day

28

• Machine is a 2827-711 valued 1.593 MSUs

• Report refers to December 2014

• 4-hour rolling average monthly peak is 1.017 MSUs

• It happened on Friday• The difference with the

second peak value is 97 MSUs

(un)Happy HourDATE DAY TYPE MODEL MSU USED

19/12/2014 Fri 2827 711 1.593 1.01703/12/2014 Wed 2827 711 1.593 91404/12/2014 Thu 2827 711 1.593 86615/12/2014 Mon 2827 711 1.593 83630/12/2014 Tue 2827 711 1.593 82729/12/2014 Mon 2827 711 1.593 82416/12/2014 Tue 2827 711 1.593 82418/12/2014 Thu 2827 711 1.593 82323/12/2014 Tue 2827 711 1.593 80917/12/2014 Wed 2827 711 1.593 78224/12/2014 Wed 2827 711 1.593 77402/12/2014 Tue 2827 711 1.593 73822/12/2014 Mon 2827 711 1.593 72805/12/2014 Fri 2827 711 1.593 72231/12/2014 Wed 2827 711 1.593 70219/12/2014 Fri 2827 711 1.593 62101/01/2015 Thu 2827 711 1.593 58406/12/2014 Sat 2827 711 1.593 57420/12/2014 Sat 2827 711 1.593 57225/12/2014 Thu 2827 711 1.593 53213/12/2014 Sat 2827 711 1.593 26122/12/2014 Mon 2827 711 1.593 25728/12/2014 Sun 2827 711 1.593 21821/12/2014 Sun 2827 711 1.593 213

29

Looking at the different systems’ contributions, it appeared clear that the peak was due to something running inside the SYS2 system

Our customer asked the technical team for a deeper analysis

(un)Happy HourSYSTEM 12 13 14 15 16 17 18 19 20 21 22 23

SYS1 96 103 120 130 130 125 106 87 75 69 56 21SYS2 699 720 746 538 549 594 736 878 898 867 746 580SYS3 4 4 3 4 5 3 4 4 4 4 3 3SYS4 44 43 38 35 38 43 49 48 40 39 30 23

TOTAL 843 870 907 707 722 765 895 1017 1017 979 835 627

30

The late afternoon peak was caused by a TSO user running into a loop

As you can see in the above report, TSO001 used about all the MSUs of 1 CP continuously for about 5 hours

(un)Happy HourWKL ADDRESS SPACE SRVCLASS MEAN 12 13 14 15 16 17 18 19 20 21 22 23TSO TSO001 TSO 71 97 138 143 142 141 142 47 JOB BATCH001 BATCHHI 31 4,8 56,3JOB BATCH002 BATCHHI 25 27,3 49,2 8,6 14,5JOB BATCH005 BATCHHI 24 31,3 38,8 0,5JOB BATCH006 BATCHHI 23 29,9 38,8 0,5JOB BATCH008 BATCHHI 22 8,3 18,8 28,2 49,4 29 19,1 22,7 1,3DB2 DB2DIST DDFDB2 22 29,5 22,7 33,4 52,4 63 30,7 6,6 5,6 3,3 8,7 3,8 1,2

31

• ZNET workload used to be very stable

• something happened on October 24th

• It was Monday !• First idea was to

check for maintenance activities performed in the week end

The system you don’t expectDATE DAY MSU SYSA TST1 TST2 ZNET TOT

16/10/2011 Sun 1.139 395 5 5 18 423

17/10/2011 Mon 1.139 886 7 7 43 942

18/10/2011 Tue 1.139 896 8 7 43 954

19/10/2011 Wed 1.139 869 9 8 43 928

20/10/2011 Thu 1.139 851 8 7 45 910

21/10/2011 Fri 1.139 796 7 7 41 850

22/10/2011 Sat 1.139 684 5 5 24 718

23/10/2011 Sun 1.139 376 5 5 16 402

24/10/2011 Mon 1.139 863 7 7 79 955

25/10/2011 Tue 1.139 891 9 7 78 985

26/10/2011 Wed 1.139 900 10 8 78 996

27/10/2011 Thu 1.139 892 8 8 79 987

28/10/2011 Fri 1.139 842 7 7 75 931

29/10/2011 Sat 1.139 698 5 5 40 748

30/10/2011 Sun 1.139 385 5 5 38 433

31/10/2011 Mon 1.139 979 7 7 84 1077

01/11/2011 Tue 1.139 988 10 8 86 1092

32

The system you don’t expect A more detailed ZNET workload analysis

showed a correspondent CPU increase of the session manager address space

The new version of the session manager caused such a big increase (about 40 MSUs).

In this case most of these MSUs were recovered thanks to some PTFs

Being able to measure and report this issue gave the customer the possibility of discussing the October and November monthly bills with IBM in order to reduce them

33

DATE CURR NO IIPCP2014-10 1267 1192

2014-09 1218 1092

2014-08 1182 1076

2014-07 1206 1146

2014-06 1200 1140

2014-05 1194 1134

2014-04 1188 1129

2014-03 1152 1094

2014-02 1134 1077

2014-01 1128 1128

2013-12 1140 1140

2013-11 1110 1110

2013-10 1120 1120

• IIPCP was always substantially less than CURR

• In October 2014 peak hour the difference is 75 MSUs

Could we save more money with zIIP ? 34

Could we save more money with zIIP ? 35

DATE IMP 0-6 IMP 0-5 IMP 0-4 IMP 0-3 IMP 0-2 IMP 0-1 IMP 02015-01 1.087 954 713 628 440 387 1402014-12 1.167 1.017 590 513 373 323 1432014-11 1.163 1.013 655 565 351 301 1462014-10 1.028 847 485 426 291 223 1172014-09 1.004 841 512 461 302 251 1242014-08 1.006 833 498 459 314 257 1332014-07 1.044 862 410 359 283 224 1292014-06 1.090 924 499 414 295 239 1362014-05 1.114 952 603 523 323 260 1232014-04 1.117 946 554 513 299 249 1502014-03 1.128 970 548 465 302 245 1822014-02 1.071 900 486 392 292 232 1492014-01 1.095 899 569 525 293 234 133

IBM 2827 Model 709 - 1.350 MSU

The importance of “importance”

• The difference between these two columns is the contribution of discretionary workloads to the software bill

• At this customer site, this difference was very high in every month

36

The importance of “importance” 37

Questions ?38

Mainframe Fine Tuning - Fabio Massimo Ottaviani

Technology

Transcript of Mainframe Fine Tuning - Fabio Massimo Ottaviani