# Introduction ne of the most important requirements of an IT organization is to maximize the uptime and availability of the network and its resources. Without having constant monitoring, the ability to look at historical data, proactive monitoring, and change management this becomes almost impossible to accomplish. Many times without a management solution you first find out about a problem when users start calling the helpdesk with complaints. This wastes time and extends the length of the outage costing companies anywhere from hundreds to hundreds of thousands of dollars an hour. In this paper we will use current monitoring system response data to demonstrate how efficient a managed WAN can be and compare that to possible scenarios that show what could happen using a manual process. The data has been derived from the WAN ticketing system provided by an outside vendor and compared to a separate internal help desk system. The midsize company network I will be using as an example has 300 switches, 100 access points, 20 routers, 3 wireless LAN controllers and 16 WAN accelerators. There are also multiple firewalls and IPS' that are not part of this paper but are logged and monitored. 90% of the devices are Cisco hardware or software running on IBM or Cisco UCS systems. From this point forward they will be referred to ABC Company. Author : East Carolina University. E-mail : thomasons09@students.ecu.edu Why have managed networks? First of all to define what is a network management tool we have rfc1470 i Nowadays networks carry more than just data and are much more critical to the wellbeing of the business , which states that a network management tool is a tool that is used for monitoring and debugging TCP/IP Internets and interconnected devices. With an unmanaged network the only method you have of knowing when there is a problem is by word of mouth; a user calls the helpdesk after they lose connectivity to their application or someone shows up at your door asking if there is an issue with the network. The helpdesk personnel then do some basic trouble shooting which takes time. After determining that there is an actual problem and having determined whose area of responsibility it probably is then passes the ticket to the networking group. The ticketing system then generates an email to the person in the group designated to assign the tickets. At this point the networking group finally knows that there exists a problem with the communications network. The average time from initial call to making the appropriate people aware of the problem is between 30 minutes to 4 hours and this is only during normal 8 to 5 work hours. If an entire segment is down the time is closer to 30 minute. The more people affected the faster the problem is passed on. After the networking group has ownership of the issue they then begin their trouble shooting procedures. This again takes additional time, which translates to lost productivity, which translates to money. ii O . There are data, voice, and video requirements placed on the network infrastructure at an ever-increasing rate. If the network goes down and carries the companies voice traffic problems are compounded since no one may be able to report any problems. So what is needed to help make the networking group more efficient and the network more reliable? At a minimum, networking groups need the ability to monitor and react to any issue that can affect The primary purpose of having a network management system is to alert on network issues and aid in the quick recovery of the network and its services. Alerts can instantly notify personnel of a problem quickly and at times with the actual issue identified so that resolution can take place quickly. Fault management increases the ability to help identify faults and problems before they lead to actual downtime. Configuration management is necessary to be able to keep track of the configurations on the devices running the network. By tracking changes and keeping historical copies of device configurations the system helps to restore devices that have had issues or need replacing. It also helps in the trouble shooting process in that you can determine when the last change was made, who made the change, and what that change was. Performance management gives the networking team the ability to track device and component utilization, network throughputs, error rates, and other statics that help track usage and identify possible problems and bottlenecks. You can use this to answer questions such as why is the network slow? Is it due to a hardware issue or is an unauthorized process over utilizing it? Two functional areas not addressed by this paper are accounting management, which involves tracking user utilization of network services and security management, which involves the protection of information through the network. While they are part of the ISO definition of network management they are not always part of most network management systems software. Individual logons for the management system addresses some of the concerns. Below are two real life examples for ABC Company where monitoring would have aided in the ability to keep the network running as efficiently as possible. In the first example demonstrating the need for network monitoring, there was an access point in a manufacturing area that saw very little usage. That access point stopped working. Users in that area saw slower performance but did not report the problem because they were still connected to a distant access point in another isle. Without proactive monitoring the performance would remain slow and if the next closest AP had an issue users would not be able to connect at all. A proactive system would have shown that the AP was down and made it possible to address the issue before it affected production. The down access point was found out about when the switch it was attached to had an issue. Another issue that network management would help with is looking at what is using a network segment and how it is affecting the network. One instance occurred when the connection between a distribution center and the main server room experienced very poor performance. By connecting to the switches and routers at that site using SSH you could see that the interfaces were being highly utilized and there were no errors being generated. The line was up and VoIP phone calls were not affected. Knowing this information you could deduce that QoS was working correctly as calls were not affected but you did not have any idea what type of traffic was on the circuit and what it was doing. Was it a denial of service attack or someone copying a very large file or virus updates being pushed to all of the computers at an inappropriate time of day? Management systems measure and keep track of metrics such as availability, uptime, latency, error rates, and other network characteristics. They give you the ability to manage very large networks from a central location iv ABC Company has several critical applications that cannot tolerate even brief drops lasting less than a few seconds. Handheld scan guns being used for warehouse management do not tolerate any downtime. The main ERP program will not handle any drop that lasts for more than a few milliseconds. Uptime is critical. A single blip in connectivity can cause the need to reprocess orders and shipments and manufacturing processes to be reentered or restarted. . Networks that carry voice traffic also need to have the ability to look at voice statics such as jitter and echo. A good management system also has the ability to create a graphical map. A big advantage of this is the ability to see where the break in the network is located. Without this you would need to go device to device until you found the end of the working segment. Trace route is an example of command that can help determine where there is a break in communication but that only works for layer 3 devices. A third party v that works with the MPLS provider monitors wide area network and WAN accelerators for ABC Company. They are responsible for monitoring the connections and performance of the WAN and its components. The third party has a system within the ABC Company network that gathers and monitors network statics. If an alert is received they review that alert with a combination of filters and human interaction. All interactions are logged in their management system. ABC Company looked at having them monitor the entire infrastructure but that cost was prohibitive and harder to justify. The performance of the monitoring company's system is compared to the internal systems within ABC # Year Company. This consists of a manual entry help desk system dependent on user interaction and an older network management system with limited abilities that monitors the LAN. It has the ability to determine if a device is completely down but not what types of issues may be occurring. Reporting data for this paper was taken from an existing WAN monitoring system that is monitored 24x7x365 by a third party system. The data points used included the following: ticket creation time, 1 st human response time, the number of human interactions per ticket, whether the local site contact or Telco was contacted, the time the site was fully functional and the total number of minutes before the ticket was closed. The network has a monitoring station that receives snmp and icmp packets to track uptime and interface errors. If there is an error then the system sends out an alert to the monitoring company where the alert is evaluated and the appropriate predefine action takes place. An example of a predefined action might be to ignore the error until 8:00 am due to the fact that the site is remote and no one is available until that time, call a local site contact, or wake someone up. An alert can signify anything that can lead to degradation in performance or availability. The management company first tries to resolve the issue if possible. If it looks like as if the problem is with a circuit the local Telco is notified and a technician is dispatched per predefined scripts. In 2012 there were 75 interactions that generated alerts that required examination. The average response time it took for an actual person to review the alert was 13.3 minutes. The average time it took to resolve the issue was 481.9 minutes. An alert did not necessarily mean that a circuit or connection was down but it did mean that performance was degraded. The percentage of interactions that required local site personal to be contacted was 21 and 35% of the alerts required a technician from the local circuit provider. Data from the first 10 tickets. Sample ticket at the end of the document. # Manual Help Desk System ? Due to the way the help desk system functions internal tickets can take well over 30 minutes before the 1 st person in the networking groups is even aware of the issue. ? Percent of human involvement taking greater than 30 minutes -95%. ? Help desk system creates tickets that are hard to research for past issues due to a lack of historical indexing. ? Keeps track of tickets but because it is not a dedicated network system it is hard to research past issues. Finding out what the root cause of an alert is required and having the issue resolved was is the first step in the process. The average man-minutes spent until the ticket was closed was 800.6 meaning that 40% of the time on ticket was spent after resolution determining what the cause was. By having the monitoring system in place for the WAN it meant that 76% of all tickets were initially addressed in less than 15 minutes. 91% were addressedin less than 30 minutes. Comparing this to an existing system within the company for the local area network shows a major difference in time to reaction. The average time before being notified using the internal helpdesk system ranged from 30 minutes to over 6 hours especially if the issue occurred after hours and the after hour answering service was not able to notify the appropriate personnel. This system depended upon user input for notifications. A completely down system was easy to detect but in instances where performance was degraded it was not as easy to determine since often the end user did not report it until it became severe. A management notification system could have seen the performance issue and notified someone so that the issue could have been addressed before any downtime had taken place. Management systems can vary quite a bit depending on the vendor and features. Retail cost for the Cisco Prime infrastructure and Collaboration software is approximately 85 thousand dollars for an average SMB. Using the data taken from the WAN monitoring service we found that out of 75 incidents there were 16 that took over 4 hours to address. ABC Company has determined that after 4 hours of downtime to manufacturing processes can cost over $100,000 per hour. If you take the best effort of the manual helpdesk system and add 30 minutes to each of those tickets you would have increased the loss by an additional $200,000. This would more than cover the cost of installing a management system for the network and the example does not even address the internal LANs in each location. Cisco Prime Infrastructurevi, being reviewed by ABC Company, is only one of many management systems available. One of the main reasons for looking at Prime is the current investment in Cisco equipment, training, and services. Some of the other more common management systems are Solarwinds vii , SpiceWorks viii , Paessler ix , Tivoli Framework x , and WhatsUp Gold xi A good management system allows for a smart notification system that analyses alerts and prevents alert overload. Is the alert identifying a critical issue or a temporary over utilization of a port?This way only the critical alerts get through so any notifications that are sent to a technician are the ones that are really important. The system needs to be able to differentiate between an end user reboot their PC and a connection to a MPLS network. The alerts not defined as critical are logged so that they can be review from the management console. . This is not an exhaustive list as there are many different vendors each with its own set of features. Prime manages wired and wireless access over local and wide area networks. It gives the user a combination of inventory, configuration management, visibility, and wireless management features. The system allows for the use of Netflow, NBAR, and Medianet agents for reporting. Prime also keeps historical data and gives the user a drill down functionality to help with trouble shooting issues. When combined with Cisco ISE -Integrated Service Engine it also give the user security and accounting functionality meeting all of the ISOs five functional areas. With budgets tight and executives wanting a positive ROI network monitoring needs to show it savings potential. According to white paper by Intermapper xii a) Savings . There are several areas of potential savings that are generated by a network monitoring system. # ? Staff/salary time savings ? Automating processes and monitoring frees staff for other tasks and projects. ? May not need to increase the number of staff personnel as the network grows. # ? Minimizing or avoiding outages ? Proactive actions when issues appear allow for action to take place before downtime occurs. # ? Reducing support calls ? If the network is up and functioning correctly there will be fewer trouble tickets created. # ? Reducing time to fix ? The sooner IT is alerted to a problem, the sooner the issue can be resolved. # ? Guaranteeing and managing SLAs ? Using the historical data retained by the management system it can be determined whether or not SLAs are being met. ? Reducing downtime ? Configuration and change management allows for quicker restore of down devices and services. Network management systems are not in expensive. They range from just sending alerts when a ping to a device fails to be able to meet all of the requirements for the 5 functional areas listed by the ISO. ? While the solution can save many hours doing tedious tasks such as device management, configuration, and changes, it still requires human interaction. To be successful a management system needs to have several metrics that can be used to justify its purchase and operation. According to a study by Solarwindsxiii one client determined the metrics to the number of man-hours/time spend each day trouble shooting issues, revenue generated, and costs saved. Each issue or ticket then was tagged to any SLA agreements. Having a network performance monitor allowed them to save over 20% in equipment replacement costs and gave them the ability to keep historical data so it became much easier to determine root cause of each issue. Another side benefit of using a management system is the reduction in human caused errors. Over 80% of critical network outages are caused by human error xiv As companies grow and expand manufacturing can expand to become a 24-hour endeavor. As production schedules expand it becomes more critical to either have staff working all 3 shifts or a system that monitors all of the networking components and can alert staff of any issues at any time of the day. Year Other functionality that can greatly reduce staff time is the ability to manage multiple devices from one central system. Using ABC Company as an example, they have 300 switches that at some point will need to be upgraded due to an increase requirement of security for functionality. Based on the average time of 30 minutes spent by a network administrator to manually upgrade each switch, it would take 15 hours to complete these tasks. This does not include the time it takes to research what the latest approved version is, can the switch handle the upgrade, downloading the software, or setting up the tftp server for distribution. Another part of the process would be to backup each configuration and save the current running configuration before even starting. Using a management system such as Prime Infrastructure xv In another example the company managing the WAN devices that consisted of Cisco routers and Riverbed accelerators were able to automate the upgrade of all 13 WAN accelerators form version 6.x to version 7.0.x. This was done in order to keep the devices up to date for support requirements. Using an automated management console every Riverbed was upgraded to the latest code using a simple script reducing downtime and labor costs. Setting up the script took approximately 1 hour, running the upgrade on 15 devices took 30 minutes for a total of 2 manhours. Manually upgrading 15 devices would take approximately 45 minutes per device resulting in a requirement of 11.25 man-hours. The ability to have process take place in a batch mode further demonstrates the usefulness and cost savings ability of management software. , the network administrator would only need to run a report to see if everything could be upgraded, plan on when the upgrades could take place, and create a job that would handle all of the upgrades automatically. Using the data derived from the third party monitoring service and comparing that to the information derived from the internal help desk system, can be shown that an automated system is obviously faster than a manual system. Adding on average over 30 minutes to every ticket costs time and as stated before when an incident is affecting down time that runs over 4 hours the costs to the company is well over the cost of using a network management system. As with the example above assuming that you have to upgrade the companies 300 switches once a year, which at 30 minutes per switch takes 150 man hours automating that saves you over 100 hours of a person's time based on a 75% savings in time. At $25 per hour that is a savings of over $2,800 and that is being conservative. So anyway you look at the benefits of using a network management system, you will save time and money, whether in reducing downtime, meeting SLAs, or reducing maintenance downtime. To review, network management systems have the ability to save companies money and increase profitability and productivity. This gives existing employees more time to be proactive or keeps management from having to hire more highly trained End Notes: Using historical data, it also becomes much easier to justify increasing bandwidth when there is data demonstrating the overutilization of existing circuits. You also have the ability to remove any underutilized circuits to save money. Monitoring can also help to point out potential problems by measuring network metrics such as packet loss and error rates. Without monitoring you do not have the ability to know if the number of errors on an interface is from a temporary problem or a potential issue. Network Management software is not only cost affective but critical to the well being of a company's communication infrastructure. ------------------------------ This page is intentionally left blank ![](image-2.png ".") xiWhatsUpGold.Web.20March2013.http://www.whatsupgold.com/products/whatsup-gold-core/index.aspx,xii Intermapper. 6 Ways to Calculate Returns from yourNetwork Monitoring Investment. Web Feb. 2013.http://www.intermapper.com/uploads/pdf/InterMapper%20ROI%20Whitepaper.pdfxiii Network Management. Solarwinds, Web. 20 March2013.http://www.solarwinds.com/network-manage-ment-software.aspxxiv HP. Your network enables your business. Web Jan.2013. http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA1-6185ENW.pdfxv Cisco Prime Infrastructure Configuration Guide, CiscoSystems, Web Jan, 2013 http://www.cisco.com/en/US/docs/wireless/prime_infrastructure/1.2/configuration/guide/pi_12_cg.html 04/2504/25/2012 04:47:29 --log --Smith, Jeromie========= STM_RTR_Catawba -Router -Cisco -2811 Event Logs for this case : ======================== 04/25/2012 11:31:01 --close --Smith, Amber Ticket Closure: ATT ticket had no further updates (closed ticket)Opening ticket with the carrier:BGP uptime 06:55:28013ticket was already openno errors01322Year ( D D D D D D D D ) ESerial0/0/0 is up, line protocol is up Hardware is GT96K with integrated T1 CSU/DSU Description: AT&T Circuit ID DHEC.918563.ATI Internet address is 192.168.199.37/30 MTU 1500 bytes, BW 1544 Kbit/sec, DLY 20000 usec, reliability 255/255, txload 12/255, rxload 72/255 Encapsulation PPP, LCP Open Listen: CDPCP Open: IPCP, loopback not set Keepalive set (10 sec) Last input 00:00:00, output 00:00:00, output hang never Last clearing of "show interface" counters 06:48:04 Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: Class-based queueing Output queue: 0/1000/0 (size/max total/drops) Activity Type Code: ( Checking carrier ticket: Year 04/25/2012,01:42:12:[AT&T ] Activity Type Code Desc: PROGRESS COMMENTS D D D D D D D D ) E30 second input rate 442000 bits/sec, 76This is a summary of your case and its currentpackets/secstatus30 second output rate 76000 bits/sec, 78packets/secCustomer Name :980698 packets input, 487014126 bytes, 0 no=============bufferSTM Industries (AT&T NI)Received 0 broadcasts, 0 runts, 0 giants, 0throttlesSite Name :0 input errors, 0 CRC, 0 frame, 0 overrun, 0=========ignored, 0 abortCatawba, NC (Oxford DC)1086429 packets output, 112190832 bytes, 0underrunsStatus :0 output errors, 0 collisions, 0 interface=======resetsClosed0 unknown protocol drops0 output buffer failures, 0 output buffersPriority :swapped out=======0 carrier transitionsInformationalDCD=up DSR=up DTR=up RTS=up CTS=upOpening Summary :04/25/2012 04:50:28 --priority change --Smith,=======JeromieNetwork alarm occured.Priority change, from New Investigation toInformational.Opening Description :===========04/25/2012 04:50:17 --log --Smith, JeromieTicket generation based on a network alarm. NOCcontinue to monitor and check carrier ticket atinvestigating.scheduled timeEquipment : E Are Network Management Systems, Which Are Becoming More and More Critical to the Reliability, Availability and Recoverability of Today's Data and Voice Networks Cost Effective? E Are Network Management Systems, Which Are Becoming More and More Critical to the Reliability, Availability and Recoverability of Today's Data and Voice Networks Cost Effective? E Are Network Management Systems, Which Are Becoming More and More Critical to the Reliability, Availability and Recoverability of Today's Data and Voice Networks Cost Effective? © 2013 Global Journals Inc. (US) * Computer Network Routing with a Fuzzy Neural Network JuliaKIi Brande November 7, 1997. 15 Jan. 2013 AIRLINX Communications, Inc. Network Management * Wireless Sensor Network Management and Functionality: An Overview, Wireless Sensor Network GeorgoulasDimitrios BlowKeith 10.4236/wsn.2009.14032 2009. 2009 1 Published Online November * Global Linking Solutions.www.GLS.com vi TechTarget. A new Cisco network management strategy: Cisco Prime. Web Gls 15 Jan. 2013 * -software.aspx viii SpiceWorks Network Management Software. Web. 22 Web. 20 March 2013. March 2013 /A-new-Cisco-network-management-strategy-Cisco-Prime vii Network Management. Solarwinds * 01:41:37 AT&T ] Activity Type Code Desc: TEST COMMENTS Activity Type Code: TEST unable to read the csr's CSU registers 25/2012 * 04/25/2012 05:00 Jeromie Priority Status change, from NOC Working Ticket to NOC Working Ticket. Priority status schedule is set to * Jeromie Logging into device to gather status: STM_Catawba_RTR uptime is 45 weeks, 1 day, 18 hours, 49 minutes System returned to ROM by power-on bgp= 00:06:25 * is up, line protocol is up Hardware is GT96K with integrated T1 CSU/DSU Description: AT&T Circuit ID DHEC.918563.ATI Internet address is 192.168.199.37/30 MTU 1500 bytes, BW 1544 Kbit/sec, DLY 20000 usec Serial0/0/0 reliability 255/255, txload 1/255, rxload 1/255 * output hang never Last clearing of "show interface" counters 32w6d Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 29508 Queueing strategy: Class-based queueing Output queue: 0/1000/0 (size/max total/drops) 30 second input rate 7000 bits/sec, 8 packets/sec 30 second output rate 10000 bits/sec, 10 packets/sec 442685557 packets input PppEncapsulation Lcp Open Listen set (10 sec) Last input 00:00:00, output 00:00:00 CDPCP Open: IPCP, loopback not set Keepalive 3444254842 bytes, 0 no buffer Received 0 broadcasts, 1 runts, 8 giants, 0 throttles 29581 input errors, 29581 CRC, 12182 frame, 5090 overrun, 0 ignored, 18545 abort 529237866 packets output, 968859321 bytes, 0 underruns 0 output errors, 0 collisions, 4 interface resets 0 unknown protocol drops 0 output buffer failures, 0 output buffers swapped out 16 carrier transitions DCD=up DSR=up DTR=up RTS=up CTS=up * 04/25/2012 05:00 Jeromie Priority Status change, from NOC Working Ticket to NOC Working Ticket. Priority status schedule is set to * Healthy, optimisation service: running 04/25/2012 04:36:03 --log --Manager, Event STM_RSH_Catawba -PING -OK -172 /25/2012 04:36:03 --log --Manager Event STM_RSH_Catawba -Riverbed Status -Steelhead 250 250M 0 25.1.4: rta 15.498ms, lost * /25/2012 04:36:03 Manager, Event STM_RSH_Catawba -HOST -OK -172.25.1.4: rta 20.992ms, lost 0 * 04/25/2012 04:30 Jeromie Priority Status change, from System Generated to NOC Working Ticket. Priority status schedule is set to Jeromie Reviewing ticket and ticket procedures * /25/2012 03:56:04 Manager, Event STM_RTR_Catawba -PING -CRITICAL -192.168.199.37: rta nan 100 * /25/2012 03:56:03 Manager, Event STM_RSH_Catawba -PING -CRITICAL -172.25.1.4: rta nan 100 * /25/2012 03:56:03 Manager, Event STM_RSH_Catawba -HOST -CRITICAL -172.25.1.4: rta nan 100