PathFinder

PathFinder 1.0.5 Reference Manual

This is confidential corporate information of siteROCK Corporation. Do not distribute without explicit permission.

1 Introduction

1.1 Abstract

Pathfinder is a monitoring tool focused exclusively on gathering information about the network infrastructure via ICMP probes. Pathfinder can discover and maintain data from an observation point to a set of given endpoints. Alerts can be generated based on up/down or a variety of other conditions. Pathfinder provides a queryable RAM based database and an interface that can be used by scripts and other programs to query the information in the Pathfinder database. Alert frequency, root cause analysis and alert fork limitations are integrated into the alert system to keep down the alert frequency and to avoid system overload condition due to alert storms.

Pathfinder comes with a set of scripts build around the pathfinder engine for easy access to data from the web. A set of example alert scripts is provided demonstrating the integration with the Netsaint diagnostic tool.

1.2 Features

  1. One pathfinder process can probe and monitor the routes to over one million IP addresses given enough RAM.
  2. Written in C and therefore easily portable to other platforms.
  3. Configurable history length that is kept in RAM.
  4. Extremely fast reporting and database operations because the database is realized as a linked list and optimized for IP data retrieval. The size of data elements is minimal.
  5. Can execute a script based on conditions encountered.
  6. Can alert on route changes.
  7. Can alert on failure rates, failure sequences or performance degradation to individual hosts or groups of hosts.
  8. Can be easily placed on a remote system and remotely controlled.
  9. Integrated alert system allowing the control of the frequency of alerts and the queuing of alerts to avoid overloading the system.
  10. Alert system allows alerting on root cause. If a router to 10000 system fails then one alert for the failing router will be generated.
  11. Fully Client-Server. One web-interface can retrieve data from multiple pathfinder servers.

2 Sample Session

This is a sample session with a limited number of hosts. The session assumes that the pathfinder process has been started without any options. Simply starting pfd should usually give the configuration needed to do the sample session.

Pathfinder usually listens on port 89. A connection can be established by telnetting to this port. This can be done simply by:

TELNET LOCALHOST 89

Pathfinder then answers with the HELLO string:

PathFinder V1.0.5 0 IPs 0 Paths 0 Checks

You are talking to a PathFinder Version 1.0.5. No IP addresses are defined yet and no checks are operating on those IP addresses.

We are now defining two groups A and B and are placing hosts in them.

GROUP A

http://www.siterock.com/

http://www.openrock.net/

http://www.siterock.co.jp/

GROUP B

http://www.abcnews.com/

http://www.cnn.com/

The current status of pathfinder can be found by typing

STATUS

Which will show that Pathfinder is in state STOPPED but it has some IP addresses defined. We need to tell it to start monitoring. That is done through the START command.

START

If status is done again we can see that pings are now being sent out to those IP addresses.

We want to see now what hosts are defined.

LIST *

This will give a list of all IP addresses. Note that the IP addresses are marked with their group and either @ or $ after the groups they belong to. An IP address with a @ is one of the ones we defined. An IP address with $ only is an IP address of a router that pathfinder found on the way to those IPs with a @.

Note that data will not be available until the completion of the first cycle. The results of LIST * look empty until that happens.

Now lets instrument a check on all groups. Alert after 7 consecutive pings to a host in any group have failed.

GROUP *

VIEW 7

FAIL

Now the checks can be listed with:

CHECKS

Pathfinder can now be left running and the controlling session disconnected using

EXIT

The connection to a running pathfinder can be reestablished with the TELNET command given earlier and then the accumulated data can be queried and looked at.

3 Concepts

3.1 Measurement

Measurements are done by sending out an ICMP probe to a host. Probes are both used for discovery of routes and for getting the measurements of RTA times to those hosts. The result of a measurement can be:

A)The return time in microseconds (in the range 2 microseconds -4 minutes)

B)A timeout or a failure

C)Unknown

A unknown measurement is printed as a question mark �?�. A timeout will print out as �FAIL�.

A measurement will be in an unknown state if certain systems cannot be reached. If 5 routers to a target cannot be reached then pathfinder will not check the rest of the path and the IPs on the rest of the path will be marked as having an unknown measurement.

An unknown state can also result from a router no longer being used by any of the targets being monitored. Pathfinder will drop the router from its list when it has a history of only unknown values.

Measurements are printed in sequences are represented by single characters for certain reports:

Measurement

Symbol

Failure

*

Response time <10ms

+

Response time <100ms

=

Response time <1s

Response time >1s

~

Unknown

?

Measurements are usually specified and printed in microseconds (1/100000 second). The USER display mode has some convenient conversions to milliseconds to make it easier to read output. Those conversions are not performed in BATCH mode.

Pathfinder tries to conserve memory by packing the representation of measurements as densely as possible. Measurements are done in microseconds but the accuracy varies. The greater the measurement in time the less resolution in microseconds. A couple of microseconds do not matter if we are dealing with a timeframe of 2 minutes. On the other hand one would like to know the time to the microsecond if the response time was less than 200 microseconds. The minimum time measurable is 2 microseconds and the maximum time around 4 minutes. Everything about the maximum is considered to be a timeout.

3.2 Cycle

Pathfinder does measurements in cycles. Every IP address is visited in each cycle. During a cycle there is a period of sending ICMP probes (and also accepting replies) that will finish around 15% before the end of the cycle. The rest of the cycle is then spend on only listening to replies.At the end of the cycle all outstanding replies are marked as being in a state of timeout. All not visited IPs are marked as having an unknown measurement.

At the beginning of the cycle pathfinder calculates the ping frequency needed. The first and second cycle are usually more ping intensive than later cycles because pathfinder does not know the router structure yet and will do multiple probes to a single IP address. If the router structure is known then only one probe will be send to a router per cycle.

If pathfinder is started with a large set of IPs then the routers one or two hops away will be hit with intensive ICMP traffic during the first cycle. This can be avoided to some extend by excepting early routers from observation with the SKIP command.

3.3 Groups

Pathfinder separates the IP addresses into groups. These can be partially assigned by the user (Group A-Z) or are assigned by the system (@#$%!).

IP addresses can be assigned to multiple groups so that multiple checks can be done on them.

Checks also can be assigned multiple groups so that they can work on different sets of hosts.

Groups are specified by one or more characters that describe the group membership. For example

A

Would mean select IP addresses belonging to group A. The groups can be combined.

AB

Means select IP addresses belonging to group A and B. Some commands (such as the LIST command) allow the specification of two groups. The first is the list of groups to include the second the groups to avoid. For example

LIST A B

Would mean show the members of group A that are not members of group B.

The GROUP command can be used to set default group specifications for all future commands and definitions. For example

LIST A

And

GROUP A

LIST

Will have the same output.

List of special group symbols

Symbol

Meaning / Function

*

Means all groups.

@

Target. All IPs that were defined by the user have an @ symbol.

$

Router. The IP address was encountered on the path to a target. Cannot be manually assigned.

#

Inheritance. If a target is a member of this group then all the routers encountered will become the members of the same groups as the target. This allows the instrumentation of alerts on a dynamically changing route path. This is not a real characteristic of an IP address. The participation of user groups in inheritance is set with the INHERITGROUPS command.

%

Members of this group are PING only. No path discovery is done for hosts in this group. The participation of user groups in PING mode is set with the PINGGROUPS command.

~

The target is a member of groups in the randomized groups. IP addresses will be randomly reassigned if no response was obtained. Membership in randomization is set through the RANDOMGROUPS command.

+

An operation marked this IP address.

3.4 Memory Space use

Pathfinder keeps all its history of measurements of probes to IPs in RAM. If pathfinder is swapped out performance can be severely degraded and pathfinder could abort cycles if they are not completed.

The number of IPs monitorable from a system is determined by the number of measurements to be kept in memory and the memory size of the system. The following formulae should help to figure out how many IP addresses are monitorable by a pathfinder process (This is a very conservative estimate):

Number of IP addresses = Memory / 2 / (200+2*measurements per target)

With a memory of 128M and 64 measurements per target this would mean around 200000 IP addresses.

A memory size of 128M and 1440 measurements (one per minute of the day per target) would allow around 20000 IPs.

To monitor one million IP addresses we can limit the observation to once every hour and keep records for 2 days (48 measurements) in memory. Given the above we have

1 million = Memory / 2 / (200+2*48)

So we need a system with a memory of 500 Megabytes to observe 1 million systems over 2 days.

3.5 Bandwith use

Pathfinder usually stablilizes after a few monitoring cycles around needing one ICMP packet per IP address. It turns out that mass monitoring (>1000 targets) attempts usually result in a 1:1 relationship to discovered routers. So if you want to monitor N targets then there will be roughly N routers to those targets. Therefore the number of pings needed is 2*N/Interval-Length.

The bandwidth can be calculated (this is conservatively overestimating):

Bandwidth in kb/sec = 200*Number of Targets / Interval-Length in seconds

To monitor 1 million targets (plus 1 million routers to those targets) every 5 minutes one would need 200 million/300 second ~ 670 kByte/second of bandwidth. It might be more reasonable to monitor those hosts only every hour. In that case the scenario becomes much better 200 million / 3600 seconds ~ 50 kByte/second.

A limitation on the ability to monitor is also the number of packets send per second. That is a limitation of the TCP stack of the operating system. A safe area is less than 30000 probes per second. This can be calculated by dividing the number of IPs (2*N) by the time. For the 5 minute case with 1 mio addresses we would need 6700 probes per second. The hourly case would need less than 600 probes per second.

Given our numbers we could probably terrorize the world with probes in 1 minute intervals to 1 million hosts with a single pathfinder server. We would be needing around 30000 probes per seconds. If we wanted to keep minute by minute record of those then the memory use will be a staggering 6 Gigabytes though and the bandwidth needed will be 3MByte/second which will probably cause trouble with network providers and the switching fabric on the net.

CAUTION: Routers are frequently configured to shut down ICMP if they detect a too high frequency of probes originating from a system. Also note that high ICMP traffic can trigger DoS attack detectors of network providers or expose routing problems in internet routers. Some people do not like ICMP probing of their router structure. You can cause serious trouble on the Internet with PathFinder.

I have tested pathfinder with 600000 ip addresses (300000 generated via random ips and 300000 discovered). The attempt to monitor those IP addresses with an interval of 1 minute failed because the router detected a ping flood attack (around 5000 probes/second on startup) and stopped forwarding ICMP traffic from pathfinder. When the interval was reduced to 5 minute the router let traffic through (1000 probes/second) but I heard from the network abuse hotline of our connectivity provider soon after my test finished and got a detailed report on how pathfinder probes the network. The major problem with 600000 IPs was the memory. The system I used was a Celeron-433 with 128M Ram. Pathfinder needed around 90megabytes of that which did not leave much for the operating system.

3.6 Characteristics of IP addresses

IP address

Group Membership

Last Route / Or route to target for routers

Cycle last measurement was taken in

How many targets use this router

Individual measurements

3.7 Characteristics of Network Paths

Pointer to alternate path

Use of route (frequency of it having been measured)

Cycle the last measurement was made

Path elements

3.8 TIME selections

Pathfinder can display data from certain time frames. It will recalculate the times to find out to what cycles they refer and then display the time for those cycles.

If times before or after what is stored in memory are specified then those measurements will be displayed as unknown.

3.9 Alert Generation

Every measurement will immediately lead to check of alert conditions on individual hosts. If an alert condition is satisfied then pathfinder will start a script

pfd_alert [UP/DOWN/GUP/GDOWN/ROUTECHANGE]

The good router / bad router -address is specified if a ROOTCAUSE clause was given for a hostgroup and if such a cause could be localized. The good router is the last router reachable and the bad router is the first router failing. If a ROOTCAUSE is given then all hosts depending on this bad-router-address have been marked in alert state. The alertscript should query pathfinder for the list of ips depending on the bad-router–ip to determine the exact impact of the failure.

A DOWN alert means that at least one IP address was marked as being in alert state.

Group alerts will have zero (UNKNOWN) for all ip addresses.

3.10Startup

Pathfinder can be invoked with a parameter specifying a script with commands to execute:

pfd

Typically these scripts begin with a LISTEN and INTERVAL command and finish with a START command. Here is an example of a script to monitor routechanges on two hosts in group A:

LISTEN 127.0.0.1 89

INTERVAL 60 64

GROUP A

Hosta.com

Hostb.com

ROUTECHANGE

PROTECT password

START

There is no need for remote configuration if pathfinder is invoked that way. If there is no LISTEN command in the command file then pathfinder will not listen for commands and the configuration cannot be changed remotely. This feature could be used for a secure configuration just generating alerts.

If no parameter is given on the command line then pathfinder will act as if the command

LISTEN 0.0.0.0 89

had been specified. This will allow a remote connection to pathfinder and then a remote configuration.

3.11IP addresses

IP addresses are printed in 3 different formats:

AAA.BBB.CCC.DDD

Default format

XXXXXX

Batch mode. This is the IP address in network byte order as a 4 byte integer as used in databases.

Dns name

Dns mode

IP addresses can be specified in any of these 3 formats as well.

3.12Command Abbreviations

All commands can be abbreviated to what is necessary to distinguish them from others. I.e. the ADD command can be abbreviated with an A. The CHECK command works with just C.

3.13List output format

Nr Grp HostSeqLast Use R F U DiRR%MinAvgMax

——————————————————————————

1 @207.82.242.33???????????=-101 1 0 0810010108206

2 @206.79.179.198???????????==111 1 0 09100111111

3 @216.32.132.141???????????==501 1 0 08100494950

4 @209.1.169.33???????????==101 1 0 07100101011

5 @216.32.132.222???????????== 881 1 0 09100888888

6 @209.1.169.114???????????==101 1 0 06100101010

7 @209.1.169.195??????????????0 1 0 2 30100???

8 $209.185.84.102???????????==121 1 0 08100111112

9 $@216.34.2.92???????????==114 1 0 08100111111

10 $209.10.12.49???????????==401 1 0 05100393940

Nr

Running number

Grp

Groups an IP address belong to

Host

The IP address

Seq

Display of the last measurement in sequence characters

Last

Last measurement

Tgt

Number of targets (@) using this IP address

R

Number of routes to this ip address in the current timeframe

F

Number of consecutive failures

U

Number of unknown measurements

Di

Distance from the pathfinder host in hops

RR%

Return Rate (how many packets out of 100 have been returned ?)

Min

Minimal round trip time measured in current time interval

Avg

Average round trip time measured in current time interval

Max

Maximum round trip time measured in current time interval

4 Command Reference

4.1 ADD

ADD ipaddress [groups]

Adds an IP address to the database. If groups are specified then assign the IP address to the groups given. If no group is specified then the IP address is assigned to the group set with the GROUP command.

4.2 ALERTS

ALERTS [include-groups [exclude-groups]]

List current alert state. The output format includes the time the alert occurred and the text generated for the alert. These are alert states memorized by pathfinder. Alert states are only kept for down hosts and down group conditions. A routechange is an event and not an alert state and therefore will never show up in the alert state list. Alert histories should be kept by scripts invoked by pathfinder but not by pathfinder itself.

4.3 BATCH

No parameters.

Switch the output format to BATCH format. Batch format is optimized for scripts retrieving output. Tabs separate fields and records are separated by newline. All measurements are displayed in microseconds avoiding pretty printing. All IP addresses are displayed as one integer in network byte order. This format can be imported directly into databases such as mysql or used to generate associative arrays in PERL or PHP. Batch format is not very readable by humans since the output is not justified and might overflow lines.

The default mode is USER. Batch mode has to be switched on for every connection established.

4.4 CHECKDEL

CHECKDEL groups

Removes all checks that match the groups given. If a check is a member of multiple groups then the group association to the groups given is removed. If a check is no longer member of any group then the check is removed.

4.5 CHECKS

CHECKS [include-groups [exclude-groups]]

List checks. If no groups are specified then list the groups specified by the GROUP command.

4.6 CLEAR

No parameters

Wipe out the database. Pathfinder must be in a stopped state for this to work. It is necessary to issue a CLEAR if the number of measurements per IP need to be increased.

4.7 CONCURRENTALERTS

CONCURRENTALERTS number [groups]

Allow number of simultaneous alert scripts to run. The default configuration is to allow only one alert-script to run. Other alerts generated during the runtime of the script are ignored and categorized as ignoredalerts (Will be displayed with the CYCLE command). An alert in process will show up in the ALERTS display with the process ID that was forked if it is still active.

4.8 COUNT

COUNT [includegroups [excludegroups]]

Count the matching items and display the number of matches.

4.9 CYCLE

No parameters

Print information about the monitoring cycles in the current time/viewframe.

4.10DATA

DATA ipaddress

Prints the measurements taken in the current time/viewframe for a certain IP address.

4.11DEL

DEL ipaddress

Remove IP address. Only IP addresses that are marked as @ (targets) can be removed. If the IP address is also marked as $ then the targetmode is removed but the IP will stay in the database until the router is no longer used.

4.12DEPENDS

DEPENDS ipaddress [hops]

Print IPs that depend on a given IP address. If no hops are specified then print all targets (@ marked IPs) going through this router.

If hops are specified then print all routers using this router at the specified distance of hops from the specified router. For example DEPENDS ipaddress 1 would display all routers one hop to the target from ipaddress.

4.13DNS

No parameters.

Switch on reverse DNS lookups and the display of DNSnames for all reports.

WARNING: DNS resolution can take a long time especially if no reverse DNS is set up for some IP addresses.

4.14EXIT

No parameters

Terminate the current session. Session parameters (view/timeframe, batch/dns mode) are lost but the IP/check configuration will be kept and pathfinder will continue to run.

4.15FAIL

FAIL [group]

Generate an alert if a member of the specified group (or if no groups are specified the groups specified through the GROUP command) has more than the current viewlength (See VIEW command) consecutive failures.

4.16GFAIL

GFAIL number [group]

Generate an alert if more than number of hosts are failing in the group specified (or if not specified the groups set by the GROUP command). Failures are determined to exist if viewlength (See VIEW command) consecutive failures have occurred.

4.17GRATE

GRATE returnrate [group]

Generate an alert if the returnrate of the group specified (or the group specified with the GROUP command if not specified in this command) is lower than the percentage returnrate specified over the time period set with the VIEW command.

4.18GRESPONSE

GRESPONSE ms [group]

Generate an alert if the average response time of the group specified (or the group specified with the GROUP command if unspecified) exceeds the responsetime ms set over the time period set with the VIEW command.

4.19GROUP

GROUP [include-group [exclude-group]]

Set the groups to be displayed on display commands or to be used for definitions of checks and IP addresses. If no parameters are specified reset to the display of all IPs. Otherwise include and exclude the groups specified.

WARNING: If some IP addresses just wont show up then it could be that there is a group setting active. Get rid of it by just typing

GROUP

4.20GROUPDEL

GROUPDEL include-group [exclude-group]

Remove IPs belonging to the groups specified.

4.21HELP

No parameters

Display information about available commands and their syntax.

4.22INHERITGROUPS

INHERITGROUPS groups

Mark all routers in the groups specified as being members of the same groups as the target. The checks defined for those groups will then also apply to the routers.

4.23INTERVAL

INTERVAL cyclelength [history-size]

Set the length of a cycle in seconds and set the number of measurements kept in memory. This command should be given when pathfinder is stopped. Reduction of the number of measurements kept is possible at all times but will lead to wasted memory. Enlargement of the number of measurements kept is only possible if there are no hosts defined and no data available yet. The CLEAR command can be used to wipe out all what has been defined so far and allocate memory in the right way. Use CLEAR before redefining the history-size if IPs are already defined.

4.24IP

IP address

Prints all characteristics about an IP address and a summary of the data in the current view.

4.25IPS

LIST [includegroups [excludegroups]]

List the Ips satisfying the conditions only. Do not list additional information.

4.26LIMIT

LIMIT [field [lower-boundary [upper-boundary]]]

LIMIT allows restricting the items displayed by list commands and the sort command to have values in certain ranges. All ranges specified must be sastisfied in order for data to be displayed.

Restrictions can be placed on the following values:

Field

Purpose and comments

TARGETS

The number of targets reached through a router

DISTANCE

Distance in hops of the IP address

ROUTES

Number of alternative routes encountered in the view interval.

MIN

Minimum RTA

MAX

Maximum RTA

AVG

Average RTA

FAIL

Number of consecutive failures

UNKNOWN

Number of measurements not performed

COUNTER

The number of the item to be displayed. Restrictions on COUNTER can be used to display the 200 to 400th element of a list. Simply setting the upper boundary limits the number of items displayed by commands listing ips.

If the LIMIT command is given without any parameters then all the boundary limits currently set are displayed. If LIMIT is used with one parameter of the field then the field boundaries are reset so that all IP addresses match.

If just a lower limit is specified then the upper limit is not enforced.

4.27LIST

LIST [includegroups [excludegroups]]

List all the IPs satisfying the group conditions or the conditions set by the GROUP command. Each IP has a full line of information added to it.

4.28LISTEN

LISTEN listen-address [port [allow-ip [allow-mask]]]

Change the port that pathfinder is listening to. IP addresses from which connections are allowed can be specified to secure the machine.

4.29LOG

LOG loglevel

Change the log level. Pathfinder usually starts with loglevel 0 which means do not log anything. The following loglevels exist. Logging is happening to a system log file (See PORTS section).

Loglevel

What is logged

0

Only critical errors

1

Very important problems. Script failures.

2

Commands given to pathfinder are logged. ALERTs generated.

3

Rootcause analysis

4

New routers, invalid responses, removal of routers.

5

Establishing and termination of remote connections to pathfinder

6

Rerandomization. Target/router conversion

7

Statistics about monitoring cycles, HASH growth, authentication,

8

Sending and receiving ICMP traffic

9

Decision making on the processing of responses

CAUTION: High levels of logging can create the need for I/O that might cause delays in pathfinders processing. Keep logging to a minimum in high IP use situations.

4.30LSEQ

LSEQ [includegroups [excludegroups]]

Print the data for the groups using sequence characters to get an overview of the measurements.

4.31NODNS

No parameters.

Switch off the reverse DNS resolution. This is the default.

4.32PATH

PATH ipaddress [number]

Print all the routes to an ipaddress in the given time/viewframe. If a number is given then just print the indicated alternate route.

4.33PINGGROUPS

PINGGROUPS groups

Do not perform path analysis for the groups defined.

4.34PROTECT

PROTECT password

If a session is unprotected then protect it with a password. Future session to pathfinder will default to not allow users to make changes. Users can make changes if they give a PROTECT command with the same password.

This can be used so that certain scripts can only do modifications to pathfinder configurations. Simply precede each change with PROTECT password.

4.35RANDOM

RANDOM number[groups]

Generates the number of ip addresses using a random IP generator and puts them into the group given. Note that some Sysadmins get easily offended if one probes random IP addresses out on the internet. Be careful.

RANDOM implicitly sets RANDOMGROUPS to groups.

4.36RANDOMGROUPS

RANDOMGROUPS groups

Rerandomize the IP address of member of groups should the targets not respond. Rerandomize until response is received.

4.37RANGE

RANGE ipaddress number [groups]

Define a range of ipaddresses to be in group groups. The ipaddress is the starting IP. It is then incremented number times to generate additional IPs.

4.38RATE

RATE percent [groups]

Alert if the return rate of a host in the groups (or the groups specified with the GROUP command) is lower than specified over the timeperiod specified with the VIEW command.

4.39RECOVERYALERT

RECOVERYALERT [groups]

Generate an alert if the failure condition goes away. Without this command no notification of a recovery is performed.

4.40REPEATALERT

REPEATALERT cycles [groups]

Repeat an alert for a condition if it has not cleared within cycles monitoring cycles.

4.41RESPONSE

RESPONSE usec [groups]

Alert if the response time is higher than indicated over the timeperiod specified with the VIEW command for the groups specified (or the groups specified with the GROUP command).

4.42ROOTCAUSEALERT

ROOTCAUSEALERT [usec [groups]]

If this command is given then the alerts are not generated for individual targets but for the router that caused a failure. All targets affected are marked as being in alert state before the one alert on the router is generated. The ROOTCAUSEALERT allows an effective reduction of alert frequency.

Root cause analysis can only be performed on host alerts not on group based alerts. The first parameter given to the directive specifies the criteria used to decide if the host is part of the problem. If the response time of a host is greater than the time specified or if the host did not respond at all then the host is considered part of the down part of the path and can then eventually be flagged by the root cause analysis as the culprit for the failure. The default for the time period is 1 second or 1000000 microseconds.

Rootcause works as an additional instrumentation of either FAIL, RESPONSE or RATE. It makes most sense to be use for FAIL. Most reliable detection requires a VIEW length of 2 or more cycles on FAIL or RESPONSE. If 1 cycle is specified then flukes can result in detection of a rootcause failure. The fault might also have been happening during the path probing leading to the detection of the first failure. At that point the network path is reflecting a transition condition of a path into failure, which is not so useful to analyze.

4.43ROUTECHANGE

ROUTECHANGE [groups]

Generate an alert if a route change happens to one of the targets in the groups specified or in the groups specified with the GROUP command.

4.44ROUTES

ROUTES ipaddress

Print the number of routes to an ipaddress in the specified timeframe.

4.45SEQ

SEQ ipaddress

Print all measurements in the timeframe given to an IP address using sequence characters.

4.46SHUTDOWN

No parameters.

Shut down pathfinder remotely.

4.47SKIP

SKIP number-of-hops

Do not analyze and trace route over the number of hops specified. The first hop will be after number-of-hops.

4.48SORT

SORT data-item [order [number-of-items [include-groups [exclude-groups]]]]]

data-item = USE | ROUTES | DISTANCE | MIN | AVG | MAX | FAIL | UNKNOWN

order = ASCENDING | DESCENDING

Sort the values given the criteria and display at maximum number-of-items. The smaller the number-of-items the faster the sort. It is advisable not to specify a length longer than 1000. Display is in LIST format.

4.49START

No parameters

Begin ICMP probing. Hosts must have been defined for this to work. Hosts can be defined after a START command but the data collection might not be very consistent for the cycle in which they are defined.

4.50STATUS

No parameters

Shows the current session and pathfinder parameters. Note that some of the parameters are the current running parameters within a probe cycle. The CYCLE command displays the final counts from cycles.

4.51STOP

No parameters

Stops ICMP probing. This should only be done in emergencies. STOP aborts the current CYCLE. Cycle data might be inaccurate. If possible do not START after STOP but use CLEAR to wipeout the data first.

4.52TIME

TIME [from [to]]

Set the current view window to begin at the time indicated. If to is also specified then calculate the length of the timeframe in cycles and set it (overrides the length set with the VIEW command).

TIME with no parameters sets the view to the most recent cycles.

4.53USER

No parameters.

Switch from Batch mode into user mode.

4.54VIEW

VIEW [cycles]

Sets the number of measurement cycles to be displayed or considered for statistics and checks.

VIEW without a parameters set the number of cycles to the number of measurements kept in memory.

5 Ports

5.1 Linux

Output of the processingcan be found in /var/log/pathfinder.err and /var/log/pathfinder.out.

Pathfinder needs a RAW tcp socket in order to work. Raw TCP socket allocation requires superuser priviledges.

Pathfinder can securely run setsuid. Pathfinder drops superuser rights immediately and reestablishes them only to open the raw socket and the log files for output and errors which have a fixed location.

5.2 Microsoft Operating Systems: CygWIN environment

Has only been run in debugging mode (without forking) yet.

6 Writing Scripts

6.1 Introduction

Pathfinder becomes only really powerful if its information is used for other purposes such as alarm correlation or the integration into a monitoring framework (such as Netsaint). That integration is possible by having scripts that have a dialogue with pathfinder. They configure or retrieve data from pathfinder. Pathfinder is designed to be scripted in this way. A special operating mode the BATCH mode exists to allow easy interaction from PHP, PERL and other scripting languages.

Scripts can be invoked from Pathfinder based on certain alert events. Those scripts can then investigate the event further by querying other pathfinder information and then take actions as needed.

6.2 Connecting to Pathfinder

The connection to pathfinder is a simple TCP connection. Usually pathfinder uses port 89 to accept a connection but this can be reconfigured with the LISTEN command (for example to run multiple pathfinder processes on one machine or to deal with firewall issues).

Here is a fragment of a PHP script establishing a connection:

/* Establish TCP connection to pathfinder */

$cid=fsockopen($pfd_address,$pfd_port,$errno,$errstr);

if (!$cid) {

echo “

The connection to the pathfinder server on $pfd_address port $pfd_port failed: $errstr

“;

} else { . . . Successfully connected . . .}

After the initial connection it is good to do a handshake to see that the right connection has been established. For that purpose the hello string of pathfinder should be parsed:

list($hello,$version,$ips,,$paths,,$checks,)=explode(“\t”,fgets($cid,200));

if ($hello == “PathFinder”) { . . .all is well …} else { . . .this is not a pathfinder connection . . .)

After this is complete some information about pathfinder is already available. Please check the version number if possible. Significant changes will result in a change of the major version number. The commands described in here should work with all pathfinder 1.x.y versions.

6.3 Configuring the Connection

After the connection has been established pathfinder needs to know what information we want and in what format we want it. The first step of a connection from a script is usually to establish BATCH mode. Batch mode avoids all the commandline niceties that just cause trouble while parsing and allows scripts to stay very simple.

fputs($cid,”BATCH\n”);

$result=fgets($cid,200);

With that we have already issues a first command to pathfinder. Note that all commands must be terminated properly (lest your script will seem to hang). All commands return an empty line on success. $result should contain only a newline. If it does not then it contains an error message.

Now we need to select which data we want to see. This is usually done with the TIME, VIEW and GROUP commands. Some example:

if ($pfd_viewlength) {

fputs($cid,”VIEW $pfd_viewlength\n”);

fgets($cid,200);

}

if ($pfd_time) {

fputs($cid,”TIME $pfd_time\n”);

fgets($cid,200);

}

If the user has set a view length or a time then issue a command to pathfinder to set the viewlength or timeframe.

6.4 Issuing Commands and retrieving Information

During the configuration of the pathfinder only commands are given to pathfinder. Retrieving information gets more complicated. What is coming back now from a command that retrieves information is the information and then an empty line follows to mark the completion of the command. Each line read must be checked if it contains the final newline.

fputs($cid,”CHECKS\n”);

while (!feof($cid)) {

$y=chop(fgets($cid,200));

if (empty($y)) break;

list($nr,$group,$text)=explode(“\t”,$y);

if (empty($text)) break;

. . .process the information . . .

}

This queries all the defined conditions that can lead to alerts. It�s a bit paranoid about checking for invalid data but better be safe than sorry. All data from all informational commands can be processed in the above way. See the pfd.php script for more examples and code pieces that could be used elsewhere.

Pathfinder is a single threaded design that processing commands in a busy loop. Some of the commands retrieving information cannot be satisfied in a busy loop or pathfinder would start hanging. Pathfinder forks for these commands. This means that pathfinder is able to accept a new command while the old one is still generating out. Do not send multiple commands to retrieve large amounts of information to pathfinder. Pathfinder will silently ignore concurrent informational commands. Only send a new command after the final empty line has been received.

6.5 Special Considerations for Alert Scripts

Alerts cause pathfinder to execute the pfd_alert script. Pathfinder is designed to monitor masses of IP addresses. There is the potential of generating massive amounts of alerts and therefore massive amounts of processes on a system. To guard against that pathfinder has a queuing mechanism for script invocations and a limit on concurrent script execution. The default limit is to only allow one script for a given group to run at a time. Other alerts are queued and executed when that script terminates. The limitations can be changed with the CONCURRENTALERTS command.

No new script will be executed until the alert script for a group has terminated. The design of the alert scripts needs to take that into account. Do not simply fork and return to make pathfinder think that the script is complete but really let it finish its work. If you bypass pathfinders safeguards for limiting the amounts of alerts generated then the potential of overloading the system exists again. Alerts scripts need to do their job in a fast way but they should not fork other jobs that continue running.

7 The PHP Web interface

7.1 Introduction

The PHP web interface makes it easy to display and handle pathfinder information. In particular it offers:

  • Fast Reverse DNS resolution caching DNS information in a MySQL table. This means that the information is much easier for humans to handle.
  • Display of measurements in milliseconds.
  • Point and Click for browsing through the information on the network status.
  • Ability to connect to any pathfinder server from the Web Interface.

7.2 Connecting

There is a link on the Tbox homepage to the PHP interface to Pathfinder. Click on that or go directly to the interface using http://hostname/pfd.

The first screen allows the specification of the location and port of a pathfinder server as well as to restrict the timeframe to be investigated. The default values allow the connection to a pathfinder server on the TBOX we are connecting to (should a pathfinder be running there).