watchme

The gpu watcher will help to monitor gpu devices (memory, power, versions, limits). This watcher must obviously be run on a machine with one or more GPU devices.

Let’s first create a watcher for your tasks to live under:

$ watchme create gpu

Now let’s walk through the gpu_task, which is currently the only one available for the gpu watcher. If you aren’t familiar with how to add a task and otherwise manage them, see the getting started.

Add a Task

Remember that we are going to be added tasks to our watcher, “gpu” above. The general format to add a task looks like this:

$ watchme add-task <watcher> <task-name> key1@value1 key2@value2

The key and value pairs are going to vary based on the watcher task.

Task Parameters

The gpu task doesn’t have any required arguments, but you can define a comma separated list of entries to skip.

Name Required Default Example Notes
skip No undefined skip@ varies by task (see below)

Also note that for all of the tasks below, you need to select the type of task with --type gpu.

Task Environment

For any task, if you have a variable exporter to the environment that starts with WATCHMEENV_, it will be found and added to the result. For example, WATCHMEENV_result_id=123 will be saved to the results dictionary with key “result_id” set to “123.”

Return Values

All of the return values here will be dictionaries, meaning appropriate to export to json. You shouldn’t need to define a file_name parameter, as it will be set automatically to be your host plus username. However, you are free to set this parameter to change this value.

1. The GPU Task

This task is useful for monitoring a specific process that uses a GPU. You can run it as a task (discussed first) or a decorator to a function to enable continuous monitoring. Either way, you will want to create your watcher first:

$ watchme create gpu

Run on the Fly, Python

It’s most likely you want to run and monitor a command on the fly, either from within Python or the command line. First, let’s take a look at running from within Python. We are going to use a TerminalRunner to run the command:

from watchme.tasks.decorators import TerminalRunner
runner = TerminalRunner('sleep 5')
runner.run()
timepoints = runner.wait('gpu_task')

You’ll get a list of timepoints, collected at intervals of 3 seconds! Here is a look at the first timepoint:

In [3]: timepoints[0]                                                                                                       
Out[3]: 
{'nvml_driver_version': '418.67',
 'nvml_system_nvml_version': '10.418.67',
 'nvml_deviceCount': 1,
 'nvml_unit_count': 0,
 'devices': {'Tesla V100-SXM2-32GB': {'nvml_device_board_id': 6656,
   'nvml_device_multi_gpu_board': 0,
   'nvml_device_brand': 2,
   'nvml_device_serial': '0323518083147',
   'nvml_device_set_cpu_affinite': None,
   'nvml_device_minor_number': 0,
   'nvml_device_uuid': 'GPU-be7b9ac8-f75e-1960-c52d-85429b4c86b1',
   'nvml_device_inforom_version': 'G503.0203.00.04',
   'nvml_device_inforam_checksum': 0,
   'nvml_device_display_mode': 1,
   'nvml_device_display_active': 0,
   'nvml_device_persistence_mode': 1,
   'nvml_device_supported_memory_clocks': [877],
   'nvml_device_performance_state': 0,
   'nvml_device_management_mode': 1,
   'nvml_device_power_managerment_mode': 1,
   'nvml_device_power_management_limit': 300000,
   'nvml_device_power_management_limit_constraints': [150000, 300000],
   'nvml_device_power_management_default_limit': 300000,
   'nvml_device_enforced_power_limit': 300000,
   'nvml_device_power_usage': 45297,
   'nvml_device_memory_info': {'free': 34058207232,
    'total': 34058272768,
    'used': 65536},
   'nvml_device_bar1_memory_info': {'bar1Free': 34357047296,
    'bar1Total': 34359738368,
    'bar1Used': 2691072},
   'nvml_device_compute_mode': 3,
   'nvml_device_ecc_mode': [1, 1],
   'nvml_device_current_ecc_mode': 1,
   'nvml_device_pending_ecc_mode': 1,
   'nvml_device_utilization_rates': {'gpu': 0, 'memory': 0},
   'nvml_device_encoder_utilization': [0, 167000],
   'nvml_device_decoder_utilization': [0, 167000],
   'nvml_device_pci_replay_counter': 0,
   'nvml_device_vbios_version': '88.00.43.00.03',
   'nvml_device_compute_running_processes': [],
   'nvml_device_grapics_running_processes': [],
   'nvml_device_supported_event_types': 31,
   'nvml_device_current_pcie_link_generation': 3,
   'nvml_device_max_pcie_link_generation': 3,
   'nvml_device_curr_pcie_link_width': 16,
   'nvml_device_max_pcie_link_width': 16,
   'nvml_device_supported_clocks_throttle_reasons': 511,
   'nvml_device_current_clocks_throttle_reasons': 1,
   'nvml_device_index': 0,
   'nvml_device_accounting_mode': 0,
   'nvml_device_accounting_pids': [],
   'nvml_device_accounting_buffer_size': 4000}},
 'SECONDS': '3'}

Do you need to add data to the structure? Just export it with prefix WATCHMEENV_*

os.environ['WATCHMEENV_AVOCADO'] = '5'

and it will appear in the result.

Run on the Fly, Command Line

If you choose, the same function can be run via the watchme command line client. If you provide no additional arguments, it will print the data structure to the screen:

$ watchme monitor --func gpu_task sleep 2
[{"nvml_driver_version": "418.67", "nvml_system_nvml_version": "10.418.67", "nvml_deviceCount": 1, "nvml_unit_count": 0, "devices": {"Tesla V100-SXM2-32GB": {"nvml_device_board_id": 6656, "nvml_device_multi_gpu_board": 0, "nvml_device_brand": 2, "nvml_device_serial": "0323518083147", "nvml_device_set_cpu_affinite": null, "nvml_device_minor_number": 0, "nvml_device_uuid": "GPU-be7b9ac8-f75e-1960-c52d-85429b4c86b1", "nvml_device_inforom_version": "G503.0203.00.04", "nvml_device_inforam_checksum": 0, "nvml_device_display_mode": 1, "nvml_device_display_active": 0, "nvml_device_persistence_mode": 1, "nvml_device_supported_memory_clocks": [877], "nvml_device_performance_state": 0, "nvml_device_management_mode": 1, "nvml_device_power_managerment_mode": 1, "nvml_device_power_management_limit": 300000, "nvml_device_power_management_limit_constraints": [150000, 300000], "nvml_device_power_management_default_limit": 300000, "nvml_device_enforced_power_limit": 300000, "nvml_device_power_usage": 45327, "nvml_device_memory_info": {"free": 34058207232, "total": 34058272768, "used": 65536}, "nvml_device_bar1_memory_info": {"bar1Free": 34357047296, "bar1Total": 34359738368, "bar1Used": 2691072}, "nvml_device_compute_mode": 3, "nvml_device_ecc_mode": [1, 1], "nvml_device_current_ecc_mode": 1, "nvml_device_pending_ecc_mode": 1, "nvml_device_utilization_rates": {"gpu": 0, "memory": 0}, "nvml_device_encoder_utilization": [0, 167000], "nvml_device_decoder_utilization": [0, 167000], "nvml_device_pci_replay_counter": 0, "nvml_device_vbios_version": "88.00.43.00.03", "nvml_device_compute_running_processes": [], "nvml_device_grapics_running_processes": [], "nvml_device_supported_event_types": 31, "nvml_device_current_pcie_link_generation": 3, "nvml_device_max_pcie_link_generation": 3, "nvml_device_curr_pcie_link_width": 16, "nvml_device_max_pcie_link_width": 16, "nvml_device_supported_clocks_throttle_reasons": 511, "nvml_device_current_clocks_throttle_reasons": 1, "nvml_device_index": 0, "nvml_device_accounting_mode": 0, "nvml_device_accounting_pids": [], "nvml_device_accounting_buffer_size": 4000}}, "SECONDS": "3"}]

Want to add an environment variable? The same applies - you can export WATCHMEENV_* and they will be added to results.

$ export WATCHMEENV_AVOCADOS=3
$ watchme monitor --func gpu_task sleep 2

If you want to save to a watcher, then provide the watcher name as the first argument. For example, here we run a task on the fly, and save the result to the watcher “decorator.” Since we don’t provide a --name argument, the name defaults to a derivation of the command run.

$ watchme create decorators
$ watchme monitor --func gpu_task decorators sleep 5

List the folders in the watcher named “decorators” to see the newly added result:

$ watchme list decorators
watcher: /home/users/vsochat/.watchme/decorators
  decorator-gpu-sleep-5
  .git
  watchme.cfg

And then use export to export the data!

$ watchme export decorators decorator-gpu-sleep-5 result.json --json
git log --all --oneline --pretty=tformat:"%H" --grep "ADD results" fd52cbf5d3cd325acdd2709bfc202d32ab721327..b579418f862e69e58d9e3464bc7981580a04cb3a -- decorator-gpu-sleep-5/result.json
{
    "commits": [
        "b579418f862e69e58d9e3464bc7981580a04cb3a"
    ],
    "dates": [
        "2019-07-11 09:31:55 -0700"
    ],
    "content": [
        {
            "nvml_driver_version": "418.67",
            "nvml_system_nvml_version": "10.418.67",
            "nvml_deviceCount": 1,
            "nvml_unit_count": 0,
            "devices": {
                "Tesla V100-SXM2-32GB": {
                    "nvml_device_board_id": 6656,
                    "nvml_device_multi_gpu_board": 0,
                    "nvml_device_brand": 2,
                    "nvml_device_serial": "0323518083147",
                    "nvml_device_set_cpu_affinite": null,
                    "nvml_device_minor_number": 0,
                    "nvml_device_uuid": "GPU-be7b9ac8-f75e-1960-c52d-85429b4c86b1",
                    "nvml_device_inforom_version": "G503.0203.00.04",
                    "nvml_device_inforam_checksum": 0,
                    "nvml_device_display_mode": 1,
                    "nvml_device_display_active": 0,
                    "nvml_device_persistence_mode": 1,
                    "nvml_device_supported_memory_clocks": [
                        877
                    ],
...    
    ]
}

For both of the command line above, you can define --name to give a custom name, or --seconds to set the interval at which to collect metrics (default is 3).

$ watchme monitor --func gpu_task sleep 5 --seconds 1

And along with the interactive Python version, you can optionally specify a comma separated value string of keys to include, skip, or only use. Here we skip two fields:

$ watchme monitor --func gpu_task sleep 5 --seconds 1 --skip nvml_device_supported_memory_clocks

And don’t forget you can use the default “func” argument (monitor_pid_task) to look at cpu metrics:

$ watchme monitor sleep 2
[{"ionice": {"value": 4, "ioclass": "IOPRIO_CLASS_NONE"}, "nice": 0, "open_files": 0, "cpu_affinity": [0], "cpu_times": {"user": 0.0, "system": 0.0, "children_user": 0.0, "children_system": 0.0}, "memory_full_info": {"rss": 376832, "vms": 110555136, "shared": 290816, "text": 24576, "lib": 0, "data": 331776, "dirty": 0, "uss": 106496, "pss": 120832, "swap": 0}, "create_time": 1562862319.59, "username": "vsochat", "pid": 124761, "ppid": 124759, "cpu_percent": 0.0, "gids": {"real": 32264, "effetive": 32264, "saved": 32264}, "status": "sleeping", "memory_percent": 0.00018733631362330647, "cwd": "/scratch/users/vsochat/WORK/watchme", "uids": {"real": 9182, "effetive": 9182, "saved": 9182}, "name": "sleep", "terminal": "/dev/pts/0", "cmdline": ["sleep", "2"], "exe": "/usr/bin/sleep", "num_ctx_switches": {"voluntary": 13, "involuntary": 3}, "io_counters": {"read_count": 6, "write_count": 0, "read_bytes": 0, "write_bytes": 0, "read_chars": 2012, "write_chars": 0}, "num_fds": 3, "cpu_num": 0, "num_threads": 1, "connections": [], "SECONDS": "3"}]

And there you have it! With these methods to monitor any process on the fly at a particular interval, you are good to go!

Run as a Task

To run as a task, you will want to provide func@gpu_task when you create the task. You are also required to add a pid, and this can be a number, or the name of the process. Either way, likely you would be running this for a long running process, and in the case of providing an integer, you would need to update the pid of the process when you restart your computer. Here is an example adding a process name I know will be running when my computer starts:

$ watchme add-task gpu task-monitor-python --type gpu func@gpu_task pid@python
[task-monitor-python]
func  = gpu_task
pid  = python
skip  = nvml_device_uuid
active  = true
type  = gpu

If you choose a process name, remember that different processes can have the same name (for example, think about the case of having multiple Python interpreters open!) This means that watchme will select the first in the list that it finds. If you have preference for a specific one, then it’s recommended that you provide the process id directly.

Customize

You can also add a parameter called “skip”, including one or more (comma separate) indices in the results to skip.

$ watchme add-task gpu task-monitor-python --type gpu func@gpu_task pid@python skip@nvml_device_uuid

To test out the task, you can do the following:

$ watchme run gpu task-monitor-python --test

You’ll see the results.json print to the screen! When it’s time to use the watcher, you can active and schedule it.

Use as a Decorator

Although the parameters are not stored in the watchme.cfg, if you use the decorator to run the same task, the .git repository is still used as a database, and you can collect data to monitor your different Python functions on the fly. Specifically, the decorator is going to use multiprocessing to run your function, and then watch it via the process id (pid). You get to choose how often (in seconds) you want to record metrics like memory, io counters, and cpu, and threads. See here for an example of default output for one timepoint. This decorator function uses the same task function, that we discussed first, but with a different invocation.

from watchme.watchers.gpu.decorators import monitor_gpu
from time import sleep

@monitor_gpu('gpu', seconds=3)
def myfunc():
    long_list = []
    for i in range(5):
        long_list = long_list + (i*10)*['pancakes']
        print("i is %s, sleeping 10 seconds" % i)
        sleep(10)

The first argument is the name of the watcher (e.g., gpu) and you are also allowed to specify the following arguments (not shown):

Why don’t you need specify a pid? Your running function will produce the process id, so you don’t need to provide it. Let’s run this script. You can get the full thing from the gist here. You’ll notice in the gist example we demonstrate “myfunc” taking arguments too.

$ python test_monitor_gpu.py 
Calling myfunc with 2 iters
Generating a long list, pause is 2 and iters is 2
i is 0, sleeping 2 seconds
i is 1, sleeping 2 seconds
Result list has length 10

Great! So it ran for the watcher we created called gpu, but where are the results? Let’s take a look in our watcher folder:

~/.watchme/gpu$ tree
.
├── decorator-gpu-myfunc
│   ├── result.json
│   └── TIMESTAMP
├── task-monitor-python
└── watchme.cfg

In addition to the task that we ran, “task-monitor-python,” we also have results in a new “decorator-gpu-myfunc” folder. You’ve guessed it right - the decorator namespace creates folders of the format decorator-gpu-<name>, where name is the name of the function, or a “name” parameter you provide to the decorator.

What is a result?

Remember that we are monitoring our function every 3 seconds, so for a function that lasts about 10, we will record process metrics three times. How would we export that data? Like this:

$ watchme export gpu decorator-gpu-myfunc result.json --json

We ask for --json because we know the result is provided in json. For the above export, we will find three commits, each with a commit id, timestamp, and full result.

Verify the Addition

Once you add one or more tasks, you can inspect your watcher configuration file at $HOME/.watchme/gpu/watchme.cfg:

$ cat $HOME/.watchme/gpu/watchme.cfg

You can also use inspect:

$ watchme inspect gpu

At this point, you can test running the watcher with the --test flag for run:

$ watchme run gpu --test

and when you are ready, activate the entire watcher to run:

$ watchme activate gpu
$ watchme run gpu

And don’t forget to set a schedule to automated it (if appropriate)

$ watchme schedule gpu @hourly
$ crontab -l
@hourly /home/vanessa/anaconda3/bin/watchme run gpu # watchme-gpu

Every time your task runs, new files (or old files will be updated) and you can choose to push the entire watcher folder to GitHub to have reproducible monitoring! Since the parameter result files are named based on your host and username, others could fork the repo, run on their system, and pull request to combine data.

When you are ready, read more about exporting data.