Agent¶
An agent can be described as a gateway to a computational or storage site. It knows how to manage the resources under its control. An agent is able to provision/deprovision resources, deploy/terminate workers, and forward tasks and results to/from workers. The idea is to deploy an Agent in the frontend of a cluster or some gateway machine to interact with the resources of such site (a VM for example). We can have as many agents as we want as part of our federation.
An agent is composed by multiple services, including:
- File Server, which manages a local staging area and it is able to retrieve and send files upon request from the workers or prefetch them after workload is scheduled. Currently it uses ssh via rsync, hence we need to make sure that Agent machine is able to ssh with no password to the input data source machines.
- Task Service, which interacts with the Task Manager to request tasks and send results of completed tasks.
- Local Resource Management Service, which manages local resources and receives commands from the Autonomic Scheduler regarding to which resources have to be provisioned or deprovisioned. It also start workers to run the application described in the workflow description.
Next we describe how to configure an Agent in a site called machine3.domain.com and detail the different configuration files involved. We will show how to configure two types of infrastructures, namely Cloud and Cluster. More information about this section configuration files can be found in AGENT section. For simplicity we keep scripts and examples of the configuration files in the directory simple_run/agent/
Requirements¶
This service requires Java runtime 1.7+, rsync, and python 2.6+.
Configure Agent machine¶
Configure Agent machine to automatically trust worker machines (SSH first handshake). First make sure you have a directory called .ssh in your home directory, if not you can create it executing mkdir ~/.ssh. Then create a file inside that directory called config (file should be ~/.ssh/config)
Host * StrictHostKeyChecking no ForwardAgent yes
Typically we would want to retrieve data from other Agents. In this way we can make use of intermediate results that are available at remote sites or we can exploit data locality by retrieving the copy of a file that is closest to us. For this we should be able to ssh with no password between Agents.
- Generate sshkey in each agent by executing ssh-keygen. Press enter to all questions and do not establish a passphrase.
- Add public ssh key to all other Agents and data storage sites from which Agents may retrieve data. Copy the content of the file ~/.ssh/id_rsa.pub and place insert it in the file ~/.ssh/authorized_keys of all other Agents. If ~/.ssh/authorized_keys does not exist, you can create it and make sure that the permissions are 0600 (chmod 0600 ~/.ssh/authorized_keys).
Edit comet.properties file¶
This file must be exactly the SAME as the one created in the Task Manager’s comet.properties file. As we described before, it contains information required by CometCloud. The property IsolatedProxy tells the Agent where to request tasks. As we mentioned in Customizing Additional CometCloud Ports section, additional properties are required if we want to customize the CometCloud ports.
MasterClass=tassl.application.cometcloud.WorkflowMaster WorkerClass=tassl.application.cometcloud.AppWorker TaskClass=tassl.application.cometcloud.WorkflowTaskTuple RoutingKeys=TaskId TaskMonitoringPeriod=50000 IsolatedProxy=machine2.domain.comNote
Remember that if you change comet.properties in the Task Manager/WorkflowMaster, you need to copy it to every Agent and restart the Agent.
Edit agent.properties file¶
This file contains the information required to configure the different services of an Agent. We are going to explain the content of this file by sections. More information about this section configuration files can be found in AGENT section.
Configure Task Service. The following properties define the machine where the Agent is running (publicIpAgent) and the port of the Task Service (portAgent). Two optional parameters are IpAgentForWorkers and portAgentForWorkers to specify a local IP and port that can be used for the workers to interact with the Task Service. The property MaxNoTaskRetries allows Agents to send workers to sleep when there are no more tasks of the type that the worker is requesting. Agents can awake workers if any change occurs, such a rescheduling.
publicIpAgent=machine3.domain.com portAgent=8880 IpAgentForWorkers=192.0.1.2 portAgentForWorkers=6666 #Number of times a worker uses a query unsuccessfully. MaxNoTaskRetries=100 logFile=Agent.log
File Server configuration. This server is used to transfer files automatically to the resources when the workers need them. It uses rsync for that. The server will be listening on the port FileServerPort and it will be able to handle FileServerMaxThreads concurrent requests. We also need to define the staging directory (StageFileDir) where input files and intermediate results can be temporarily stored.
StartFileServer=true FileServerPort=6668 FileServerMaxThreads=10 StageFileDir=/tmp/stage/
Local Resource Management Service configuration. This service is used to dynamically provision resources. It will be listening in the port MgmtPort and it also defines a monitoring interval of MonitorInterval that is used to monitor status of resources. An Agent needs to know which resource manager (i.e. autonomic scheduler) has to register. This is specified by the property CentralManagerServer, as we can see it has the same values as the one defined in the Workflow Manager configuration Section
StartManager=true MgmtPort=9889 CentralManagerServer=machine1.domain.com:7778 MonitorInterval=30
Resource configuration. In this property we define the type of resources this agent is going to be managing. Currently we support two types, namely cloud and cluster. A cloud requires provisioning virtual machines (VMs) before workers can start, while in a cluster we assume that the machines are available (e.g., a pilot job has booked machines) and we can simply start workers.
- Cluster definition. clusterDell is the name of the file that has the specific details of this cluster.
Resources=cluster:clusterDell
- Cloud definition. cloudOSSierra is the name of the file that has the specific details of this cloud.
Resources=cloud:cloudOSSierra
Metrics service configuration. Next we specify the information regarding the metrics service, which will be used to store runtime information regarding this agent (e.g., application execution times, failure rates, etc.). More information about the metrics service can be found in the Metrics Service configuration Section. The use of the metrics service is optional (disabled by default) as the system could operate using static information. To enable the use of the metrics service we define the following properties: UseMeticsService equals true, AgentMetricsAddress equals to the IP or name hosting the metrics service, AgentMetricsPort equals to the port where the metrics service is listening, and DBName indicates the name of the database that will store the information of this agent.
UseMeticsService=true AgentMetricsAddress=metrics.domain.com AgentMetricsPort=8891 DBName=agentmetricsSpring
Additional Configuration when Firewall issues (Optional)¶
There are special cases where an Agent and its workers can only communicate via SSH due to firewall configuration. For these cases, we have enable functionality that can automatically generate SSH tunnels to enable the communication. We identify two cases, which can be simultaneously configured.
Agent cannot communicate with workers. In this case we need to define the following variables in the agent.properties file. Every time a worker is started, an SSH tunnel will be created using a port from the port range specified in the SshTunnelAgentPortRange variable. This port has to be available in the Agent machine.
StartSshTunnelAgent=true SshTunnelAgentPortRange=10000:10100
Workers cannot communicate with Agent. In this case we need to define the following variables in the agent.properties file. In this case the ports specified in the portAgent, FileServerPort, and MgmtPort variables should be available in the worker machine.
StartSshTunnelWorkers=true IpAgentForWorkers=localhost
Agent machine cannot communicate with itself. In this case the ports of this machine are closed and we cannot use the public IP or fqdn. In this case we need to set publicIpAgent equal to localhost and define a new parameter publicIPSite to ensure that files generated in this site have the public IP such that they are accessible to others (SSH is typically open).
publicIpAgent=localhost publicIPSite=machine3.domain.com
In this case you will need to deploy SSH tunnels between the Workflow Manager machine and this Agent machine, and between the Task Generator machine and this Agent machine. If you followed the Administrator Documentation configuration until here, you will need to execute the following commands:
# create tunnels from Workflow Manager to the Agent (MgmtPort port) ssh machine1.domain.com ssh -f machine3.domain.com -L 9889:machine3.domain.com:9889 -N exit # create tunnels from Task Generator to the Agent (portAgent port) ssh machine2.domain.com ssh -f machine3.domain.com -L 8880:machine3.domain.com:8880 -N exit
Configuring a Cluster (clusterDell)¶
This configuration file shows an example of how to configure a cluster. We are going to explain the content of this file by sections.
General information about the site. It has the Name and Zone, which identifies the site and it is used to enforce the constraints established in the input data sources (see Workflow Definition). Currently, we assume that the machines of the cluster are reserved to us (e.g., pilot job), hence QueueType and QueueName are not in use. In the future they will be used to enable dynamic reservation of machines.
Name=siteDell02 Zone=zoneB QueueType=Torque QueueName=regular
- Resource information. In this section we specify attributes of the site:
Overhead=5 WorkerLimit=192.168.2.156:1;192.168.2.157:1 Cost=192.168.2.156:0.6;192.168.2.157:0.12 BenchmarkScore=192.168.2.156:30000;192.168.2.157:30000 workerPortRange=7777:7788
- Overhead. Overhead to provision a machine. This value might change over time if the Metrics service is enabled.
- WorkerLimit. Number of workers per machine. It is recommended to use only one per machine, more than one is experimentally supported and might have bugs. In the example we have two machines and we can provision one worker in each one.
- Cost of a machine per unit of time (i.e. hour). We can specify a different cost for each machine. In this case the machine 192.168.2.156 costs 0.6 dollars/SU per hour and the machine 192.168.2.157 costs 0.12 dollars/SUs per hour.
- BenchmarkScore. Benchmark score of each machine. This is used to estimate the performance of each machine. Any benchmark can be used for this, e.g., the Whetstone Score (UnixBench) for a t2.medium instance in AWS is 8200. This value is only used when there is no real data regarding to a specific application. Once the agent has real data this score is ignored. Note that in order for the agent to obtain real data from application executions the Metrics service must be enabled.
- workerPortRange, are the ports to be used inside the provisioned machines to start the workers.
Note
Agent machine must be able to contact the workers in workerPortRange range of ports (i.e. ports in machines hosting workers must be OPEN).
Application information. We specify a list of applications this resource can handle, and the CometCloud worker class that contains the logic. The name of this applications are the ones that need to be used in the Application field of the workflow definition (see Defining Workflows).
SupportedApps=montage,mapreduce montage=tassl.application.cometcloud.sampleMontage.AppWorkerMontage mapreduce=tassl.application.cometcloud.sample.AppWorker
- Internal Resource Information. This information is regarding to the worker machines. Therefore make sure that the paths refer to those inside those machines.
SoftwareDirWorker=/cac/soft/cometcloud/ UserWorker=med208 WorkingDir=/tmp/work/
- SoftwareDirWorker. Software directory in worker machine. Directory where the software is expected to be in the resource for each application. For example, the software of montage is expected to be in the path SoftwareDirWorker/montage/. Inside that directory we must have any binary required by our application plus the dist and lib directories with the CometCloud jar files as well as the specific worker application jar.
- UserWorker. User in worker machine. This is the user that we will use to ssh into the worker machine. It must be configure to enable ssh with no password.
- WorkingDir. Working directory in worker machine. Directory inside the worker machine that will be used to transfer input files and write any data generated during the execution of the application.
Note
Remember agents must be able to ssh with no password to the worker machines (i.e. ssh med208@192.168.2.156 should not ask for a password)
Worker Cluster configuration¶
Previously we specified that our software directory in the workers was /cac/soft/cometcloud/ (SoftwareDirWorker). We assume that the directory /cac/soft/ is shared across all machines and therefore they can access to any software we copy in /cac/soft/cometcloud/ from our cluster’s login machine.
- Create directories with application names in the directory specified as SoftwareDirWorker in the configuration file (clusterDell). Inside each application directory place the dist and lib directories with CometCloud jar files as well as your application’s jar. You can also put any other script/binary file that your application may require.
Configuring a Cloud (cloudOSSierra)¶
This configuration file shows an example of how to configure a cloud. We are going to explain the content of this file by sections.
General information about the site.
Name=siteSierra Zone=zoneA key=/N/u/jdiaz/OS-grizzly/ec2/jdiaznova.pem CloudScript=../../scripts/cloud.py ProviderType=openstack_ec2 Region=nova ProviderConfigFile=/N/u/jdiaz/OS-grizzly/ec2/eucarc
Name and Zone, which identifies the site and it is used to enforce the constraints established in the input data sources (see Workflow Definition).
key. This key is used to interact with the virtual machines (VMs) of the cloud. Typically, when using cloud technology you need to specify the a keypair that will allow ssh to the VM.
CloudScript. Path to the plugin that enables provisioning and deprovisioning cloud resources. Currently we provide with one called cloud.py that supports some of the major cloud platforms.
ProviderType. Since the CloudScript supports multiple platforms, we need to specify which one we will be using. The following table summarizes the current options.
ProviderType Platform API-used openstack_ec2 OpenStack EC2 boto openstack_nova OpenStack Nova novaclient nimbus_ec2 Nimbus EC2 boto aws_ec2 AWS EC2 boto Region. Cloud providers usually require to specify a region in which you will be operating. Some examples of current platforms are described in the following table.
Provider Region FutureSystems nova Chameleon (Alamo) regionOne AWS (US West -Oregon) us-west-2 ProviderConfigFile. This file contains the configuration provided for your cloud provider. Some examples are provided in Cloud configuration Files section.
Note
The key file must have the same name as the keypair in the cloud (you can see your cloud keypairs in the portal-security or using the command line, e.g., euca-describe-keypairs, nova keypait-list). The extension (.pem in this case) is not important. Also make sure that the permission of the key file are correct (e.g., chmod 0600 jdiaznova.pem).
Resource information. In this section we specify attributes of the site:
Overhead=20 VMLimits=m1.small:2;m1.medium:5;m1.large:3 WorkerLimit=m1.small:1;m1.medium:1;m1.large:1 Cost=m1.small:0.06;m1.medium:0.12;m1.large:0.24 BenchmarkScore=t2.small:4050;t2.medium:8200;t2.large:16000 CostDataIn=0.01 CostDataOut=0.12 workerPortRange=7777:7888
- Overhead. It is the overhead to provision a machine. This value might change over time if the Metrics service is enabled.
- VMLimits. Determine the type of virtual machine (VM) supported and the number of each type that is available. Format is <type>:<number>;<type>:<number>. In our example, this site will be able to launch a maximum of 2 VMs of m1.small type, 5 VMs of m1.medium type, and 3 VMs of m1.large type.
- WorkerLimit. Number of workers per type of VM. It is recommended to use only one per machine, more than one is experimentally supported and might have bugs. In the example we have three types of VMs and we can provision one worker in each one.
- Cost. Cost of a VM per unit of time (i.e. hour). We can specify a different cost for each machine. In this case VMs of type m1.small cost 0.06 dollars per hour, m1.medium VMs cost 0.12 dollars per hour, and m1.large Vms cost 0.24 dollars per hour.
- BenchmarkScore. Benchmark score of each machine. This is used to estimate the performance of each machine. Any benchmark can be used for this, e.g., the Whetstone Score (UnixBench) for a t2.medium instance in AWS is 8200. This value is only used when there is no real data regarding to a specific application. Once the agent has real data this score is ignored. Note that in order for the agent to obtain real data from application executions the Metrics service must be enabled.
- CostDataIn. Cost of transferring data inside the site. Cost per GB of data.
- CostDataOut. Cost of transferring data outside the site. Cost per GB of data.
- workerPortRange, are the ports to be used inside the provisioned machines to start the workers. Format is <initialPort>:<lastPort>
Note
Agent machine must be able to contact the workers in the workerPortRange range of ports (i.e. ports in VMs hosting workers must be OPEN). This is typically done in changing options in the security groups of your cloud.
Application information. We specify which application this resource can handle, and the CometCloud worker class that contains the logic. The name of this applications are the ones that need to be used in the Application field of the workflow definition (see Defining Workflows).
SupportedApps=montage,mapreduce montage=tassl.application.cometcloud.sampleMontage.AppWorkerMontage mapreduce=tassl.application.cometcloud.sample.AppWorker defaultImageId=ami-0000003a mapreduceImageId=ami-0000003b
- defaultImageId. This is the id of the VM image that will be deployed by default as a worker machine. This VM image has to be prepared beforehand with the proper software and libraries. More information is provided in Cloud VM image configuration section.
- We can also specify different VMs images for different applications. The format is <application>ImageId. In our example, we have a specific image for the mapreduce application (mapreduceImageId), while the montage application uses the default one.
- Using the right VM image ID, if you use ec2 you need to obtain the id using euca-describe-images, and if you use nova you need to obtain it using nova image-list.
Internal Resource Information. This information is regarding to the worker machines. Therefore make sure that the paths refer to those inside those machines.
SoftwareDirWorker=/home/ubuntu/ UserWorker=ubuntu WorkingDir=/home/ubuntu/
- Software directory in worker machine. Directory where the software is expected to be in the resource for each application. For example, the software of montage is expected to be in the path SoftwareDirWorker/montage/. Inside that directory we must have any binary required by our application plus the dist and lib directories with the CometCloud jar files as well as the specific worker application jar.
- User in worker machine. This is the user that we will use to ssh into the worker machine.
- Working directory in worker machine
Note
Remember agents must be able to ssh with no password to the worker VMs. This is typically achieved automatically in cloud infrastructures using the previous keypair. (e.g., ssh -i jdiazkey.pem ubuntu@54.187.182.145)
Cloud configuration files¶
Example for FutureSystems, India site. We use the provider openstack_ec2 with this configuration. Make sure that your configuration works with euca2ools. More information in FutureSystems OpenStack Manual, EC2 Section
export NOVA_KEY_DIR=$(cd $(dirname ${BASH_SOURCE[0]}) && pwd) export EC2_ACCESS_KEY="accesskey" export EC2_SECRET_KEY="secretkey" export EC2_URL="http://i5r.idp.iu.futuregrid.org:8773/services/Cloud" export S3_URL="http://i5r.idp.iu.futuregrid.org:3333" export EC2_USER_ID=15 export EC2_PRIVATE_KEY=${NOVA_KEY_DIR}/pk.pem export EC2_CERT=${NOVA_KEY_DIR}/cert.pem export NOVA_CERT=${NOVA_KEY_DIR}/cacert.pem export EUCALYPTUS_CERT=${NOVA_CERT} alias ec2-bundle-image="ec2-bundle-image --cert ${EC2_CERT} --privatekey ${EC2_PRIVATE_KEY} --user 42 --ec2cert ${NOVA_CERT}" alias ec2-upload-bundle="ec2-upload-bundle -a ${EC2_ACCESS_KEY} -s ${EC2_SECRET_KEY} --url ${S3_URL} --ec2cert ${NOVA_CERT}"
Example for Chameleon, Alamo site. We use the provider openstack_nova with this configuration. Once you create the configuration file, make sure that it works using the nova command line interface. The FutureSystems OpenStack Manual is a good tutorial to learn on how to use the nova command line interface. To obtain your nova configuration file you can do it from the online portal Chameleon User Guide.
#!/bin/bash # To use an Openstack cloud you need to authenticate against keystone, which # returns a **Token** and **Service Catalog**. The catalog contains the # endpoint for all services the user/tenant has access to - including nova, # glance, keystone, swift. # # *NOTE*: Using the 2.0 *auth api* does not mean that compute api is 2.0. We # will use the 1.1 *compute api* export OS_AUTH_URL=https://proxy.chameleon.tacc.utexas.edu:5000/v2.0 # With the addition of Keystone we have standardized on the term **tenant** # as the entity that owns the resources. export OS_TENANT_ID=FG-337 export OS_TENANT_NAME="FG-337" # In addition to the owning entity (tenant), openstack stores the entity # performing the action as the **user**. export OS_USERNAME="javidiaz" # With Keystone you pass the keystone password. #echo "Please enter your OpenStack Password: " #read -sr OS_PASSWORD_INPUT export OS_PASSWORD="mypassword" # If your configuration has multiple regions, we set that information here. # OS_REGION_NAME is optional and only valid in certain environments. export OS_REGION_NAME="regionOne" # Don't leave a blank variable, unset it if it was empty if [ -z "$OS_REGION_NAME" ]; then unset OS_REGION_NAME; fi
Worker Cloud VM image configuration¶
To create a VM image that will be used as worker VM, we recommend starting with one of the default instances offered by your cloud provider. We would want to start the smallest type of instance possible (e.g., m1.tiny) to minimize the size of the image. For example, we can start a typical Ubuntu VM image.
- Launch a VM with an Ubuntu image
- Connect to the VM via ssh (ssh -i jdiaznova.pem ubuntu@<ipaddress>)
- Install java and rsync (sudo apt-get install openjdk-7-jre-headless rsync)
- Create directories with application names in the directory specified as SoftwareDirWorker in the configuration file (cloudOSSierra). Inside each application directory place the dist and lib directories with CometCloud jar files as well as your application’s jar. You can also put any other script/binary file that your application may require. To send these files to the VM you can use scp from the machine where your files are located. For example, if our SoftwareDirWorker is /home/ubuntu/, and we want to install application called mapreduce, the command will be scp -r -i jdiaznova.pem dist lib ubuntu@<ipaddress>:/home/ubuntu/mapreduce/. This assumes that dist and lib have all jar files needed and that the directory /home/ubuntu/mapreduce/ exists inside the VM.
- Once everything is configured, we can create a snapshot of this VM to create our VM instance. You can do this from a portal or from the command line using nova.
- After the snapshot is created, you can obtain the ID of your new image and terminate your VM.
- From now on, the Agent should be able to start VMs using the VM image that you created.
Agent Cloud VM image configuration¶
To create a VM image that will be used as an Agent VM, we recommend starting with one of the default instances offered by your cloud provider. We would want to start the smallest type of instance possible (e.g., m1.tiny) to minimize the size of the image. For example, we can start a typical Ubuntu VM image.
- Launch a VM with an Ubuntu image
- Connect to the VM via ssh (ssh -i jdiaznova.pem ubuntu@<ipaddress>)
- Install java, rsync, python 2.6+ (sudo apt-get install openjdk-7-jre-headless rsync python)
- Install python EC2 API, boto (sudo apt-get install python-novaclient) or python Nova API, novaclient (sudo apt-get install python-boto)
- Create a directory with the CometWorkflow software. Inside this directory we have the dist and lib directories with CometCloud jar files, the script with the cloud provisioning script, and simple_run with examples of configuration files and starting scripts. as well as your application jar.
- Once everything is configured, we can create a snapshot of this VM to create our VM instance. You can do this from a portal or from the command line using nova.
- After the snapshot is created, you can obtain the ID of your new image and reuse it in the future to start your Agent VM.
Starting the Service¶
Once the configuration is ready, you can start the agent by executing the script startAgent.sh. This script contains the following code that defines the CLASSPATH and executes the appropriated java class.
export CLASSPATH=../../dist/*:../../lib/* java -Xms256m -Xmx1024m -cp $CLASSPATH tassl.application.cometcloud.AgentLite -propertyFile comet.properties -propertyFile agent.propertiesNote
Agents register with the Autonomic Scheduler, hence you need to make sure that the Workflow Manager and Autonomic Scheduler are running before you start an Agent.