-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy pathartofHTC.shtml
299 lines (297 loc) · 10.9 KB
/
artofHTC.shtml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
layout: default
title: Running Your First CHTC Job
---
<p>
So, you have access to a submit machine,
have all the right accounts,
and you are ready to run your first job in the CHTC.
As we described in <a href="{{ '/approach' | relative_url}}">Our Approach</a>,
the CHTC is a collection of distributed resources.
The magic that enables you to run jobs on these resources is software,
called <a href="http://research.cs.wisc.edu/htcondor/">HTCondor</a>,
developed at the UW-Madison.
</p>
<h2>Let's first do, and then ask why</h2>
<p>
Rather than having you read a bunch of stuff before hand,
let's just run some jobs, and <i>then</i> you can do the reading.
We are going to run the traditional 'hello world' program with a CHTC twist.
In order to demonstrate the distributed resource nature of the CHTC,
we will run 'hello CHTC' 3 times, where each time is its own job.
Since you are not directly invoking the execution of each job,
you need to tell HTCondor <i>how</i> to run the jobs.
The information needed is placed into a <i>submit file</i>.
</p>
<p>
<strong><i>Note: You must be logged into an HTCondor submit machine
for the following example to work</i></strong>
<p>
Copy the highlighted text below,
and paste it into file called <code>hello-chtc.sub</code>, the submit file,
in your home directory on the submit machine.
</p>
<pre>
# hello-chtc.sub
# My very first HTCondor submit file
#
# Submit files require that you define certain aspects of your job,
# but you can also specify custom variables. We're going to specify
# our own variable that will be used later in the submit file as
# $(job).
job = hello-chtc
#
# Specify the HTCondor Universe (vanilla is typical for almost all
# jobs) AND the desired name of the HTCondor log file. $(Cluster)
# will utilize the queue number for the set of 3 jobs we submit.
universe = vanilla
log = $(job)_$(Cluster).log
#
# Specify your executable (single binary or a script that runs several
# commands), arguments, and files for system output and system error.
# $(Process) will be a unique number for each job, starting with "0".
executable = $(job)
arguments = $(Process)
output = $(job)_$(Cluster)_$(Process).out
error = $(job)_$(Cluster)_$(Process).err
#
# Specify that HTCondor should transfer files to and from the
# computer where each job runs. The last of these lines *would* be
# used if there were any other files needed for the executable to run.
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
# transfer_input_files = file1,file2,file3,etc
#
# Tell HTCondor how many CPU cores, and how much memory and disk space
# each job will need on the computer where it runs.
# (Memory is in MBs; we'll ask for 1GB)
# (Storage in KBs, we'll ask for 1 MB)
request_cpus = 1
request_memory = 1000
request_disk = 1000
#
# Tell HTCondor to run 3 instances of our job:
queue 3
</pre>
<P>
Now, copy the text below and paste it into a file
called <code>hello-chtc</code>, again placing this file into
your home directory on the submit machine. This bash script
will be our executable.
<P>
<pre>
#!/bin/bash
#
# hello-chtc
# My very first CHTC job
#
echo "Hello CHTC from Job $1 running on `whoami`@`hostname`"
</pre>
<P>
Ok, run your very first job by entering these two commands:
<P>
<pre>
chmod +x hello-chtc
condor_submit hello-chtc.sub
</pre>
<P>
The <code>chmod</code> command tells the system that
the <code>hello-chtc</code> file is an executable program.
You only need to run this command the first time you create a new
executable file.
The <code>condor_submit</code> command
actually submits your jobs to HTCondor.
If all goes well, you will see output from
the <code>condor_submit</code> command that appears as:
<P>
<pre>
Submitting job(s).....
3 job(s) submitted to cluster 845638.
</pre>
<P>
To check on the status of your jobs, run the following command:
<P>
<pre>
condor_q $USER
</pre>
<P>
The <code>$USER</code> argument will show only your jobs,
and you can replace <code>$USER</code> with your username. You will see output that looks like:
<P>
<pre>
-- Submitter: [email protected] : <128.104.100.43:9618?sock=5235_1ed5_2> : submit-1.chtc.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
845638.0 user 3/12 12:46 0+00:00:00 I 0 0.0 hello-chtc 0
845638.1 user 3/12 12:46 0+00:00:00 I 0 0.0 hello-chtc 1
845638.2 user 3/12 12:46 0+00:00:00 I 0 0.0 hello-chtc 2
3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
</pre>
<P>
You can run the <code>condor_q</code> command periodically to see the progress of your jobs. In a few minutes,
the queue should be empty, implying that your jobs have completed. If you do a listing of your home directory,
you should see something like:
<P>
<pre>
-rwxrwxr-x 1 user user 92 Mar 12 12:45 hello-chtc
-rw------- 1 user user 17 Mar 12 12:48 hello-chtc_845638_0.out
-rw------- 1 user user 17 Mar 12 12:48 hello-chtc_845638_1.out
-rw------- 1 user user 17 Mar 12 12:48 hello-chtc_845638_2.out
-rw-rw-r-- 1 user user 3180 Mar 12 12:48 hello-chtc_845638.log
-rw-rw-r-- 1 user user 580 Mar 12 12:40 hello-chtc.sub
</pre>
<P>
<b>Useful information is provided in the user log and the output/error
files. </b>
<P>
HTCondor creates a transaction log of everything that happens to your jobs.
Looking at the log file is very useful for
debugging problems that may arise.
An excerpt from
<code>hello-chtc_845638.log</code>
produced due the submission of the 3 jobs will look something like this:
</P>
<pre>
000 (845638.000.000) 03/12 12:46:29 Job submitted from host: <128.104.100.43:9618?sock=5235_1ed5_2>
...
000 (845638.001.000) 03/12 12:46:29 Job submitted from host: <128.104.100.43:9618?sock=5235_1ed5_2>
...
000 (845638.002.000) 03/12 12:46:29 Job submitted from host: <128.104.100.43:9618?sock=5235_1ed5_2>
...
001 (845638.000.000) 03/12 12:48:06 Job executing on host: <128.104.58.85:49163>
...
005 (845638.000.000) 03/12 12:48:06 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
17 - Run Bytes Sent By Job
92 - Run Bytes Received By Job
17 - Total Bytes Sent By Job
92 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 12 1000 26703078
Memory (MB) : 0 1000 1000
...
</pre>
And, if you look at one of the output files,
you should see something like this:
</P>
<pre>
Hello CHTC from Job 0 running on [email protected]
</pre>
<strong>Congratulations.</strong> You've run your first jobs in the CHTC!
<h1>What Else?</h1>
<h2>Removing Jobs</h2>
<p>
To remove a specific job, specify the job # in the queue (format:
Cluster.Process). Example:
<pre>
condor_rm 845638.0
</pre>
You can even remove all of the jobs of the same cluster by specifying
only the Cluster value for that batch.<br>
<br>
To remove <b>all of your jobs</b>:
<pre>
condor_rm $USER
</pre>
</p>
<h2>The Importance of Testing</h2>
<P>
1. <b>Examining Job Success.</b> Within the log file, you can see
information about the completion of each job, including a system
error code (as seen in "return value 0"). You can use this code, as well
as information in your ".err" file and other output files, to determine what
issues your job(s) may have had, if any.<br>
<br>
2. <b>Determining Memory and Disk Requirements.</b> The log file also
indicates how much memory and disk each job used, so that you
can first test a few jobs before submitting many more with more accurate
request values. When you request too little, your jobs will be "evicted"
from the computer they're running on, and HTCondor will have to try to
rerun them (maybe many times) until it requests enough for you.
When you request too much, your jobs may not match to as many available
"slots" as they could otherwise, and your overall throughput will suffer
in that case as well.<br>
<br>
3. <b>Determining Run Time.</b> Depending on how long each of your jobs
are (determined by examining when the job began executing and when it
completed), you can send your jobs to even more computers than are in the
CHTC Pool (where your jobs will run, by default). Refer to the table below
for tips on how to send your jobs to the rest of the UW Grid and to the
national <a href="http://www.opensciencegrid.org/">Open Science Grid</a>.
</P>
<h2>Getting the Right Resources</h2>
<P>
Be sure to always add or modify the following lines in your submit files, as appropriate,
and after running a few tests.
<table class="gtable">
<tr><th>Submit file entry</th><th>Resources your jobs will run on</th></tr>
<tr>
<td>request_cpus = <i>cpus</i></td>
<td>
Matches each job to a computer "slot" with at least this many CPU cores.
</td>
</tr>
<tr>
<td>request_disk = <i>kilobytes</i></td>
<td>
Matches each job to a slot with at least this much disk space, in units of KB.<br>
</td>
</tr>
<tr>
<td>request_memory = <i>megabytes</i></td>
<td>
Matches each job to a slot with at least this much memory (RAM), in units of MB.<br>
</td>
</tr>
<tr>
<td>+WantFlocking = true</td>
<td>
Also send jobs to other HTCondor Pools on campus (UW Grid)<br>
Good for jobs that are less than ~4 hours, on average, or checkpointing jobs.
</td>
</tr>
<tr>
<td>+WantGlideIn = true</td>
<td>
Also send jobs to the Open Science Grid (OSG).<br>
Good for jobs that are less than ~2 hours, on average, or checkpointing jobs.
</td>
</tr>
</table>
<h2>Now, time for a little homework</h2>
<P>
To get the most of the CHTC,
you will want to have a good understanding of how HTCondor works.
The full HTCondor manual is comprehensive,
but the links below guide you to the most important sections to read
in order to get started.
You can always dig into more details as you become more experienced.
<P>
<ul>
<li><a href="http://research.cs.wisc.edu/htcondor/manual/v7.8/2_4Road_map_Running.html">Road-map for Running Jobs</a>
</li>
<li><a href="http://research.cs.wisc.edu/htcondor/manual/v7.8/2_5Submitting_Job.html">Submitting a Job</a>
</li>
<li><a href="http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_submit.html"><code>condor_submit</code> manual page</a>
</li>
<li><a href="http://research.cs.wisc.edu/htcondor/manual/v7.8/2_6Managing_Job.html">Managing a Job</a>
</li>
<li><a href="http://research.cs.wisc.edu/htcondor/manual/v7.8/condor_q.html"><code>condor_q</code> manual page</a>
</li>
<li><a href="http://research.cs.wisc.edu/htcondor/manual/v7.8/2_3Matchmaking_with.html">Matchmaking with ClassAds</a>
</li>
</ul>
</P>
<h2>Now you are ready for some real work</h2>
<p>
Ok, you have the basics!
This should be enough background to get you started using the CHTC for the
<i>real</i> problems you came to us for.
Remember, we are here to help.
Don't hesitate to contact us at
<a href="[email protected]">[email protected]</a> with questions.
</p>