-
Notifications
You must be signed in to change notification settings - Fork 32
/
Copy pathresources.shtml
80 lines (78 loc) · 4.66 KB
/
resources.shtml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
layout: default
title: About CHTC Resources
---
<p>As you saw from the <a href="{{ 'approach' | relative_url}}">About Our Approach</a> document,
the CHTC's resources are distributed across
campus in several <i>pools</i>.
A sample of resources (slots) available on a typical day in February of 2012
is shown in this table:</p>
<table align="center" class="gtable">
<tr>
<th>Platform</th> <th>GLOW</th> <th>CHTC</th> <th>CS</th> <th>CAE</th> <th>Total</th>
</tr>
<tr>
<td>32-bit Linux</td> <td class="num">331</td> <td class="num">0</td> <td class="num">199</td> <td class="num">0</td> <td class="num">530</td>
</tr>
<tr>
<td>64-bit Linux</td> <td class="num">10,168</td> <td class="num">1,976</td> <td class="num">904</td> <td class="num">392</td> <td class="num">13,440</td>
</tr>
<tr>
<td>Windows 7</td> <td class="num">0</td> <td class="num">18</td> <td class="num">0</td> <td class="num">1,206</td> <td class="num">1,224</td>
</tr>
</table>
<P>
<strong>Of nodes, CPUS and slots</strong>
<p>
In the 1990s things were so simple. A node was the same as a CPU, and on one of those CPUs HTCondor would run one job.
When hyper-threaded and multi-core CPUs came on the scene, things got a little more complicated. In order to complicate
things more (because that's what we do at research universities, right?), we decided to coin a new term, the <i>slot</i>.
The one thing that is simple is that 1 job runs on 1 slot. What is more complicated is understanding what a slot is,
because it turns out that the answer is "it depends."
<p>
Let's say that we have a 16 core, 64 Gbyte RAM machine. If we did it the old way, we could run 1 job on this machine.
With a slot, we can make an arbitrary assignment of machine resources. For example, we could say that 1 slot is 1 CPU
with 4 Gbyte RAM. Then we could 'provision' 16 slots on that machine and run 16 jobs simultaneously. We could instead
provision 8 such slots and create a 9<sup>th</sup> slot that used the remainder of the machine, which would be 8 cores
with 32 Gbyte RAM. So, the slot concept gives us more flexibility in dividing physical machine resources into logical
machine resources.
<p>
<strong>How many slots are available in the CHTC, and with how much RAM?</strong>
<p>
Because of the way we can configure slots with arbitrary amounts of available memory, this is a bit of a tricky question.
We try to maximize the number of available slots given _most_ jobs' needs.
We can adjust, so if you have needs beyond what is currently offered, please get in touch with us. Speaking of adjustments,
we are working on an automated way to configure machine resources to match current job needs. We use a technique that we
call <i>dynamic slots</i>. Hopefully in the near future, you will not have to ask this question, because we will just make
enough properly configured slots to meet the needs of your jobs.
<p>
<strong>How are priorities determined?</strong>
<p>
We use a concept called <i>fair share</i>. This means that your priority decreases as you use more resources over time in
order to let newer, less frequent users access the resources too. We can override the automatic priority scheme if you need
high priority in order to meet a paper or other deadline.
<p>
<strong>How long are jobs allowed to run?</strong>
<p>
Maybe forever, as long as a higher priority user does not need the same resources you have claimed. When there is
competition for resources, we guarantee that your jobs will run for at least 24 hours. If you need longer run times,
the best approach is to make sure that your jobs can produce a <i>checkpoint</i>. In a checkpoint, a job records its
progress, and when that job restarts, it can utilize a checkpoint to pick up where it left off and continue. We can help
you understand how you might implement making and using checkpoints, if you feel this is something you need.
<p>
<strong>How much local disc space is available for jobs?</strong>
<p>
15 GBytes per slot is the average.
<p>
<strong>How fast is the network that connects all these nodes?</strong>
<p>
Due to the distributed nature of our resources, and the fact that they fall under different administrative domains all over
campus, this is a trickier question to answer than you probably hoped. Most compute nodes have 1 Gbit per second network
interfaces. We generally have 10 Gbit per second interfaces between pools.</p>
<p>
<strong>When do you expand these resources?</strong>
<p>
We monitor the use of all our campus pools, as well as the OSG, to see if we are delivering all the cycles that UW researchers
need. When total demand exceeds total supply, we will add resources. We also add resources when researchers have additional
requirements that cannot be met with our current resources.
<P>