Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRABServer REST should not talk to CRIC #6917

Open
belforte opened this issue Dec 14, 2021 · 12 comments · May be fixed by #8943
Open

CRABServer REST should not talk to CRIC #6917

belforte opened this issue Dec 14, 2021 · 12 comments · May be fixed by #8943

Comments

@belforte
Copy link
Member

belforte commented Dec 14, 2021

better to avoid talking to any external service from the REST
supporting proper authentication and debugging problems is too much of a pain
IIRC this access is only used to fill the site whitelist for MC with "all sites".
Which in principle can be done in TW.
I looked at this some time ago and change seemed too big to be worth.
But now I think that it is worth the effort.
@mapellidario FYI

@belforte
Copy link
Member Author

belforte commented Feb 8, 2022

we just had a small storm of CRABServer failures over 1h, all due to errors in talking to CRIC.
So I am increasing priority

@belforte
Copy link
Member Author

first thing should be to make this capable to deal with wildcards in the site list

siteWhitelist = set(kwargs['task']['tm_site_whitelist'])
siteBlacklist = set(kwargs['task']['tm_site_blacklist'])
self.logger.debug("Site whitelist: %s", list(siteWhitelist))
self.logger.debug("Site blacklist: %s", list(siteBlacklist))

then will worry about modifying REST to pass the list with the *'s in it

@belforte
Copy link
Member Author

belforte commented Feb 10, 2022

current code in REST relies on

def _expandSites(self, sites, pnn=False):
and
def wrap(func):
def wrapped_func(*args, **kwargs):
if 'cric' in services and (not args[0].allCMSNames.sites or (args[0].allCMSNames.cachetime+1800 < mktime(gmtime()))):
args[0].allCMSNames = CMSSitesCache(sites=CRIC().getAllPSNs(), cachetime=mktime(gmtime()))
args[0].allPNNNames = CMSSitesCache(sites=CRIC().getAllPhEDExNodeNames(), cachetime=mktime(gmtime()))
to get list of sites from CRIC every 30min and cache in memory.
(By the way that's hard to understand since WMCore CRIC class already has a 1h default cache inside... oh well...)

TaskWorker actions are done in independente processes, so it woule make sense to reuse _expandSites but cache site list on a disk file instead (like we used to do with SiteDB info long time ago), list of sites from CRIC does not need to be refreshed any faster than once a day !! Anyhow.. since it is one call per task.. we may even do it every time, rate is low, is only a matter of riding outages. There should be two times set:

  • when to refresh
  • how long to use a stale cache

current caching reduces number of calls to external service, but makes things fail miserably if server is down when cache expires.

Should definitely combine with the call to CRIC in DataDiscovery and have a single cache file, ref. #6946 Or at least a common access method with the refresh+use-stale policy.

@belforte
Copy link
Member Author

Let's move info from #6946 inside here, to simplify tracking
from TW log on Jan 3, 2022

2022-01-03 15:55:02,724:INFO:DBSDataDiscovery:Looking up data location with Rucio in cms scope.
2022-01-03 15:55:03,132:DEBUG:DataDiscovery: Formatting data discovery output 
2022-01-03 15:56:46,259:ERROR:DataDiscovery:Impossible translating ['T2_US_UCSD', 'T2_PK_NCP', 'T2_RU_IHEP', 'T2_UA_KIPT', 'T1_FR_CCIN2P3_Disk', 'T1_ES_PIC_Disk', 'T3_KR_UOS', 'T2_AT_Vienna', 'T1_US_FNAL_Disk', 'T2_FR_IPHC', 'T3_US_Colorado', 'T2_IT_Bari', 'T3_TW_NTU_HEP', 'T2_UK_SGrid_RALPP', 'T3_IT_Trieste', 'T2_BR_SPRACE', 'T1_DE_KIT_Disk', 'T2_US_Caltech', 'T2_UK_London_Brunel', 'T2_IT_Legnaro', 'T2_IT_Rome', 'T2_CH_CSCS', 'T2_BE_UCL', 'T2_GR_Ioannina', 'T3_KR_KNU', 'T2_UK_London_IC', 'T3_US_UMiss', 'T2_UK_SGrid_Bristol', 'T1_IT_CNAF_Disk', 'T2_HU_Budapest', 'T0_CH_CERN_Disk', 'T2_US_MIT', 'T3_CH_PSI', 'T1_UK_RAL_Disk', 'T2_US_Caltech_Ceph', 'T3_BG_UNI_SOFIA', 'T2_RU_JINR', 'T2_BR_UERJ', 'T3_US_NotreDame', 'T2_FR_GRIF_LLR', 'T2_ES_IFCA', 'T2_US_Wisconsin', 'T3_FR_IPNL', 'T3_US_NERSC', 'T2_FR_GRIF_IRFU', 'T2_FI_HIP', 'T2_PL_Swierk', 'T3_US_Rutgers', 'T2_TR_METU', 'T3_US_MIT', 'T2_US_Nebraska', 'T2_KR_KISTI', 'T2_CN_Beijing', 'T2_EE_Estonia', 'T3_US_Baylor', 'T2_US_Florida', 'T1_RU_JINR_Disk', 'T2_US_Vanderbilt', 'T2_DE_DESY', 'T2_BE_IIHE', 'T2_RU_INR', 'T2_US_Purdue', 'T2_CH_CERN', 'T2_IT_Pisa', 'T3_US_FNALLPC', 'T2_DE_RWTH', 'T2_ES_CIEMAT', 'T3_US_CMU', 'T2_PT_NCG_Lisbon', 'T2_FR_CCIN2P3', 'T3_KR_KISTI'] to a CMS name through CMS Resource Catalog
2022-01-03 15:56:46,264:ERROR:DataDiscovery:got this exception:
 (35, 'OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cms-cric.cern.ch:443 ')
2022-01-03 15:56:46,397:ERROR:Handler:Problem handling 220103_144838:cmsbot_crab_outputFiles because of (35, 'OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cms-cric.cern.ch:443 ') failure, traceback follows
Traceback (most recent call last):
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/Handler.py", line 80, in executeAction
    output = work.execute(nextinput, task=self._task, tempDir=self.tempDir)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 243, in execute
    result = self.executeInternal(*args, **kwargs)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 462, in executeInternal
    result = self.formatOutput(task=kwargs['task'], requestname=self.taskName,
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/TaskWorker/Actions/DataDiscovery.py", line 62, in formatOutput
    wmfile['locations'] = resourceCatalog.PNNstoPSNs(locations[wmfile['block']])
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/CRIC/CRIC.py", line 159, in PNNstoPSNs
    mapping = self._CRICSiteQuery(callname='data-processing')
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/CRIC/CRIC.py", line 91, in _CRICSiteQuery
    sitenames = self._getResult(uri, callname=callname, args=extraArgs)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/CRIC/CRIC.py", line 64, in _getResult
    data = self.refreshCache(cachedApi, apiUrl)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Service.py", line 206, in refreshCache
    self.getData(cachefile, url, inputdata, incoming_headers, encoder, decoder, verb, contentType)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Service.py", line 279, in getData
    data, dummyStatus, dummyReason, from_cache = self["requests"].makeRequest(uri=url,
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 159, in makeRequest
    result, response = self.makeRequest_pycurl(uri, data, verb, headers)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/Requests.py", line 176, in makeRequest_pycurl
    response, result = self.reqmgr.request(uri, data, headers, verb=verb,
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/Utils/PortForward.py", line 69, in portMangle
    return callFunc(callObj, url, *args, **kwargs)
  File "/data/srv/TaskManager/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/slc7_amd64_gcc630/cms/crabtaskworker/py3.211222patch1-b387affe8225aa14684125e5aa1a74e0/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 283, in request
    curl.perform()
pycurl.error: (35, 'OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to cms-cric.cern.ch:443 ')

@belforte
Copy link
Member Author

maybe a good topic for @mapellidario next month ? I am not comfortable with named ntuple and decorators (see

def conn_handler(services):
)

@belforte
Copy link
Member Author

On Hold waiting for decision on who will work on it

@mapellidario
Copy link
Member

I will need some guidance (as always), but I will happily take this!

@belforte
Copy link
Member Author

thanks Dario. Certainly ! Let's assume that we can get to this sometimes in March

@mapellidario mapellidario self-assigned this Feb 11, 2022
@belforte
Copy link
Member Author

we forgot to remove onhold label, doing it now

@sinonkt
Copy link
Contributor

sinonkt commented Feb 18, 2025

Regarding the #8937 changes.

Summary

  1. Now, CRIC-related requests (calls) and site checking/expanding logic are handled by TaskWorker [1][2].
  2. CRABServer no longer talk to CRIC.
  3. At TaskWorker, we minimize CRIC-related requests via file-based caching at /tmp/tw-cache and refreshing the cache once per day [3].

(See also test results [1][2][3])


Test Results

[1] When user submit invalid site (e.g. T2_TH_Bangkok) or glob pattern in either whitelist or blacklist to expand (e.g. T2_TH_*).

Instead of response right away to user with CherryPy 400 Bad Request from CRABServer as following.

[kphornsi@lxplus905 workspace]$ ./submit.sh
Will use CRAB configuration file HC-1kj.T2_TH_Bangkok.py
Importing CMSSW configuration pset.py
Finished importing CMSSW configuration pset.py
...
HTTP code/reason = 400/Bad Request .  stdout:
...
        <h2>400 Bad Request</h2>
        <p>Invalid input parameter</p>
        <pre id="traceback"></pre>
...
The server answered with an error.
Server answered with: Invalid input parameter
Reason is: The parameter T2_TH_Bangkok is not in the list of known CMS Processing Site Names
Error Id: d85de5577b8e8ecfdcaef922cb54f18e
The server answered with an error.
Server answered with: Invalid input parameter
Reason is: Cannot expand site T2_TH_* to anything
Error Id: 114f34c56274730ced670100d7691d26

Now, It reject LATER by TaskWorker with SUBMITFAILED task status as following.
(PS. signified that decommissioning CRIC-related functionalities on CRABServer works as expected too)

[kphornsi@lxplus905 workspace]$ crab status -d /tmp/crabStatusTracking/crab_20250218_031815
Rucio client intialized for account kphornsi
CRAB project directory:		/tmp/crabStatusTracking/crab_20250218_031815
Task name:			250218_021818:kphornsi_crab_20250218_031815
Grid scheduler - Task Worker:	[email protected] - crab-dev-tw03
Status on the CRAB server:	SUBMITFAILED
Task URL to use for HELP:	https://cmsweb-test12.cern.ch/crabserver/ui/task/250218_021818%3Akphornsi_crab_20250218_031815
Dashboard monitoring URL:	https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=kphornsi&var-task=250218_021818%3Akphornsi_crab_20250218_031815&from=1739841498000&to=now
Failure message from server:	A site name T2_TH_Bangkok that user specified is not in the list of known CMS Processing Site Names
Log file is /tmp/crabStatusTracking/crab_20250218_031815/crab.log
[kphornsi@lxplus905 workspace]$ crab status -d /tmp/crabStatusTracking/crab_20250218_033144
Rucio client intialized for account kphornsi
CRAB project directory:		/tmp/crabStatusTracking/crab_20250218_033144
Task name:			250218_023148:kphornsi_crab_20250218_033144
Grid scheduler - Task Worker:	[email protected] - crab-dev-tw03
Status on the CRAB server:	SUBMITFAILED
Task URL to use for HELP:	https://cmsweb-test12.cern.ch/crabserver/ui/task/250218_023148%3Akphornsi_crab_20250218_033144
Dashboard monitoring URL:	https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=kphornsi&var-task=250218_023148%3Akphornsi_crab_20250218_033144&from=1739842308000&to=now
Failure message from server:	Remote output data site not valid, Cannot expand site T2_TH_* to anything
Log file is /tmp/crabStatusTracking/crab_20250218_033144/crab.log

[2] As well as, when user submit banned Storage Site regarding banned-out-dest from external central config. (PS. At time of writing, banned-out-dest list are empty, here i use ['T2_CH_CERN'] for mock testing.)
(PS. implied that checkASODestination works as expected at TaskWorker)

[kphornsi@lxplus905 workspace]$ crab status -d /tmp/crabStatusTracking/crab_20250218_050327
Rucio client intialized for account kphornsi
CRAB project directory:		/tmp/crabStatusTracking/crab_20250218_050327
Task name:			250218_040331:kphornsi_crab_20250218_050327
Grid scheduler - Task Worker:	[email protected] - crab-dev-tw03
Status on the CRAB server:	SUBMITFAILED
Task URL to use for HELP:	https://cmsweb-test12.cern.ch/crabserver/ui/task/250218_040331%3Akphornsi_crab_20250218_050327
Dashboard monitoring URL:	https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=kphornsi&var-task=250218_040331%3Akphornsi_crab_20250218_050327&from=1739847811000&to=now
Failure message from server:	Remote output data site is banned, The output site you specified in the Site.storageSite parameter (T2_CH_CERN) is blacklisted (banned sites: ['T2_CH_CERN'])
Log file is /tmp/crabStatusTracking/crab_20250218_050327/crab.log

[3] Test output shows that CRIC requests have not only been cached.
(PS. our CRIC Cache Time-To-Live default value is (60 * 60 * 24 seconds).

crab3@crab-dev-tw03:/data/srv/current/lib/python/site-packages/TaskWorker$ CRIC_TTL=120 python3 ExternalService.py
===== Test::Begining with 3 cache keys and [354/18](hits/misses) =====
DEBUG:root:Fetching data from /api/cms/site/query/?json&preset=site-names, with args {'rcsite_state': 'ANY'}
DEBUG:root:getData:
	url: /api/cms/site/query/?json&preset=site-names&rcsite_state=ANY
	verb: GET
	incoming_headers: {}
	data: {}
DEBUG:root:Fetching data from /api/cms/site/query/?json&preset=site-names, with args {'rcsite_state': 'ANY'}
DEBUG:root:Data is from the Service cache
DEBUG:root:Fetching data from /api/cms/site/query/?json&preset=data-processing, with args {'rcsite_state': 'ANY'}
DEBUG:root:getData:
	url: /api/cms/site/query/?json&preset=data-processing&rcsite_state=ANY
	verb: GET
	incoming_headers: {}
	data: {}
===== Test::CacheKeys::[('__main__.CachedCRICService.getAllPSNs', None), ('__main__.CachedCRICService.PNNstoPSNs', [], None), ('__main__.CachedCRICService.getAllPhEDExNodeNames', None)] =====
===== Test::Success::CachedCRICService works as expected with Total/Hits/Misses == (12/9/3) =====
===== Test::Success::with accumulated Cache Stats:: (Hits/Misses) == (363/21) =====

But The new caching also mitigate flooding of CRIC debugging log from WMCore that we have to temporary turn-on-and-off using context manager: with tempSetLogLevel(logger=self.logger, level=logging.ERROR): throughout the codebase every time we call CRIC-related functions too.

@sinonkt
Copy link
Contributor

sinonkt commented Feb 19, 2025

NOTES: regarding #8937 (comment) extensive changes than we expected.
@belforte Please🥹, let me try explaining things here with my limited experience but a strong willingness to embrace clean code discipline.

Problem Definition:

Throughout, TaskWorker code base, every time we wish to fetch CRIC-related sites info in any TaskAction. We've to instantiate CRIC(logger=self.logger, configDict=configDict) within either envForCMSWEB or tempSetLogLevel context manager, which leave us with kind of boilerplate code in many places as we can see in [1][2][3][4].
(P.S and we also have high tendencies to repeat these patterns elsewhere. e.g. in this issue there is _checkSite, _expandSites and _checkASODestination that extensively use multiple CRIC fns.)

[1]

with self.config.TaskWorker.envForCMSWEB:
configDict = {"cacheduration": 1, "pycurl": True} # cache duration is in hours
resourceCatalog = CRIC(logger=self.logger, configDict=configDict)
try:
possiblesites = set(resourceCatalog.getAllPSNs())

[2]
## Loop over the sorted list of files.
configDict = {"cacheduration": 1, "pycurl": True} # cache duration is in hours
with tempSetLogLevel(logger=self.logger, level=logging.ERROR):
resourceCatalog = CRIC(logger=self.logger, configDict=configDict)
# can't affort one message from CRIC per file, unless critical !

[3]
with self.config.TaskWorker.envForCMSWEB :
configDict = {"cacheduration": 1, "pycurl": True} # cache duration is in hours
resourceCatalog = CRIC(logger=self.logger, configDict=configDict)
locations = resourceCatalog.getAllPSNs()

[4]
with self.config.TaskWorker.envForCMSWEB:
configDict = {"cacheduration": 1, "pycurl": True} # cache duration is in hours
self.resourceCatalog = CRIC(logger=self.logger, configDict=configDict)
def getListOfSites(self):
""" Get the list of sites to use for PrivateMC workflows.
For the moment we are filtering out T1_ since they are precious resources
and don't want to overtake production (WMAgent) jobs there. In the
future we would like to take this list from the SSB.
"""
with self.config.TaskWorker.envForCMSWEB:
sites = self.resourceCatalog.getAllPSNs()
filteredSites = [site for site in sites if not site.startswith("T1_")]


Solution:

As you might know, what i'm trying to do is Dependency Injection, weak form of Inversion of Control (IoC), like we did with config [WMCore.Configuration] on higher context manager (e.g. Actions/Handler.py). But not overdo it until becoming web frameworks like Angular, Laravel and Nestjs.

  1. Instantiate CachedCRICService at Handler and PreDag instead: such that we can easily inject CachedCRICService instance in any TaskAction. (PS. See also this commit 11bd4b1)
  2. Generalize related Class's constructor signature: Since DataDiscovery Class is the one who use CRIC and is ParentClass of DBSDataDiscovery/RucioDataDiscovery/UserDataDiscovery so refactoring a bit to more generic constructor is necessity also easier to pass new positional args or keyword args later. (PS. As i did here 3a59622)

Clean Code Principles that we embrace/emphasize here:

  1. Boy Scout Rule: everyday, leave the campground cleaner than you found it.🥹
  2. Use Dependency Injection: I think in this issue it leave us (Future me and CRAB folks) with tolerable practice/pattern to follows.
  3. Don't repeat ourself (DRY): reduce boilerplate/redundant code.

🥹 I'm by no means an expert at this. used to have discipline once in a while but wish to be disciplined Boy Scout someday! Of course! not overdo it and break things are number one priority!
(See also in this gist Summary of 'Clean code' by Robert C. Martin)

@sinonkt
Copy link
Contributor

sinonkt commented Feb 26, 2025

In response to Stefano's review, I've split this PR #8937 into 2 stages as following.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment