Difference between revisions of "Hardware:Status"

From CAC Wiki
Jump to: navigation, search
Line 120: Line 120:
 
|-
 
|-
 
| 10/30/2017 - 8:00 AM
 
| 10/30/2017 - 8:00 AM
| '''multiple production nodes unreachable'''
+
| multiple production nodes unreachable
 
| scheduler lost contact to production nodes
 
| scheduler lost contact to production nodes
| diagnosing, attempting to re-connect nodes
+
| nodes will be transfered to Frontenac
| '''no'''
+
| yes
 
|-
 
|-
 
!colspan="5"| '''Issues with scheduler - node connection, multiple production nodes unreachable'''
 
!colspan="5"| '''Issues with scheduler - node connection, multiple production nodes unreachable'''
 
|-
 
|-
 +
| 11/21/2017 - 11:00 PM
 +
| Frontenac (all nodes)
 +
| Temporary unmount of /global file system
 +
| remounted, but production jobs affected, scheduler inactive
 +
| '''no'''
 
|}
 
|}

Revision as of 13:58, 22 November 2017

This page shows information about the status of systems at the Centre for Advanced Computing. It will be updated with additional information as new events arise.

System Status Messages
Date/Time Affected Systems Issue Details Resolved ?
3/21/2017 - 1:30 PM All Compute / Login Power blip / outage Shutdown of all compute clusters and login nodes. Yes
3/22/2017 - 10:30 AM All Compute / Login Recovery from power outage Login nodes, system, and data access restored. Compute cluster still down, scheduler queues disabled. Yes
3/24/2017 - 8:00 AM All Compute Recovery from power outage Compute cluster nodes cac013-cac099 up and running. Scheduler queues restricted/disabled. Yes
3/24/2017 - 2:00 PM All Compute Recovery from power outage Scheduler queues for SW (Linux) compute cluster re-opened. Cluster is up and running. SNO (SX) cluster queues still disabled. Yes
3/24/2017 - 3:00 PM All Compute Recovery from power outage Scheduler queues for SX (SNO, Linux) compute cluster re-opened. Cluster is up and running. Yes
3/27/2017 - 2:00 PM File system (disk arrays 1 and 2) Trouble shooting on disk arrays Replacing disks, rebooting head units; intermittent login and disk access issues to be expected. Yes
3/28/2017 - 2:00 PM cac029 (compute) cac029 off-line cac029 is undergoing memory maintenance. Yes
4/13/2017 - 8:00 AM swlogin1 (login/workup) login problems connectivity issues on swlogin1 prevent or delay login from sflogin0 Yes
5/02/2017 - 8:00 AM all nodes (login & "SW" production) Grid Engine scheduler not functional The Grid Engine scheduler is currently not functional ; qstat/qsub/qmon not avaialable Yes
5/02/2017 - 1:00 PM swlogin1 Grid Engine scheduler not functional Network issues on SGE_PROD Yes
5/10/2017 - 8:00 AM swlogin1 / cac012-26 Connectivity issues Access to storage temporarily lost Yes
5/10/2017 - 9:00 AM swlogin1 Connectivity restored Login restored Yes
5/10/2017 - 12:00 PM cac012-27 Reboot Queues temporarfily disabled Yes
7/13/2017 - 10:00 PM all systems issues with Grid Engine qmaster resolved Yes
7/13/2017 - 10:00 PM swlogin1 unreachable through ssh resolved Yes
10/02/2017 - 8:00 AM head-6b disk array full partly resolved (freed 4 TB) Yes
10/03/2017 - 8:00 AM head-6b disk array at near capacity working on reducing usage yes
10/30/2017 - 8:00 AM swlogin1 (login node) No login possible login restored yes
10/30/2017 - 8:00 AM multiple production nodes unreachable scheduler lost contact to production nodes nodes will be transfered to Frontenac yes
Issues with scheduler - node connection, multiple production nodes unreachable
11/21/2017 - 11:00 PM Frontenac (all nodes) Temporary unmount of /global file system remounted, but production jobs affected, scheduler inactive no