Difference between revisions of "Hardware:Status"

Latest revision as of 19:39, 5 July 2023

This page shows information about the status of systems at the Centre for Advanced Computing. It will be updated with additional information as new events arise.

System Status Messages
Date	Affected systems	Details/reason	Resolution
05/0723 - Continuing	Frontenac Signup	Frontenac signup unavailable due to website work	New CAC website is currently being rolled out and troubleshooting and fixes are underway.
22/01/22 - 8-4:30	Frontenac Cluster	Onsite Power work	Electrical work will require Frontenac Cluster to run off of generator. No disruptions are expected.
07/09/2019 - 9.30 am	caclogin03	caclogin03 down	login node requires reboot after becoming non-responsive . Re-booted and updated this host
07/02/2019 - 9.40 am	caclogin04	caclogin04 unresposive	login node requires reboot after becoming un-responsive . Reboot fixed the issue and updated this host
05/17/2019 - 11 am	caclogin03	caclogin03 down	login node requires reboot after becoming non-rseponsive , reboot resolved the issue
02/27/2019 - 10 am	caclogin02	caclogin02 VM down	login traffic directed to caclogin03/4, rebooted resolved issue
12/10/2018 - 8 am	caclogin02	caclogin02 VM down	login traffic directed to caclogin03/4
12/04/2018 - 8 pm	caclogin03/04	issues with login nodes; caclogin02 works	resolved after reboot
10/29/2018 - 4 pm	caclogin03	/ file system full	resolved
08/06/2018 - 08/10/2018	Cluster downtime	Scheduled filesystem upgrade	planned downtime
07/12/2018 - 10:30 AM	GPFS outage	Filesystem temporarily unavailable	resolved
06/27/2018 - 9:00 AM	Login node shutdown	Maintenance (unscheduled)	node back in service
06/20/2018 - 8:00 AM	Login node non-responsive	Cause : out of memory	resolved, login restored (take-down, reboot)
05/01/2018 - 9:00 AM	Scheduler maintenance	Scheduled upgrade/downtime of scheduler	resolved
04/23/2018 - 7:00 AM	Frontenac login node	login issues, reboot	functional after reboot
04/19/2018 - 3:30 PM	Frontenac login node	lost access to file system, reboot	resolved after reboot
03/16/2018 - 11:00 AM	Scheduler upgrade	Scheduled upgrade/downtime of scheduler	Upgrade complete, working on x11 support
01/28/2018 - 5:00 AM	Frontenac login node caclogin02	Node went down out of schedule	login restored, investigating causes
01/18/2018 - 11:30 AM	Frontenac login node caclogin01	Out-of-schedule shutdown / reboot (~45min)	updates / maintenance
11/21/2017 - 11:00 PM	Frontenac (all nodes)	Temporary unmount of /global file system	re-mounted, file system accessible
10/30/2017 - 8:00 AM	multiple production nodes unreachable	scheduler lost contact to production nodes	nodes will be transfered to Frontenac
10/30/2017 - 8:00 AM	swlogin1 (login node)	No login possible	login restored
10/03/2017 - 8:00 AM	head-6b	disk array at near capacity	working on reducing usage
10/02/2017 - 8:00 AM	head-6b	disk array full	partly resolved (freed 4 TB)
7/13/2017 - 10:00 AM	swlogin1	unreachable through ssh	resolved
7/13/2017 - 8:00 AM	caclogin01	temporary maintenance shutdown	back up

@@ Line 5: / Line 5: @@
 !colspan="5"| '''System Status Messages'''
 |-
-|'''Date/Time'''
+| '''Date'''
-|'''Affected Systems'''
+| '''Affected systems'''
-|'''Issue'''
+| '''Details/reason'''
-|'''Details'''
+| '''Resolution'''
-|'''Resolved ?'''
 |-
-| 3/21/2017 - 1:30 PM
+| 05/0723 - Continuing
-| All Compute / Login
+| Frontenac Signup
-| Power blip / outage
+| Frontenac signup unavailable due to website work
-| Shutdown of all compute clusters and login nodes.
+| New CAC website is currently being rolled out and troubleshooting and fixes are underway.
-| Yes
 |-
-| 3/22/2017 - 10:30 AM
+| 22/01/22 - 8-4:30
-| All Compute / Login
+| Frontenac Cluster
-| Recovery from power outage
+| Onsite Power work
-| Login nodes, system, and data access restored. Compute cluster still down, scheduler queues disabled.
+| Electrical work will require Frontenac Cluster to run off of generator. No disruptions are expected.
-| Yes
 |-
-| 3/24/2017 - 8:00 AM
-| All Compute
-| Recovery from power outage
-| Compute cluster nodes cac013-cac099 up and running. Scheduler queues restricted/disabled.
-| Yes
 |-
-| 3/24/2017 - 2:00 PM
+| 07/09/2019 - 9.30 am
-| All Compute
+| caclogin03
-| Recovery from power outage
+| caclogin03 down
-| Scheduler queues for SW (Linux) compute cluster re-opened. Cluster is up and running. SNO (SX) cluster queues still disabled.
+| login node requires reboot after becoming non-responsive . Re-booted and updated this host
-| Yes
 |-
-| 3/24/2017 - 3:00 PM
-| All Compute
-| Recovery from power outage
-| Scheduler queues for SX (SNO, Linux) compute cluster re-opened. Cluster is up and running.
-| Yes
 |-
-| 3/27/2017 - 2:00 PM
+| 07/02/2019 - 9.40 am
-| File system (disk arrays 1 and 2)
+| caclogin04
-| Trouble shooting on disk arrays
+| caclogin04 unresposive
-| Replacing disks, rebooting head units; intermittent login and disk access issues to be expected.
+| login node requires reboot after becoming un-responsive . Reboot fixed the issue and updated this host
-| Yes
 |-
-| 3/28/2017 - 2:00 PM
-| cac029 (compute)
-| cac029 off-line
-| cac029 is undergoing memory maintenance.
-| Yes
 |-
-| 4/13/2017 - 8:00 AM
+| 05/17/2019 - 11 am
-| swlogin1 (login/workup)
+| caclogin03
-| login problems
+| caclogin03 down
-| connectivity issues on swlogin1 prevent or delay login from sflogin0
+| login node requires reboot after becoming non-rseponsive , reboot resolved the issue
-| Yes
 |-
-| 5/02/2017 - 8:00 AM
+| 02/27/2019 - 10 am
-| all nodes (login & "SW" production)
+| caclogin02
-| Grid Engine scheduler not functional
+| caclogin02 VM down
-| The Grid Engine scheduler is currently not functional ; qstat/qsub/qmon not avaialable
+| login traffic directed to caclogin03/4, rebooted resolved issue
-| Yes
 |-
-| 5/02/2017 - 1:00 PM
+| 12/10/2018 - 8 am
-| swlogin1
+| caclogin02
-| Grid Engine scheduler not functional
+| caclogin02 VM down
-| Network issues on SGE_PROD
+| login traffic directed to caclogin03/4
-| Yes
 |-
-| 5/10/2017 - 8:00 AM
+| 12/04/2018 - 8 pm
-| swlogin1  / cac012-26
+| caclogin03/04
-| Connectivity issues
+| issues with login nodes; caclogin02 works
-| Access to storage temporarily lost
+| resolved after reboot
-| Yes
 |-
-| 5/10/2017 - 9:00 AM
+| 10/29/2018 - 4 pm
-| swlogin1
+| caclogin03
-| Connectivity restored
+| / file system full
-| Login restored
+| resolved
-| Yes
 |-
-| 5/10/2017 - 12:00 PM
+| 08/06/2018 - 08/10/2018
-| cac012-27
+| Cluster downtime
-| Reboot
+| Scheduled filesystem upgrade
-| Queues temporarfily disabled
+| planned downtime
-| Yes
 |-
-| 7/13/2017 - 10:00 PM
+| 07/12/2018 - 10:30 AM
-| all systems
+| GPFS outage
-| issues with Grid Engine qmaster
+| Filesystem temporarily unavailable
 | resolved
-| Yes
 |-
-| 7/13/2017 - 10:00 PM
+| 06/27/2018 - 9:00 AM
-| swlogin1
+| Login node shutdown
-| unreachable through ssh
+| Maintenance (unscheduled)
+| node back in service
+|-
+| 06/20/2018 - 8:00 AM
+| Login node non-responsive
+| Cause : out of memory
+| resolved, login restored (take-down, reboot)
+|-
+| 05/01/2018 - 9:00 AM
+| Scheduler maintenance
+| Scheduled upgrade/downtime of scheduler
 | resolved
-| Yes
 |-
-| 10/02/2017 - 8:00 AM
+| 04/23/2018 - 7:00 AM
-| head-6b
+| Frontenac login node
-| disk array full
+| login issues, reboot
-| partly resolved (freed 4 TB)
+| functional after reboot
-| Yes
 |-
-| 10/03/2017 - 8:00 AM
+| 04/19/2018 - 3:30 PM
-| head-6b
+| Frontenac login node
-| disk array at near capacity
+| lost access to file system, reboot
-| working on reducing usage
+| resolved after reboot
-| yes
 |-
-| 10/30/2017 - 8:00 AM
+| 03/16/2018 - 11:00 AM
-| swlogin1 (login node)
+| Scheduler upgrade
-| No login possible
+| Scheduled upgrade/downtime of scheduler
-| login restored
+| Upgrade complete, working on x11 support
-| yes
 |-
-| 10/30/2017 - 8:00 AM
+| 01/28/2018 - 5:00 AM
-| multiple production nodes unreachable
+| Frontenac login node caclogin02
-| scheduler lost contact to production nodes
+| Node went down out of schedule
-| nodes will be transfered to Frontenac
+| login restored, investigating causes
-| yes
+|-
+| 01/18/2018 - 11:30 AM
+| Frontenac login node caclogin01
+| Out-of-schedule shutdown / reboot (~45min)
+| updates / maintenance
 |-
 | 11/21/2017 - 11:00 PM
@@ Line 129: / Line 112: @@
 | Temporary unmount of /global file system
 | re-mounted, file system accessible
-| yes
 |-
-| 11/22/2017 - 11:00 AM
+| 10/30/2017 - 8:00 AM
-| '''Frontenac (all nodes)'''
+| multiple production nodes unreachable
-| '''production jobs terminated due to FS issues'''
+| scheduler lost contact to production nodes
-| '''please re-submit your jobs'''
+| nodes will be transfered to Frontenac
-| '''no'''
+|-
+| 10/30/2017 - 8:00 AM
+| swlogin1 (login node)
+| No login possible
+| login restored
+|-
+| 10/03/2017 - 8:00 AM
+| head-6b
+| disk array at near capacity
+| working on reducing usage
+|-
+| 10/02/2017 - 8:00 AM
+| head-6b
+| disk array full
+| partly resolved (freed 4 TB)
+|-
+| 7/13/2017 - 10:00 AM
+| swlogin1
+| unreachable through ssh
+| resolved
+|-
+| 7/13/2017 - 8:00 AM
+| caclogin01
+| temporary maintenance shutdown
+| back up
 |-
 |}

Difference between revisions of "Hardware:Status"

Latest revision as of 19:39, 5 July 2023

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools