Difference between revisions of "Hardware:Status"
From CAC Wiki
(50 intermediate revisions by 4 users not shown) | |||
Line 5: | Line 5: | ||
!colspan="5"| '''System Status Messages''' | !colspan="5"| '''System Status Messages''' | ||
|- | |- | ||
− | | | + | | '''Date''' |
− | | | + | | '''Affected systems''' |
− | | issues with | + | | '''Details/reason''' |
+ | | '''Resolution''' | ||
+ | |- | ||
+ | | 05/0723 - Continuing | ||
+ | | Frontenac Signup | ||
+ | | Frontenac signup unavailable due to website work | ||
+ | | New CAC website is currently being rolled out and troubleshooting and fixes are underway. | ||
+ | |- | ||
+ | | 22/01/22 - 8-4:30 | ||
+ | | Frontenac Cluster | ||
+ | | Onsite Power work | ||
+ | | Electrical work will require Frontenac Cluster to run off of generator. No disruptions are expected. | ||
+ | |- | ||
+ | |- | ||
+ | | 07/09/2019 - 9.30 am | ||
+ | | caclogin03 | ||
+ | | caclogin03 down | ||
+ | | login node requires reboot after becoming non-responsive . Re-booted and updated this host | ||
+ | |- | ||
+ | |- | ||
+ | | 07/02/2019 - 9.40 am | ||
+ | | caclogin04 | ||
+ | | caclogin04 unresposive | ||
+ | | login node requires reboot after becoming un-responsive . Reboot fixed the issue and updated this host | ||
+ | |- | ||
+ | |- | ||
+ | | 05/17/2019 - 11 am | ||
+ | | caclogin03 | ||
+ | | caclogin03 down | ||
+ | | login node requires reboot after becoming non-rseponsive , reboot resolved the issue | ||
+ | |- | ||
+ | | 02/27/2019 - 10 am | ||
+ | | caclogin02 | ||
+ | | caclogin02 VM down | ||
+ | | login traffic directed to caclogin03/4, rebooted resolved issue | ||
+ | |- | ||
+ | | 12/10/2018 - 8 am | ||
+ | | caclogin02 | ||
+ | | caclogin02 VM down | ||
+ | | login traffic directed to caclogin03/4 | ||
+ | |- | ||
+ | | 12/04/2018 - 8 pm | ||
+ | | caclogin03/04 | ||
+ | | issues with login nodes; caclogin02 works | ||
+ | | resolved after reboot | ||
+ | |- | ||
+ | | 10/29/2018 - 4 pm | ||
+ | | caclogin03 | ||
+ | | / file system full | ||
| resolved | | resolved | ||
− | |||
|- | |- | ||
− | | | + | | 08/06/2018 - 08/10/2018 |
− | | | + | | Cluster downtime |
− | | | + | | Scheduled filesystem upgrade |
+ | | planned downtime | ||
+ | |- | ||
+ | | 07/12/2018 - 10:30 AM | ||
+ | | GPFS outage | ||
+ | | Filesystem temporarily unavailable | ||
| resolved | | resolved | ||
− | |||
|- | |- | ||
− | | | + | | 06/27/2018 - 9:00 AM |
− | | | + | | Login node shutdown |
− | + | | Maintenance (unscheduled) | |
− | | | + | | node back in service |
− | | | + | |
|- | |- | ||
− | | | + | | 06/20/2018 - 8:00 AM |
− | | | + | | Login node non-responsive |
− | | | + | | Cause : out of memory |
− | | | + | | resolved, login restored (take-down, reboot) |
− | + | ||
|- | |- | ||
− | | | + | | 05/01/2018 - 9:00 AM |
− | | | + | | Scheduler maintenance |
− | | | + | | Scheduled upgrade/downtime of scheduler |
− | + | | resolved | |
− | | | + | |
|- | |- | ||
− | | | + | | 04/23/2018 - 7:00 AM |
− | | | + | | Frontenac login node |
− | | | + | | login issues, reboot |
− | | | + | | functional after reboot |
− | | | + | |- |
+ | | 04/19/2018 - 3:30 PM | ||
+ | | Frontenac login node | ||
+ | | lost access to file system, reboot | ||
+ | | resolved after reboot | ||
+ | |- | ||
+ | | 03/16/2018 - 11:00 AM | ||
+ | | Scheduler upgrade | ||
+ | | Scheduled upgrade/downtime of scheduler | ||
+ | | Upgrade complete, working on x11 support | ||
+ | |- | ||
+ | | 01/28/2018 - 5:00 AM | ||
+ | | Frontenac login node caclogin02 | ||
+ | | Node went down out of schedule | ||
+ | | login restored, investigating causes | ||
+ | |- | ||
+ | | 01/18/2018 - 11:30 AM | ||
+ | | Frontenac login node caclogin01 | ||
+ | | Out-of-schedule shutdown / reboot (~45min) | ||
+ | | updates / maintenance | ||
|- | |- | ||
| 11/21/2017 - 11:00 PM | | 11/21/2017 - 11:00 PM | ||
Line 45: | Line 112: | ||
| Temporary unmount of /global file system | | Temporary unmount of /global file system | ||
| re-mounted, file system accessible | | re-mounted, file system accessible | ||
− | | | + | |- |
+ | | 10/30/2017 - 8:00 AM | ||
+ | | multiple production nodes unreachable | ||
+ | | scheduler lost contact to production nodes | ||
+ | | nodes will be transfered to Frontenac | ||
+ | |- | ||
+ | | 10/30/2017 - 8:00 AM | ||
+ | | swlogin1 (login node) | ||
+ | | No login possible | ||
+ | | login restored | ||
+ | |- | ||
+ | | 10/03/2017 - 8:00 AM | ||
+ | | head-6b | ||
+ | | disk array at near capacity | ||
+ | | working on reducing usage | ||
+ | |- | ||
+ | | 10/02/2017 - 8:00 AM | ||
+ | | head-6b | ||
+ | | disk array full | ||
+ | | partly resolved (freed 4 TB) | ||
+ | |- | ||
+ | | 7/13/2017 - 10:00 AM | ||
+ | | swlogin1 | ||
+ | | unreachable through ssh | ||
+ | | resolved | ||
+ | |- | ||
+ | | 7/13/2017 - 8:00 AM | ||
+ | | caclogin01 | ||
+ | | temporary maintenance shutdown | ||
+ | | back up | ||
|- | |- | ||
|} | |} |
Latest revision as of 19:39, 5 July 2023
This page shows information about the status of systems at the Centre for Advanced Computing. It will be updated with additional information as new events arise.
System Status Messages | ||||
---|---|---|---|---|
Date | Affected systems | Details/reason | Resolution | |
05/0723 - Continuing | Frontenac Signup | Frontenac signup unavailable due to website work | New CAC website is currently being rolled out and troubleshooting and fixes are underway. | |
22/01/22 - 8-4:30 | Frontenac Cluster | Onsite Power work | Electrical work will require Frontenac Cluster to run off of generator. No disruptions are expected. | |
07/09/2019 - 9.30 am | caclogin03 | caclogin03 down | login node requires reboot after becoming non-responsive . Re-booted and updated this host | |
07/02/2019 - 9.40 am | caclogin04 | caclogin04 unresposive | login node requires reboot after becoming un-responsive . Reboot fixed the issue and updated this host | |
05/17/2019 - 11 am | caclogin03 | caclogin03 down | login node requires reboot after becoming non-rseponsive , reboot resolved the issue | |
02/27/2019 - 10 am | caclogin02 | caclogin02 VM down | login traffic directed to caclogin03/4, rebooted resolved issue | |
12/10/2018 - 8 am | caclogin02 | caclogin02 VM down | login traffic directed to caclogin03/4 | |
12/04/2018 - 8 pm | caclogin03/04 | issues with login nodes; caclogin02 works | resolved after reboot | |
10/29/2018 - 4 pm | caclogin03 | / file system full | resolved | |
08/06/2018 - 08/10/2018 | Cluster downtime | Scheduled filesystem upgrade | planned downtime | |
07/12/2018 - 10:30 AM | GPFS outage | Filesystem temporarily unavailable | resolved | |
06/27/2018 - 9:00 AM | Login node shutdown | Maintenance (unscheduled) | node back in service | |
06/20/2018 - 8:00 AM | Login node non-responsive | Cause : out of memory | resolved, login restored (take-down, reboot) | |
05/01/2018 - 9:00 AM | Scheduler maintenance | Scheduled upgrade/downtime of scheduler | resolved | |
04/23/2018 - 7:00 AM | Frontenac login node | login issues, reboot | functional after reboot | |
04/19/2018 - 3:30 PM | Frontenac login node | lost access to file system, reboot | resolved after reboot | |
03/16/2018 - 11:00 AM | Scheduler upgrade | Scheduled upgrade/downtime of scheduler | Upgrade complete, working on x11 support | |
01/28/2018 - 5:00 AM | Frontenac login node caclogin02 | Node went down out of schedule | login restored, investigating causes | |
01/18/2018 - 11:30 AM | Frontenac login node caclogin01 | Out-of-schedule shutdown / reboot (~45min) | updates / maintenance | |
11/21/2017 - 11:00 PM | Frontenac (all nodes) | Temporary unmount of /global file system | re-mounted, file system accessible | |
10/30/2017 - 8:00 AM | multiple production nodes unreachable | scheduler lost contact to production nodes | nodes will be transfered to Frontenac | |
10/30/2017 - 8:00 AM | swlogin1 (login node) | No login possible | login restored | |
10/03/2017 - 8:00 AM | head-6b | disk array at near capacity | working on reducing usage | |
10/02/2017 - 8:00 AM | head-6b | disk array full | partly resolved (freed 4 TB) | |
7/13/2017 - 10:00 AM | swlogin1 | unreachable through ssh | resolved | |
7/13/2017 - 8:00 AM | caclogin01 | temporary maintenance shutdown | back up |