
%20(6)_edited.jpg)
Rapid Recovery: Resolving MES Server Downtime After the CrowdStrike Incident
Optimizing IT Support for MES Server Recovery Through Knowledge Management
Executive Summary In today’s highly automated Life Sciences industry, server downtimes can significantly disrupt manufacturing operations, leading to costly delays. This case study demonstrates how our 24/7 IT support team efficiently managed critical MES (Manufacturing Execution System) server issues after the CrowdStrike incident on July 19, 2024, which impacted industries worldwide. By leveraging our knowledge management strategy, we resolved communication breakdowns, application locks, and restarts to impacted servers within three hours—minimizing operational downtime and ensuring business continuity.
Challenges Overview Following the CrowdStrike incident, our client's manufacturing operations experienced widespread server failures, affecting MES application servers, database servers, and related services:
-
ERP to MES Communication Interruption Issue: Communication responsible for transferring process order messages, material data, and order management messages between ERP and MES was interrupted. Impact: Immediate disruption of the manufacturing process.
-
PCS to MES Communication Interruption Issue: Server downtimes compromised services managing communication between ANSI/ISA 95 Level 2 data sources and MES. Impact: Halted data flow between PCS and MES, affecting real-time process controls.
-
MES Application Lock Issue: Open MES sessions were interrupted, causing an application lock that prevented access to critical functionalities. Impact: Blocked access delayed manufacturing activities.
-
Missing Pallets Issue: Discrepancies in pallet data—such as pallets missing in ERP and count mismatches. Impact: Disrupted shipping and packaging processes due to inconsistencies between MES and ERP data.
-
Load Balancing Issues Issue: Multiple database nodes were impacted, leading to high loads on the remaining servers. Impact: Unbalanced server load reduced MES performance, causing delays in manufacturing processes.
Existing BCPs (Business Continuity Plan) provided sufficient guidance for server restoration but proved to be ineffective in addressing application recovery. Leveraging Inteldot’s KB (Knowledge Base), our System Administration team was able to resolve each of these issues without contacting the software vendor—resulting in recovery times significantly faster when compared to our client’s sister sites.
The KB empowered the entire System Administration team with access to collective expertise, allowing newer team members to confidently resolve complex issues using detailed, step-by-step guides Inteldot’s Knowledge-Based Process in action:

By following this structured, repeatable process, our team resolves issues faster and enhances operational resilience, ensuring efficient management of future incidents. This approach also provides several key benefits:
-
Reduce dependency on external support.
-
Allows for rapid troubleshooting and minimized downtime.
-
Enables scalable knowledge management.
Lessons Learned & Future Improvements While the incident was successfully managed, we identified several opportunities for improvement:
-
Enhanced Server Monitoring: Implementing additional real-time monitoring tools to detect potential server overloads before they escalate.
-
Proactive Communication with Database Administration Teams: Strengthening coordination with database teams will ensure faster restarts and more effective load balancing during incidents.
-
Updating BCP: The BCP should be revised to include specific protocols for application recovery, incorporating new inputs from SMEs who participated in the disaster recovery process.
Conclusion This case study demonstrates the power of a well-documented, knowledge-based approach in managing critical server downtimes. By quickly resolving MES and Data Historian server issues, we protected our client's revenue, compliance, and operational continuity. A strong culture of knowledge management, coupled with a dynamic approach to revisiting and updating Business Continuity Plans based on real incidents, ensures resilience and preparedness.
Next Steps If you’re interested in learning how our IT support team can help safeguard your operations with proactive, knowledge-based solutions, we invite you to connect with us for a more in-depth discussion.