‘The Long Game’ by NGD’s Engineering Director Mark Campion, published in Data Economy, discusses the engineering skills and financial requirements necessary for ensuring a data centre’s longevity in the face of growing compute and storage demands from the Cloud, HPC and the IoT.
Maximising a data centre’s longevity and ensuring it remains fit for purpose is an ongoing process.
Rather like painting the Forth Bridge it’s never finished. There are many factors with the potential to impact on IT systems uptime, availability and security; power distribution management, cooling, UPS and network availability among them.
Consequently, when it comes to the demands of supporting Cloud, IoT and HPC applications and environments, many organisations with in house facilities are finding it increasingly cost-prohibitive and technically demanding to keep them fit for purpose. The growing shift towards colocation demonstrates this.
From a business potential perspective, this would appear good news for all third party operators. But for many it increases the pressure on their maximising ROIs on existing premises and critical infrastructure, while ensuring facilities remain fit for purpose.
Faced with such challenges, some operators will simply take a short-term view by seeking to maximise profit and then exit; others will be unable to compete due to becoming increasingly power-strapped, or simply by not having sufficient multiple high speed carrier, ISP and public cloud connectivity options available. And for many, lacking the resilience in their infrastructure to support concurrent maintenance will be a challenge, not to mention the engineering skills needed for undertaking rigorous preventative maintenance.
For those operators willing and able to play the long game, an ongoing infrastructure preventative maintenance and repair programme is an essential part of longevity and resilience.
A long life starts with a highly duplicated plant architecture that supports concurrent maintenance with minimal risk to customers. Without this, thorough lifecycle maintenance is problematic. A minimum of N + 1 redundancy infrastructure is therefore a prerequisite.
However, it is not safe to assume when a component failure occurs that N+1 will solve all problems. Ensuring a quick repair places a premium on staff training, predictive diagnostics, good support contracts and on-site spares. A component’s design MTBF will only be achieved if it is regularly and competently maintained.
Data centre lifecycle maintenance is similar to aircraft; a full regime requires detailed and continuous test, inspection and replacement. This involves regular replacement of capacitors, fans and batteries, through to stripping and greasing breakers, and even high tech thermal imaging and component testing for minute flaws.
Taking inclusive OEM maintenance packages alongside procurement is another important consideration. Plant manufacturers are focussed on new equipment sales, so the operator challenge here is to drive OEMs to provide long term support as a combined package in partnership.
Operating parameters will also have an influence on life span and mean time between failure (MTBF). Best practice is to avoid high load factors either due to lack of resilience or selling to maximum capacity.
Slowly but surely operators are better understanding the balance between financial returns, service continuity and reputation. Best practice extends also to operating exactly within OEM stated environmental parameters.
Finally, regular testing of the UPS and generator back-up systems rather than waiting for a real-life loss of mains power is the final proof in whether or not the critical facilities infrastructure is still fit or purpose. Absolute proof comes with ‘Black Testing’. It’s not for the faint hearted and many data centres simply don’t do it.
Every six months NGD isolates incoming mains grid power and for up to sixteen seconds the UPS takes the full load while the emergency backup generators kick-in. Clearly, we are only cutting the power to one side of a 2N+2 infrastructure and it’s done under strictly controlled conditions.
The extent all of the above factors are taken on by operators varies enormously. Any potential customer organisation should therefore ensure comprehensive maintenance and testing is factored into their buying criteria.
When it’s time for change
Inevitably, the time will come for operators to replace/upgrade outdated or no longer supported electrical and mechanical plant. The challenge here is both financial and logistical.
The prudent approach is to budget for short term and longer term infrastructure needs as part of an ongoing planned and preventative maintenance programme. This will offset all or some of major purchase expenses from capex to opex.
But making this viable assumes an operator business has sufficient long term contracted revenues, excellent economies of scale, good cash flow, and if needs be, ready access to additional funding. For many this is far easier said than done.
On the logistics side, there will need to be thorough planning with customers, coupled with provisions for parallel build, logistics support and swing space. Larger facilities have the advantage of having free space for refit and generally built their plant gradually, so the refit is gradual.
In summary, aside from financial resources, the essentials for maximising a data centre’s lifespan are abundant space and power and deep engineering expertise; to be able to scale and continue to resiliently meet the ever increasing rack density and cooling requirements being driven by Cloud, The IoT and HPC. This is proving a step too far for many organisations which is creating a surge in demand for third party data centre facilities which have what it takes to keep pace.